As low-end image sensors are available in many different configurations, the approach of using multiple cameras to increase the depth of information captured by the image sensors becomes more common. The Stanford University led this approach with the publication  presenting a dense array of CMOS image sensors used to capture high-speed videos and later with  using 100 cameras for synthetic aperture imaging. Besides these camera arrays utilizing the increased depth of information from more sensors, binocular stereo computer vision systems are common. Principally, two views are enough to compute 3D information of an object depicted in both frames. Therefore a rig of two cameras can be used or the camera has to move around the object to capture frames from different viewpoints. The latter principle is commonly referred to as structure from motion, whereas the primal approach is called stereoscopy. As both principles rely on at least two views with a common area, software implementations for structure from motion can, at least in parts, be utilized for stereoscopy and vice versa as they share the same mathematical approach.
In the presented work the open source computer vision library OpenCV in version 3.4.4 is used. The multi-camera system referred to in this paper is designed to overview the interior of a production machine with tools moving on a gantry in X, Y, Z direction. Multiple processing stations may be placed in the range of motion of the gantry. Since the machine should be able to work its way through the interior without colliding with any object, the three-dimensional profile needs to be detected. Therefore multiple cameras, at first four, are mounted to the machine frame (see Figure 1). A stereovision approach is used, which leads to six stereo pairs for four cameras.
Generally speaking, the aim of stereoscopy algorithms is to calculate a disparity map from two frames by two cameras corresponding to the same scene. The disparity map holds information about the different positions of common scene points projected to the individual camera frames. With this information the point can be easily reprojected to three-dimensional coordinates, resulting in a point cloud with every point containing X, Y, Z coordinates in real world coordinates. In the presented setup this leads to six point clouds from six stereo pairs depicting the machine interior from different viewpoints.
The calibration of a multi-camera system is an essential part in the process of reconstructing three-dimensional data from two-dimensional images, as it defines the possible measurement accuracy as well as the scale. The aim of the calibration is to get the intrinsic and extrinsic parameters of every camera to transform a point from world coordinates to image pixel coordinates (see Figure 2). Later these parameters are used to undistort and rectify the images, along with for image transformations necessary for the reconstruction of 3D points.
The extrinsic parameters include the rotation matrix R and translation vector t. Intrinsic parameters are the camera matrix A, comprising the focal lengths fx, fy and the optical center (cx, cy), as well as the distortion coefficients d.
All calibration parameters stay true as long as the position and orientation of the cameras, along with the focal settings, do not change. Therefore the calibration has to be carried out only once for a camera setup.
Single Camera Calibration
To start the camera calibration, a camera model has to be chosen and be described. The OpenCV calibration algorithm utilizes a pinhole camera model and introduces radial and tangential distortion . Distortion correction is vital since the presented multi-camera system uses low-cost board-level cameras with S-mount lenses, which introduce an amount of distortion to the images that can not be neglected.
At first a transformation from a three-dimensional point Pw(Xw, Yw, Zw) in world coordinates to a point P(u, v) in image pixel plane coordinate system has to be found (see Figure 3). The equation (1) transforms a point Pw to a point Pc(Xc, Yc, Zc) in the camera coordinate system, where R is a 3x3 rotation matrix and t is a 3x1 translation vector.
The point Pc is now projected through the pinhole model in order to obtain physical coordinates on the image plane as P(x, y), see (2).
By introducing radial distortion coefficients (k1, k2, k3), the corrected point coordinates Pk(xk, yk) are defined as following:
With k1, k2 and k3 being the radial distortion coefficients and r2 = x2 + y2. Due to a lens not aligned perfectly to the image plane, tangential distortion is introduced. Its correction for a point Pp(xp, yp) can be described as:
In summary the distortion coefficients are defined as d = (k1, k2, p1, p2, k3). The tangential and radial distortion corrected point Pq(xq, yq) is, as combination of the equations (3) and (4), defined as:
To translate the physical image plane coordinates to image plane pixel coordinates, the camera matrix A is needed, which contains information about the focal length in mm as fx and fy, as well as the optical center in pixel coordinates as (cx, cy). γ represents the skewness, which is the angle error in between the two axis of the pixel array. For industrial grade image sensors skewness is usually small and can be neglected.
In conclusion a point P(u, v) in pixel coordinates is defined as:
Stereo Camera Calibration
Stereo camera calibration, or binocular calibration, will, in addition to the already obtained intrinsic and extrinsic parameters, get the relative position of one camera to the other of a stereo pair along with matrices necessary for the reconstruction of the scene.
We suppose the left camera extrinsic parameters are rotation matrix RL and translation vector tL. For the right camera these are RR and tr. From these matrices and vectors the translation vector from left to right camera coordinates tLR and the corresponding rotation matrix RLR are derived.
The keyword for stereoscopic reconstruction is epipolar geometry, which encapsulates the relation of the projective geometry between two views. The centrepiece of this is the fundamental matrix F, which derives from the relation given in equation (8), with x as the projection of a real-world point X in the left camera and x′ in the right camera frame. The equation could be solved for image points with known real world distances obtained from both cameras of a stereo pair. For a more in-depth explanation of epipolar geometry refer to .
To be able to transform images pairwise abiding to epipolar geometry, rectification homographies HL and HR need to be found. With known fundamental matrix F, the algorithm described in  can be utilized.
Figure 6 shows frames of pair 2 distortion corrected and rectified, which means the rectification homo-graphies are applied.
To be able to reproject image points to 3D coordinate space, disparity has to be introduced. Figure 4 depicts a simplified projection of a point P onto two already rectified images. Since these images are rectified, the projections P1 and P2 are located on one epipolar line. Therefore v1 = v2 and u1 ≠ u2. The equation for disparity is given by (9) with B as baseline or distance between the optical centers of the cameras (C1 and C2), f as focal length and z as distance to point P in 3D world coordinate system.
One step in stereo calibration is still missing, the computation of the disparity-to-depth mapping matrix Q. Its definition is given by equation (10). Since the vectors are noted in homogeneous coordinates their fourth entry refers to scale. With this equation a disparity map can be reprojected to the world coordinate system in 3D (see section 3).
In order to apply the above elucidated relations, a set of points with known distances is needed. Therefore a chessboard type calibration target is used, as shown in Figure 5. A set of images with different positions of the target are taken, where the target should be moved to every position in the measurement volume in order to get a strongly calibrated stereo vision system. The position of the corners on the calibration target are detected by using the OpenCV function findChessboardCorners() and, to get the corner position with subpixel precision, cornerSubPix() is called afterwards. This routine is carried out for every image during the calibration process. This leads to a set of points for every camera where every point refers to a corner on the chessboard pattern. With these points and an array with distance values in mm corresponding to the dimension of the rectangles in the calibration pattern, the cameras can be calibrated. For every stereo pair at first the extrinsic and intrinsic parameters of the two partaking cameras are calculated (see section 2.1) using the OpenCV function calibrateCamera(), which is based on  and . This routine delivers the camera matrices AL and AR, distortion coefficients dL and dR, rotation matrix and translation vector for every pattern view and a reprojection error. The distortion coefficients are later used to undistort images, as depicted in Figure 6 on the left side.
From the rotation and translation matrices for every pattern view, the rotation matrix RLR and translation vector tLR from left to right camera are derived.
The above mentioned chessboard corner points in image pixel coordinates are now undistorted using the function undistortPoints(), and further on used to obtain the F matrix via the OpenCV function findFundamentalMat(). This function utilizes the random sample consensus (RANSAC) algorithm to solve the problem delineated in equation (8). With the obtained F matrix the above mentioned rectification homographies can be calculated. In OpenCV the function stereoRectifyUncalibrated() computes the rectification homographies HL and HR using the algorithm presented in . On the right side Figure 6 depicts rectified images of stereo pair 2.
After rectification a disparity map can be calculated using a matching algorithm, for example semi global block matching (SGBM), this procedure is explained in section 3. As mentioned above, the Q matrix is needed to reproject a disparity map to 3D. The connection between disparity, world points and Q is given in (10). Since we are using the OpenCV function stereoRectifyUncalibrated(), which does not return a Q matrix, the disparity-to-depth mapping matrix needs to be calculated in another way. Therefore the equation (10) is utilized. At first the points obtained from the calibration target are triangulated using OpenCVs triangulatePoints(), which results in a matrix with points in homogeneous coordinates, notated as B in (11). The matrix D in the same equation is acquired by applying HL, respectively HR, to image points obtained by findChessboardCorners() and cornersSubPix(). Equation (12) describes the derivation of a pseudoinverse matrix, which is in (13) used to convert equation (11) to obtain Q. In this way it is possible to calculate the disparity-to-depth mapping matrix based on the points retrieved from the different views of the calibration target. Assuming the calibration target was moved through the whole observed volume, maximum and minimum disparity values can also be calculated using these points.
RECONSTRUCTION OF THE SCENE
The calibration as a whole delivers information to rectify camera frames according to epipolar geometry. From this point a disparity map for a stereo pair needs to be calculated, which then can be reprojected to 3D world coordinates with the already obtained disparity-to-depth mapping matrix Q.
The OpenCV implementation of the semi-global block matching (SGBM) algorithm is applied to the rectified images of stereo pair 2 (see Figure 6, right side). The OpenCV class StereoSGBM uses a modified algorithm from , whereas, instead of mutual information cost function, a simpler sub-pixel metric from  is implemented.
In the first step pixelwise cost calculation is done, the cost is calculated as the absolute minimum difference of intensities in the range of a half pixel in 5 directions along the epipolar line. As the name of the algorithm suggests, a global smoothness constraint is approximated. The smoothed cost for a pixel (or block, depending on the chosen block size) and disparity is calculated by summing the costs of all minimum cost paths that end in the pixel or block. The disparity map is determined by selecting a disparity with corresponding minimum cost for every pixel of the source image. This process is done two times, the first time from left to right image, the second time from right to left image. Therefore a left and right matcher instance is created. The input minimum disparity value is obtained from the matrix D in equation (11), as well as the number of disparities, which is the maximum disparity value minus the minimum disparity value. From these two maps one combined disparity map is preserved using the Weighted Least Squares filter with a left-right-consistency-based confidence map. In OpenCV these functions are implemented in the ximgproc.DisparityWLSFilter class. Figure 7 depicts the output disparity maps.
Constructing Point Clouds
The obtained disparity images (see Figure 7) need to be reprojected to 3D world coordinates. Therefore the disparity-to-depth mapping matrix Q is calculated, see section 2.2. In OpenCV the function reprojectImageTo3D() transforms a disparity map to a 3-channel point cloud. For each pixel (u, v) and its corresponding disparity value disparity(u, v) the equation (10) is solved. This results in a dense point cloud with 5.038.848 points. Figure 8 depicts the valid part of the resulting point cloud.
CONCLUSION AND FUTURE WORK
This paper has presented a calibration technique for stereoscopic applications, utilizing the functions and algorithms implemented in the OpenCV library. The calibration procedure does not only rely on these functions but is also extended by own approaches, for instance the computation of the disparity-to-depth mapping matrix. With the presented calibration procedure it is possible to calibrate an array of cameras, segregated into stereoscopic camera pairs, and retrieve 3D information about the contemplated scene.
In the future the individual point clouds will be registered and filtered to obtain one point cloud containing all information from all views. Another significant step is the usage of structured light in order to make the SGBM algorithm work on all parts of the acquired images.
This paper was supported in part by the European Social Fund and Thüringer Aufbaubank.