## 1.

## INTRODUCTION

As low-end image sensors are available in many different configurations, the approach of using multiple cameras to increase the depth of information captured by the image sensors becomes more common. The Stanford University led this approach with the publication [1] presenting a dense array of CMOS image sensors used to capture high-speed videos and later with [2] using 100 cameras for synthetic aperture imaging. Besides these camera arrays utilizing the increased depth of information from more sensors, binocular stereo computer vision systems are common. Principally, two views are enough to compute 3D information of an object depicted in both frames. Therefore a rig of two cameras can be used or the camera has to move around the object to capture frames from different viewpoints. The latter principle is commonly referred to as structure from motion, whereas the primal approach is called stereoscopy. As both principles rely on at least two views with a common area, software implementations for structure from motion can, at least in parts, be utilized for stereoscopy and vice versa as they share the same mathematical approach.

In the presented work the open source computer vision library OpenCV in version 3.4.4 is used. The multi-camera system referred to in this paper is designed to overview the interior of a production machine with tools moving on a gantry in X, Y, Z direction. Multiple processing stations may be placed in the range of motion of the gantry. Since the machine should be able to work its way through the interior without colliding with any object, the three-dimensional profile needs to be detected. Therefore multiple cameras, at first four, are mounted to the machine frame (see Figure 1). A stereovision approach is used, which leads to six stereo pairs for four cameras.

Generally speaking, the aim of stereoscopy algorithms is to calculate a disparity map from two frames by two cameras corresponding to the same scene. The disparity map holds information about the different positions of common scene points projected to the individual camera frames. With this information the point can be easily reprojected to three-dimensional coordinates, resulting in a point cloud with every point containing X, Y, Z coordinates in real world coordinates. In the presented setup this leads to six point clouds from six stereo pairs depicting the machine interior from different viewpoints.

## 2.

## CAMERA CALIBRATION

The calibration of a multi-camera system is an essential part in the process of reconstructing three-dimensional data from two-dimensional images, as it defines the possible measurement accuracy as well as the scale. The aim of the calibration is to get the intrinsic and extrinsic parameters of every camera to transform a point from world coordinates to image pixel coordinates (see Figure 2). Later these parameters are used to undistort and rectify the images, along with for image transformations necessary for the reconstruction of 3D points.

The extrinsic parameters include the rotation matrix ** R** and translation vector

**. Intrinsic parameters are the camera matrix**

*t***, comprising the focal lengths**

*A**f*,

_{x}*f*and the optical center (

_{y}*c*), as well as the distortion coefficients

_{x}, c_{y}**.**

*d*All calibration parameters stay true as long as the position and orientation of the cameras, along with the focal settings, do not change. Therefore the calibration has to be carried out only once for a camera setup.

## 2.1

### Single Camera Calibration

To start the camera calibration, a camera model has to be chosen and be described. The OpenCV calibration algorithm utilizes a pinhole camera model and introduces radial and tangential distortion [3]. Distortion correction is vital since the presented multi-camera system uses low-cost board-level cameras with S-mount lenses, which introduce an amount of distortion to the images that can not be neglected.

At first a transformation from a three-dimensional point *P _{w}*(

*X*) in world coordinates to a point

_{w}, Y_{w}, Z_{w}*P*(

*u, v*) in image pixel plane coordinate system has to be found (see Figure 3). The equation (1) transforms a point

*P*to a point

_{w}*P*(

_{c}*X*) in the camera coordinate system, where

_{c}, Y_{c}, Z_{c}*R*is a 3

*x*3 rotation matrix and

*t*is a 3

*x*1 translation vector.

The point *P _{c}* is now projected through the pinhole model in order to obtain physical coordinates on the image plane as

*P*(

*x, y*), see (2).

By introducing radial distortion coefficients (*k*_{1}, *k*_{2}, *k*_{3}), the corrected point coordinates *P _{k}*(

*x*) are defined as following:

_{k}, y_{k}With *k*_{1}, *k*_{2} and *k*_{3} being the radial distortion coefficients and *r*^{2} = *x*^{2} + *y*^{2}. Due to a lens not aligned perfectly to the image plane, tangential distortion is introduced. Its correction for a point *P _{p}*(

*x*) can be described as:

_{p}, y_{p}In summary the distortion coefficients are defined as *d* = (*k*_{1}, *k*_{2}, *p*_{1}, *p*_{2}, *k*_{3}). The tangential and radial distortion corrected point *P _{q}*(

*x*) is, as combination of the equations (3) and (4), defined as:

_{q}, y_{q}To translate the physical image plane coordinates to image plane pixel coordinates, the camera matrix *A* is needed, which contains information about the focal length in *mm* as *f _{x}* and

*f*, as well as the optical center in pixel coordinates as (

_{y}*c*).

_{x}, c_{y}*γ*represents the skewness, which is the angle error in between the two axis of the pixel array. For industrial grade image sensors skewness is usually small and can be neglected.

In conclusion a point *P*(*u, v*) in pixel coordinates is defined as:

## 2.2

### Stereo Camera Calibration

Stereo camera calibration, or binocular calibration, will, in addition to the already obtained intrinsic and extrinsic parameters, get the relative position of one camera to the other of a stereo pair along with matrices necessary for the reconstruction of the scene.

We suppose the left camera extrinsic parameters are rotation matrix *R _{L}* and translation vector

*t*. For the right camera these are

_{L}*R*and

_{R}*t*. From these matrices and vectors the translation vector from left to right camera coordinates

_{r}**and the corresponding rotation matrix**

*t*_{LR}**are derived.**

*R*_{LR}The keyword for stereoscopic reconstruction is epipolar geometry, which encapsulates the relation of the projective geometry between two views. The centrepiece of this is the fundamental matrix ** F**, which derives from the relation given in equation (8), with

*x*as the projection of a real-world point

*X*in the left camera and

*x*′ in the right camera frame. The equation could be solved for image points with known real world distances obtained from both cameras of a stereo pair. For a more in-depth explanation of epipolar geometry refer to [4].

To be able to transform images pairwise abiding to epipolar geometry, rectification homographies ** H_{L}** and

**need to be found. With known fundamental matrix F, the algorithm described in [5] can be utilized.**

*H*_{R}Figure 6 shows frames of pair 2 distortion corrected and rectified, which means the rectification homo-graphies are applied.

To be able to reproject image points to 3D coordinate space, disparity has to be introduced. Figure 4 depicts a simplified projection of a point *P* onto two already rectified images. Since these images are rectified, the projections *P*_{1} and *P*_{2} are located on one epipolar line. Therefore *v*_{1} = *v*_{2} and *u*_{1} ≠ *u*_{2}. The equation for disparity is given by (9) with *B* as baseline or distance between the optical centers of the cameras (*C*_{1} and *C*_{2}), *f* as focal length and *z* as distance to point *P* in 3D world coordinate system.

One step in stereo calibration is still missing, the computation of the disparity-to-depth mapping matrix ** Q**. Its definition is given by equation (10). Since the vectors are noted in homogeneous coordinates their fourth entry refers to scale. With this equation a disparity map can be reprojected to the world coordinate system in 3D (see section 3).

## 2.3

### Calibration Procedure

In order to apply the above elucidated relations, a set of points with known distances is needed. Therefore a chessboard type calibration target is used, as shown in Figure 5. A set of images with different positions of the target are taken, where the target should be moved to every position in the measurement volume in order to get a strongly calibrated stereo vision system. The position of the corners on the calibration target are detected by using the OpenCV function *findChessboardCorners()* and, to get the corner position with subpixel precision, *cornerSubPix()* is called afterwards. This routine is carried out for every image during the calibration process. This leads to a set of points for every camera where every point refers to a corner on the chessboard pattern. With these points and an array with distance values in *mm* corresponding to the dimension of the rectangles in the calibration pattern, the cameras can be calibrated. For every stereo pair at first the extrinsic and intrinsic parameters of the two partaking cameras are calculated (see section 2.1) using the OpenCV function *calibrateCamera()*, which is based on [6] and [7]. This routine delivers the camera matrices *A _{L}* and

*A*, distortion coefficients

_{R}*d*and

_{L}*d*, rotation matrix and translation vector for every pattern view and a reprojection error. The distortion coefficients are later used to undistort images, as depicted in Figure 6 on the left side.

_{R}From the rotation and translation matrices for every pattern view, the rotation matrix *R _{LR}* and translation vector

*t*from left to right camera are derived.

_{LR}The above mentioned chessboard corner points in image pixel coordinates are now undistorted using the function *undistortPoints()*, and further on used to obtain the *F* matrix via the OpenCV function *findFundamentalMat()*. This function utilizes the random sample consensus (RANSAC) algorithm to solve the problem delineated in equation (8). With the obtained *F* matrix the above mentioned rectification homographies can be calculated. In OpenCV the function *stereoRectifyUncalibrated()* computes the rectification homographies *H _{L}* and

*H*using the algorithm presented in [5]. On the right side Figure 6 depicts rectified images of stereo pair 2.

_{R}After rectification a disparity map can be calculated using a matching algorithm, for example semi global block matching (SGBM), this procedure is explained in section 3. As mentioned above, the *Q* matrix is needed to reproject a disparity map to 3D. The connection between disparity, world points and *Q* is given in (10). Since we are using the OpenCV function *stereoRectifyUncalibrated()*, which does not return a *Q* matrix, the disparity-to-depth mapping matrix needs to be calculated in another way. Therefore the equation (10) is utilized. At first the points obtained from the calibration target are triangulated using OpenCVs *triangulatePoints()*, which results in a matrix with points in homogeneous coordinates, notated as *B* in (11). The matrix *D* in the same equation is acquired by applying *H _{L}*, respectively

*H*, to image points obtained by

_{R}*findChessboardCorners()*and

*cornersSubPix()*. Equation (12) describes the derivation of a pseudoinverse matrix, which is in (13) used to convert equation (11) to obtain

*Q*. In this way it is possible to calculate the disparity-to-depth mapping matrix based on the points retrieved from the different views of the calibration target. Assuming the calibration target was moved through the whole observed volume, maximum and minimum disparity values can also be calculated using these points.

## 3.

## RECONSTRUCTION OF THE SCENE

The calibration as a whole delivers information to rectify camera frames according to epipolar geometry. From this point a disparity map for a stereo pair needs to be calculated, which then can be reprojected to 3D world coordinates with the already obtained disparity-to-depth mapping matrix *Q*.

The section 3.1 exemplifies the computation of the disparity map, section 3.2 explains the reprojection to 3D world coordinates.

## 3.1

### Constructing Disparity

The OpenCV implementation of the semi-global block matching (SGBM) algorithm is applied to the rectified images of stereo pair 2 (see Figure 6, right side). The OpenCV class *StereoSGBM* uses a modified algorithm from [8], whereas, instead of mutual information cost function, a simpler sub-pixel metric from [9] is implemented.

In the first step pixelwise cost calculation is done, the cost is calculated as the absolute minimum difference of intensities in the range of a half pixel in 5 directions along the epipolar line. As the name of the algorithm suggests, a global smoothness constraint is approximated. The smoothed cost for a pixel (or block, depending on the chosen block size) and disparity is calculated by summing the costs of all minimum cost paths that end in the pixel or block. The disparity map is determined by selecting a disparity with corresponding minimum cost for every pixel of the source image. This process is done two times, the first time from left to right image, the second time from right to left image. Therefore a left and right matcher instance is created. The input minimum disparity value is obtained from the matrix *D* in equation (11), as well as the number of disparities, which is the maximum disparity value minus the minimum disparity value. From these two maps one combined disparity map is preserved using the Weighted Least Squares filter with a left-right-consistency-based confidence map. In OpenCV these functions are implemented in the *ximgproc.DisparityWLSFilter* class. Figure 7 depicts the output disparity maps.

## 3.2

### Constructing Point Clouds

The obtained disparity images (see Figure 7) need to be reprojected to 3D world coordinates. Therefore the disparity-to-depth mapping matrix Q is calculated, see section 2.2. In OpenCV the function *reprojectImageTo3D()* transforms a disparity map to a 3-channel point cloud. For each pixel (*u, v*) and its corresponding disparity value *disparity*(*u, v*) the equation (10) is solved. This results in a dense point cloud with 5.038.848 points. Figure 8 depicts the valid part of the resulting point cloud.

## 4.

## CONCLUSION AND FUTURE WORK

This paper has presented a calibration technique for stereoscopic applications, utilizing the functions and algorithms implemented in the OpenCV library. The calibration procedure does not only rely on these functions but is also extended by own approaches, for instance the computation of the disparity-to-depth mapping matrix. With the presented calibration procedure it is possible to calibrate an array of cameras, segregated into stereoscopic camera pairs, and retrieve 3D information about the contemplated scene.

In the future the individual point clouds will be registered and filtered to obtain one point cloud containing all information from all views. Another significant step is the usage of structured light in order to make the SGBM algorithm work on all parts of the acquired images.

## ACKNOWLEDGMENTS

This paper was supported in part by the European Social Fund and Thüringer Aufbaubank.