Near-real-time stereo matching method using both cross-based support regions in stereo views

Abstract. This paper presents a near-real-time stereo matching method using both cross-based support regions in stereo views. By applying the logical AND operator to the cross-based support region in the reference image and target image, we can obtain an intersection support region, which is used as an adaptive matching window. The proposed method aggregates absolute difference estimates in the intersection support region, which are combined with the census transform results. The census transform with a fixed window size and shape is applied, and only the resultant binary code of the pixel in the intersection support region is used. From Middlebury images and their ground truth disparity maps, we compute the area similarity ratio of support regions in stereo views. Then, a conditional probability of observing a correct disparity estimate with respect to the area similarity ratio is examined. By taking a natural logarithm of the probability, a relative reliability weight about the area similarity of support regions is obtained. The initial matching cost is then combined with the reliability weight to obtain the final cost, and the disparity with the minimum cost is chosen as the final disparity estimate. Experimental results demonstrate that the proposed method can estimate accurate disparity maps.

Near-real-time stereo matching method using both cross-based support regions in stereo views 1 Introduction Stereo vision applications based on stereo matching are becoming increasingly common, ranging from mobile robotics to driver assistance system. The goal of stereo matching is to determine a precise disparity, which indicates the difference in the location of the corresponding pixels. The corresponding pixels between two images of the same scene are established based on similarity measures. Dense stereo matching to find the disparity for every pixel between two or more images has been actively researched for decades. [1][2][3][4][5][6][7][8][9][10] Stereo matching algorithms are classified into global and local approaches. 1 Local methods utilize the color or intensity values within a finite support window to determine the disparity for each pixel. Global methods compute all disparities of an image simultaneously by optimizing a global energy function. [2][3][4][5][6][7][8][9][10] Global methods, which define a global energy function with a data term and smoothness term, help produce accurate disparity maps. To find the minimum global energy function, various global optimizers, such as dynamic programming (DP), 5,6 belief propagation, 7 and graph cuts, 8 have been proposed. Local algorithms select the potential disparity with the minimal matching cost at the pixel; hence, they are efficient and easy to implement.
Local stereo methods commonly use matching windows with a fixed size and shape, but the estimation results are greatly influenced by irrelevant pixels within the window considered. Improved local algorithms that reduce the error effects of irrelevant pixels can be divided into two categories. 3 Local stereo methods focus on selecting either the optimal window among predefined multiple windows or selecting point by point to adaptively support a window's size and shape. 9,10 However, building windows of various sizes and shapes adaptive to neighboring intensity distribution is time-consuming. Adaptive-weight methods assign different support weights to pixels in the given window by evaluating color similarity and geometric proximity, 2 but textureless regions, repeated patterns, and occlusion regions are not readily amenable to this solution.
This paper introduces an adaptive stereo matching method using the cross-based support regions in stereo views. The size and shape of the cross-based support region are chosen adaptively according to local color information and spatial distance. By applying the AND logical operator to the cross-based support region in both the reference image (left) and target image (right), an intersection support region can be obtained, which is then used as an adaptive matching window. The proposed method aggregates absolute difference (AD) estimates in the adaptive matching window, which are combined with the census transform results.
From the Middlebury reference images and their ground truth depth maps, 11 a conditional probability of observing a correct disparity estimate with respect to the similarity ratio of the areas in the cross-based support regions in stereo views can be calculated. When the cross-based support region in the target image is similar to that in the reference image, the area similarity ratio is close to 1, and a more accurate disparity value can be obtained. By taking the natural log probability, a relative reliability weight value based on the area similarity ratio is obtained. The initial matching cost is then combined with the reliability weight to obtain the final cost, and the disparity with the minimum cost is chosen as the final disparity estimate. Experimental results demonstrate that the proposed algorithm based on reliability weight provides more accurate disparity estimates than previous methods.
The main contributions of this paper are twofold. First, using both support regions in stereo views enables a disparity map to be more accurately estimated. The intersection region of cross-based support regions in the reference image and target image is used as an adaptive matching window. Second, a reliability weight based on the area similarity ratio of both support regions in stereo views is introduced. The reliability weight represents the probability of correct disparity estimation with respect to the area similarity between stereo views. The shaded box in Fig. 1 represents the main components of our contribution.

Building Intersection Support Region
In a textureless region, large matching windows to consider enough pixels are needed to overcome their erroneous matching costs, whereas in highly textured regions at depth discontinuities, smaller windows are needed to avoid oversmoothing. To address this problem, a cross-based support region construction method that can adaptively alter the window's shape and size is proposed, 3,4 which considers only the support region of the reference image. Another method considers cross-based support regions in both target and reference images, but the regions are used only for initial matching cost aggregation. 12 In this paper, an area where the cross-based support regions in both the target and reference images intersect is used as an adaptive support window for matching cost computation.
Reference 4 presented the enhanced construction rules to determine the shape of the cross-based support regions. Figure 2 shows how an upright cross with four arms for the anchor pixel p ¼ ðx; yÞ is constructed. Table 1 shows a pseudocode to determine the arm length l and endpoint, p e of the horizontal and vertical directions with respect to p. Here, p n is the n'th pixel along the scanning direction of each arm: p n ¼ ðx-n; yÞ in the left arm and p n ¼ ðx þ n; yÞ in    the right arm, and the maximum number of p n is set to L 1 . In Table 1, we examine whether the color differences D c ðp n ; pÞ and D c ðp n ; p n−1 Þ are lower than τ 1 along the scanning direction of each arm. The color difference D c ðp n ; pÞ is defined as the maximum AD between p n and p in RGB channels: D c ðp n ; pÞ ¼ max i¼R;G;B jI i ðp n Þ − I i ðpÞj. Two successive pixels (p n and its predecessor p n−1 ) are examined so that the arm will not go beyond the edges of the image. When the arm length is longer than L 2 (n > L 2 ), the lower color threshold value τ 2 (τ 2 < τ 1 ) is used in the color difference computation to control the arm length more flexibly. In this paper, τ 1 , L 1 , τ 2 , and L 2 are the experimentally preset threshold values 27, 21, 15, and 13, respectively. Then, the support region of pixel p is modeled by merging the horizontal arms of all the pixels (for example, green colored row of q in Fig. 2) lying on the vertical arms of pixel p. Each horizontal line (left arm and right arm) of q is examined as shown in Table 1, and the arm length l and endpoint q e are determined. Since the color distribution is different for each row of pixel q lying on the vertical arms of p, the support region of p is not rectangular.
In the AD method, the matching cost is computed as the AD in color/intensity between corresponding pixels. 1 The matching costs in the support region are aggregated within two passes along the horizontal and vertical directions. In the first pass, the matching costs are aggregated from the endpoint's predecessor q e−1 of the left arm of any pixel q to the endpoint's predecessor of the right arm. In other words, the horizontal summation of matching costs is performed on every pixel q lying on the vertical arms of pixel p. Then, the intermediate results of q s on the vertical arms are aggregated vertically to obtain the final cost. Both passes can be efficiently computed with one-dimensional integral images. 3,4 Given a pixel p ¼ ðx; yÞ in the reference image, our goal is to establish its corresponding pd ¼ ðx-d; yÞ in the target image accurately. Since the stereo rig is assumed to be rectified, we only examined a horizontal translation. Here, we can acquire a support region SRðpÞ in the reference image and the support regions SR 0ðpdÞ in the target image along a candidate disparity level d. Using the logical AND operation of the cross-based support region in the reference image and target image, the support region with the same size and shape, called the intersection support region, can be obtained.
Assuming that neighboring pixels with similar colors have similar disparities, previous methods built cross-based support regions for initial matching cost aggregation. 3,4 However, when the intensity distribution is complex, constructing the cross-based support regions of a surface with the same depth information is difficult. Furthermore, a support region of insufficient size may be built in this case. If the total arm length in both horizontal directions is less than five pixels, the proposed method sets the length of the support region to five pixels to consider the minimum neighborhood region.

Computing Initial Matching Cost
In the cross-based support region method, 4 the initial matching cost Cðp; dÞ is computed by combining the AD measure and census transform in an exponential function. More specifically, given two corresponding pixels p and pd, two cost values C AD ðp; dÞ and C census ðp; dÞ are computed individually.
C AD ðp; dÞ is defined as the average intensity difference of p and pd in RGB channels by Eq. (1). In the original census transform, the brightness value of anchor pixel p is compared with the brightness values of pixels NðpÞ in the census window W. In Eq. (2), function ξðÞ returns 1 if the brightness value Iðp n Þ of the neighborhood pixel is higher than the counterpart IðpÞ of the central pixel and 0 if the brightness value of the neighborhood pixel is lower than that of the central pixel by comparing the brightness values between pixels. Using the concatenation operator ⊗, the pixel-based brightness comparison results are encoded into the bit-string CðpÞ. More specifically, CðpÞ, consisting of 0 and 1, refers to the relative brightness distribution of neighborhood pixels on the basis of the central pixel of the window.
By considering only the resultant binary code of the pixel p n in the intersection support region ISRðpÞ, we can reduce the errors caused by irrelevant pixels. This means that p n is a pixel in the intersection region of NðpÞ and ISRðpÞ. The census transform result is obtained from the hamming distance of the two bit-strings of pixel p and the corresponding pd E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 3 2 6 ; 4 9 9 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 3 2 6 ; 4 5 3 The length of the bit-string by census transform depends on the size of the census mask. The proposed method stores the bit-string CðpÞ by census transform in a one-byte unit. For example, when the census window is 9 × 7 pixels, 62 bits are generated by the brightness comparison. Then, eight strings of one-byte length are encoded, and the first two bits of the eighth string are "do not care" bits. An exclusive-OR operation of the binary string in the reference image and in the target image is performed in the census transform. A look-up table that contains the exclusive-OR operation results of all two bit-strings (of one-byte length) is used for computation efficiency.
C AD ðp; dÞ is normalized with the maximum intensity value (255) as Eq. (3). When the AD measure and census transform are combined, Mei et al. 4 used the exponential function. First, it maps different cost measures (AD measure and census transform) to the range [0,1], such that the cost values will not be severely biased by one of the measures. Second, it allows easy control of the influence of the outliers in each cost measure.
A clipping function is used for efficient implementation, instead of the exponential function used in Ref. 4. C N_AD ðp; dÞ is aggregated in the intersection support region ISRðp; dÞ, which are normalized with the area of the intersection support region area½ISRðp; dÞ as Eq. (3). The size and shape of the intersection support region is determined for the candidate disparity levels (0 − d max −1 ). In the same way, C census ðp; dÞ is normalized with the area of the intersection of NðpÞ and ISRðp; dÞ, and then clipped in Eq. (4). Here, λ AD and λ census are the threshold values for the clipping functions. Two parameter values are determined by considering the matching costs of the AD measure and census transform, and C N_AD ðp; dÞ and C census ðp; dÞ are normalized by λ AD and λ census , respectively. In Eq. (5), the initial matching cost Cðp; dÞ is computed with C SAD ðp; dÞ and C census ðp; dÞ. To give the relative weight of importance for two cost measures directly, w AD and w census parameters are included E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 6 3 ; 6 5 3 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 5 ; 6 3 ; 5 0 2 Cðp; dÞ ¼ w AD C SAD ðp; dÞ þ w census C census ðp; dÞ: The size and shape of the intersection support region vary greatly with disparity levels. If a census transform is applied to the intersection support regions at every disparity level, its computational load is very high. Hence, a census transform with a fixed window size and shape is applied, and only the resultant binary code of the pixel in the intersection support region is used as Eq. (2). Even if the size of the intersection support region is larger than the census transform window, the result of the census transform with the fixed window is used as CðpÞ.

Area Similarity Ratio of Intersection Support Regions
The proposed method introduces a relative reliability weight based on the similarity ratio of the areas of the cross-based support regions between stereo views. The area of the intersection support region is divided by the area of the crossbased support region in the reference image. The obtained area ratio R i (0.0 to 1.0) represents the area similarity between the support region in the reference image and target image. Figure 3 shows the process of computing the area similarity ratio and assigning reliability weights at each disparity level in more detail. The figure shows both the candidate corresponding pixels and their support regions in the target image with respect to the anchor pixel in the reference image. Here, the anchor pixel in the cross-based support region is indicated by a white dot. In the winner takes all (WTA) strategy, the disparity with the minimum cost value is generally selected as the final disparity estimate. 1 Accordingly, the reciprocal value of the reliability weight based on the probability is multiplied by the initial matching cost at each disparity level.
From the Middlebury reference images and their ground truth depth maps, the conditional probability of observing a correct disparity estimate on the condition that the area similarity ratio R i is given is examined. To compute this probability, the number of cases where correct disparity estimates are obtained from R i is divided by the total number of image pixels (width × height). The area similarity ratio is divided Fig. 3 Process of area similarity ratio computation at each disparity level.  When the cross-based support region in the target image is similar to that in the reference image (i.e., the area similarity ratio is close to 1), a more accurate disparity value is obtained. Here, an extremely small support region (area less than 5 × 5) is excluded from this estimation procedure because it will yield a small denominator, which may cause ambiguity in the area similarity ratio computation.
By taking the natural log probability, a relative reliability weight according to the area similarity ratio is obtained. In general, since the probability value is too small, its logarithm result may be negative. Here, the probability value is multiplied by 10 5 to make the logarithm results of the minimum probability value positive. Then, the logarithm results are normalized with the value at the last sampling level. Figure 4(b) shows the reciprocal value of the reliability weight according to the R i . The weighted matching cost distribution of the anchor pixel is examined in the disparity estimation procedure.

Disparity Refinement
Unreliable disparity values could still be obtained owing to occlusions, repetitive structures, and texture-less regions. To eliminate these matching ambiguities, a four-direction scanline optimizer based on a semiglobal matching method is used. 4,13,14 Given the scanline directions (two along the horizontal directions, and two along the vertical directions), the path cost at pixel p and disparity d is updated, penalizing the disparity changes between neighboring pixels. The final cost for pixel p and disparity d is obtained by averaging the path costs from four directions.
Using left-right consistency (LRC) checking, the matching ambiguity regions can be obtained. 2,3 The edge propagation (EDP) method is then used as an optimization process with color continuity and edge information. 15 The EDP method propagates the disparity values to the peripheral regions, considering the color differences of pixels and the edge costs of regions. When the cost value of an adjacent pixel is small, the disparity values of neighborhood pixels are propagated. The proposed method uses WTA to determine the disparity value in the cost volume, and then performs the LRC check for outlier detection. In addition, unstable disparity estimation values are refined by iterative region voting and proper interpolation procedures. 3

Experimental Results
The experiment was carried out using a PC with Intel(R) Core(TM) i7-6700 CPU @3.40 GHz, NVIDIA Geforce GTX 1070 graphics card, and VS 2013, OpenCV 2.4.13 development environment. The proposed algorithm was implemented on a GPU-based CUDA 8.0 platform with multithreads and parallel programming. Figure 5 shows the Tsukuba, Venus, teddy, and cones stereo datasets (2001 and 2003), their ground truth disparity maps, and the disparity map results from our method. Table 2 shows four reference benchmark image resolutions and their maximum disparity levels. Table 3 shows quantitative evaluation results by stereo matching algorithms for the Middlebury database set. In Middlebury stereo evaluation, the percentage of bad matched pixels (BMP) is used as the error measure. 1 The BMP is based on counting the disparity estimation errors exceeding a threshold value, and the most commonly used value is 1 pixel. In this paper, the threshold value for the BMP is set to 1. The matching errors of previous methods are listed with their original rank, as reported in the Middlebury benchmark. In this comparison, the area similarity ratio is divided into four sampling intervals (8,16,32, and 64 levels). Table 3 shows that the sampling interval of R i exhibits some influence on the disparity estimation results. However, when the sampling interval of R i is too narrow, it becomes difficult for the reliability weight to reflect precisely the correlation between R i and the correct disparity estimate. When the sampling interval of R i is 64 levels, the smallest average errors (bold values) are obtained. So, the sampling interval of R i is set to 64 levels in this experiment. The proposed Lee and Hong: Near-real-time stereo matching method using both cross-based support regions in stereo views method provides better results compared with previous algorithms except the AD-census method. 2,4,7,16,17 Table 3 shows that the performance of the proposed method is better or nearly the same as the AD-census method. 4 Using the adaptive matching window based on the intersection support regions in stereo views, the proposed method can reduce the errors caused by irrelevant pixels in the initial matching cost computation. The experimental results show that the reliability weight based on the area similarity ratio of support regions is helpful for accurate disparity estimation. A previous algorithm based on color segmentation and a plane estimation procedure is also considered. The performance of the method in the Middlebury test is high but involves significant computational load. Here, performances are evaluated in the nonoccluded region, all (including halfoccluded) regions, and regions near depth discontinuities, denoted as "non-occ," "all," and "disk," respectively. Here, λ AD , w AD , λ census , and w census are set to 0.1, 0.2, 0.8, and 1.0, respectively. Additionally, the spatial and color control parameters in EDP are set to 40 and 20, respectively. Reliability measures are used to evaluate how the reliable disparity value is obtained and to reduce the average error of the disparity map. 18,19 Two reliability measures are used to evaluate matching performance: the peak-ratio naive (PKRN) and the maximum likelihood metric (MLM). 19 PKRN computes the ratio of the minimum cost C 1 and the second minimum cost C 2 as Eq. (6). MLM obtains a probability density function for the disparity of the minimum cost. MLM measures the relative minimum cost value compared with the sum of the total minimum cost values as Eq. (7). Here, ε and σ are set to 128 and 8, respectively. More specifically, the confidence methods measure the disparity estimate's likelihood of being correct and generate reliable disparity maps by selecting among multiple hypotheses for each pixel E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 3 2 6 ; 2 7 8 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 7 ; 3 2 6 ; 2 3 6 Figure 6 shows the confidence maps of the disparity estimation results by the proposed method. EDP in the refinement procedure replaces the matching cost distribution with a new quadric cost distribution based on the stable disparity value. Hence, we examine the disparity estimation results before performing the EDP process. In Fig. 6, the first row shows PKRN confidence maps of the depth estimation, and the second row shows MLM confidence maps. In addition, the first column shows the confidence map of the depth estimation without adaptive window and reliability weight. The second column shows the confidence map of the depth estimation  with adaptive window and no reliability weight consideration. The third column shows the confidence map of the depth estimation with both adaptive window and reliability weight. Confidence maps using PKRN and MLM are nonlinearly scaled for visualization. Here, brighter regions with high confidence values represent more reliable disparity estimates. Figure 6 shows that the proposed method with adaptive window and reliability weight can improve disparity estimation performance in the detailed parts, such as the book shelf and light stand (marked in colored circles). Table 4 shows the modular computation performance on CUDA implementation for stereo images. In many stereo matching studies, the computation loads have been evaluated based on the fixed maximum disparity levels of benchmark stereo images. 3,20 In Table 2, the resolution and the maximum disparity level of teddy images are 450 × 375 and 60, respectively, which are the same as those of Cones images. The resolution and maximum disparity level of both teddy and cones images are higher than the other two images. Overall, computation loads are highly dependent on both the image resolution and maximum disparity level. This means that less matching cost computations are performed in Tsukuba and Venus images.
Initially, cross-based support regions in stereo views and their intersection regions along the disparity level are built. The area similarity ratio of the intersection region and the census transform within the 9 × 7 window are also computed at this stage. In the initial matching cost computation step, two procedures have almost the same processing time: (1) finding the pixels in the intersection region and computing AD estimates, and (2) combining AD aggregation results with census transform results and considering the reliability weight in the matching cost computation. The proposed method obtains the final disparity map in 20.12 to 139.17 ms (7.2 to 49.7 frame∕s).
In this experiment, the average sizes of support regions in cones and teddy images are 235.01 and 454.77 pixels,  respectively. This means that there are more homogeneous color regions in teddy images than in cones images. More pixels are examined to construct the support regions in teddy images. Therefore, in teddy images, the computation time of the initial step with the support region construction is longer than that in Cones images. The size of the support region has little effect on the matching cost computation step, because the integral image of matching cost is  Table 4, the computation of the refinement step in cones images takes longer time than that in teddy images. Table 4 shows that the scanline optimization step requires longer computation time. To improve performance further, we will consider another optimization method based on parallel GPU architecture for computation efficiency. 21 In Figs. 7 and 8, the proposed method is qualitatively compared with previous methods. 17,22 The test is performed for 2005 and 2006 Middlebury database sets (aloe, laundry, rock1, and cloth4) and the book arrival sequence, 23 a realworld stereo video clip captured in an uncontrolled environment. Figures 7(c)-7(e) show the disparity maps by the MGM method, DCB grid method, and proposed method, respectively. In the MGM method, a refinement procedure is not included and the disparity results are expressed in pseudo color. However, we can see roughly matching performance by the MGM method. The disparity maps by the proposed method have more distinct boundaries of objects and are less noisy. However, further considerations to eliminate errors in the occluded and textureless regions (such as backgrounds in rock1 image) are needed. The reference images (the 1st and 20th frames) and their disparity results are shown in Fig. 8, respectively. The next two columns show the results produced by previous methods. (DCB grid's source code is downloaded from author's project website at https://www.cl.cam.ac.uk/research/ rainbow/projects/dcbgrid/, and disparity results by the MGM method are obtained using an online demo on author's website at http://dev.ipol.im/~faccilol/mgm.) The last column shows the results by the proposed method. In the same manner, the proposed method is compared with two algorithms on outdoor scene image sequence, as shown in Fig. 9.  Lee and Hong: Near-real-time stereo matching method using both cross-based support regions in stereo views

Conclusion
This paper presents a near-real-time stereo matching method using cross-based support regions in both the reference and target images. The proposed method obtains the intersection region of the cross-based support regions in stereo views. The intersection region is used as an adaptive matching window for initial matching cost computation. When the area of the support region in the reference image is similar to that in the target image, the candidate disparity is likely to be the correct disparity value. The proposed method computes the reliability weight based on this probability from the Middlebury reference database sets. Experimental results show that the proposed method with CUDA implementation provides improved matching accuracy and processing efficiency.