Range information from images is typically obtained using time-of-flight sensors,1 configurations based on pattern projection,2 illumination variation by photometric stereo,3 focus variation,4 multicamera systems,5 or light field cameras.6 Line scanning is a popular method for acquiring images of moving objects, especially in machine vision applications. From moving platforms, such as air- or spaceborne scanners, the so-called pushbroom principle is used to acquire sensor lines while moving along a predefined trajectory in space. We utilize this acquisition principle, extended to binocular stereo, for an application in ground reconstruction from a vehicular platform. The application area is the inspection of road surface conditions. Figure 1 shows a few examples of single road images as acquired by the proposed system. We will describe how to obtain depth information from line-scan stereo pairs, e.g., pairs of images taken concurrently from slightly displaced positions.
In stereo imaging, the range for each pixel is obtained from the estimated disparity, i.e., the displacement between corresponding points observed in two (or more) images. The epipolar constraint in a stereo vision system states that a point in one image is found along the corresponding epipolar line in the other image. Epipolar rectification in area-scan stereo pairs aligns epipolar lines to image lines, thus reducing the correspondence estimation to a search over an expected disparity range oriented along image lines. In the presented line-scan stereo system, one adjusts this geometrical constraint mechanically such that epipolar lines are aligned with sensor lines. Estimation of disparities is then performed along sensor lines.
We will discuss the stochastic binary local descriptor (STABLE) for disparity estimation.7 Since the introduction of the scale invariant feature transform (SIFT), a number of feature detectors and descriptors were suggested over the last decades.8 Among others, the goal of speeding up SIFT was met in speeded up robust features (SURF).9 Some representations of local derivatives, e.g., gradient orientation histograms, are commonly used in those descriptors. Higher speed is sometimes also traded against reduced invariance properties, e.g., in binary robust-independent elementary features (BRIEF).10 Efficient representations and fast matching are obtained by the family of binary descriptors. Oriented BRIEF (ORB) is an alternative to SIFT and SURF that is based on a binary description.11
STABLE belongs to a broad class of local binary descriptors, along with the census transform (CT),12 local binary patterns (LBP),13 BRIEF,10 binary robust invariant scalable keypoints (BRISK),14 or fast retina keypoints (FREAK).15 The most similar descriptor to STABLE is BRIEF;10 the main difference lies in the ability of STABLE to have more than one pair of pixels contributing to a single descriptor bit. In general, binary descriptors are known to be robust against intensity variations as relative pixel intensity comparisons are used in descriptor construction followed by bitstring matching, especially when compared to direct intensity comparison using sums of absolute or squared intensity differences. Furthermore, a higher speed could be expected from simple comparison operations.
STABLE can be related to the principle of “compressed sampling.”16 The compressed sampling theory claims that each signal with a sparse representation in some (potentially unknown) linear basis can be preserved and reconstructed from a small number of random projections. For natural images, this means that, due to the sparsity of image edges and inherent smoothness, it is sufficient to sample the image in a compressive manner without losing any significant information. While the reconstruction is not the main focus in our application, we exploit the principles of compressed solely sampling for deriving an efficient binary representation of any given pattern, i.e., for encoding the pattern into a constant number of bits that is greatly independent from the pattern’s size.
We used two line-scan cameras sensitive in the visible spectrum for stereo acquisition of the road surface while the acquisition device was moving. The surface could be acquired using either one long image sensor line shared by two lenses or two collinearly arranged line-scan image sensors observing the same surface line patch. Figure 2(a) shows the selected setup using two collinearly arranged line-scan sensors observing the surface from two different viewpoints. The optical axes are verged to obtain a larger overlapping region. Some details on the geometrical setup are as follows: a baseline of , distance to the ground of , verging of the cameras of wrt the ground surface normal, and field of view of the camera lens of .
General design principles were that car driving speeds up to should be possible at lateral resolution on the order of magnitude of . The used cameras were able to achieve the required line rates of . Regarding the field of view, it was sufficient to cover a small stripe only, i.e., in the center of the region where car tires usually interact with the surface, which made it possible to restrict the line length to 1000 pixels.
Verging of the optical axis has two drawbacks. First, the object resolution decreases from left to right in one view and from right to left in the other view. Second, the limited depth of field might result in sharpness reduction depending on optical parameters and adjustment when compared to a canonical stereo system. Geometric calibration of the sensor lines ensures a constant object pixel size at the regular working distance for planar surfaces.
The depth of field was estimated to be on the order of magnitude of for a -number of 5.6, a magnification of 0.1, and a sensor pixel size of . For -numbers of 1.4 or 2.8, we would obtain a depth of field of or , respectively. Although these are quite low numbers, it turned out to be sufficient to compensate for the varying distance due to verging and the expected depth variation in road inspection.
The purpose of calibration is the alignment of the sensors lines to ensure that the plane spanned by the left optical axis and left sensor line is coplanar with the plane spanned by the right optical axis and right sensor line. This property is important to fulfill the epipolar constraint at each depth and requires a calibration procedure that ensures collinearity of the sensor lines at a number of distances. To facilitate this requirement, one has to ensure the collinearity of the sensors at least for two different distances. Using a calibration target similar to the one suggested by Luna et al.17 where target patterns are present at parallel planes at different distances, one is able to determine the camera pose, including the epipolar plane orientation, of a single line-scan camera. A similar calibration target is required for line-scan stereo calibration. In our case, the concurrent mechanical adjustment of both sensor lines ensures the observation of corresponding patterns at different distances and for both cameras, i.e., the epilpolar planes are the same for both sensors. Nevertheless, residual misalignment and vibrations of the system might result in problems during stereo matching. We suggest an additional correspondence search between lines adjacent to the concurrently taken sensor lines.
Stereo Image Processing
To obtain depth information from stereo image pairs, corresponding points need to be found. Corresponding points are typically identified via block matching, i.e., comparison of image patches between image pairs. Measures of block similarity include direct comparison of pixel intensities using similarity metrics such as the sum of absolute differences, the sum of squared errors, the normalized cross correlation, and comparison based on measuring some distance between block feature descriptors. While for descriptors, such as SURF or SIFT, vector metrics in high-dimensional spaces are commonly used to quantify descriptor similarity; for binary descriptors, the Hamming distance is applied in most cases.
Local Binary Descriptors
In general, binary descriptors have been used for tasks like texture analysis, recognition, and matching, e.g., LBP13,18 and the CT.12 In the context of local descriptors, several fast binary descriptors were also developed recently, e.g., BRIEF,10 BRISK,14 FREAK,15 etc. In our experiments, we considered the center-based descriptors CENSUS and LBP, where center-based refers to the fact that pairwise comparison always involves the central pixel, and the uncentered descriptors BRIEF and STABLE. The main difference in binary descriptors is in the sampling pattern for local intensity comparisons, which results in a binary descriptor vector. The CENSUS-dense descriptor is the only descriptor utilizing exactly all pixels in the considered matching window. We alternatively investigate the CENSUS-sparse descriptor, which uses a subsample of off-center pixels on a regular grid and compares those against the central pixel. The BRIEF descriptor uses a subsample of pixel pairs (typically sparsely) located at arbitrary positions in the matching window. The resulting descriptor lengths equal the number of pixel pair comparisons performed. Finally, with STABLE, we also get pixel pairs at random positions, but we are able to map a larger number of pixel pairs to a smaller number of descriptor bits. Figure 5 shows the compared descriptor masks (the meaning of the numbers in the mask will be explained in the next section).
We consider an image patch of size pixels. The operation derives the th descriptor bit from patch as follows:
Figure 3 shows this operation schematically. A set of sparse filter masks from a dictionary are applied to the same image patch and, depending on the number and individual signs of the filter mask entries, a number of pixels is contributing to each descriptor bit.
A more efficient implementation of STABLE, avoiding binarized convolutions with sparse feature filters, uses a single index filter mask . This mask is of the same size as the image patch and encodes at nonzero pixel positions the position in the descriptor array and a sign. An accumulator array of size is used to perform a sign-dependent accumulation in cell of the pixel values in with corresponding filter mask index . After all accumulators cells are processed, the descriptor is derived by thresholding each cell entry of . The improved operation involving the filter index mask instead of the filter dictionary is shown in Fig. 4.
The concept of filter masks is also applicable to other binary descriptors, e.g., Fig. 5 shows filter masks corresponding to CENSUS-dense, CENSUS-sparse, LBP, BRIEF, and STABLE. The number in each cell refers to which bit a pixels contributes. The sign indicates whether the pixel value is taken as is () or if it is negated () when using the accumulator-based implementation scheme. The center-based descriptors in Figs. 5(a) to 5(c) utilize the central pixel for each descriptor bit, which is indicated by .
In stereo matching, we consider a discrete range of disparities  for which the descriptors, corresponding to each image pixel position, are compared by the Hamming distance. This results in a cost stack of dimension , where and are the image dimensions and is the number of evaluated disparities. The cost stack is searched for by the minimum cost at each pixel, which provides the associated disparity estimation. The cost stack is then filtered with a Gaussian kernel in the cost domain followed by filtering with a Gaussian kernel in the image domain. Finally, to efficiently obtain a subpixel accuracy disparity map, a parabola is fitted to three values around the initial integer disparity estimation, a procedure commonly applied in image processing.19
We will present results on synthetically disturbed data to estimate robustness of STABLE in comparison to other binary descriptors. We also provide detailed comparison to BRIEF, the most similar approach to STABLE, based on stereo matching performance on the Middlebury stereo dataset. Furthermore, we provide illustrative examples on real-world data of freeway road surface. Finally, we provide run-time measurements on GPU platform.
To evaluate performance of the STABLE descriptor compared with other state-of-the-art local binary descriptors, we employed a similar evaluation scheme as suggested by Mikolajczyk and Schmid20 based on the analysis of receiver operator characteristic (ROC) curves. We extracted 1200 grayscale patterns from 48 natural images contained in the data set introduced in Ref. 20, always 25 patterns per image at random locations. Given the perturbation type, for each pattern, we introduced 25 synthetic perturbations, which gave a total number of 30,000 patches. In this study, we considered five different types of perturbations:
• Gaussian additive noise ();
• Gaussian blur ();
• shift in random direction ();
• scaling (); and
• rotation ().
We considered matching windows of size .
Given the set of 30,000 patches defined for each perturbation type, there is always a group of 25 associated perturbed versions for each patch in the data set. Making every patch a query, one can assess its Hamming distance to all patches in the data set making use of a particular feature descriptor. Knowing that for each query there are only 25 relevant elements, one can calculate the precision and recall values for all result sets associated with different thresholds put on the Hamming distance. The ROC curve is then defined by the obtained precision and recall values.
In total, we compared five local binary descriptors:
While for CENSUS and LBP, the descriptor size depends on the matching window, in the case of STABLE and BRIEF, the number of feature bits is defined independently from the matching window. Thus, we also looked into the relationship between matching performance, expressed in terms of the area under the ROC curve (AUC), and the descriptor size in bits. Furthermore, as both of these descriptors are generated stochastically, their performance was assessed as the average and standard deviation over 25 trials with different randomly generated filter masks. We believe this should provide a clear picture about the typical performance and stability of those stochastic descriptors.
Figure 6 shows the recognition performance obtained by different feature descriptors for a constant configuration of the descriptor size. Going from the worst to the best performing descriptors, it can be seen that the LBP provides the overall worst performance for all perturbation types. It is then followed by CENSUS-sparse and CENSUS-dense, both of which provide comparable performance despite their very different numbers of feature bits. For most perturbation types, it is then followed by BRIEF and finally by STABLE (notice the curve with circles exceeds all the other curves in most cases).
In Fig. 7, the matching performance is analyzed in relationship with the descriptor size. All descriptors with a constant number of bits are marked as points, whereas all the others are represented as curves. In this analysis, it is even more pronounced that the performance of the both CENSUS descriptors and LBP is significantly worse than for STABLE and BRIEF at the respective bit counts. In the case of noise, blur, and shift perturbations, the STABLE descriptor outperforms the BRIEF descriptor, especially for medium numbers of feature bits.
The performance of STABLE versus BRIEF is documented in detail in Fig. 8. The advantage of STABLE over BRIEF is expressed in terms of the recognition performance gain defined as a ratio between AUC values obtained by both descriptors using the same numbers of bits. It follows that AUC ratios above one mark the cases where STABLE outperformed BRIEF and vice versa. It is apparent that the advantage of STABLE is mostly pronounced for medium bit counts, while with the increasing size of the descriptor, the difference is getting smaller as both descriptors become more similar to each other. It should be noted that at the maximum possible number of bits, both descriptors are in fact the same where each bit is generated by just a pair of pixels. There are two cases in which STABLE significantly outperformed BRIEF, namely perturbations by (i) the additive noise and (ii) the blur. In the case of additive noise, a performance gain as large as 30% was obtained with 8-bit descriptors. For blur and shift perturbations, the highest AUC ratios exceeding 5% were obtained for 32-bit and 8-bit descriptors, respectively. For scale and rotation perturbations, STABLE performs generally slightly worse than BRIEF; however, the worst performance loss is still well below 5%.
Stereo Matching on Middlebury Stereo Dataset
We assessed the dense stereo reconstruction performance of STABLE versus BRIEF on real-world data. We used 10 evaluation training sets with disparity ground truth from the Middlebury Stereo Datasets 2014.21 For both STABLE and BRIEF, we used windows of size of and descriptor length of 8, 16, 32, and 64 bits. The left view served as the reference view.
As the error metric, we used the percentage of pixels with absolute disparity error greater than 2.0 (dubbed as bad 2.0). We did not include occluded pixels. For each of the 10 datasets, we performed 25 runs (each run with a different index mask for both descriptors) and recorded the best and average values for each metric. Performance gain of STABLE relative to BRIEF averaged over all datasets is shown in Fig. 9.
Results in Fig. 9 show that both for average and best cases and for all tested bit lengths, STABLE outperforms BRIEF. The largest performance gain (4.33% in bad 2.0) was measured for the length of 32 bits. Performance gain for 8 and 64 bits is significantly lower for both metrics. For illustration, Figs. 10(a) to 10(d) show the example where STABLE outperformed BRIEF the most and Figs. 10(e) to 10(h) show the example where STABLE was least superior to BRIEF. The green areas in the difference images in Figs. 10(c) and 10(g) depict areas where STABLE gained a better bad 2.0 score when compared to BRIEF, whereas magenta refers to a better bad 2.0 score for BRIEF. Black to white areas indicate that both methods obtained very similar bad 2.0 errors.
Road Surface Data
In this section, we present results on real world data acquired by driving our system on a freeway. First, we compare STABLE and CENSUS-dense and provide results for STABLE with different descriptor lengths. Subsequently, we provide illustrative examples for selected features found during the road surface survey. The purpose of this survey is to assess 3-D road surface as poor road conditions lead to increased wear and tear on vehicles and has an impact on surface water transport, noise emission, etc.
Descriptor properties for road surface
Figures 11(a) and 11(b) show a stereo image pair depicting a top down view of a washed concrete surface. The estimated depth maps shown in Figs. 11(c) and 11(d) are results of CENSUS-dense and STABLE with 64 bit descriptor length, respectively. The result of STABLE is less noisy (i.e., less “black” pixels) using just 64 bits, while achieving a qualitatively similar, or even slightly better, depth estimation as the bit long CENSUS descriptor.
Figure 12 shows the performance of STABLE with a descriptor length ranging from 16 bits to 112 bits, which is the maximum bit count possible for the matching window. While the 16-bit long descriptor still provides quite noisy results, using 32- or 64-bit descriptors improves the reconstruction quality significantly. On the other hand, increasing the size of the descriptor to full 112 bits does not seem to improve the result any further.
Finally, Fig. 13 shows the influence of spatial averaging and additive noise on CENSUS-dense and STABLE. In both cases, STABLE outperforms CENSUS-dense descriptor. The images show the estimated disparities, which are linearly related to depth measurements.
Sample images from road survey
Due to the lack of ground truth, we refer to a manual annotation of interesting properties visible to human observers and show the derived 3-D reconstruction from which these properties become clearly visible. In most of the results, there is a vertically oriented 3-D structure visible. This stems from diamond grinding, which is a pavement preservation technique used to remove surface irregularities to reduce noise and increase road safety. We applied postprocessing based on total variation (TV) regularization22 to obtain smoother 3-D renderings, shown in Fig. 14. The brighter the disparity, the closer the observed object point is to the observer.
Figure 14(a) shows an image of a grinded concrete road surface with an expansion joint. The grinding stripes, as well as the expansion joint, are visible in the disparity map in Fig. 14(b). A 3-D rendering of the portion around the expansion joint is provided in Fig. 14(c). Figure 14(d) shows an image of a grinded concrete pavement with a small hole; the corresponding disparity and 3-D rendering of the area of the hole are shown in Figs. 14(e) and 14(f), respectively. A grayscale image, disparity, and 3-D rendering of a larger break out of the surface are shown in Figs. 14(g) to 14(i), respectively. Finally, an image showing two grinding lanes of different depths is provided in Fig. 14(j). Additionally, in the left upper corner, there is some material, which we assume is chewing gum, observed in the area of the deeper grinding. The disparity in Fig. 14(k) shows that the valley of the grinding is not reached at the position of this suspicious object. In the 3-D rendering in Fig. 14(l), the different grinding depths and the object are visible as well.
For computational complexity analysis, we compared STABLE and BRIEF with features bits applied to image patches implemented using the index filter mask implementation, which was shown in Fig. 4. In general, there are two main operations required for using any of the local binary descriptors—building and matching. The matching operation is typically identical for all binary descriptors, i.e., making use of the Hamming distance applied to binary strings of length . The difference can thus be only in the computational complexity of the building operation.
Building of the descriptors is comprised of three basic steps:
1. generating the index filter mask;
2. computing the accumulator values; and
3. binarization of the accumulator values.
The index filter mask is generated only once and can be considered an input parameter for the building operation. Therefore, this step can be omitted from our analysis. The binarization step uses the same thresholding algorithm for both analyzed descriptors and can be neglected as well. Hence, the only difference comes from the complexity of computing the accumulator values, as shown in Algorithm 1. While STABLE requires processing of elements from the index filter mask as well as from the image patch (or for odd number of pixels), BRIEF requires processing only such elements. Consequently, for a fixed , STABLE scales linearly with the number of patch pixels while BRIEF, in principle, requires only a constant time.
Computation of the accumulator values in BRIEF and STABLE using a single index filter mask.
|Require: image patch , index filter mask|
|initialize array to size with values of 0|
|for non-zero in do|
In practice, however, the difference between the actual execution time on CPU or GPU platforms and the theoretical one might be more in favor of STABLE due to caching in the on-chip memory. When a memory read for a cell is requested, often nearby cells are fetched and stored in the cache as well (details are hardware-dependent). To enable optimal caching, the data have to be well-organized in the memory, i.e., aligned with the hardware layout, and should be accessed using a predictable memory access patterns, e.g., in the same order as they were stored. This is especially important for GPUs where the global memory latency is higher compared to the CPU memory and thus optimal utilization of the cache memory has a higher impact on the final performance. We believe that such memory caching mechanisms can be better utilized with STABLE as all elements in both index and image patch arrays are always accessed. In particular, they are accessed sequentially. Therefore, the memory access pattern can be fully optimized. On the other hand, as BRIEF uses a random-access sparse memory pattern, prediction algorithms implemented in various memory caching mechanisms are more prone to fail.
To practically measure the difference between execution times of building STABLE and BRIEF descriptors, we implemented the accumulator algorithm, described in Algorithm 1, for a CUDA-enabled GPU in C/C++. Namely, we used the CUDA Toolkit 7.5 and a NVIDIA GTX Titan GPU. As a reference, we also implemented the CENSUS-dense descriptor. The test data were a grayscale image of . The descriptors of length 32 and 64 bits were represented as packed binary strings using 32- and 64-bit integers, respectively, and were computed from windows of . Each CUDA thread computed one descriptor. Threads were arranged into thread blocks. The CUDA code was compiled with the preference on L1 cache memory size. We executed the algorithm 500 times, each time with a different randomly generated index filter mask for both descriptors. Average measured execution times are listed in Table 1.
Average execution time measurements of descriptor generation for BRIEF, STABLE, and CENSUS-dense run on a GPU. A comparison relative to BRIEF is shown on the right side of respective columns.
|Parameters||Descriptor||Utilized pixels||Total time (ms)||Time per util. pixel (ns)|
Results in Table 1 show that total execution time of STABLE, in comparison to BRIEF, is lower than expected merely from the number of utilized pixels for both 32- and 64-bit lengths. When considering execution time per utilized pixel, execution time for STABLE is even lower by 44% and 22% for 32- and 64-bit length, respectively. This strongly points to a better utilization of GPU’s hardware memory caching.
In this paper, we have introduced the STABLE descriptor, suitable for high-performance dense stereo matching, for the application of line-scan stereo matching. STABLE relates to the compressed sensing theory for efficient representation of image patterns. We showed that STABLE provides significantly better matching quality wrt, the efficiency of data representation being preserved in a highly compressed binary form.
Compared with other state-of-the-art binary descriptors, our descriptor achieves the same matching quality with considerably fewer descriptor bits required, or alternatively, significantly better matching quality making use of the same number of descriptor bits. This could be advantageous in storage- and/or memory-limited environments. STABLE offers increased stability and robustness, especially in the cases where data are subject to noise, blur, and/or slight misplacement, which is often observed in practice. In all of the considered data, i.e., synthetically perturbed data from the set introduced in Ref. 20, the Middlebury Stereo Dataset 2014 (Ref. 21) and real-world line-scan stereo data, encouraging results were achieved. Promising illustrative examples from the real-world road survey application were provided.
Unlike some other descriptors, the descriptor size and the matching window are defined independently in STABLE. Moreover, STABLE always utilizes all pixels of the given matching window for producing the required number of feature bits, which makes it suitable for many practical applications where a trade-off between the descriptor size, due to computational performance limitations, and the overall matching performance is necessary. Yet another indication of the same is that STABLE surpasses other analyzed descriptors predominantly in a small-medium range of feature bits.
Despite that STABLE requires more operations to compute than BRIEF for the same window size and bit length, it runs in less time per utilized pixel on GPU as it can take better advantage of the GPUs memory caching mechanisms. Comparable results are expected on different computing platforms implementing similar caching mechanisms.
We have demonstrated that the proposed descriptor works very well for a broad class of natural patterns and that the inherent sparsity of those patterns suffices the assumptions of the compressed sensing theory. Another direction of our future research will go toward ways of mitigating certain matching artifacts originating from a typically rectangular matching window, where each pixel is utilized precisely one time. The calibration of line-scan stereo, which so far has been solved only by a mechanical camera adjustment, will also be considered in more detail in future investigations.
Kristián Valentín received his PhD in computer science from Comenius University in Bratislava, Slovakia, in 2015. Since 2014, he has worked at AIT, Vienna, Austria, in the field of computational imaging and computer vision.
Reinhold Huber-Mörk received his PhD in computer science from the University of Salzburg, Austria, in 1999. Since then he has worked at the Aerosensing GmbH, Oberpfaffenhofen, Germany, in remote sensing image analysis, at the Advanced Computer Vision GmbH, Vienna, Austria, in computer vision, and in 2006 he joined the AIT, Vienna, Austria, where he is currently a senior scientist in the field of machine vision.
Svorad Štolc received his master’s degree in computer science from Comenius University, Bratislava, in 2002, and his PhD in bionics and biomechanics from the Technical University of Košice and Slovak Academy of Sciences, Bratislava, in 2009. He is a researcher at the Digital Safety and Security Department of AIT GmbH, Vienna. His main research areas are image processing and computational imaging.