Full high-definition real-time depth estimation for three-dimensional video system

Abstract. Three-dimensional (3-D) video brings people strong visual perspective experience, but also introduces large data and complexity processing problems. The depth estimation algorithm is especially complex and it is an obstacle for real-time system implementation. Meanwhile, high-resolution depth maps are necessary to provide a good image quality on autostereoscopic displays which deliver stereo content without the need for 3-D glasses. This paper presents a hardware implementation of a full high-definition (HD) depth estimation system that is capable of processing full HD resolution images with a maximum processing speed of 125 fps and a disparity search range of 240 pixels. The proposed field-programmable gate array (FPGA)-based architecture implements a fusion strategy matching algorithm for efficiency design. The system performs with high efficiency and stability by using a full pipeline design, multiresolution processing, synchronizers which avoid clock domain crossing problems, efficient memory management, etc. The implementation can be included in the video systems for live 3-D television applications and can be used as an independent hardware module in low-power integrated applications.


Introduction
By using dense depth information, three-dimensional (3-D) video systems 1 such as 3-D Blue-ray, 3-D television (TV) sets provide stereo video, which has multiple views for different viewers.The results of depth estimation can also be used for driving assistance for automatic driving vehicles and robot navigation.In 3-D TV and free-viewpoint TV (FTV) 2 applications, depth image based rendering (DIBR) 3 is used to generate the arbitrary virtual views.The partial content of each view is synthesized into one image which includes eight or more views for autostereoscope display.So resolution reduction of each viewing zone is inevitable for multiview 3-D.High definition is necessary to provide high image quality in this condition.
Depth estimation has been thoroughly studied and a broad spectrum of approaches are summarized and compared by Scharstein and Szeliski. 4Hardware design based on a digital signal processor, 5 application-specific integrated circuit (ASIC), 6,7 or field-programmable gate array (FPGA) is more competent for real-time depth estimation implementation and design embedding than PC-based design.State-ofthe-art implementations continually improve the performance.Jin et al. 8 present an FPGA-based real-time stereo vision system which processes 640 × 480 images with a disparity range of 32 pixels in 230 fps.Considering only the pixel throughput, this system would be adequate for 1920 × 1080 with 34 fps.Riechert et al. 9 present a software algorithm that is capable of processing 1920 × 1080 disparity maps in real time, but on a system with two general-purpose CPUs and two high-end graphic processing units (GPUs), which is not suitable for embedded application.The software algorithm presented by Mei et al. 10 presents a system with good performance in accuracy, and currently, it is the second performer in the Middlebury benchmark. 11But it is implemented on a PC with a GPU card and gets a low-resolution sequence (e.g., 512 × 384, 60 disparity levels) in ∼10 fps.
Researches implement and evaluate variable efficient matching algorithms in order to get a high-performance depth estimation based on hardware design.Scharstein and Szeliski's taxonomy 4 has summarized the stereo matching algorithm.Compared to global algorithms of stereo matching (e.g., graph cut), 12 belief propagation (BP), 13 whose calculation is a complex local (area-based) algorithm, has hardware friendly characteristics.A single matching algorithm strategy, such as the sum of absolute difference (SAD) based [14][15][16][17] or census based [18][19][20][21] were widely used in the hardware designs, early research, and other matching methods, such as the dynamic programming algorithm 22 and BP, 23 were implemented based on a GPU platform.Previous works illustrate that a hardware solution provides real-time processing.But it is inevitable that the single local method introduces some errors under some conditions.For example, the census transform (CT) method introduces errors in the repeat edges.Low image resolution is another problem that exists in available implementations.With the increasing requirements for accuracy and image size, there are two strategies that appear in hardware implementation.One is the semi-global matching (SGM) method.The other is a fusion algorithm, which involves more than one algorithm to take advantage of different methods in the implementation.
Considering the analysis above, a high-definition depth estimation (HDDE) system, which is a real-time FPGA implementation, is proposed in this paper.It is capable of processing full HD (1920 × 1080) content with a maximum processing speed of 125 fps and a maximum disparity search range of 240 pixels.The HD depth map enhances the quality of the stereo content and the large depth range ensures that objects close to the cameras can be measured.The fusion matching strategy is used and implemented.The method includes a multiresolution operation for supporting a full HD resolution video input and using a synchronous design to overcome the instability problem introduced by the clock domain crossing (CDC) operation.Finally, in order to evaluate the performance of the design, the source usage condition and power consumption are analyzed, and the mega disparity evaluation per second (MdeS) is used to illustrate the overall performance.
The remainder of this paper is structured as follows.Section 2 analyzes the algorithms of the depth estimation, which include the fusing strategy stereo matching algorithm and a multiclock domains operation.Section 3 summarizes the details of hardware implementation for the HDDE system.Results of implementation are discussed in Sec. 4. A conclusion is shown in Sec. 5.

Depth Estimation
In a stereo vision system, image matching algorithms are important.The task of a stereo vision algorithm is to analyze the images captured by a pair of cameras and to extract the object shift in both images.This shift is counted in pixels and is called disparity d.According to the geometry constraints, the real-world depth is Z ¼ bf∕d, where b and f are the baseline and focal length of the camera pair.The flow of the proposed depth estimation includes preprocessing, image matching, postprocessing, etc.This section describes the algorithm of the fusing matching method by adopting the idea that the combined image matching measure successfully reduces the errors caused by individual measures.The FPGA implementation is used because it is suited for consumer applications in terms of size, cost, and power consumption.This is one main motivation to implement this algorithm.

Stereo Matching
Based on the requirements of the hardware design, the fusion strategy algorithms adopted in the implementation should have mutually reinforcing features and have the potential of parallel processing.SAD and CT belong to local (areabased) algorithm methods.Table 1 illustrates a comparison of CT and SAD with different characteristics.The CT approach has advantages in being bias-independent and having a homogenous area and low hardware complexity, while SAD has an advantage in feature-rich areas, especially in texture repeated regions.They are strongly complementary to each other and, hence, have huge potential for fusion in matching efficiency.
The initial matching cost calculation includes two parts, respectively, the Hamming distance obtained from the census transformed image and the SAD value based on the original image.The hybrid matching costs 10 can be expressed by Eq. (1): CðP;dÞ ¼ ρ½C Census ðP;dÞ;λ census þ ρ½C SAD ðP;dÞ;λ AD ; (1)   where λ AD and λ Census present the integration parameters, which can be adjusted to control the influence of outliers.Two individual cost values of C Census and C SAD are computed.ρðC; λÞ can be 1 − exp½−ðc∕λÞ.With this normalization, Eq. (1) will not be severely biased by one of the measures.
CT is a nonparametric local transform proposed by Woodfill 24,18 in the early '90s of the 20th century.Compared to a conventional algorithm, it can avoid noise between image pairs introduced by different cameras and simplify the hardware design with an integral calculation.In particular, the matching performance is high in the structural feature highlighted regions, e.g., areas near object boundaries.One CT value of a current pixel is the bits array, which summarizes the local image structure in a specified window.The CT value of left/right view I 0 is equal to ⊗ n∈N ⊗ m∈M ξ½pðu; vÞ; pðu þ n; v þ mÞ, where the operator ⊗ denotes a bitwise catenation, M × N is the mesh window size, and ξ ðp 1 ; p 2 Þ is 1 when the pixel original value p 1 is bigger than p 2 , otherwise it is 0. ðu; vÞ are the coordinate values of the corresponding pixel.If k is the number of the bit of each CT value, k is equal to w × w − 1 bit as the window size is w × w.Figures 1(a) and 1(b) show an example of CT where the mesh window is 3 × 3. The Teddy image and its CT image are shown in Figs.1(c) and 1(d).
Different window sizes result in transform values with different lengths, which impacts the matching results.Meanwhile, the computation complexity increases with the increasing window size.Figures 2(a) and 2(b), respectively, show the correct percentages of disparity estimation and the time consumption based on a double core 2.67 GHz PC with different window sizes from 7 × 7 to 37 × 37.They shows that with an increasing transform window, the accurate rate of the depth map does not increase linearly, while the computation obviously increases.Therefore, a tradeoff is needed between the window size and computation complexity in the hardware design.
After running CT, the Hamming distance, Hamming½I 0 1 ðu; vÞ; I 0 2 ðu þ d; vÞ, is used as one part of the initial matching cost.I 0 1 and I 0 2 are CT values.The value of the Hamming distance is the sum of the bitwise exclusive OR of a pixel pair.
Another famous local matching method is SAD. 25 Its cost function C SAD is the absolute difference of the pixels' intensity values, In the fusion steps, different combinations of the two matching methods can be used.The normalization method and other methods are used to control the proper weight of the basic matching costs.Following efficient combination, matching costs are provided for the disparity optimization processing.

Multiscale Processing
High-resolution image processing is one of the trends in 3-D video 26 and it is necessary to provide a high-quality image in an autostereoscopic display.In contrast to natural video signals, depth maps are characterized by piecewise smooth regions bounded by sharp edges.A smooth interior surface is a benefit for a multiscale operation.We find that the depth map can obtain a higher quality in an interpolation process compared to a normal texture image.Figure 3 shows the interpolation quality difference between the depth map and the texture image with different interpolation methods.The interpolation methods of 1 to 7 represent, respectively, nearest, bilinear, bicubic, box, lanczos2, lanczos3, and spline interpolation. 27This shows that no matter what interpolation is used, the depth map has a better performance than the texture image.Since joint bilateral filtering 27 upsampling not only requires using the low-resolution image data but also needs an additional guidance image for separately calculating the spatial filter kernel and the range filter kernel, it will   result in additional storage consumption and high complexity.Therefore, the joint bilateral filter will not be used for upsampling in this paper.The method that can be used in this design depends not only on the quality of the results, but also on the implementation complexity.Nearest interpolation is adopted in this design and synthesizers are used in the hardware design to avoid meat-stability problems since the multiclock domains are involved in different resolution operations.
In addition to the above-mentioned image matching algorithm and multiscale operation, refining processes are also used to enhance the performance of the depth estimation.The improved cross-based regions are processed for efficient cost aggregation.Support regions based on cross-skeletons allow fast aggregation with middle-ranking disparity results.As an optimizer, winner-takes-all (WTA) is used for the initial disparity map.A refining process to correct various disparity errors is used for improving the disparity results.The left-right consistency check occludes the points where the two disparity maps are not negatives of each other. 8ubpixel disparity 8 is another postprocessing method, which can use a parabolic fitting to generate the disparity in the subpixel accuracy.Filtering can be used as a fitting to achieve subpixel-level image filtering.These are used to resolve problems such as misleading occlusions and not being aligned with objects and outliers.

Hardware Implementation
The top-level block diagram of the proposed depth estimation hardware architecture is summarized in Fig. 4. The HDDE system involves eight submodules and internal block random access memorys (BRAMs) and an external double data rate 2 (DDR2) memory for data buffering.
The processed source image data streams are captured from the professional binocular cameras with a serial digital interface and the data format is converted from YUV to 8-bit gray-scale intensities.When the depth estimation is completed, the depth maps according with the original texture image are sent out for further processing such as coding.BRAMs are used to store rows of image data for the window operation in the algorithm implementation.The external memory bandwidth is important for the depth estimation of high-resolution images.The proposed memory management module controls the bandwidth and the data allocation scheme based on the asynchronous first input first output (FIFO) architecture.The eight submodules consist of three main submodules and five submodules.are involved in the system.These modules provide functions such as data control, data transform, further processing, etc.

Matching Core Implementation
The matching core is the key architecture of the proposed hardware fusion algorithm implementation, which adopts a parallel design to increase the throughput.Calculation of the value of the CT for each pixel is to compare the center pixel with pixels around it in the transform window.To meet this design requirement, it is necessary to set the line buffers with memory depth equal to the window height and the word width of the buffer equal to 8 bits.Meanwhile, the progress should be finished in one pixel clock to ensure real-time processing.The Altera intellectual property (IP) core of a dual-port RAM was used to realize the line buffers for the windows operation.The size of every line buffer is determined by the size of the input image.As shown in Fig. 5, a RAM with read and write ports composes the line buffers for further implementation.The number of line buffers is determined by the window height h and the width of line buffer is determined by the image horizontal resolution.In the experiments, the full HD images (1920 × 1080 pixels) were progressed, so nine line buffers are used for 9 × 9 windows and the minimum memory depth of each line buffer is 1920 words, where the word width is 8 bits.The outputs of each line buffer are merged together by using register arrays to form the data for the windows operation, as shown in Fig. 6.The diagram gives an example of a 5 × 5 window operation, which is composed of w × h cascade registers.The CT diagram in Fig. 6 is used to get the CT value of one pixel in serial.One CT value is 80 bits for a 9 × 9 window and is saved in the memory unit with an 80-bit word width.Figure 7 shows the processing architecture of the CT, which consists of comparators and outputs bit arrays with m ¼ w × h − 1 bits.As the output of this submodule, the original image data and CT value will be fed to next stage.
The initial matching cost value of each pixel is calculated by mixing the Hamming distance of the CT value and the SAD value determined by the original intensity value.In the matching processing, the current pixel and all pixels included in the disparity range in the reference image take part in the matching cost calculation, which generates Ln ¼ d max −d min þ1 matching cost values for each pixel.To ensure real-time processing, the output of all the Ln matching costs of the current pixel should be completed simultaneously.Parallel processing of the costs for each pixel is achieved through cascade register units.Hardware architecture of the Hamming distance module and the SAD consists of adder and comparator logic.Figure 8 shows the cost fusion calculation architecture.The parameters were utilized in fusing the CT and SAD costs.Figure 9 shows the logic design of the Hamming distance, which uses a fine grain pipeline method.The example shows the comparison of two 16-bit CT values by three-level pipeline structures.

Left image data
Right image data To further improve the matching accuracy, the cost aggregation in the specified window is carried out in each of the disparity plans.In order to keep a good trade-off between the quality and processing time, a 5 × 5 squared window aggregation is used.The windows operation in aggregation processing is the same as the method in CT, which uses the line buffers and register arrays.The five-stage pipeline is adopted.The first stage outputs are 10 bits, and the second stage result is 11 bits, so a 14-bit result is output in the final stage.The cost results are used to get the disparity using optimization processing.

Multiclock Domain Design
The design has to include several different clock domains because the FPGA chip and peripheral devices have different operating frequencies; another reason is the multiscale operation which has been used to improve the processing efficiency of the high-resolution video.When data are transferred from one clock domain to another, the received data risk including errors because of the existence of metastability.So the synchronizer is necessary for stable data communication.Based on the metastability characteristics of a synchronizing flip-flop, synchronizer reliability is typically expressed in terms of the mean time between failures  (MTBF). 28It can be calculated by where τ is the settling time of the flop, T W is a parameter related to its time window of susceptibility, f s is the synchronizer's clock frequency, and f d is the frequency of pushing data across the clock domain boundary.In the proposed design, synchronizers are used in the multiscale operation and DDR2 memory management to improve the system stability.Besides the synchronizer above, asynchronous FIFO is suitable for data bus synchronization.It is a typical structure for asynchronous FIFO. 29In the application of DDR2 write management of the HDDE system, a data buffer consists of the FIFO memory unit and the FIFO control module which generates the control signals.

Memory Organization
High-resolution processing and complex algorithm processing need mass memory.So memory management and organization are important parts of an HDDE system.The external DDR2 memory is arranged for the frame buffer, and BRAM is used for the slice storage.In order to realize the communication between FPGA internal data and DDR, a data input/ output buffer module is set up to resolve the CDC problem.
As shown in Fig. 10, write control in the internal clock domain, memory buffer, and read control in DDR2 clock domain compose the DDR2 data input buffer.In addition, the data bit width has been adjusted to make efficient use of the DDR2 data transfer ability.The data that were read under read control were transferred to DDR2 through a direct memory access module.The DDR access flow is shown in Fig. 11.As mentioned above, there are similar modules that are responsible for inverse data access.

Other Submodules
Other submodules, such as the input signal transform, WTA, postprocessing, or visual quantification module, are implemented using logic gates and logic units, such as comparators, counters, multiplexers, and finite-state machines.
Figure 12 shows the architecture of the basic unit of a WTA module, which includes a comparator and multiplexer.
Figure 13 shows the design of the median filter which can reduce the random noise introduced in the disparity map assignment stage.An Optional modules are the module of disparity converted to depth and depth quantization.If the camera pair is fixed, the camera intrinsic and external parameters are constant and the depth in the real world is in fixed inverse proportion to the disparity.The processing of depth quantization can be expressed as I d ðzÞ ¼ roundf255½ð1∕zÞ− ð1∕z max Þ∕½ð1∕z min Þ − ð1∕z max Þg.Depth representation has two major advantages.First, because depth is the distance between the object and the camera, it avoids the disparity dependence on camera parameters, which is favorable for free viewpoint rendering based on a depth map.On the other hand, the converted quantization results in that object close to the camera has a fine resolution and an object far from the camera has a coarse resolution, which satisfies the human vision character.Since directly using the formula to design modules is a high resource consumption, a look-up table is used to realize the conversion processing.

Results and Discussion
The depth estimation system we proposed, the HDDE system is implemented and the stereo vision system based on it is setup for performance evaluation.The binocular professional cameras of Panasonic AG-3DA1MC are used to provide the full HD 1920 × 1080 pixels video streams.An autostereoscopic display with an eight-view display LCD is used for effect evaluation.The HDDE system implemented on an EP2AGX260EF29C4 of the Arria II GX  family is bounded on the print circuit board (PCB) board with a peripheral component interconnect (PCI) interface.Following the real-time depth map generation, a server PC is used for the remaining processing (e.g., DIBR).Part of the system is shown in Fig. 14, which also shows the FPGA-based PCB board.
In the experimental video system, the depth map can be directly displayed on the LCD TV or can be used to generate a virtual image based on the DIBR.The autostereoscopic display shows the real-time stereo video based on the original images and synthesized virtual images.The visible depth map and stereo video effect are utilized to provide a direct measure of the depth estimation performance.
The HDDE system is capable of processing full HD content with a processing speed of 125 fps and a disparity search range of 240 pixels.This section will summarize the implementation results and resource consumption of the design, analyze the depth map quality, and, finally, further analyze the performance of the whole system through power analysis and data processing calculations.The core module of the HDDE system mentioned below refers to design modules and does not include peripheral interface IPs and management modules.The implemented system includes the core module, the necessary interface IP module, the external memory management, etc.

Hardware Implementation Result
The HDDE system has been implemented using Verilog, simulated using Mentor Graphics ModelSim 6.6 and Fig. 13 Circuit scheme of median filter.Li, An, and Zhang: Full high-definition real-time depth estimation for three-dimensional video system Downloaded From: https://www.spiedigitallibrary.org/journals/Optical-Engineering on 01 Mar 2020 Terms of Use: https://www.spiedigitallibrary.org/terms-of-use executed using Quartus II tool.Except for the design, a core module has been developed to allow an easy migration to other FPGA or ASIC technologies.The resource consumptions of the core module and HDDE system implementation are shown in Table 2.The internal storage consumption mainly comes from the CT and the matching cost aggregation, as well as the filter window operations which need to setup line buffers for parallel calculation.So the storage unit will increase with the increasing resolution and the processing window size.Taking into account the quality assurance and resource consumption, we can determine the appropriate size of the window, e.g., between 9 × 9 and 17 × 17 can be chosen for CT.Compared to the core module, the logic unit and the BRAM of the HDDE system are slightly increased because of the additional external communication interface IP resource.The phase locked loops (PLLs) are used for setting multiple clock domains.

Depth Map Analysis
In order to evaluate the effect of the depth estimation results, depth maps of real scenes collected directly from the implementation system are used for evaluation.Figure 15 As can be seen, there is high accuracy in texture-rich regions which are fit for real scene reconstruction.This exhibits the characteristics that not only the regions with repeating texture but also the area of objects' edges are all concerned and also the reliability of the algorithm implementation is illustrated.
We use data sets of Middlebury for our algorithm performance measurement; the images Venus, Teddy, and Tsukuba are measured.Figure 16 shows the results and the last line is the disparity image obtained by the proposed method, which also includes the postprocessing mentioned above.Disparity maps have high accuracy in high texture regions, which is the advantage of the SAD algorithm.Good edge information illustrates the characteristics of the CT method.

Further Analysis of the Performance
A hardware-based design, especially a semiconductor design, can achieve low power costs in the application  system.The power consumption is a critical matter of concern in hardware design, especially under the equipment miniaturization trend.Altera PowerPlay Power Analyzer is used to make a power consumption analysis of our HDDE system.Table 3 shows the results of power consumption.Compared to the typical power of an Intel Core I7 processor, the working condition is ∼338 W 30 and the execution HDDE system power consumption is ∼3.2 W. The power consumption of the core module is only 801 mW.
Reducing the resource usage rate of FPGA or reducing the operation clock frequency can further reduce the power consumption.One single aspect, e.g., image resolution, is not sufficient to measure the performance of the hardware implementation of depth estimation.The performance of the system can be illustrated by its data processing capabilities.Therefore, this paper uses megapixels per second and mega disparity evaluation per second (MdeS) as the assessment criteria.MdeS can be described as MdeS ¼ width Ã height Ã disps Ã fps∕ 1;000;000.It is more meaningful in line with the overall performance of the system.Table 4 compares the existing approaches with the hardware implementations mentioned in this paper.It can be seen that the maximum frame rate in Refs.8 and 30 is high, but in the 3-D TV, real-time 3-D video applications only need >30 fps for National Television Standards Committee (NTSC) and 25 fps for Phase Alternating Line (PAL).Considering only the pixel throughput, they would be adequate for 1080 pixels with 34 and 65 fps because their resolution is VGA.The proposed implementation can compete with other hardware architectures in terms of full HD depth maps in real time and can dramatically boosts the overall performance.

Conclusions
An FPGA architecture of a real-time depth estimation system, the HDDE system, has been proposed and evaluated in this paper.It is capable of processing full HD (1920 × 1080) resolution stereo video at 125 fps with 240 disparity levels.It boosts the overall performance of the stereo vision system by efficient hardware parallel implementation, including pipeline and block window data parallel architectures.It uses a fusion stereo matching strategy, which enhances the quality of the depth map with hybrid CT and SAD matching algorithms.An autostereoscopic display with an integrated renderer can be connected to the hardware and provides a high-quality multiview video.
The system performs with high efficiency and stability by using a full pipeline design, multiresolution processing, synchronizers which avoid CDC problems, efficient memory management, etc. Results of power analysis demonstrate that the proposed architecture is capable of a low-power application.In future work, we intend to search for a more robust postprocessing method to further improve the quality of the depth map and try to use it as a part of the 3-D video system for high-performance 3-D reconstruction.

I 1
is the intensity value of the primary stereo image.I 2 is the intensity value of the secondary stereo image at disparity level d.The coordinates are

Fig. 2
Fig. 2 Matching quality and time consumption for different census mask sizes.

Fig. 3
Fig.3Interpolation performance for depth map and texture image.

Fig. 5
Fig. 5 Line buffers for w × h window operation.

Fig. 8 Fig. 9
Fig.8Census transform and sum of absolute difference calculation for the initial matching cost.

Fig.Fig. 12
Fig. Double data rate 2 memory access flow.

Fig. 14
Fig. 14 Stereo vision system and field-programmable gate array based PCB board for HDDE system implementation.
(a) is captured in the laboratory environment, Fig. 15(b) is the corresponding depth map, and Fig. 15(c) is the false-color rendered image of the depth map.To help optical viewing, the distance coordinate in the real world is also shown in the figure.

Fig. 15
Fig. 15 Results captured in the implemented video system.(a) Left input image of a real-scene scenario.(b) Depth map.(c) False-color result rendered for optical viewing, including the color bar.

Table 1
Comparison of census transform and sum of absolute difference (SAD).

Table 2
Specifications of target device for core module and high definition depth estimation (HDDE) system implementation on EP2AGX260EF29C4 of Arria II GX.

Table 3
Field-programmable gate array (FPGA) device power dissipation characteristics.