The research and development in three-dimensional (3-D) video are capturing the attention of the research community, application developers, and the game industry. Many interesting applications of 3-D video—such as 3-D television (3DTV), free-viewpoint television, 3-D cinema, gesture recognition systems, and other consumer electronics products—have been developed. An attractive 3-D video representation is a multiview video plus depth (MVD) format,1 which allows rendering numerous viewing angles from only two to three given input views. However, MVD results in a vast amount of data to be stored or transmitted, and efficient compression techniques for MVD are vital for achieving high 3-D visual experience with constrained bandwidth. In addition, the introduction of MVD format allows generating an arbitrary number of intermediate views with low-cost depth image–based rendering2 techniques, but the quality depends on the accuracy of the depth maps.3,4 Thus, in this article, we concentrate on the compression of depth information in an MVD format.
A new video coding standard for high efficiency video coding (HEVC)5 is now being finalized with a primary focus on efficient compression of monoscopic video. Preliminary results have already demonstrated that this new standard provides the same subjective quality at 50% of the bit rate compared to H.264/AVC High Profile. Recently, JCT-3DV has been formed for the development of new 3-D standards, including extensions of HEVC. Since depth maps generally have more spatial redundancy than natural images, the depth down/upsampling can be combined with HEVC framework to increase coding efficiency. There have been some works proposed to compress a downsampled depth map at the encoder in the H.264/AVC framework.67.8.–9 MPEG 3DV experiments also demonstrate that this down/upsampling-based depth coding approach can improve the depth map coding efficiency.10 At the same time, 3D-AVC Test Model11 successfully exploits the possibility of subsampling depth data by the factor of 2, which substantially increases compression efficiency. Since the quality of the synthesized views depend on the accuracy of the depth map information, depth coding-induced distortion not only affects the depth quality but also the synthesized view quality. Therefore, depth down/upsampling method at the decoder needs to be carefully designed to guarantee synthesized view quality.
Classical techniques, such as pixel repetition, bilinear, or bicubic interpolation cause jagged boundaries, blurred edges, and annoying artifacts around edges. Bilateral filter is a widely used edge-preserving filtering technique, where the weights of the filter are selected as a function of a photometric similarity measure of the neighboring pixels. Besides that, a joint bilateral filter12 is proposed by using auxiliary information from high-resolution images, which is beneficial for edge preserving. The concepts of bilateral and joint bilateral filter have been used for in-loop filtering1314.–15 and postfiltering1617.–18 on reconstructed depth images. Liu et al.15 designed a joint trilateral in-loop filter to reconstruct the depth map that takes into account both the similarity among depth samples and that among corresponding texture pixels. Wildeboer et al.16 proposed a joint bilateral upsampling algorithm by utilizing the high-resolution texture video in the process of depth upsampling; they calculated a weight-cost based on pixel positions and intensity similarities. Ekmekcioglu et al.17 exploited an adaptive depth map upsampling algorithm with a corresponding color image in order to obtain coding gain while maintaining the quality of the synthesized view. Recently, Schwarz et al.18 introduced an adaptive depth filter utilizing an edge information from the texture video to improve HEVC efficiency. However, the texture-assisted joint bilateral filter for depth image suffers from the texture copy problem. The edge-directed interpolation techniques recover sharp edges while suppressing pixel jaggedness and blurring artifacts by imposing accurate source models. Li and Orchard19 proposed a new edge-directed interpolation (NEDI) algorithm for natural images, which exploits image geometric regularity by using the covariance of a low-resolution image to estimate that of a high-resolution image. Asuni and Giachetti20 improved the stability of NEDI by using edge segmentation. Zhang et al.21 estimated the low-resolution covariance adaptively with improved nonlocal edge-directed interpolation. Since NEDI needs a relatively large window to compute the covariance matrix for each missing sample, it may introduce spurious artifacts in local structures due to nonstationary structures and result in incorrect covariance estimate.
Preserving the edges of depth maps is important for improving the synthesized view quality. This article proposes a novel edge-preserving depth upsampling method for down/upsampling-based depth coding using both the texture and depth information. The optimal minimum mean square error (MMSE) upsampling coefficients are estimated from the local covariance matrix of the downsampled depth map. By using an adaptive weight model, which takes into account both the structural similarity within the depth map and the edge similarity between the depth map and its corresponding texture image, our proposed method is capable of suppressing artifacts caused by the different geometry structures in a local window.
The remainder of this article is organized as follows. Section 2 describes the depth map coding framework and details the proposed down- and upsampling algorithms. Section 3 presents some experimental results and comparative studies and Sec. 4 concludes the article.
Figure 1 shows the framework of the proposed depth map encoder and decoder based on a HEVC codec. We utilize the efficiency of HEVC and concentrate on depth down/upsampling to increase coding efficiency and synthesized view quality. The encoder contains a preprocessing block that enables the spatial resolution reduction of depth data. Then the resulting depth map is encoded with HEVC. For the decoding process, a novel edge-preserving upsampling (EPU) is utilized to upsample the spatial resolution of the decoded depth map, especially on object boundaries, by taking the depth and texture characteristics into account. The motivation is that, on one hand, with an efficient HEVC codec, encoding the depth data on the reduced resolution can reduce the bit rate substantially. On the other hand, with an efficient upsampling algorithm, encoding the depth data on the reduced resolution can still achieve a good synthesized view quality. The novelty of this approach is the two key components of the proposed depth map coding framework: reliable median downsampling and EPU filter. In what follows, we give a detailed description of the down/upsampling algorithm.
We use an edge detection–based prefiltering before downsampling to preserve important objection boundaries and remove potential high frequencies in constant depth regions. Figure 2 illustrates a block diagram of the prefiltering. It contains three blocks of boundary layer detection, Gaussian blur, and boundary enhancement. A Canny edge detector22 divided the input depth map into the smooth region and the boundary layer. The filtered depth map contains the enhanced boundaries and the blurred smooth region.
The smooth depth region is then filtered using a bilateral filter. The bilateral filter is an edge-preserving filtering technique where the kernel filter weights are modified as a function of the photometric similarity between pixels, thus giving higher weights to pixels belonging to similar regions and reducing the blurring effect in the edges, where photometric discontinuities are present. Let us consider as the intensity of the pixel at position and its neighborhood and the resulting filtered pixel obtained with the bilateral filter is:
The boundary layer is enhanced by a Gaussian high-pass filtering. We mark a 7-pixel wide area along depth edges as the boundary layer which includes foreground and background boundary information. In our experiment, the boundary layer is enhanced by a Gaussian high-pass filter with a size of and .
Reducing the resolution of encoding depth can reduce the bit rate substantially, while the loss of resolution also degrades the quality of the depth map. Therefore, the downsampling method should be designed for better recovering of the quality of high-resolution depth after decoding. Conventional linear downsampling filters create new unrealistic pixel values which will spread to the entire depth map in the upsampling procedure, further causing distortion in the synthesized view.
Considering the above, we propose a reliable median filter for depth downsampling. The proposed reliable median filter is a nonlinear downsampling filter. The downsampled results are obtained in two steps:
Step 1 We obtain those reliable depth values of a block of the depth map in detail as follows.
Define as a block of the depth map, we sort all pixels in by intensity value and the mean value for is defined by
The pixels in are categorized into low and high groups by as
Let and be the maximum and minimum values of the block , respectively. If the maximum value and minimum value are very close, then the local window is a smooth region, so all pixels in the block are reliable candidates; otherwise, the local window contains foreground and background regions, so only the pixels belonging to the foreground region are chosen as the reliable candidates in order to avoid background covering foreground. The reliable candidates formulated as follows:
Step 2 The median of the reliable data is the filtering results. The reliable median filter for depth downsampling is
The proposed reliable depth downsampling filter has the following merits over other linear filters: (1) it is more robust against outliers; a noisy neighboring pixel does not affect the median value significantly and (2) the median filtering does not create new unrealistic pixel values when the filter straddles an edge since the median value must actually be the value of one of the pixels in the same object.
The proposed downsampling excludes the nonsimilar neighbor pixels from the filtering process, thus discriminating from pixels that belong to different objects. It is a generalized form of a 2-D median downsampling filter.6,7 When the downsampling factor is 2, the reliable-based median filter can be simplified as the 2-D median downsampling filter.
Edge-Preserving Depth Upsampling
After HEVC encoding and decoding, the downsampled depth map is needed to be recovered to the original full resolution for rendering virtual views. An EPU is proposed for depth map reconstruction, utilizing edge information from the corresponding texture frame.
Figure 3 gives the sketch map of the upsampling process. Let denotes the input low-resolution depth map of size . We start with the simplest case of upsampling by a factor of 2 and assume is the high-resolution depth map after upsampling to size . We first copy the low-resolution depth map directly to its high-resolution version , i.e., and then interpolate , , and from in two steps. The first step is to interpolate from its four nearest neighbors , , , and along the diagonal directions of a square lattice. The second step is to interpolate other missing samples and from a rhombus lattice in the same way after a 45-deg rotation of the square grid. Therefore, the implementation of all the pixels is almost identical. For example, is calculated as
Since natural images typically consist of smooth areas, textures, and edges, they are not globally stationary. A reasonable assumption is that the sample mean and variance of a pixel are equal to the local mean and variance of all pixels within a fixed range surrounding. The validity of the assumption is applied in most statistical image representations in previous work as shown by Kuan23 and Lee.24 Moreover, compared with natural images, depth maps are more homogenous mostly, therefore, it is reasonable to treat depth maps as being locally stationary. Furthermore, optimal MMSE linear interpolation is successful in the image recovery in that it effectively removes noise while preserving important image features (e.g., edges). Thus, under the assumption that depth image can be modeled as a locally stationary Gaussian process, according to classical Wiener filtering theory, the optimal MMSE linear interpolation coefficients are given by
By exploiting the similarity between the high-resolution covariance and the low-resolution covariance, and can be estimated from a local window of its low-resolution depth map. As shown in Fig. 3, we estimate and based on the local statistics from a local window centered at the interpolated pixel location, leading toFig. 3.
We note that the covariance estimation in NEDI (Ref. 19) with each sample inside the window having the same weight is a special case of ours. In edge-preserving depth map upsampling, the samples used to calculate coefficients should have similar geometric structure (i.e., edge direction) with the region centered in the interpolated pixel . Otherwise, in the presence of a sharp edge, if a sample is interpolated across instead of along the edge direction, large and visually disturbing artifacts will be introduced. In this article, we introduce a weight model for each sample and make samples adaptive to the local characteristics of the depth map.
Aiming to take advantage of the geometric similarity within depth maps as well as the photometric similarity between the depth map and its corresponding texture sequence, we propose to use the pixel distance, intensity difference, and texture similarity to build a weight model with
The quantity in Eq. (9) is a function of the absolute difference between the current pixel value and center pixel value in a depth map, and given by
Different from texture sequences, the depth maps usually come with their accompanying texture video. It is known that they share similar structures, especially along the edges. Therefore, an additional term measuring this similarity is introduced in Eq. (9). Similar to depth samples’ similarity , the third subcost function means the similarity of texture intensity between the current texture pixel value and the center texture pixel value in texture image. It is measured by the absolute difference as given in
We study the performance of the proposed depth down/upsampling method for depth map coding using two types of test sequences in resolutions (: Poznan_Street,25 Undo_Dancer and : Newspaper, Bookarrival26), with YUV 4:2:0 8 bits per pixel (bpp) format. The test materials are provided by MPEG and depth maps have been estimated from original video based on the depth estimation reference software.27 For Poznan_Street sequence, view 3 and view 5 are selected as reference views. For Undo_Dancer sequence, view 4 and view 6 are selected as references and view 5 as the target view. For Book-Arrival sequence, view 8 and view 10 are selected as references and view 9 as the target view.
For each reference depth map, we downsample it by a factor of two before encoding using the 3-D-HEVC test model (HTM) version 4.128 with quantization parameters (QP) 24, 28, 32, 40, and 44. The texture video sequences have a fixed QP 32. Thirty frames are coded for each sequence. Other encoder configurations follow those specified in the common test conditions29 for 3-D video coding. No multiview video coding is applied. After the decoding is finished, the intermediate view is synthesized by view synthesis reference software.30 The efficiency of the proposed method is evaluated through rate distortion (RD) performance and subjective quality of synthesized view. For the RD curves, the -axis stands for the total bit rate for the two depth maps and two texture sequences, and the -axis is the Y_PSNR of the synthesized views compared to the original view.
First, the performance of the down/upsampling-based depth coding scheme is compared to that of full scale. For the full scale method, the depth maps are encoded without down/upsampling using HTM reference software. Figure 4 shows the RD curves comparison between the proposed method and the full resolution method.
It can be seen that the down/upsampling-based depth maps coding scheme outperforms full-scale depth map coding at lower bit rates. Specially, as shown in Table 1, bit rate saving is up to 32.2% for “BookArrival” and 27.6% for “Newspaper” on depth maps, whereas it is 8.9% for “BookArrival” and 5.3% for “Newspaper” on total bit rates. Since the bit rates of depth maps are only about 10 to 20% that of texture sequences, the gain of bit rate saving is less for total bit rate than that for depth bit rate. At higher bit rates, the frames are encoded with larger QP and preserve much more details in texture. Therefore, the influence of down/upsampling distortion becomes larger. The RD performance is below the full scale case with high bit rate.
Performaces (bitrate versus synthesized view PSNR) of full scale and down/upsampling depth map coding.
|Quantization parameter (QP)||T1+T2 (QP32)||D1+D2 (kb/s)||2T+2D (kb/s)||Y_PSNR (dB)||D1+D2 (kb/s)||2T+2D (kb/s)||Y_PSNR (dB)|
|QP||T1+T2 (QP32)||D1+D2 (kb/s)||2T+2D (kb/s)||Y_PSNR (dB)||D1+D2 (kb/s)||2T+2D (kb/s)||Y_PSNR (dB)|
|BD_ rate (2T+2D)=5.3%|
Second, we evaluate the performances of the proposed downsampling, upsampling, and prefiltering method separately. In order to test the effectiveness of the proposed downsampling algorithm, the original depth maps are downsampled using different downsampling methods while being upsampled with the same EPU algorithm. Figure 5(a) shows the RD curves of different depth downsampling method, “Median downsc.” stands for the downsampling as proposed by Oh in Ref. 6 and “Reliable Median downsc.” that described in Sec. 2.2.
In order to test the effectiveness of the proposed upsampling algorithm, the decoded depth maps are upsampled using different interpolation algorithms while downsampled using the same reliable median downsampling before encoding. Figure 5(b) shows the RD curves of different upsampling methods, where “EPU upsc.” stands for the upsampling method as described in Sec. 2.3, “NEDI upsc.” stands for the upsampling method in Ref. 9, and “EWOC upsc.” and “JBU upsc.” stand for the recent published upsampling algorithms in Refs. 17 and 18, respectively.
Figure 5(c) shows the RD curves to compare the coding efficiency of the proposed methods against two advanced down/upsampling-based depth coding methods. “Method 1” is the combined method, where depth maps are preprocessed as described in Sec. 2.1, then downsampled as described in Sec. 2.2 and upsampled as described in Sec. 2.3. “Method 2” is the result with the proposed downsampling and upsampling. “EWOC” stands for the depth map coding method in Ref. 18. “JBU” stands for the down/upsampling algorithm for depth maps in Ref. 17, where depth maps are downsampled with median filtering and upsampled with JBU. No prefiltering is applied to either “Method 2” or JBU method.
We can see that both the proposed upsampling method and downsampling show good performance as shown in Fig. 5(a) and 5(b). By combining the proposed prefiltering, downsampling and the upsampling methods, additional gain can be achieved as shown in Fig. 5(c).
Synthesized View Quality
Depth map downsampling and upsampling directly impact the subjective quality of synthesized views. Figures 6Fig. 7–8 compare our proposed upsampling method with EWOC upsampling and the JBU in terms of the subjective quality of the synthesized views at the decoder after depth map encoding at the same rate. It is seen that the synthesized images with EWOC interpolation and JBU upsampling exhibit strong jaggedness around object edges. On the other hand, for our proposed upsampling method, it employs texture image which provides the edge information in the upsampling procedure; therefore, our method obtains clearer and smoother edges along object boundaries.
Computational Complexity Analysis
We show the processing times in the Table 2. Depth map encoding time for the proposed method contains downsampling time (low-pass filtering and downsampling procedures), HEVC encoding time, decoding time, and proposed upsampling time. Depth coding time for full scale contains HEVC encoding time and decoding time. and denote the Newspaper and Book-Arrival sequences.
Processing times of full scale coding and down/upsampling coding.
|Enc T [s]||Dec T [s]||Down T [s]||Enc T [s]||Up T [s]||Dec T [s]|
|Sum T=71935 s||Sum T=24362 s|
|Sum T=90026 s||Sum T=40899 s|
Since the resolution of encoding video in down/upsampling-based method is less than full scale method, the encoding time of downsampled is far less than that of full scale. Although additional downsampling and upsampling procedures are needed for the down/upsampling method, the overall computation time of down/upsampling-based method is less than the full scale method as shown in Table 2.
We have presented an edge-preserving depth upsampling method for down/upsampling-based depth coding within the HEVC framework. Different from the NEDI algorithm of Ref. 19, we introduced a weight model for each sample that incorporates geometric similarity as well as intensity similarity in both the depth map and its corresponding texture sequence, thus allowing an adaptation of interpolation coefficients to the edge orientation. An evaluation of performance in terms of coded data and synthesized views has been provided. Experimental results show that our proposed interpolation method for down/upsampling-based depth coding improves both the coding efficiency and synthesized view quality.
This work is supported in part by the National Science Foundation of China under Grant Nos. 61231010 and 61202301.
Huiping Deng received a BS degree in electronics and information engineering, an MS degree in communication and information system from Yangtze University, Jingzhou, China, in 2005 and 2008, respectively. She is currently working toward the PhD degree in the Electronics and Information Engineering Department, HUST. Her research interests are video coding and computer vision, currently focusing on three-dimensional video (3DV).
Li Yu received the BS degree in electronics and information engineering, the MS degree in communication and information system and the PhD degree in electronics and information engineering, all from Huazhong University of Science and Technology (HUST), Wuhan, China, in 1995, 1997, and 1999, respectively. In 2000, she joined the Electronics and Information Engineering Department, HUST, where she has a professor since 2005. She is a co-sponsor of China AVS standard special working group and working as the key member of China AVS standard special working group. Her team has applied more than 10 related patents and submitted 79 proposals to AVS standard organization. Her current research interests include multimedia communication and processing, computer network, wireless communication.