Edge-preserving down/upsampling for depth map compression in high-efficiency video coding

Abstract. An efficient down/upsampling method to compress a depth map efficiently within the high-efficiency video coding (HEVC) framework is presented. A different edge-preserving depth upsampling method is proposed by using both the texture and depth information. We take into account the edge similarity between depth maps and their corresponding texture images as well as the structural similarity among depth maps to build a weight model. Based on the weight model, the optimal minimum mean square error upsampling coefficients are estimated from the local covariance coefficients of the downsampled depth map. The upsampling filter is combined with HEVC to increase coding efficiency. The objective results demonstrate that we achieve a maximum bit rate saving of 32.2% compared to full resolution method and 27.6% compared to a competing depth down/upsampling method on depth bit rate. The subjective evaluation showed that our proposed method achieves better quality in synthesized views than existing methods do.


Introduction
The research and development in three-dimensional (3-D)  video are capturing the attention of the research community, application developers, and the game industry.Many interesting applications of 3-D video-such as 3-D television (3DTV), free-viewpoint television, 3-D cinema, gesture recognition systems, and other consumer electronics productshave been developed.An attractive 3-D video representation is a multiview video plus depth (MVD) format, 1 which allows rendering numerous viewing angles from only two to three given input views.However, MVD results in a vast amount of data to be stored or transmitted, and efficient compression techniques for MVD are vital for achieving high 3-D visual experience with constrained bandwidth.In addition, the introduction of MVD format allows generating an arbitrary number of intermediate views with lowcost depth image-based rendering 2 techniques, but the quality depends on the accuracy of the depth maps. 3,4Thus, in this article, we concentrate on the compression of depth information in an MVD format.
A new video coding standard for high efficiency video coding (HEVC) 5 is now being finalized with a primary focus on efficient compression of monoscopic video.Preliminary results have already demonstrated that this new standard provides the same subjective quality at 50% of the bit rate compared to H.264/AVC High Profile.Recently, JCT-3DV has been formed for the development of new 3-D standards, including extensions of HEVC.Since depth maps generally have more spatial redundancy than natural images, the depth down/upsampling can be combined with HEVC framework to increase coding efficiency.][8][9] MPEG 3DV experiments also demonstrate that this down/ upsampling-based depth coding approach can improve the depth map coding efficiency. 10At the same time, 3D-AVC Test Model 11 successfully exploits the possibility of subsampling depth data by the factor of 2, which substantially increases compression efficiency.Since the quality of the synthesized views depend on the accuracy of the depth map information, depth coding-induced distortion not only affects the depth quality but also the synthesized view quality.Therefore, depth down/upsampling method at the decoder needs to be carefully designed to guarantee synthesized view quality.
Classical techniques, such as pixel repetition, bilinear, or bicubic interpolation cause jagged boundaries, blurred edges, and annoying artifacts around edges.Bilateral filter is a widely used edge-preserving filtering technique, where the weights of the filter are selected as a function of a photometric similarity measure of the neighboring pixels.Besides that, a joint bilateral filter 12 is proposed by using auxiliary information from high-resolution images, which is beneficial for edge preserving.The concepts of bilateral and joint bilateral filter have been used for in-loop filtering [13][14][15] and postfiltering [16][17][18] on reconstructed depth images.Liu et al. 15 designed a joint trilateral in-loop filter to reconstruct the depth map that takes into account both the similarity among depth samples and that among corresponding texture pixels.Wildeboer et al. 16 proposed a joint bilateral upsampling algorithm by utilizing the high-resolution texture video in the process of depth upsampling; they calculated a weight-cost based on pixel positions and intensity similarities.Ekmekcioglu et al. 17 exploited an adaptive depth map upsampling algorithm with a corresponding color image in order to obtain coding gain while maintaining the quality of the synthesized view.Recently, Schwarz et al. 18 introduced an adaptive depth filter utilizing an edge information from the texture video to improve HEVC efficiency.However, the texture-assisted joint bilateral filter for depth image suffers from the texture copy problem.The edge-directed interpolation techniques recover sharp edges while suppressing pixel jaggedness and blurring artifacts by imposing accurate source models.Li and Orchard 19 proposed a new edge-directed interpolation (NEDI) algorithm for natural images, which exploits image geometric regularity by using the covariance of a low-resolution image to estimate that of a high-resolution image.Asuni and Giachetti 20 improved the stability of NEDI by using edge segmentation.Zhang et al. 21estimated the lowresolution covariance adaptively with improved nonlocal edge-directed interpolation.Since NEDI needs a relatively large window to compute the covariance matrix for each missing sample, it may introduce spurious artifacts in local structures due to nonstationary structures and result in incorrect covariance estimate.
Preserving the edges of depth maps is important for improving the synthesized view quality.This article proposes a novel edge-preserving depth upsampling method for down/ upsampling-based depth coding using both the texture and depth information.The optimal minimum mean square error (MMSE) upsampling coefficients are estimated from the local covariance matrix of the downsampled depth map.By using an adaptive weight model, which takes into account both the structural similarity within the depth map and the edge similarity between the depth map and its corresponding texture image, our proposed method is capable of suppressing artifacts caused by the different geometry structures in a local window.
The remainder of this article is organized as follows.Section 2 describes the depth map coding framework and details the proposed down-and upsampling algorithms.Section 3 presents some experimental results and comparative studies and Sec. 4 concludes the article.

Proposed Method
Figure 1 shows the framework of the proposed depth map encoder and decoder based on a HEVC codec.We utilize the efficiency of HEVC and concentrate on depth down/ upsampling to increase coding efficiency and synthesized view quality.The encoder contains a preprocessing block that enables the spatial resolution reduction of depth data.Then the resulting depth map is encoded with HEVC.For the decoding process, a novel edge-preserving upsampling (EPU) is utilized to upsample the spatial resolution of the decoded depth map, especially on object boundaries, by taking the depth and texture characteristics into account.The motivation is that, on one hand, with an efficient HEVC codec, encoding the depth data on the reduced resolution can reduce the bit rate substantially.On the other hand, with an efficient upsampling algorithm, encoding the depth data on the reduced resolution can still achieve a good synthesized view quality.The novelty of this approach is the two key components of the proposed depth map coding framework: reliable median downsampling and EPU filter.In what follows, we give a detailed description of the down/upsampling algorithm.

Depth Prefiltering
We use an edge detection-based prefiltering before downsampling to preserve important objection boundaries and remove potential high frequencies in constant depth regions.Figure 2 illustrates a block diagram of the prefiltering.It contains three blocks of boundary layer detection, Gaussian blur, and boundary enhancement.A Canny edge detector 22 divided the input depth map into the smooth region and the boundary layer.The filtered depth map contains the enhanced boundaries and the blurred smooth region.
The smooth depth region is then filtered using a bilateral filter.The bilateral filter is an edge-preserving filtering technique where the kernel filter weights are modified as a function of the photometric similarity between pixels, thus giving higher weights to pixels belonging to similar regions and reducing the blurring effect in the edges, where photometric discontinuities are present.Let us consider D full ðpÞ as the intensity of the pixel at position p and Ω p its neighborhood and the resulting filtered pixel D filt ðpÞ obtained with the bilateral filter is: where f is a two-dimensional (2-D) smoothing kernel also known as the domain term that measures the closeness of the pixels, and is the range term that measures the intensity similarity of the pixels.The scalar k p ¼ P q∈Ω p fðp;qÞgðkD full ðpÞ−D full ðqÞkÞ is a normalization factor.In our experiment, the filter size is 15 × 15, and σ f ¼ 3.5 and σ g ¼ 15.
The boundary layer is enhanced by a Gaussian high-pass filtering.We mark a 7-pixel wide area along depth edges as the boundary layer which includes foreground and background boundary information.In our experiment, the boundary layer is enhanced by a Gaussian high-pass filter with a size of 3 × 3 and σ ¼ 0.5.

Depth Downsampling
Reducing the resolution of encoding depth can reduce the bit rate substantially, while the loss of resolution also degrades the quality of the depth map.Therefore, the downsampling method should be designed for better recovering of the quality of high-resolution depth after decoding.Conventional linear downsampling filters create new unrealistic pixel values which will spread to the entire depth map in the upsampling procedure, further causing distortion in the synthesized view.
Considering the above, we propose a reliable median filter for depth downsampling.The proposed reliable median filter is a nonlinear downsampling filter.The downsampled results are obtained in two steps: Step 1 We obtain those reliable depth values R m×n of a block W m×n of the depth map in detail as follows.
Define W m×n as a m × n block of the depth map, we sort all pixels in W m×n by e intensity value and the mean value for W m×n is defined by sort½Wðx; yÞ ¼ fD 1 ; D 2 ; : : : D m×n g The pixels in W m×n are categorized into low and high groups by D ave as Wðx; yÞ ∈ S fg ; if Wðx; yÞ > D ave S bg ; otherwise : Let maxðW m×n Þ and minðW m×n Þ be the maximum and minimum values of the block W m×n , respectively.If the maximum value maxðW m×n Þ and minimum value minðW m×n Þ are very close, then the local window W m×n is a smooth region, so all pixels in the block W m×n are reliable candidates; otherwise, the local window W m×n contains foreground and background regions, so only the pixels belonging to the foreground region are chosen as the reliable candidates in order to avoid background covering foreground.The reliable candidates formulated as follows: where the threshold T 0 ¼ 10 in our experiment.
Step 2 The median of the reliable data is the filtering results.
The reliable median filter for depth downsampling is The proposed reliable depth downsampling filter has the following merits over other linear filters: (1)  it is more robust against outliers; a noisy neighboring pixel does not affect the median value significantly and (2) the median filtering does not create new unrealistic pixel values when the filter straddles an edge since the median value must actually be the value of one of the pixels in the same object.
The proposed downsampling excludes the nonsimilar neighbor pixels from the filtering process, thus discriminating from pixels that belong to different objects.It is a generalized form of a 2-D median downsampling filter. 6,7When the downsampling factor is 2, the reliable-based median filter can be simplified as the 2-D median downsampling filter.

Edge-Preserving Depth Upsampling
After HEVC encoding and decoding, the downsampled depth map d is needed to be recovered to the original full resolution for rendering virtual views.An EPU is proposed for depth map reconstruction, utilizing edge information from the corresponding texture frame.
where k 0 , k 1 , k 2 , and k 3 are interpolation coefficients.Since natural images typically consist of smooth areas, textures, and edges, they are not globally stationary.A reasonable assumption is that the sample mean and variance of a pixel are equal to the local mean and variance of all pixels within a fixed range surrounding.The validity of the assumption is applied in most statistical image representations in previous work as shown by Kuan 23 and Lee. 24oreover, compared with natural images, depth maps are more homogenous mostly, therefore, it is reasonable to treat depth maps as being locally stationary.Furthermore, optimal MMSE linear interpolation is successful in the image recovery in that it effectively removes noise while preserving important image features (e.g., edges).Thus, under the assumption that depth image can be modeled as a locally stationary Gaussian process, according to classical Wiener filtering theory, the optimal MMSE linear interpolation coefficients K ¼ ½k 0 ; k 1 ; k 2 ; k 3 T are given by where R ¼ E½DD T , D ¼ ½D 2x;2y ; D 2xþ2;2y ; D 2x;2yþ2 ; D 2xþ2;2yþ2 T , and r ¼ ½D 2xþ1;2yþ1 D are the local covariance at the high-resolution level.
By exploiting the similarity between the high-resolution covariance and the low-resolution covariance, R and r can be estimated from a local window of its low-resolution depth map.As shown in Fig. 3, we estimate R and r based on the local statistics from a local w × w window centered at the interpolated pixel location, leading to where D n is the known pixel from low resolution d(D 2x;2y ¼ d x;y ),p n is the weighting of sample D n , and c n is a 4 × 1 matrix whose samples are the four neighbors of D n along the diagonal directions, as shown in Fig. 3.
We note that the covariance estimation in NEDI (Ref.19) with each sample inside the w × w window having the same weight p n ¼ 1∕w 2 is a special case of ours.In edgepreserving depth map upsampling, the samples D 0 ¼ ½D 1 ; D 2 ; : : : : : : ; D w 2 T used to calculate coefficients should have similar geometric structure (i.e., edge direction) with the region centered in the interpolated pixel D 2xþ1;2yþ1 .Otherwise, in the presence of a sharp edge, if a sample is interpolated across instead of along the edge direction, large and visually disturbing artifacts will be introduced.In this article, we introduce a weight model for each sample and make samples adaptive to the local characteristics of the depth map.
Aiming to take advantage of the geometric similarity within depth maps as well as the photometric similarity between the depth map and its corresponding texture sequence, we propose to use the pixel distance, intensity difference, and texture similarity to build a weight model with where p c n depends on the distance between the current pixel position ðx n ; y n Þ and the center pixel position ðx c ; y c Þ, which is measured by the Euclidean distance as and given by where max dist and min dist are the maximum and minimum pixel distance within the window W, respectively.The quantity p d n in Eq. ( 9) is a function of the absolute difference dif DðnÞ ¼ jD n − D c j between the current pixel value D n and center pixel value D c in a depth map, and given by where max difD and min difD indicate the maximum and minimum depth intensity difference within the window W, respectively.
Different from texture sequences, the depth maps usually come with their accompanying texture video.It is known that they share similar structures, especially along the edges.Therefore, an additional term p t n measuring this similarity is introduced in Eq. ( 9).Similar to depth samples' similarity p d n , the third subcost function p t n means the similarity of texture intensity between the current texture pixel value I n and the center texture pixel value I c in texture image.It is measured by the absolute difference difTðnÞ ¼ jI n − I c j as given in where max difT and min difT indicate the maximum and minimum texture intensity differences within the window W, respectively.With this texture similarity, even if the reconstructed depth map has certain artifacts around the edges, we can still utilize the corresponding texture information to provide help with depth boundaries.
With the weight model in Eq. ( 9), we can estimate R and r using Eq. ( 8).Consequently, the interpolation coefficients   6) can be obtained from Eq. ( 7) as

Experimental Results
We study the performance of the proposed depth down/ upsampling method for depth map coding using two types of test sequences in resolutions (1920 × 1088 pixels: Poznan_Street, 25 Undo_Dancer and 1024 × 768 pixels: Newspaper, Bookarrival 26 ), with YUV 4:2:0 8 bits per pixel (bpp) format.The test materials are provided by MPEG and depth maps have been estimated from original video based on the depth estimation reference software. 27or Poznan_Street sequence, view 3 and view 5 are selected as reference views.For Undo_Dancer sequence, view 4 and view 6 are selected as references and view 5 as the target view.For Book-Arrival sequence, view 8 and view 10 are selected as references and view 9 as the target view.
For each reference depth map, we downsample it by a factor of two before encoding using the 3-D-HEVC test model (HTM) version 4.1 28 with quantization parameters (QP) 24, 28, 32, 40, and 44.The texture video sequences have a fixed QP 32.Thirty frames are coded for each sequence.Other encoder configurations follow those specified in the common test conditions 29 for 3-D video coding.No multiview video coding is applied.After the decoding is finished, the intermediate view is synthesized by view synthesis reference software. 30The efficiency of the proposed method is evaluated through rate distortion (RD) performance and subjective quality of synthesized view.For the RD curves, the x-axis stands for the total bit rate for the two depth maps and two texture sequences, and the y-axis is the Y_PSNR of the synthesized views compared to the original view.

Coding Performance
First, the performance of the down/upsampling-based depth coding scheme is compared to that of full scale.For the full scale method, the depth maps are encoded without down/ upsampling using HTM reference software.Figure 4 shows the RD curves comparison between the proposed method and the full resolution method.
It can be seen that the down/upsampling-based depth maps coding scheme outperforms full-scale depth map coding at lower bit rates.Specially, as shown in Table 1, bit rate saving is up to 32.2% for "BookArrival" and 27.6% for "Newspaper" on depth maps, whereas it is 8.9% for "BookArrival" and 5.3% for "Newspaper" on total bit rates.Since the bit rates of depth maps are only about 10 to 20% that of texture sequences, the gain of bit rate saving is less for total bit rate than that for depth bit rate.At higher bit rates, the frames are encoded with larger QP and preserve much more details in texture.Therefore, the influence of down/ upsampling distortion becomes larger.The RD performance is below the full scale case with high bit rate.
Second, we evaluate the performances of the proposed downsampling, upsampling, and prefiltering method separately.In order to test the effectiveness of the proposed downsampling algorithm, the original depth maps are downsampled using different downsampling methods while being upsampled with the same EPU algorithm.Figure 5(a) shows the RD curves of different depth downsampling method, "Median downsc."stands for the downsampling as proposed by Oh in Ref. 6 and "Reliable Median downsc."that described in Sec.2.2.
In order to test the effectiveness of the proposed upsampling algorithm, the decoded depth maps are upsampled using different interpolation algorithms while downsampled using the same reliable median downsampling before encoding.Figure 5(b) shows the RD curves of different upsampling methods, where "EPU upsc."stands for the upsampling method as described in Sec.2.3, "NEDI upsc."stands for the upsampling method in Ref. 9, and "EWOC upsc." and "JBU upsc."stand for the recent published upsampling algorithms in Refs.17 and 18, respectively.
Figure 5(c) shows the RD curves to compare the coding efficiency of the proposed methods against two advanced down/upsampling-based depth coding methods."Method 1" is the combined method, where depth maps are preprocessed as described in Sec.2.1, then downsampled as described in Sec.2.2 and upsampled as described in Sec.2.3."Method 2" is the result with the proposed downsampling and upsampling."EWOC" stands for the depth map coding method in Ref. 18. "JBU" stands for the down/upsampling algorithm for depth maps in Ref. 17, where depth maps are downsampled with median filtering and upsampled with JBU.No prefiltering is applied to either "Method 2" or JBU method.
We can see that both the proposed upsampling method and downsampling show good performance as shown in Fig. 5(a) and 5(b).By combining the proposed prefiltering, downsampling and the upsampling methods, additional gain can be achieved as shown in Fig. 5(c).

Synthesized View Quality
Depth map downsampling and upsampling directly impact the subjective quality of synthesized views.Figures 6-8 compare our proposed upsampling method with EWOC upsampling and the JBU in terms of the subjective quality of the synthesized views at the decoder after depth map encoding at the same rate.It is seen that the synthesized images with EWOC interpolation and JBU upsampling exhibit strong jaggedness around object edges.On the other hand, for our proposed upsampling method, it employs texture image which provides the edge information in the upsampling procedure; therefore, our method obtains clearer and smoother edges along object boundaries.

Computational Complexity Analysis
We show the processing times in the Table 2. Depth map encoding time for the proposed method contains downsampling time (low-pass filtering and downsampling procedures), HEVC encoding time, decoding time, and proposed upsampling time.Depth coding time for full scale contains HEVC encoding time and decoding time.S1 and S2 denote the Newspaper and Book-Arrival sequences.
Since the resolution of encoding video in down/ upsampling-based method is less than full scale method, the encoding time of downsampled is far less than that of full scale.Although additional downsampling and upsampling procedures are needed for the down/upsampling method, the overall computation time of down/upsampling-based method is less than the full scale method as shown in Table 2.

Conclusions
We have presented an edge-preserving depth upsampling method for down/upsampling-based depth coding within the HEVC framework.Different from the NEDI algorithm of Ref. 19, we introduced a weight model for each sample that incorporates geometric similarity as well as intensity similarity in both the depth map and its corresponding texture sequence, thus allowing an adaptation of interpolation coefficients to the edge orientation.An evaluation of performance in terms of coded data and synthesized views has been provided.Experimental results show that our proposed interpolation method for down/upsampling-based depth coding improves both the coding efficiency and synthesized view quality.

Fig. 3
Fig. 3 Covariance estimation based on local statistics from a local window.

Figure 3
Figure3gives the sketch map of the upsampling process.Let d denotes the input low-resolution depth map of size M × N. We start with the simplest case of upsampling by a factor of 2 and assume D is the high-resolution depth map after upsampling to size 2M × 2N.We first copy the low-resolution depth map d directly to its high-resolution version D, i.e., D 2x;2y ¼ d x;y and then interpolate D 2xþ1;2yþ1 , D 2xþ1;2y , and D 2x;2yþ1 from D in two steps.The first step is to interpolate D 2xþ1;2yþ1 from its four nearest neighbors D 2x;2y , D 2xþ2;2y , D 2x;2yþ2 , and D 2xþ2;2yþ2 along the diagonal directions of a square lattice.The second step is to interpolate other missing samples D 2xþ1;2y and D 2x;2yþ1 from a rhombus lattice in the same way after a 45-deg rotation of the square grid.Therefore, the implementation of all the pixels is almost identical.For example, D 2xþ1;2yþ1 is calculated as

Fig. 4
Fig. 4 Rate distortion (RD) performance comparison of encoding depth maps between full scale and down/upsampling based method.(a) Book Arrival and (b) Newspaper. distðnÞ

Table 1
Performaces (bitrate versus synthesized view PSNR) of full scale and down/upsampling depth map coding.