## 1.

## Introduction

The three-dimensional (3D) video has drawn increasing attention as a new multimedia technique for capturing real-world scenes, which is expected as the ensuing evolution of the two-dimensional video.^{1} The 3D video captures photorealistic texture as well as complex geometric shape of real-world scenes. Two types of 3D video applications are envisioned. Three-dimensional television (3DTV) aims to provide viewer depth perception of the scene by simultaneously rendering multiple images from different viewing angles.^{2} Instead, free-viewpoint television (FTV) application allows for interactive selection of the viewpoint and direction in the scene within a certain operating range.^{3} In order to promote the 3DTV and FTV applications, multiview acquisition, multiview video coding (MVC), and virtual view rendering, etc., are being developed as key technologies.

However, the quality of the rendered virtual views can be affected by the compression of texture videos and depth maps.^{4} Given a maximum bitrate budget to represent the 3D scene, how to optimally distribute the bitrate between texture and depth, such that the rendering distortion is minimized, is still an open-ended problem. Results from the European ATTEST project had shown that the bitrate of depth map was fixed at a percentage of 20% of the texture bitrate, but it cannot guarantee that the bit allocation is optimal to practical multi-view video plus depth (MVD) data. Morvan used a full-search method to exhaustively search the optimal bitrate trade-off between texture and depth.^{5} Liu proposed a view synthesis distortion model to seek the optimal bitrate trade-off between texture and depth.^{6} Kim proposed a new distortion metric to quantify the effect of depth coding on synthesized view quality.^{7} Yuan proposed a planar distortion model to solve the bit allocation problem by the Lagrangian multiplier method.^{8} However, the accurate relationship between the view synthesis distortion and the corresponding quantization needs to be found for selecting the optimal quantization parameters (QPs) for texture videos and depth maps.

In order to solve the bit allocation problem in 3D video coding with high accuracy, a novel optimized view synthesis distortion model is proposed in this letter. The proposed model separates the distortion into two independent terms: the distortion from the compression of texture videos and the distortion from the compression of depth maps. Then, the two terms of distortions are modeled, respectively, by quadratic distortion models. Finally, the bit allocation problem is solved by determining the optimal QPs for texture and depth in the view synthesis distortion equation.

## 2.

## View Synthesis Distortion Model Optimization

An accurate bit allocation model plays an important role in 3D video coding due to its efficiency in rendering quality and low complexity. The optimization is to find the associated quantization levels for texture videos and depth maps, such that the view synthesis distortion is minimized under the total bitrate constraint *R*_{c}

## 1

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} \begin{array}{l} \mathop {{\rm }\arg \min }\limits_{(\scriptstyle R_r,R_d) \in Q} D_v \\[10pt] s.t.{\rm }R_d + R_t \le R_c, \\ \end{array} \end{equation}\end{document} $$\begin{array}{c}\underset{{\scriptstyle ({R}_{r},{R}_{d})\in Q}}{\mathrm{arg}\mathrm{min}}{D}_{v}\hfill \\ s.t.{R}_{d}+{R}_{t}\le {R}_{c},\hfill \end{array}$$*R*

_{t}and

*R*

_{d}are the coding bitrates of texture videos and depth maps, respectively,

*D*

_{v}is the view synthesis distortion, and

*Q*is the candidate set of the bitrate pair.

In order to model the expression in Eq. 1, let *S*_{v} denote the virtual image synthesized by the original texture images and the original depth maps,
[TeX:]
$\bar S_v$
${\overline{S}}_{v}$
denote the virtual image synthesized by the original texture images and the compressed depth maps,
[TeX:]
$\tilde S_v$
${\stackrel{\u0303}{S}}_{v}$
denotes the virtual image synthesized by the compressed texture images and the original depth maps, and
[TeX:]
$\hat S_v$
${\widehat{S}}_{v}$
denotes the virtual image synthesized by the compressed texture images and the compressed depth maps. Our goal is to obtain a mathematical expression for the view synthesis distortion model as

## 2

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{eqnarray} D_v &=& E\{ {[ {(S_v - \bar S_v) + (\bar S_v - \hat S_v)} ]^2 } \} \nonumber\\ & = & E[ {(S_v - \bar S_v)^2 } ] + E[ {(\bar S_v - \hat S_v)^2 } ]\nonumber\\ && +\, 2E[ {(S_v - \bar S_v)(\bar S_v - \hat S_v)} ]. \end{eqnarray}\end{document} $$\begin{array}{ccc}\hfill {D}_{v}& =& E\left\{{\left[({S}_{v}-{\overline{S}}_{v})+({\overline{S}}_{v}-{\widehat{S}}_{v})\right]}^{2}\right\}\hfill \\ & =& E\left[{({S}_{v}-{\overline{S}}_{v})}^{2}\right]+E\left[{({\overline{S}}_{v}-{\widehat{S}}_{v})}^{2}\right]\hfill \\ & & +\phantom{\rule{0.16em}{0ex}}2E\left[({S}_{v}-{\overline{S}}_{v})({\overline{S}}_{v}-{\widehat{S}}_{v})\right].\hfill \end{array}$$Two hypotheses are assumed: 1. The term [TeX:] $E[ {(S_v - \bar S_v)(\bar S_v - \hat S_v)} ]$ $E\left[({S}_{v}-{\overline{S}}_{v})({\overline{S}}_{v}-{\widehat{S}}_{v})\right]$ is supposed to be negligible. 2. The term [TeX:] $E[ {(\bar S_v - \hat S_v)^2 } ]$ $E\left[{({\overline{S}}_{v}-{\widehat{S}}_{v})}^{2}\right]$ can be approximated by [TeX:] $E[ {(S_v - \tilde S_v)^2 } ]$ $E\left[{({S}_{v}-{\stackrel{\u0303}{S}}_{v})}^{2}\right]$ . The two hypotheses are validated by experiments. Through experiments we find that the term in the first hypothesis is close to zero and the two terms of distortions in the second hypothesis are very similar. Thus, the error propagation problem between texture and depth can be omitted. Finally, the following expression is obtained

## 3

[TeX:] \documentclass[12pt]{minimal}\begin{document} \begin{eqnarray} D_v &=& \mathop{\underbrace{E [ (S_v - \bar S_v)^2]}}\limits_{D_d} + \mathop{\underbrace{ E[( S_v - \tilde S_v) ]^2}}\limits_{D_t}. \end{eqnarray}\end{document} $$\begin{array}{ccc}\hfill {D}_{v}& =& \underset{{D}_{d}}{\underbrace{E\left[{({S}_{v}-{\overline{S}}_{v})}^{2}\right]}}+\underset{{D}_{t}}{\underbrace{E{\left[({S}_{v}-{\stackrel{\u0303}{S}}_{v})\right]}^{2}}}.\hfill \end{array}$$The view synthesis distortion is separated into two independent terms: the distortion *D*_{t} from the term related to the distortion of the synthesized virtual view induced by texture compression, and the distortion *D*_{d} from the term related to the distortion of the synthesized virtual view induced by depth compression. Let *TQ*_{step} denote the quantization step of texture image, and *DQ*_{step} denotes the quantization step of depth map, the motivation of the proposed view synthesis distortion model is to find the mathematical relationship between distortion and quantization for texture and depth. Through experiments we find that when the texture quality is fixed, the relationship between the distortion *D*_{d} induced by depth compression and the *DQ*_{step} is approximately quadratic. Similarly, the relationship between the *D*_{t} induced by texture compression and the *TQ*_{step} can also be approximated by a quadratic model. Thus, the two terms in Eq. 3 are described by quadratic distortion models

## 4

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} D_d (DQ_{{\rm step}}) \cong \alpha _d (DQ_{{\rm step}})^2 + \beta _d (DQ_{{\rm step}}) + \gamma _d, \end{equation}\end{document} $${D}_{d}\left(D{Q}_{\mathrm{step}}\right)\cong {\alpha}_{d}{\left(D{Q}_{\mathrm{step}}\right)}^{2}+{\beta}_{d}\left(D{Q}_{\mathrm{step}}\right)+{\gamma}_{d},$$## 5

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} D_t (TQ_{{\rm step}}) \cong \alpha _t (TQ_{{\rm step}})^2 + \beta _t (TQ_{{\rm step}}) + \gamma _t, \end{equation}\end{document} $${D}_{t}\left(T{Q}_{\mathrm{step}}\right)\cong {\alpha}_{t}{\left(T{Q}_{\mathrm{step}}\right)}^{2}+{\beta}_{t}\left(T{Q}_{\mathrm{step}}\right)+{\gamma}_{t},$$*α*

_{d},

*β*

_{d},

*γ*

_{d}and

*α*

_{t},

*β*

_{t},

*γ*

_{t}are model parameters for depth and texture, respectively.

In order to characterize the rate-quantization (R-Q) relationship, a lot of R-Q models have been designed, such as linear R-Q model and quadratic R-Q model. Through experiment we find that the quadratic R-Q model can fit well to the actual R-Q curve.^{9} Thus, *R*_{d} and *R*_{t} can be expressed as

## 6

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} R_d (DQ_{{\rm step}}) \cong \mu _d (1/DQ_{{\rm step}})^2 + \nu _d (1/DQ_{{\rm step}}) + \gamma _d, \end{equation}\end{document} $${R}_{d}\left(D{Q}_{\mathrm{step}}\right)\cong {\mu}_{d}{(1/D{Q}_{\mathrm{step}})}^{2}+{\nu}_{d}(1/D{Q}_{\mathrm{step}})+{\gamma}_{d},$$## 7

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} R_t (TQ_{{\rm step}}) \cong \mu _t (1/TQ_{{\rm step}})^2 + \nu _t (1/TQ_{{\rm step}}) + \gamma _t, \end{equation}\end{document} $${R}_{t}\left(T{Q}_{\mathrm{step}}\right)\cong {\mu}_{t}{(1/T{Q}_{\mathrm{step}})}^{2}+{\nu}_{t}(1/T{Q}_{\mathrm{step}})+{\gamma}_{t},$$*μ*

_{d},

*ν*

_{d},

*γ*

_{d}and

*μ*

_{t},

*ν*

_{t},

*γ*

_{t}are model parameters for depth and texture, respectively.

In the experiments, we encode the original depth maps and texture videos with a small number of quantization steps, and fit the quadratic distortion models and quadratic R-Q models in the above formulation by using linear regression. Given the above formulation, the optimal view synthesis distortion problem can be converted to the optimization problem of *TQ*_{step} and *DQ*_{step}. Thus, Eq. 1 can be represented as

## 8

[TeX:] \documentclass[12pt]{minimal}\begin{document}\begin{equation} \begin{array}{l} \mathop {\arg \min }\limits_{(DQ_{{\rm step}},TQ_{{\rm step}}) \in Q} D_d (DQ_{{\rm step}}) + D_t (TQ_{{\rm step}}) \\[10pt] {\rm }s.t.{\rm }R_d (DQ_{{\rm step}}) + R_t (TQ_{{\rm step}}) \le R_c. \\ \end{array} \end{equation}\end{document} $$\begin{array}{c}\underset{(D{Q}_{\mathrm{step}},T{Q}_{\mathrm{step}})\in Q}{\mathrm{arg}\mathrm{min}}{D}_{d}\left(D{Q}_{\mathrm{step}}\right)+{D}_{t}\left(T{Q}_{\mathrm{step}}\right)\hfill \\ s.t.{R}_{d}\left(D{Q}_{\mathrm{step}}\right)+{R}_{t}\left(T{Q}_{\mathrm{step}}\right)\le {R}_{c}.\hfill \end{array}$$To solve Eq. 8, we restrict the ranges of quantization step and perform simple searches to find the minimum view synthesis distortion cost. The basic idea is to construct a subset of all possible solutions, and try to find the approximate optimal solution. Thereby, the optimal quantization steps *TQ*_{step} and *DQ*_{step} can be calculated by solving the above equation, and the optimal *QP*_{t} for texture video coding and *QP*_{d} for depth map coding can be determined based on the relationship between QP and quantization step *Q*_{step}, *Q*_{step} = 2^{(QP − 4)/6}.

## 3.

## Experimental Results

In the experiments, we used four typical 3D video sequences of “Altmoabit,” “Bookarrival,” “Doorflower,” and “Leavelaptop” with size of 1024×768. The 8th and 10th views in “Altmoabit” and “Doorflower” are used to synthesize the 9th virtual view. The 9th and 11th views in “Bookarrival’” and “Leavelaptop” are used to synthesize the 10th virtual view. For all experiments, we used H.264 JM 17.2 video codec to independently encode texture videos and depth maps, and used VSRS 3.0 software to synthesize the virtual view.

To demonstrate the accuracy of the proposed model, we demonstrate the relationship between view synthesis distortion and quantization step in Fig. 1. It is obvious that the model for texture can well fit the actual changes because the distortions in texture videos can be directly propagated to the virtual views. The accuracy of the model for depth will be dropped slightly because distortions in depth maps may induce geometry changes in the virtual views, but the correlation coefficients between the actual distortions and the fitted values are all larger than 0.98.

Then, by utilizing the obtained view synthesis distortion model, we can differentiate the view synthesis qualities at different texture and depth QPs without the actual encodings of texture videos and depth maps. We compared the view synthesis rate-distortion (R-D) performance of the proposed bit allocation method with a fixed ratio 5:1 bit allocation method. The R-D curves of different bit allocation methods for “Altmoabit” and “Doorflower” are shown in Fig. 2. As depicted by the results, compared with the fixed ratio 5:1 bit allocation method, more peak signal-to-noise ratio gains can be obtained by the proposed method.

## 4.

## Conclusions

In this letter, a novel optimized view synthesis distortion model is proposed. The proposed model separates the view synthesis distortion into two independent terms, and the two terms are modeled respectively by quadratic distortion models. The experimental results show the effectiveness of the proposed model. In the future probe, on the one hand, a more accurate model, such as cubic distortion model, is to represent the view synthesis distortion model. On the other hand, rate-distortion criterion for model decision of 3D video coding is to design by employing the proposed model.

## Acknowledgments

This work was supported by the Natural Science Foundation of China (Grant Nos. 60902096, 60832003, and 61071120), and the Specialized Research Fund for the Doctoral Program of Higher Education of China ( 20093305120002). It was also sponsored by K. C. Wong Magna Fund in Ningbo University.