## 1.

## Introduction

Multiview video codec (MVC) design becomes popular,^{1} based on which wide-spread applications, such as three-dimensional (3-D) video, free-viewpoint television (FTV), and video surveillance networks, can be developed. The 3-D video provides high quality and immersed multimedia entertainment that can be experienced through various channels, including movies, TV, internet, and so on. The FTV is a MVC system that allows viewpoint switching among different viewpoints, in which the video scene is captured by the camera from a specific view angle. For video surveillance networks, the MVC can be used to monitor and detect unusual events/objects. However, the MVC requires intersensor communication, which is expensive and not feasible in some applications. The information amount and required computational loading for a MVC codec is very large, compared to those of monoview. How to efficiently process and compress multiview videos is challenging. The joint video team has been working on the MVC, which captures videos from different video cameras and encodes these signals with reference to each other to yield a single bitstream. To enhance codec performance, most MVC schemes exploit correlations between both intraview and interview frames. At the encoder, it performs block motion compensation (MC) and disparity estimation to remove correlations between images along the intraview/temporal and interview video dimension to achieve high compression efficiency. Under this MVC framework, the time complexity of encoding operations would be high for efficient compression. It cannot provide low complexity encoding for applications like wireless video sensor/surveillance networks and low-power MVC capturing devices. The coding complexity has to be shifted to the decoder to make these applications feasible.

The distributed video coder (DVC)^{2}^{,}^{3} was proposed to effectively shift coding complexity to the decoder, which can capture and encode signals from several low-power devices independently and jointly decode these signals. It can be extended to deal with multiview video signals,^{4}^{,}^{5} in which the disparity information among images of different views can be exploited for removing correlations, in additional to correlations among intraview images. The DVC^{2} was developed based on lossless distributed source coding, also known as the Slepian–Wolf coder (SWC)^{6} for lossless coding. An important aspect of the SWC is that separated encoding can theoretically achieve the same compression ratio with joint encoding as long as the correlations among data streams are exploited by a joint decoder. This SWC framework was extended to process lossy compression with side information (SI) at the decoder,^{7} as in the case of the Wyner–Ziv (WZ) coder. With the WZ coding algorithm, the DVC treats video compression as a channel coding problem. The input video of DVC is decomposed into odd and even sequences, in which the former is encoded as key frames (KFs) and the latter WZ frames (WZFs). The KFs are encoded with H.264/AVC^{8} intramode, H.264/INTRA, and the WZFs are block-transformed, quantized, and transmitted through error correction codes in a bit-plane by bit-plane approach, in which only part of the parity bits are transmitted. At the decoder, the KFs are utilized to yield the SI a noisy WZF, which is the systematic part of an error correction code that co-operates with the received parity bits to correct channel errors. Compared to current video codec, the DVC effectively shifts a considerable amount of the coding complexity from the encoder to the decoder, which can also be applied to error resilience control^{9} that treats the side information frame (SIF) as additional reference information, SI, to correct channel errors. Recently, a new distributed video codec based on modulo operation in the pixel domain has been proposed,^{10} which demonstrates lower decoding complexity.

Integrating the MVC with a multiview distributed video coding (MDVC) would allow encoding several low-power capturing and encoding devices independently and decode these signals jointly. A view-synthesis and disparity-based correlation model that exploits interview video correlation is proposed to deliver error-resilient video in a distributed multicamera system.^{11} One simple MDVC example with a left-, a right-, and a central-view camera is shown in Fig. 1. The left- (L) and right-view (R) videos are encoded and decoded by the traditional video codec, e.g., H.264/INTRA, to act as KFs ($I$ frames) for the DVC decoding. The central-view video is encoded as interleaved one intra ($I$) and one WZF, i.e., group of picture $(|\mathrm{GOP}|)=2$. At the decoder, the SI for a WZF can be estimated by exploiting the intraview and interview image correlations, respectively. The decoded KFs are utilized to jointly reconstruct the WZFs, ${\widehat{I}}_{2t}^{\mathrm{WZ}}\mathrm{s}$, based on inter- and intraview image correlations. These correlations are utilized by assigning weights to different estimated motion vectors (MVs) exploited based on the MDVC framework. This decoder-driven fusion method is adopted to improve the codec performances, e.g., peak signal to noise ratios (PSNRs) and time complexity. In addition, the embedded DVC makes it feasible to setup low complexity, mobile encoders for multiview video acquisition to enable low delay and real-time processing of the MDVC. The decoder can consume the shifted computational complexity by setting a high performance computer for central decoding, e.g., large buffers, disk array, and high-speed CPUs.

Researches on improving MDVC SIF quality can be found by many.^{12}13.^{–}^{14} An iterative SIF generation method uses decoded WZF to refine the SIF,^{12} based on which the second iteration can enhance the quality of decoded images. By performing interpolation along intra- and interview video dimensions, respectively, to yield candidate SIFs, the final SIF can be fused from these candidate SIFs with a specific reliability measurement.^{13} The interview interpolated candidate SIF for fusion can be enhanced by using a perspectively transformed one,^{15}^{,}^{16} which can help to fuse better final SIFs and demonstrate better coding performance, as compared to monoview DVC. Three new fusion techniques that exploit signal properties of neighboring residual frames along intra- and interview direction were proposed for robustness and improving SIF quality.^{17} The fusion can also adopt a support vector machine to identify a set of features for classifying pixels into either the temporal or the disparity class, by which the fusion can yield better SIF.^{18} It provides a good solution for fusing intra- and interview predictions. However, these fusion methods suffer from performance degradation due to low temporally predicted quality and irregular video motion. An adaptive filtering view interpolation method^{19}^{,}^{20} was proposed to minimize the difference between SIF and decoded KF, which can compensate for the intercamera mismatches and improve SIF quality. When occlusion exists between interview videos, the temporal frame interpolation is utilized to compensate for the deficiency of interview linear fusion^{20} to improve SIF quality. Various SI generation methods are evaluated and compared for better utilization efficiency.

By estimating motion on interpolated frames, the irregular motion artifacts can be eliminated and the SIF quality can be improved.^{21} One MDVC codec^{22} was designed to transmit a small amount of error control information to replace an untransmitted frame and the information is obtained from a low-dimensional blockwise projection of the frame, i.e., mean-based projection. The most prominent feature of this work is that it is performed as a postprocessing step after decoding and interpolating the received video, which allows easy integration with various video transmission systems.

In the conventional video codec, it usually adopts the coding structure with a GOP size larger than 15, $|\mathrm{GOP}|>15$, to yield good enough rate-distortion (RD) performances. For the MDVC, the GOP size is usually set to be smaller in that, for the WZ codec to adopt longer GOP sizes, performing ME becomes difficult and less reliable such that the reconstructed SIF quality would be degraded. Previous research^{23} investigates the rate-distortion and complexity performance of the feedback-channel based WZ codec as a function of the GOP size and justifies that the lowest encoder complexity, e.g., $|\mathrm{GOP}|=2$, yields the best RD performance, as compared with the conventional video codec. For the MDVC, the coding structure with $|\mathrm{GOP}|=2$ is adopted for simplicity and efficiency. Under the MDVC framework, we proposed to process static and nonstatic image regions with different procedures. By exploiting correlations between images along inter- and intraview dimensions, the proposed weighted block-matching prediction (BMP) can yield higher SIF quality. This proposed categorized block matching prediction with fidelity weights method is abbreviated as COMPETE. At the decoder, the scale-invariant feature transform (SIFT)^{24} was adopted to find stable key feature points in the first decoded KF images, ${\widehat{L}}_{0}$, ${\widehat{R}}_{0}$, and ${\widehat{I}}_{0}$, which are used for matching correspondent features among interview video images to estimate the homography matrices, ${\mathbb{H}}_{\mathrm{l}}$ and ${\mathbb{H}}_{\mathrm{r}}$, through a RANSAC^{25} algorithm. The SIFT processing time is analyzed to be proportional to image size. The ${\mathbb{H}}_{\mathrm{l}}$ and ${\mathbb{H}}_{\mathrm{r}}$ are estimated once at the decoder to perspectively transform side-view images to be with central view. The homography matrix can also be estimated with a regular time interval or dynamically according to scene foreground/background change. In the proposed COMPETE algorithm, image blocks are categorized into motion, no-motion, and outlier blocks, with which blocks are processed in different ways. For motion blocks, with both perspectively transformed, ${\widehat{L}}_{t}^{\prime}$ and ${\widehat{R}}_{t}^{\prime}$, and reconstructed central-view images, ${\widehat{I}}_{t}$, at the decoder, the block MC procedure can then be performed between adjacent images from these transformed and central-view ones to yield MVs. By combining blocks reached by these MVs with weights proportional to block fidelity, it would generate more smooth and higher quality SIFs. For no-motion blocks, the current block is compensated by the co-located block in the previous frame. For blocks residing on the outlier, resulting from perspective transformation, temporal bidirectional MC is performed between central-view image, ${\widehat{I}}_{2t-1}$ and ${\widehat{I}}_{2t+1}$. The proposed COMPETE algorithm helps to improve the SI confidence and the quality of decoded WZF, ${\widehat{I}}_{2t}^{\mathrm{WZ}}\mathrm{s}$, for the MDVC system. The COMPETE also effectively decreases computational load while achieving comparable PSNR performances with other SIF reconstruction methods, e.g., MVME^{26} and H.264/INTRA.

For rate control of the MDVC channel coding, the turbo codec is designed to let the decoder receive just enough parity bits from the encoder for signal reconstruction. The rate compatible punctured turbo (RCPT) code is adopted for the MDVC channel coding, which was initialed from unequal error protection for unstable transmission.^{27} An automatic repeat request (ARQ) rate control method was developed under RCPT^{28} to transmit fewest parity bits for successful decoding. For the turbo decoder to reference more reliable prior probabilities to reduce its iteration times and improve decoding efficiency, the correlation of DCTs between the original and its SIF is modeled as Laplacian distribution.^{29} Different puncture patterns were designed for direct and alternate current coefficients, DCs and ACs, to yield the parity bits, based on which the correlation between bit-planes is exploited and utilized to estimate the posteriori probability to provide the priori probability for turbo decoding. Simulations verified that the turbo decoding time can be reduced to 37% as compared to other SIF generation methods.

In what follows, SIF reconstruction methods developed based on the MDVC system and the proposed COMPETE methods are described in Sec. 2. The proposed rate control algorithm to improve the MDVC performance is described in Sec. 3. Section 4 is the simulation study. Section 5 concludes this paper.

## 2.

## Multiview Distributed Video Coding Side Information

For one MDVC with $|\mathrm{GOP}|=2$, half of central-view images are encoded as WZFs and the SIF quality at the decoder would dominate the WZ codec performance. The SIF at the decoder can be considered as a reconstructed image of the original WZF at the encoder transmitted through noise channels. If the SIF quality is high enough, fewer parity bits will be requested during decoding and higher codec efficiency can be achieved. In a monoview video codec, the general approach to yield SIF is performing temporal interpolation/extrapolation from KFs to yield SIF, and there are other approaches adopting motion compensated interpolation to improve SIF quality, such as using an optical flow predictor^{30} and hash-based estimator.^{31} For the MVC, the same scene is captured from different viewing angles by different cameras, such that the correlation among different view videos can be utilized for SIF generation. Under the MDVC framework, we proposed to utilize the SIFT^{24} feature extraction and the RANSAC^{25}^{,}^{32} algorithm to exploit feature correspondences among interview video images. The SIFT outperforms other feature descriptors on images with real geometric and photometric transformations,^{33} and the RANSAC helps to robustly fit a model to data in the presence of outliers, based on which the homography matrices^{34} can be estimated for perspective transform from side-view video to central view. The proposed BMP algorithm can then be carried out to yield high quality SIF and improve the quality of decoded WZF. Different SIF reconstruction methods developed based on the MDVC framework, such as motion compensated temporal interpolation (MCTI),^{35} MVME,^{26} and hybrid-MVME (H-MVME), will be first reviewed for performance comparisons in the following sections.

## 2.1.

### Side Information Reconstruction

The MCTI^{35} is an image reconstruction/interpolation method, in which block ME and MC are utilized to explore temporal correlation of monoview videos. To interpolate for the current frame, ${I}_{2t}$, the MVs estimated from its previous frame ${I}_{2t-1}$ and the next frame ${I}_{2t+1}$ are halved for bidirectional MC to yield the interpolated SIF, ${Y}_{\mathrm{SI}}$. The MVME scheme^{26} carried out at the decoder is shown in Fig. 2, in which KFs, $I\mathrm{s}$, are coded with H.264/INTRA and the WZF is to be reconstructed with its SIF. For one WZF, two ME paths can be adopted: the inner path is estimated by performing disparity vector estimation followed by MV estimation, as demonstrated by Fig. 3(a); the outer path can be obtained by reversing the above two vector estimation procedures, as shown in Fig. 3(b). To interpolate for each block with $N\times N$ pixels in the WZF, let the side-view image at time ${2}^{t}-1$, ${I}^{\text{side}}({2}^{t}-1)$, be the target image, in which a best matched block, with a disparity vector, ${\overrightarrow{v}}_{d}$, corresponding to the co-located block in the central-view image, ${I}^{\text{central}}({2}^{t}-1)$, is found. The best matched block in ${I}^{\text{side}}({2}^{t}-1)$ is then used to find out another best matched block from ${I}^{\text{side}}({2}^{t})$ with a MV ${\overrightarrow{v}}_{m}$. This procedure would yield one reference ${\overrightarrow{v}}_{m}$, or one inner path MV, for the co-located block in the current WZF. By applying the same procedure to the other three sets of reference images, three other inner path MVs can be found for the current block in the WZF. The outer path MVs can be obtained by the same procedure but with MV estimation first and then disparity vector estimation.

When all ME paths of the WZF are included, i.e., four inner and four outer paths to perform MVME, it yields eight estimated frames. This SIF can be reconstructed by taking the weighted/nonweighted average of these corresponding blocks of estimated MVs. Although the MVME provides several estimated MVs for reference, it suffers from heavy computation. In addition, it may lead to trivial estimation errors for no-motion blocks. The MVME approach utilizes the general ME operations, designed for intraview video, to estimate disparity vectors among interview images. To bridge this inherent gap between ${\overrightarrow{v}}_{m}$ and ${\overrightarrow{v}}_{d}$ estimation, we proposed to estimate the homography matrix to perspectively transform the side-view video to be with central view such that applying ME on interview images would be perfect. This H-MVME approach can yield better PSNR performances than MVME. In addition to handling the MVME in the hybrid approach, we proposed to eliminate trivial ME operations for no-motion blocks and perform BMP based on calculating the weighted sum of MC blocks reached through different MVs, denoted as COMPETE as described above, to improve the MVME to yield high quality SIF. In case the disparity/MV estimation was operated on outlier, i.e., regions without correspondent pixels resulting from performing perspective transformation, the temporal MCTI is adopted to interpolate for the current block in the WZF.

## 2.2.

### COMPETE Side Information Reconstruction

The COMPETE SIF reconstruction method is proposed to enhance the H-MVME to yield SIF with higher confidence. When homography matrices are not available for perspective transformation, we utilize the SIFT feature extraction and the RANSAC procedure to estimate homography matrices and then utilize BMP to yield high confidence SIF.

## 2.2.1.

#### Homography

The homography relates the pixel coordinates in two images. When it is applied to every pixel, the new image is a warped version of the original one. However, this homography relationship is independent of the scene structure. To be more specific, one homography matrix, $\mathbb{H}$, which is $3\times 3$, can transform one camera view to another.^{34} To estimate the ${\mathbb{H}}_{v\in \{l,r\}}\mathrm{s}$, the SIFT^{24} algorithm is first applied on the video images of different views, $L$, $R$ and $I$, to find stable key feature points. Tentative feature point pairs between two images are selected to provide candidate homography matrices, ${\mathbb{H}}_{v\in \{l,r\}}$. The feature point pairs and candidate ${\mathbb{H}}_{v\in \{l,r\}}\mathrm{s}$ are iteratively selected and justified by finding the maximum consensus set through the RANSAC procedure to yield the best ${\mathbb{H}}_{v\in \{l,r\}}$. At this stage, it seeks to find all correspondent SIFT points, or matching pairs, between two different view images. Mismatches will occur in that the matching process assumes proximity and similarity, and there are some correspondence located in outliers. In general, the RANSAC outperforms gradient descent methods^{36} in that too many outliers will prevent the latter from converging to the global optimum.

## 2.2.2.

#### Scale-invariant frame transform

The SIFT^{24} procedure helps to represent one image with robust feature points. It transforms one image into scale-invariant feature coordinates corresponding to local features. This procedure would ignore low contrast feature points and eliminate edge response to filter out the remaining stable keypoints.

## 2.2.3.

#### Interpolation and homography

The SIF at the turbo decoder is generated by the “interpolation/homography” module, as shown in Fig. 4. We proposed to exploit correlations among interview images, in addition to intraview ones, to eliminate reference SIFs from having severe disparity. The reference central-view images can be obtained through the homography matrices, ${\mathbb{H}}_{l}$ and ${\mathbb{H}}_{r}$, from left- and right-view images. To estimate the ${\mathbb{H}}_{l}$ and ${\mathbb{H}}_{r}$, the first intracoded frames, ${\widehat{L}}_{0}$, ${\widehat{R}}_{0}$, and ${\widehat{I}}_{0}$, received and reconstructed at the decoder, are used as sample images to extract correspondent stable SIFT features between left/right-view and central-view images. To estimate the homography matrix based on the correspondent feature points, the RANSAC procedure was carried out to find the matrices, ${\mathbb{H}}_{l}$ and ${\mathbb{H}}_{r}$, which yielded maximum inliers. The reference central-view images can then be obtained by performing perspective transform through ${\mathbb{H}}_{l}$ and ${\mathbb{H}}_{r}$ from the decoded left- and right-view images, $\widehat{L}$ and $\widehat{R}$, i.e., ${\hat{L}}^{\prime}={\mathbb{H}}_{l}(\widehat{L})$ and ${\widehat{R}}^{\prime}={\mathbb{H}}_{r}(\widehat{R})$, as shown in Fig. 5(a). With the reference central-view images, the BMP procedure can be carried out to yield the SIF, ${\widehat{I}}_{2t}^{\mathrm{int}}$. For one multiview video, the homography matrix that transforms the side-view video to be with central view has to be estimated only once with reference to $\{{\widehat{R}}_{0},{\widehat{I}}_{0},{\widehat{L}}_{0}\}$ at the beginning of decoding. With the homography matrix estimated optimally through the SIFT and the RANSAC procedures, the BMP among ${L}^{\prime}$ and ${R}^{\prime}$, and the original decoded one ${\widehat{I}}_{2t-1}$ are performed to yield the SIF, described in the following section.

## 2.3.

### Block Matching Prediction

Performing perspective transformation from side view to central-view frames will result in an outlier, miss transformed area, as shown in Fig. 5(a). The perspectively transformed images, ${\widehat{L}}^{\prime}\mathrm{s}$ and ${\widehat{R}}^{\prime}\mathrm{s}$, and the reconstructed central-view images, ${\widehat{I}}_{2t-1}\mathrm{s}$, are used to perform block matching to estimate disparity and MVs, denoted as ${\overrightarrow{v}}_{d}^{\prime}$ and ${\overrightarrow{v}}_{m}^{\prime}$, respectively. The SIF of a central-view image not transmitted can be reconstructed through weighted motion compensated prediction by above ${\overrightarrow{v}}_{m}^{\prime}\mathrm{s}$ and ${\overrightarrow{v}}_{m}\mathrm{s}$, in which the latter were estimated from ${\widehat{I}}_{2t\pm 1}\mathrm{s}$. This BMP process would reconstruct the SIF, ${\widehat{I}}_{2t}^{\mathrm{int}}$, shown in Fig. 5(b), where ${B}_{i}$ is the block in ${\widehat{I}}_{2t-1}$, ${\overrightarrow{v}}_{{d}_{i}}^{\prime}$ and ${\overrightarrow{v}}_{{m}_{i}}^{\prime}$ are the disparity and MVs estimated between reconstructed interview images, e.g., $\{{\widehat{L}}_{2i-1}^{\prime},{\widehat{I}}_{2t-1}\}$ and $\{{\widehat{R}}_{2t-1}^{\prime},{\widehat{I}}_{2t-1}\}$, and between ${\widehat{I}}_{2t\pm 1}\mathrm{s}$, respectively. The COMPETE flowchart is shown in Fig. 6. One ${\widehat{I}}_{2t-1}$ is partitioned into $M$ $8\times 8$ blocks, $\{{B}_{i}({\widehat{I}}_{2t-1})|i\in 1,\cdots ,M\}$, and a large block $L{B}_{i}({\widehat{I}}_{2t-1})$ consists of $2\times 2$ blocks, i.e., $L{B}_{i}({\widehat{I}}_{2t-1})=\{{B}_{i}^{11},{B}_{i}^{12},{B}_{i}^{21},{B}_{i}^{22}\}$, in which ${B}_{i}^{11}$ is the current block, i.e., ${B}_{i}={B}_{i}^{11}$. The four block MVs in $L{B}_{i}$, $({\overrightarrow{v}}_{m}^{11},{\overrightarrow{v}}_{m}^{12},{\overrightarrow{v}}_{m}^{21},{\overrightarrow{v}}_{m}^{22})$, are obtained by performing motion estimation (ME) between ${\widehat{I}}_{2t-1}$ and ${\widehat{I}}_{2t+1}$ for the co-located $L{B}_{i}$. If $({\overrightarrow{v}}_{m}^{11},{\overrightarrow{v}}_{m}^{12},{\overrightarrow{v}}_{m}^{21},{\overrightarrow{v}}_{m}^{22})=\overrightarrow{0}$, it means ${B}_{i}$ in $L{B}_{i}$ is a no-motion block and can be reconstructed by direct copy from its previous image, i.e., ${B}_{i}^{11}({\widehat{I}}_{2t}^{\mathit{int}})={B}_{i}({\widehat{I}}_{2t-1})$. If $({\overrightarrow{v}}_{m}^{11},{\overrightarrow{v}}_{m}^{12},{\overrightarrow{v}}_{m}^{21},{\overrightarrow{v}}_{m}^{22})\ne \overrightarrow{0}$, then ${B}_{i}$ is a motion block and the corresponding disparity block in side-view transformed images, ${\widehat{L}}^{\prime}$ and ${\widehat{R}}^{\prime}$, and ${B}_{i}$’s MVs are combined with weights proportional to block fidelity to yield a more accurate compensated block for the ${B}_{i}$ in ${\widehat{I}}_{2t}^{\mathrm{int}}$. We take the ME process for a ${B}_{i}$ by referencing left- and central-view images as an example and the right-view one can be carried out in the same way. The first-phase block disparity estimation is performed between ${\widehat{I}}_{2\text{\hspace{0.17em}\hspace{0.17em}}t-1}$ and ${\widehat{L}}_{2t-1}^{\prime}$, denoted as $\mathbb{B}{\mathbb{M}}_{2\times 2}({B}_{i}):{B}_{i}({\widehat{I}}_{2t-1})\to {B}_{i}({\widehat{L}}_{2t-1}^{\prime})$, which will yield the best matched block from ${\widehat{L}}_{2t-1}^{\prime}$ with a ${\overrightarrow{v}}_{d}^{\prime}$. If the best matched block does not reside on the outlier of ${\widehat{L}}_{2t-1}^{\prime}$, the second-phase block ME is performed, in which the search range in ${\widehat{L}}_{2t}^{\prime}$ is two blocks wide along vertical and horizontal directions and centered at the co-located coordinate of ${B}_{i}$ on ${\widehat{L}}_{2t-1}^{\prime}$ with the offset ${\overrightarrow{v}}_{d}^{\prime}$. It yields one ${\overrightarrow{v}}_{{m}_{1}}^{\prime}$, and the second ${\overrightarrow{v}}_{{m}_{2}}^{\prime}$ can be obtained by the same procedure $\mathbb{B}{\mathbb{M}}_{2\times 2}({B}_{i}):{B}_{i}({\widehat{I}}_{2t+1})\to {B}_{i}({\widehat{L}}_{2t+1})$. The other two MVs, ${\overrightarrow{v}}_{{m}_{3}}^{\prime}$ and ${\overrightarrow{v}}_{{m}_{4}}^{\prime}$, are estimated from the right-view video through the same procedure. When performing MC for an ${\widehat{I}}_{2i}$, if any image block reached through the inner-path MV, ${\overrightarrow{v}}_{{m}_{j}}^{\prime}$, resides on the outlier, then its ${w}_{j}$ is set zero. Let ${B}_{i}(I,v)$ denote the image block obtained from the co-located block on an $I$ with its MV, $v$, and the ${B}_{i}$ reconstruction for the SIF, ${\widehat{I}}_{2t}^{\mathrm{int}}$, can be represented as

## (1)

$${B}_{i}({\widehat{I}}_{2t}^{\mathrm{int}})=\sum _{j=1,3}{w}_{j}\xb7{B}_{i}({\widehat{I}}_{2t-1},{\overrightarrow{v}}_{{m}_{j}}^{\prime})+\sum _{j=2,4}{w}_{j}\xb7{B}_{i}({\widehat{I}}_{2t+1},{\overrightarrow{v}}_{{m}_{j}}^{\prime}),$$## (2)

$${w}_{j}=\frac{1}{{\mathrm{SAD}}_{j}}/\sum _{j=1}^{4}\frac{1}{{\mathrm{SAD}}_{j}},\phantom{\rule[-0.0ex]{2em}{0.0ex}}j\le 4,$$## (3)

$${B}_{i}({\widehat{I}}_{2t}^{\mathrm{int}})=\frac{1}{2}\left[{B}_{i}\right({\widehat{I}}_{2t-1},\frac{{\overrightarrow{v}}_{m}^{11}}{2})+{B}_{i}({\widehat{I}}_{2t+1},\frac{-{\overrightarrow{v}}_{m}^{11}}{2}\left)\right].$$In our experiments, the COMPETE is operated under the frame ratio $\mathrm{KF}:\mathrm{WZF}=5:1$, while the fusion-based homography method is $\mathrm{KF}:\mathrm{WZF}=1:1$. The COMPETE can also be adapted to operate under the ratio $\mathrm{KF}:\mathrm{WZF}=1:1$. In the COMPETE, it needs to transmit the first KF of each view to estimate homography matrices, as shown in Fig. 7(a), and there are one MV and two disparity vectors that can be used to interpolate for the SIF of ${\widehat{I}}_{2t}^{\mathrm{int}}$. To interpolate for the SI of side-view images, say ${\widehat{L}}_{2t+1}^{\mathrm{int}}$, only one MV and one disparity vector can be referenced, as shown in Fig. 7(b). For the last central-view image, only two disparity vectors can be referenced to interpolate for its SI, as shown in Fig. 7(c). When the WZF/KF ratio is larger than 1, it requires learning-based approaches^{37} that apply an expectation maximization algorithm for unsupervised learning of MVs.

## 3.

## Multiview Distributed Video Coding Rate Control Algorithm

The internal signal processing flow of the MDVC (Fig. 1) is shown in Fig. 8. The encoder $E$ comprises both H.264 and WZ encoders, in which the left- and right-view images, $\{{L}_{t}\}$ and $\{{R}_{t}\}$, would be encoded by the former to yield KF bitstreams, ${s}_{l}$ and ${s}_{r}$, respectively. The central-view images, $\{{I}_{t}\}$, are separated into odd and even image sequences, $\{{I}_{2t-1}\}$ and $\{{I}_{2t}\}$. The odd images are encoded by H.264 Intra to provide the KF bitstream ${s}_{o}$ and the even ones by the WZ encoder with appended cyclic redundancy check (CRC) checksum to yield parity bits ${\tilde{p}}_{2t}$. For adaptive rate control, the RCPT^{28} code is adopted for channel coding, because it performs near the Shannon limit at low SNR, while providing excellent throughput at high SNR.^{28} The WZ encoder will determine whether to send more parity bits or not based on the feedback requested bits NAK from the WZ decoder. The decoder $D$ comprises one H.264 decoder, one WZ decoder, and one interpolation/homography function module. The received bitstreams, ${s}_{l}$, ${s}_{r}$, and ${s}_{o}$, will be decoded by the H.264 decoder to yield reconstructed images of left-, right-, and central-view odd images, ${\widehat{L}}_{t}$, ${\widehat{R}}_{t}$, and ${\widehat{I}}_{2t-1}$, respectively. They are inputs of the interpolation/homography modules that will reconstruct the SI, an interpolated central-view image ${\widehat{I}}_{2t}^{\mathrm{int}}$, for the WZ decoder to reconstruct ${\widehat{I}}_{2t}$ with reference to ${\widehat{I}}_{2t}^{\mathrm{int}}$. The multiplexer combines the reconstructed ${\widehat{I}}_{2t-1}$ and ${\widehat{I}}_{2t}$ to yield the final central-view video $\{{\widehat{I}}_{t}\}$.

## 3.1.

### Wyner–Ziv Coding

The WZ encoder in the MDVC system is shown in Fig. 9. The input image, ${I}_{2t}$, is divided into blocks with $4\times 4\text{\hspace{0.17em}\hspace{0.17em}}\text{pixels}$, which are then transformed to frequency domain coefficients, ${c}_{2t}$, through $T$, and quantized through $Q$ to yield the quantized coefficients, ${q}_{2t}$. To reduce encoding complexity, the integer DCT is adopted for low complexity hardware implementation. In ${c}_{2t}$, the DC coefficient comprises most of the block signal energy and will be allocated more bits than other higher frequency ones, ACs. Coefficients in the $4\times 4$ block, ${c}_{2t}$, are partitioned into different bands. Each coefficient band is uniformly quantized with a ${2}^{{b}_{k}}$ level quantizer ($Q$), where ${b}_{k}$ denotes the number of bits assigned to the $k$’th coefficient. The number of quantization levels, ${2}^{{b}_{k}}\mathrm{s}$, for a $4\times 4$ DCT coefficient block^{38} is determined through an optimal bit allocation procedure on the ${c}_{2t}$ coefficients.

In practical implementation, the quantization stepsize of the $i$’th coefficient, ${\mathrm{\Delta}}_{i}$, was setup with a loading factor, $\sigma =4$, for a certain coefficient probability density function (PDF),^{39} i.e.,

## (4)

$${\mathrm{\Delta}}_{i}=\frac{4{\sigma}_{i}}{{2}^{{b}_{k}}},\phantom{\rule[-0.0ex]{1em}{0.0ex}}\text{for}\text{\hspace{0.17em}\hspace{0.17em}}{b}_{k}\ne 0.$$After quantization, each coefficient is represented by its quantization index ${q}_{2t}$. For simple demonstration, the parity bits generating process for one $16\times 16$ image is provided. The $16\times 16$ image is decomposed into sixteen $4\times 4$ blocks on which DCT is performed, and the number of bits to represent the quantized indexes of DCs and ACs are 4 and 3, respectively. The DCs and ACs of these sixteen $4\times 4$ DCT blocks are rearranged such that the same frequency coefficients are grouped together and queued with zigzag scan order, i.e., ${\{{\mathrm{DC}}^{i}\}}_{i=1,2,\cdots ,n}$ ${\{{\mathrm{AC}}_{1}^{i}\}}_{i=1,2,\cdots ,n},{\{{\mathrm{AC}}_{2}^{i}\}}_{i=1,2,\cdots ,n},\cdots ,{\{{\mathrm{AC}}_{a}^{i}\}}_{i=1,2,\cdots ,n}$, where $n$ is the number of total blocks in the image and $a$ is the number of ACs for a certain quantization pattern, as shown in the upper image of Fig. 10(a). For turbo encoding, these regrouped $4\times 4$ DCs blocks are subject to bit-plane extraction, as shown in Fig. 10(a), such that the same significant bits are grouped together and transmitted by bit-plane order, i.e., ${\mathrm{MSB}}_{k}=\{{\mathrm{MSB}}_{k}^{i}|i=1,2,\cdots ,16\}$ for $k=1,2,\cdots ,K$, where $i$ is the index of the original $4\times 4$ blocks and $k$ is the bit-plane index. For regrouped $4\times 4$ ACs blocks, the above transmission order is reversed, i.e., from the LSB to the MSB. The bit-stream of these reordered bits, ${b}_{2t}$, is then used as the input to the CRC encoder, which appends checksum of ${b}_{2t}$ and passes it to the turbo encoder. After performing interleaving by the turbo encoder, it yields the parity bit-streams, ${\tilde{\mathbf{p}}}_{2t}={\tilde{\mathbb{P}}}_{i}^{1}\cup {\tilde{\mathbb{P}}}_{i}^{2}$, which can be represented as ${\tilde{\mathbb{P}}}_{i}^{1}=\{{\tilde{p}}_{1}^{1},{\tilde{p}}_{2}^{1},\cdots ,{\tilde{p}}_{16}^{1},\cdots \}$ and ${\tilde{\mathbb{P}}}_{i}^{2}=\{{\tilde{p}}_{1}^{2},{\tilde{p}}_{2}^{2},\cdots ,{\tilde{p}}_{16}^{2},\cdots \}$. Both parity bit streams are punctured with specific patterns of period $\psi =16$ to form sub-blocks queued in the transmission buffer, denoted as ${\tilde{P}}_{2t}^{1}$ and ${\tilde{P}}_{2t}^{2}$, which will be sent to the decoder upon request. The puncture pattern is designed to select parity bit according to the specified priority, as shown in Fig. 10(b). For turbo decoding, the skipped systematic bits at $E$ are replaced with the reconstructed SI at $D$, which would be reconstructed by different methods. The turbo decoder would request more parity bits in case it cannot correctly recover the data. In general, when the SI confidence is high, it would request fewer parity bits and improve the WZF quality. Detailed rate control steps will be described in Sec. 3.2.

To reconstruct the WZF, ${\widehat{I}}_{2t}$, from the received parity bits sub-block, $\{{\tilde{P}}_{2t}^{1},{\tilde{P}}_{2t}^{2}\}$, at the WZ decoder shown in Fig. 11, it needs to generate the SI, ${\widehat{I}}_{2t}^{\mathrm{int}}$, by the interpolation/homography module, as shown in Fig. 8. Before turbo decoding, the same $T$ and $Q$ processes will be applied to ${\widehat{I}}_{2t}^{\mathrm{int}}$ to yield ${\widehat{c}}_{2t}^{\mathrm{int}}$ and ${\widehat{q}}_{2t}^{\mathrm{int}}$, respectively. To increase the SI confidence for turbo decoding, the distributions of error between reconstructed SIF and the original WZF are modeled as Laplacian. A transform-domain correlation noise model parameter updating procedure^{29} was applied to fit coefficient error distribution for each $4\times 4$ block with the Laplacian model. Since the original image encoded as a WZF is not available at the decoder, the MCTI image, ${\widehat{I}}_{2t}^{\mathrm{int}}$, interpolated from ${\widehat{I}}_{2t\pm 1}^{\mathrm{int}}\mathrm{s}$, was used instead. After being processed by T and Q, the indexed signals, ${\widehat{q}}_{2t}^{\mathrm{int}}$, are reordered, grouped, and extracted by bit-plane to provide the system bits, ${\widehat{b}}_{2t}^{\mathrm{int}}$, for the turbo decoder. The turbo decoder performs the logarithmic maximum a “posterior” algorithm, Log-Map, with the help of received parity bits sub-blocks, $\{{\tilde{P}}_{2t}^{1},{\tilde{P}}_{2t}^{2}\}$, and CRC checksum verification, under a certain confidence measurement^{40} to determine either the decoding process is convergent or to request more bits for the next iteration. After ${\widehat{b}}_{2t}$ being decoded correctly, it is reversely processed by the combining bit-plane module to yield the quantized index, ${\widehat{q}}_{2t}$, which are used as the input of the reconstruction module to refine ${\widehat{c}}_{2t}^{\mathrm{int}}$ for ${\widehat{c}}_{2t}$.

The optimal reconstruction function that exploits the correlation between the original image for WZF and SI^{14} is adopted, in which the distribution of the residual signals between the original WZF and the reconstructed SIF is assumed to be Laplacian and it seeks to find the reconstructed samples that demonstrate MMSE. The optimal reconstruction value, ${\widehat{c}}_{2t}$, is the expectation ${\widehat{c}}_{2t}=E[{c}_{2t}|{c}_{2t}\in \{{\mathrm{\Delta}}_{i}^{l},{\mathrm{\Delta}}_{i}^{r}\},{\widehat{c}}_{2t}^{\mathrm{int}}]$, where ${\mathrm{\Delta}}_{i}^{l}/{\mathrm{\Delta}}_{i}^{r}$ denote the lower/upper boundary of the interval ${\mathrm{\Delta}}_{i}$ that ${\widehat{c}}_{2t}^{\mathrm{int}}$ resides, and the expected value yields the MMSE estimation of the source WZ. This procedure will prevent the reconstructed values from deviating from the original value too much due to low SI confidence. At the last stage, the ${\widehat{c}}_{2t}$ will be inversely transformed to yield the final reconstructed image, ${\widehat{I}}_{2t}$.

## 3.2.

### Rate Control Mechanism

To improve the decoding efficiency, we proposed to impose specific puncture patterns with transmission order according to signal distribution properties for DCs and ACs, respectively. In the COMPETE framework, we proposed to collect all same order DCs/ACs together, which are then zig-zag scanned for turbo encoding. For block DCT-based video coding, the DC coefficient usually contains most block signal energy. Its MSBs contribute much more signal energy than LSBs, such that the assigned priority of the former is higher than the latter. As shown in Fig. 10(b), the system is designed to transmit the first MSBs of all DCs and then the second MSBs. The magnitude of ACs would be much smaller and around zero magnitude. Since ACs may be positive or negative, by taking its absolute value, it would lead to more skewed magnitude probability distribution. The “sign bit” of quantized ACs can be replaced by that of the quantized SIF at the decoder, under which the probability of LSBs to be 0 would be larger than MSBs, when represented with a fixed number of bits. As opposed to DCs, it transmits the LSB first and then the second LSB^{41} to speed up turbo decoding, as shown in Fig. 10(c). This transmission strategy for DCs and ACs helps to correct the decoding errors of systematic bits with fewest requested parity bits. Experiments showed that this rate control strategy yields 55% to 59%, fewer requested bits for the turbo decoder.

The proposed rate control algorithm, developed based on the RCPT puncturing mechanism,^{28} is demonstrated in Fig. 12.

In the COMPETE system, the RCPT code is designed to be with rate $1/3$ and puncturing period $\psi =16$, which is formed from two rate $1/2$ recursive systematic convolutional constituent codes with generator $\frac{1+D+{D}^{2}+{D}^{4}}{1+{D}^{2}+{D}^{3}}$. The puncturing table with different rates, $\{\frac{16}{16+V}|V=0,1,\cdots ,32\}$, will be generated, in which $V=0$ will not be used because the systematic bits will be discarded under the DVC framework. Figure 10 demonstrates part of the corresponding puncture table. When the first sub-block parity bits were received, the decoding would be carried out based on the CRC alone.^{28} When receiving those of the second sub-block, it would decode the first constituent encoding data and the iterative turbo decoding will start after the third sub-block being received, in which the maximum iteration number, ${T}_{\mathrm{iter}}$, is set. When decoded results are converged, i.e., an all-zero syndrome of CRC checking or the number of iteration exceeds ${T}_{\mathrm{iter}}$, the resultant bitstream will be subjected to a second confirmation procedure. Notwithstanding, a larger ${T}_{\mathrm{iter}}$ will lead to heavy computation and the tradeoff between setting ${T}_{\mathrm{iter}}$ and heavy computation should be well manipulated. The value of ${T}_{\mathrm{iter}}$ is determined from experiments on different complexity test videos under different bit rates that can yield convergence. The confidence measurement with the criteria ConfPr $\le {10}^{-3}$,^{40} in which

To improve the turbo decoding performance while requesting fewer parity bits, the correlation among coefficient bit planes was exploited and utilized to estimate the posteriori probability, which is used as the priori probability for turbo decoding. The probability distribution of the difference between a SIF and the original image coded as a WZF is assumed to be Laplacian, i.e.,

## (6)

$${p}_{{\widehat{q}}_{2t}^{\mathit{int}}}(n)=\frac{\alpha}{2}{e}^{-\alpha |{\widehat{q}}_{2t}^{\mathit{int}}-n|},\phantom{\rule[-0.0ex]{2em}{0.0ex}}\alpha =\sqrt{\frac{2}{{\sigma}_{x}^{2}}},$$^{29}The $b$’th decoded bit of DCs (${\mathrm{DC}}^{1}$) is represented as

## (7)

$${\widehat{b}}_{b}\equiv \underset{i\in (0,1)}{\mathrm{arg}\text{\hspace{0.17em}}\mathrm{max}}\text{\hspace{0.17em}}P{r}_{{\mathrm{DC}}^{1}}(i|{\widehat{q}}_{2t}^{\mathit{int}}{\widehat{b}}_{b-1},\cdots ,{\widehat{b}}_{2},{\widehat{b}}_{1}),$$^{40}of a quantized DCs represented with four bits, ${b}_{1}{b}_{2}{b}_{3}{b}_{4}$ from MSB to LSB. The probability integrated from the shaded interval is for ${Pr}_{{\mathrm{DC}}^{1}}({\widehat{b}}_{3}=1|\cdots )$ and ${Pr}_{{\mathrm{DC}}^{1}}({\widehat{b}}_{3}=0|\cdots )$ can be calculated in the similar way. The turbo decoder will update the priori probability: and performs log-MAP decoding. Experiments verified that this probability estimation and updating method helps the decoder to request fewer parity bits and reduces the turbo decoding time.

## 4.

## Simulation Study

The COMPETE encoding performance is compared with other SIF reconstruction methods, such as MCTI, fusion-based homography (F-HOMO), MVME, and H-MVME, for evaluation. The H-MVME is the extension of MVME,^{26} in which estimated image blocks that reference to the outlier are obtained through MCTI. In the F-HOMO, both SIFs reconstructed from inter- and intraview images through DCVP^{42} and MCTI, respectively, are fused to yield the final SIF. The quality of ${\widehat{I}}_{2t}^{\mathrm{WZ}}$, which is reconstructed with its SIF generated by the above methods, is compared with those from H.264 with inter-, intra-, and inter-no-motion mode. The multiview CIF videos, Race1, Ballroom, Breakdancer, Exit, Ballet and Vassar, provided by ISO/IEC^{43} are used as test videos, whose frame rates are 30, 25, 15, 25, 15, and 25 fps, respectively. These videos present different scene complexities rated from high to low, in which the “Race1, Ballroom, and Breakdancer” are classified as high complexity videos, “Exit” as medium and (Ballet and Vassar) as low complexity ones, respectively. Three successions of the six views from a multiview video are used to provide left-, central- and right-view videos. For H.264, the CABAC function is enabled and GOP size is 12 for inter- and inter-no-motion modes. The ME search range for the former is set to be 32 and zero motion is assigned for the latter. For the H.264 coder to yield compromised decoded quality for different videos, different quantization parameters (QPs), $\mathrm{QP}\in \{30,28,26,24,20,18\}$, are used for different complexity videos. The MDVC codec adopts $|\mathrm{GOP}|=2$, in which the side-view video and central-view odd frames are encoded with H.264/INTRA to provide KFs for the decoder to reconstruct ${\widehat{I}}_{2t}^{\mathrm{WZ}}\mathrm{s}$. The quality of reconstructed ${\widehat{I}}_{2t}^{\mathrm{WZ}}\mathrm{s}$ with reference to the four SIF generation methods is compared by image PSNRs for evaluation.

## 4.1.

### Performance Analysis

To evaluate the performance of the proposed COMPETE, the error analysis based on reconstructed blocks is first carried out to investigate the signal processing behavior. Four SIF reconstruction methods, which comprise MCTI, F-HOMO, MVME, and H-MVME, are also implemented for comparisons. The SI confidence, quality of reconstructed WZFs, ${\widehat{I}}_{2t}^{\mathrm{WZ}}\mathrm{s}$, and time complexity of different methods are compared and evaluated. The time complexity of SI generation and encode/decode execution time will be discussed in Sec. 4.2.

## 4.1.1.

#### Error analysis

The error distributions of the COMPETE and MVME are investigated to justify how the SI confidence can be improved. In the COMPETE algorithm, by performing intraview ME between central-view images, blocks are classified into motion or no-motion to eliminate unnecessary ME/MC operations. For no-motion blocks, the co-located block of the previous frame is used as the MC blocks with zero motion. For motion blocks, when the search range comprises regions belonging to the outlier, only intraview ME on central-view images is performed. Otherwise, the regular weighted MVME process is carried out. Denote the number of no-motion, motion, and outlier blocks in one frame as ${K}_{n}$, ${K}_{m}$, and ${K}_{o}$, which can be normalized as ${k}_{n}$, ${k}_{m}$, and ${k}_{o}$, respectively, i.e., ${k}_{n}+{k}_{m}+{k}_{o}=1$. In the COMPETE, the MC interpolated frame can now be represented by

## (9)

$${\widehat{I}}_{2t}^{\mathit{int}}={\mathbb{B}}_{n}\cup {\mathbb{B}}_{m}\cup {\mathbb{B}}_{o},$$## (10)

$${\sigma}_{B}^{2}=\frac{1}{\sum _{\tau}{K}_{\tau}}E[{({I}_{2t}-{\widehat{I}}_{2t}^{\mathit{int}})}^{2}]\phantom{\rule{0ex}{0ex}}=\sum _{\tau \in \{n,m,o\}}{k}_{\tau}\xb7E[{B}_{\tau}({I}_{2t})-{B}_{\tau}{({\widehat{I}}_{2t}^{\mathit{int}})}^{2}|{B}_{\tau}({I}_{2t})\in {\mathbb{B}}_{\tau}].$$For one image block, a specific ME procedure corresponding to its categorization, i.e., motion, no-motion, or outlier, will be imposed. Table 1 shows the percentage of each block category for different videos and Table 2 shows the mean absolute error of block difference, between the original image and its reconstructed SIF, for the six test videos. As shown, the percentage of outlier blocks is very small and their average reconstruction error by the COMPETE is smaller than that of MVME. Both estimated intraview MV and interview disparity vector are utilized to improve the SI confidence, in which the four MVs through inner paths are utilized to perform intraview weighted MC for a central-view SIF. This SIF demonstrated higher confidence than that reconstructed through average MC in both MVME and H-MVME. As shown in Table 2, the average error of reconstructed blocks of the proposed COMPETE is smaller than that of MVME. Table 1 shows that the percentage of no-motion blocks is the highest, which are mostly from the background region or static foreground objects. For no-motion blocks, the proposed COMPETE effectively eliminated the time consuming ME process and prevented noisy MVs resulting from regular ME process of other methods. For example, the MVME method, instead of identifying no-motion blocks and skipping the time-consuming ME process, treats all as motion blocks but does not yield a more accurate estimation, as shown in Table 2. For motion blocks, the MVME does not differentiate interview disparity vector with intraview MV, such that the MC blocks would be more degraded as compared to that of COMPETE. As the COMPETE compensates no-motion blocks by the colocated ones of the previous decoded frame, in addition to reducing time complexity, the ME errors can also be decreased. In total, the proposed COMPETE effectively yielded higher SI confidence while reducing time complexity, as compared to MVME.

## Table 1

No-motion, motion, and outlier blocks distribution at QP=26.

Video | Block type | ||
---|---|---|---|

Motion blocks (%) | No motion blocks (%) | Outlier blocks (%) | |

Race1 | 72.15 | 25.67 | 2.18 |

Ballroom | 37.64 | 60.06 | 2.30 |

Breakdancer | 42.48 | 54.47 | 3.05 |

Exit | 20.66 | 78.20 | 1.14 |

Ballet | 17.17 | 81.60 | 1.23 |

Vassar | 3.71 | 96.08 | 0.21 |

## Table 2

The comparison of estimation errors.

Video | Block type | |||||
---|---|---|---|---|---|---|

Motion Blocks (MAE) | No-Motion Blocks (MAE) | Outlier Blocks (MAE) | ||||

COMPETE | H-MVME | COMPETE | H-MVME | COMPETE | H-MVME | |

Race1 | 194.28 | 423.11 | 1.46 | 1.79 | 255.08 | 292.30 |

Ballroom | 302.47 | 356.74 | 3.36 | 3.46 | 310.66 | 323.96 |

Breakdancer | 694.01 | 742.19 | 1.54 | 1.79 | 179.59 | 184.42 |

Exit | 729.95 | 804.74 | 1.91 | 2.73 | 191.55 | 206.61 |

Ballet | 134.69 | 281.63 | 1.71 | 2.11 | 129.74 | 127.08 |

Vassar | 82.81 | 364.24 | 2.59 | 2.70 | 182.27 | 196.24 |

## 4.1.2.

#### Side information confidence

The SI confidence in PSNR achieved by MCTI, F-HOMO, MVME, the COMPETE with direct linear transform (DLT) homography matrix generation method and the COMPETE performed on all test videos are shown in Fig. 14. As shown, the MCTI performance was severely degraded for high motion videos, Race1, Ballroom and Breakdancer, since it assumes linear motion and interpolates frames only along temporal dimension. For Race1, the SIF by COMPETE is 6.2 to 7.9 dB higher in PSNR than MCTI because it is a panning shot of moving objects such that MCTI cannot find the correct MVs to reconstruct SI. For the F-HOMO, it adopts pixel-based fusion and would lead to image discontinuity artifacts when fusing disparity synthesized and temporal interpolated (MCTI) images. The H-MVME outperforms MVME^{26} with 0.5 to 3 dB higher PSNR for both high and low complexity videos. For MVME, it performs ME from both inter- and intraview KFs, which may lead to false/trivial ME and degraded quality, in addition to being time consuming. The H-MVME improves the MVME by eliminating the interview disparity. The proposed COMPETE estimated MVs with reference to perspectively transformed images, ${\widehat{I}}_{2t-1}^{v}$, and detected no-motion blocks to eliminate regular ME operations. The SIFT followed by RANSAC would help to yield more stable matching point pairs, as compared to the COMPETE followed by DLT, as shown in Fig. 14. In comparison, the proposed COMPETE not only achieves the same reconstructed image quality as that of H-MVME but also decreases computation complexity. For the “Ballet,” the SIF by COMPETE is 0.1 to 2.3 dB higher in PSNR than H-MVME because the disparity problem of interview ME has been solved by the block prediction through perspective transform. In comparison, the COMPETE effectively reduced computational complexity and well utilized interview and temporal correlations to eliminate disparity block matching noises. Experiments also justified that the proposed COMPETE can yield the best SI confidence, as compared to the others.

## 4.1.3.

#### Objective performance evaluation

The PSNRs of ${\widehat{I}}_{2t}^{\mathrm{WZ}}\mathrm{s}$ coded by the five methods under the MDVC framework and reconstructed images by H.264, with intra-, inter- and inter-no-motion, are calculated for comparisons. The rate-distortion performance is similar to that of the SI confidence. For high-complexity videos, e.g., Race1, Ballroom, and Breakdancer, the SI confidence in PSNRs is comparable to COMPETE and H-MVME, both of which are 0.9 to 7.8 dB higher than MCTI and F-HOMO, as shown in Fig. 14. The reconstructed WZFs with the COMPETE, ${\widehat{I}}_{2t}^{\mathrm{WZ}}\mathrm{s}$, are 0.8 to 2.9 dB higher in PSNR than those of MCTI and F-HOMO, as shown in Figs. 15(a)–15(c). For high-complexity videos, both MCTI and F-HOMO cannot estimate accurate MVs to compensate for the reconstructed SIFs, which leads to more degraded WZFs. Both COMPETE and H-MVME yield higher SI confidence and hence better reconstructed quality for ${\widehat{I}}_{2t}^{\mathrm{WZ}}$. The COMPETE yielded 0.4 to 1 dB higher PSNR than H.264/INTRA for Breakdancer, 1 to 1.5 dB higher than H.264/INTRA for Ballroom and 0 to 0.5 dB higher than H.264/INTRA for Race1. The H.264 intra/inter-no-motion cannot well encode Race1, because the camera was tracking a moving object. For the medium-complexity video, Exit, the SI confidence in PSNRs reconstructed by the COMPETE is 2.4 to 3.9 dB higher than those of MCTI, as shown in Fig. 14. The average PSNRs of ${\widehat{I}}_{2t}^{\mathrm{WZ}}$ are 3.5 and 2.5 dB higher than those reconstructed from H.264 intra and MCTI, respectively, as shown in Fig. 15(d). For low complexity videos, Ballet and Vassar, as they demonstrate more static regions, the interpolation and fusion process can perform efficient for all methods and results in smaller difference of PSNR performances. The COMPETE yielded 0.8 to 2 dB higher PSNR than MCTI for ${\widehat{I}}_{2t}^{\mathrm{WZ}}\mathrm{s}$, and 1.5 to 2.2 dB higher than H.264/INTRA, as shown in Figs. 15(e) and 15(f). In addition, although the MVME-based methods,^{26} e.g., MVME, and H-MVME, demonstrate comparable PSNR performances with COMPETE, their time complexity is high. Experiments showed that the COMPETE outperforms the others in SIF and WZF, ${\widehat{I}}_{2t}^{\mathrm{WZ}}$, quality, in that it prevents reconstructing blocks in static regions from noise attacks during interpolation and block matching processes. Note that the KF quality setting would impact the SI confidence, and the KF quality depends on QP selection. To justify the COMPETE capability in improving the MDVC codec performance, the average image PSNR of KFs and WZFs under a fixed bit budget is provided for comparisons. As shown in Fig. 16, the COMPETE outperforms the others in PSNRs from 0.4 to 4 dB under different bitrates for both high and low complexity videos, Race1 and Vassar, respectively.

Experiments revealed that high confidence SI is much more important than the rate control method in DVC coding: (1) When SI confidence is low, the decoding confidence measure, *ConfPr* in Eq. (5), would not satisfy convergence condition, $\mathrm{ConfPr}\le {10}^{-3}$. Under this condition, either the rate control procedure was carried out or the decoder requested more parity bits, and the ConfPr could hardly converge. (2) When SI confidence is high enough and the rate control procedure transmits high priority parity bits first, the number of decoding iterations would be reduced and the convergence criteria, $\mathrm{ConfPr}\le {10}^{-3}$, would be reached quickly. One practical turbo decoder example^{44} shows that when KFs are severely attacked by channel noise, which leads to low confidence SI, the PSNRs of reconstructed WZFs will degrade rapidly because the turbo decoder cannot recover one WZF from a severely degraded SIF. The number of average requested bits and bit rate saving under different SIF reconstruction methods is provided and compared in Tables 3 and 4, respectively. As shown in Table 3, the proposed COMPETE requested the fewest parity bits among the four methods because it can yield the highest SI confidence. Table 4 shows that the proposed control mechanism enables the four SI reconstruction methods to largely reduce the requested bit rates.

## Table 3

The average requested bit rate of different SIF generation methods (15 FPS QCIF).

Video | SI generation | |||
---|---|---|---|---|

SIF | ||||

MCTI | F-HOMO | H-MVME | COMPETE | |

Race1 | 125.81 | 128.32 | 73.63 | 69.06 |

Ballroom | 89.29 | 89.56 | 82.50 | 79.93 |

Breakdancer | 44.68 | 40.56 | 34.14 | 30.52 |

Exit | 46.22 | 52.77 | 45.52 | 38.54 |

Ballet | 25.99 | 25.83 | 22.72 | 20.65 |

Vassar | 40.19 | 44.92 | 37.33 | 36.32 |

## Table 4

The turbo decoded bit rate comparisons W/ AND W/O rate control mechanism (15 FPS QCIF).

Video | SI generation | |||||||
---|---|---|---|---|---|---|---|---|

MCTI | F-HOMO | H-MVME | COMPETE | |||||

w/ | w/o | w/ | w/o | w/ | w/o | w/ | w/o | |

Race1 | 125.81 | 278.96 | 128.32 | 284.52 | 73.63 | 167.34 | 69.06 | 157.31 |

Ballroom | 89.29 | 197.54 | 89.56 | 198.58 | 82.50 | 183.74 | 79.93 | 181.65 |

Breakdancer | 44.68 | 101.78 | 40.56 | 93.67 | 34.14 | 79.58 | 30.52 | 72.32 |

Exit | 46.22 | 105.05 | 52.77 | 118.85 | 45.52 | 105.37 | 38.54 | 91.54 |

ballet | 25.99 | 61.73 | 25.83 | 61.50 | 22.72 | 54.75 | 20.65 | 50.37 |

Vassar | 40.19 | 91.34 | 44.92 | 102.55 | 37.33 | 87.84 | 36.32 | 85.66 |

## 4.2.

### Time Complexity Analysis

The time complexities of the proposed COMPETE, together with the other SI reconstruction methods, are analyzed and discussed. At first, the number of arithmetic operations, addition/subtraction and multiplication/division, required to reconstruct SI is calculated for time complexity analysis. The practical execution time is also measured to justify the time analysis. Denote the image width and height as $W$ and $H$, respectively, and the block size and search range as ${B}_{w}$ and ${S}_{r}$, respectively.

## 4.2.1.

#### Motion Compensated Temporal Interpolation

The MCTI performs intraview ME between images ${\widehat{I}}_{2t-1}$ and ${\widehat{I}}_{2t+1}$ and then performs motion compensated prediction to interpolate SI for WZFs. It performs subtraction and addition operations to yield the absolute difference summation. For one block, it needs ${B}_{w}^{2}$ subtractions and ${B}_{w}^{2}-1$ additions to calculate the block error. As the search area is ${S}_{r}^{2}$, it requires $(2\xb7{B}_{w}^{2}-1)\xb7{S}_{r}^{2}$ operations for one block to finish ME operations. The number of total operations for one image to finish ME is $(2\xb7{B}_{w}^{2}-1)\xb7{S}_{r}^{2}\xb7(H\xb7\frac{W}{{B}_{w}^{2}})\approx 2\xb7{S}_{r}^{2}\xb7H\xb7W$. The time complexity of MCTI is denoted as ${T}_{\mathrm{MCTI}}=2\xb7{S}_{r}^{2}\xb7H\xb7W$.

## 4.2.2.

#### Fusion-based homography

The fusion-based homography was implemented based on the Fusion 1 algorithm in Ref. 15. After performing perspective transformation, the synthesized perspectively transformed images, denoted as ${\widehat{I}}_{v}^{\prime}(2t)=\text{synthesis}[{\widehat{I}}_{l}^{\prime}(2t),{\widehat{I}}_{r}^{\prime}(2t)]$, and the temporarily interpolated image, ${\widehat{I}}_{c}^{\mathrm{int}}(2t)$, are considered as candidates for the fusion-based central-view image. For each pixel of the SIF to be reconstructed, it seeks to find the one, between ${\widehat{I}}_{v}^{\prime}(2t)$ and ${\widehat{I}}_{c}^{\mathrm{int}}(2t)$, that yields the minimum distance to both the previous and the next central-view image pixel values. Estimation of the initial $3\times 3$ homography matrix can be performed off-line, whose time complexity can be ignored. For perspective transformation, it needs 15 MUL/ADD operations for each pixel and $2\xb715\xb7H\xb7W$ to yield the two reference central-view images. To obtain the fusion-based image, it needs $2\xb7H\xb7W$ and $2\xb7{S}_{r}^{2}\xb7H\xb7W$ for temporal interpolation. In total, it needs $4\xb7H\xb7W$ operations to find the pixel that yields the minimum pixel value difference. The number of total operations for the fusion-based homography is $(36+2{S}_{r}^{2})\xb7H\xb7W$. The time complexity of this method is denoted as ${T}_{\text{homography}}=(36+2{S}_{r}^{2})\xb7H\xb7W$.

## 4.2.3.

#### Hybrid Multiview Motion Estimation

The H-MVME is an improved MVME.^{26} In MVME, four ME vectors through inner paths are obtained and averaged to yield the motion compensated prediction image. As the MVME algorithm is designed based on the assumption that when the optical axes of all cameras are orthogonal to the motion. For multiview video, homography transformation is required and there exists an outlier that the MVME may not applicable. In H-MVME, it performs bidirectional temporal MC when the search range resides on the outlier. The required operations comprise performing the four inner paths ME $4\xb72\xb7{T}_{\mathrm{MCTI}}$, calculating weights (three ADDs and eight DIVs) $11\xb7H\xb7\frac{W}{{B}_{w}^{2}}$ and calculating the average $7\xb7H\xb7W$. The number of total operations ${T}_{\mathrm{H}\text{-}\mathrm{MVME}}$ is $(\frac{11}{{B}_{w}^{2}}+7)\xb7H\xb7W+8\xb7{T}_{\mathrm{MCTI}}\approx 8\xb7{T}_{\mathrm{MCTI}}$. Its time complexity is smaller than that of ${T}_{\mathrm{MVME}}$, which is $16\xb7{T}_{\mathrm{MCTI}}$.^{26}

## 4.2.4.

#### COMPETE

The design target of the proposed COMPETE is to keep high quality reconstruction while reducing computation complexity. At first, it needs to perform perspective transformation from side-view images to be with central view, which requires $6\xb715\xb7H\xb7W$ operations (three left- and three right-view images). Then, it performs block ME and checks whether it is a motion block or not. It needs at least ${T}_{\mathrm{MCTI}}$ operations. Assume the ratio of motion and no-motion blocks is $1:1$. For no-motion block, direct copy from the co-located block of the previous image is adopted, and no operation is required. For motion blocks, the search range for finding disparity vectors can be minimized to ${S}_{r}^{2}/16$ in that the reference frames are perspectively transformed from side-view images. The COMPETE, as well as H-MVME, performs four inner paths ME two times. For interview ME, the first disparity vector estimation requires $4\xb7(2{B}_{w}^{2}-1)\xb7\frac{{S}_{r}^{2}}{16}\xb7\left(\frac{H\xb7W}{{B}_{w}^{2}}\right)\approx 0.5\xb7{S}_{r}^{2}\xb7H\xb7W$ operations. The second ME after disparity compensation is $4\xb7{T}_{\mathrm{MCTI}}$. Finally, by including all the required operations for computing weights and average, the number of total operations is ${T}_{\mathrm{MCTI}}+6\xb715\xb7H\xb7W+0.5[0.5\xb7{S}_{r}^{2}\xb7H\xb7W+4\xb7{T}_{\mathrm{MCTI}}(\frac{11}{{B}_{w}^{2}}+7)\xb7H\xb7W]\approx 4\xb7{T}_{\mathrm{MCTI}}$, which is denoted as ${T}_{\mathrm{COMPETE}}$. The above time complexity analysis shows that

## (11)

$${\mathbf{T}}_{\mathrm{MCTI}}<{\mathbf{T}}_{\mathrm{F}\text{-}\mathrm{HOMO}}<{\mathbf{T}}_{\mathrm{COMPETE}}<{\mathbf{T}}_{\mathrm{H}\text{-}\mathrm{MVME}}.$$Experiments show that the execution time of COMPETE is only half that of H-MVME while achieving the same SI confidence. The execution time for COMPETE is only four times that of MCTI.

## 4.2.5.

#### Practical execution time evaluation

The above time complexity analysis for different SI reconstruction methods is verified by practical execution time. All practical executions are implemented and executed on the same computer for fairness. The execution times of MDVC light encoder and H.264 encoder are first investigated.

Table 5 lists the average encoding time for one frame by MDVC, H.264 intra, H.264 inter no motion and H.264 inter, respectively. As shown, the MDVC light encoder spends about 5 to 15 times less than the others, which justifies the above time analysis. Table 6 lists the average execution time for reconstructing one SIF by MCTI, F-HOMO, H-MVME, and COMPETE, respectively. As shown, the average execution time for reconstructing one SIF of H-MVME is about eight times that of MCTI. For the COMPETE, this average execution time can be largely reduced for lower complexity videos. As the probability to process motion blocks in high complexity videos is high, the percentage of time reduction is limited, which is 1.29 to 2.56 times less than that of H-MVME. Table 7 lists the average turbo decoding time for different SI reconstruction methods. The performance of time reduction was evaluated based on the MCTI execution time for simplicity. Experiments showed that the decoding time would be reduced for higher SI confidence, which justifies that the proposed COMPETE can provide better SI than the others.

## Table 5

The average time to encode one image (QCIF) in MDVC, H.264 with intra, inter no motion and intercoding mode (MSEC/FRAME) and CIF ones are provided for comparisons.

Video | Encoding time | |||
---|---|---|---|---|

Even frame | GOP=12 | |||

MDVC QCIF (CIF) | H.264 Intra | H.264 Inter no motion | H.264 Inter | |

Race1 | 6.70 (23.51) | 32.67 | 73.39 | 100.06 |

Ballroom | 6.06 (23.07) | 33.46 | 71.97 | 99.27 |

Breakdancer | 6.06 (22.87) | 29.83 | 68.18 | 106.38 |

Exit | 6.96 (26.42) | 30.15 | 66.92 | 90.12 |

Ballet | 5.42 (21.51) | 29.36 | 65.97 | 91.86 |

Vassar | 6.06 (22.93) | 31.57 | 67.71 | 89.02 |

Average | 6.17 (23.38) | 31.17 | 69.02 | 96.12 |

## Table 6

The average time to construct one SIF (MSEC/FRAME).

Video | Reconstruction time | |||
---|---|---|---|---|

SIF | ||||

MCTI | F-HOMO | H-MVME | COMPETE | |

Race1 | 64.1 | 102.0 | 498.7 | 386.5 |

Ballroom | 62.8 | 102.0 | 496.8 | 291.1 |

Breakdancer | 63.8 | 100.4 | 495.2 | 312.5 |

Exit | 62.8 | 100.8 | 496.2 | 242.3 |

Ballet | 62.8 | 102.0 | 499.4 | 238.2 |

Vassar | 63.1 | 102.4 | 497.1 | 193.9 |

## Table 7

The average time saving of turbo decoding with different SI reconstruction methods as compared to MCTI.

Video | Decoding Δ Time (%) | ||
---|---|---|---|

ΔTime(%)=T(SI method)−T(MCTI)T(MCTI) | |||

F-HOMO (%) | H-MVME (%) | COMPETE (%) | |

Race1 | 1.92 | $-33.32$ | $-37.45$ |

Ballroom | 9.78 | $-4.59$ | $-6.42$ |

Breakdancer | $-4.74$ | $-17.72$ | $-19.89$ |

Exit | 13.64 | 3.31 | $-11.20$ |

Ballet | 12.61 | $-11.38$ | $-17.44$ |

Vassar | 13.30 | $-6.05$ | $-7.88$ |

## 4.3.

### Subjective performance evaluation

The subjective performance of different methods carried out on test videos is presented in this section. The QP control parameter of H.264 is set to be 26.

## 4.3.1.

#### Reconstructed Side Information Frames

The SIFs reconstructed by MCTI and F-HOMO demonstrate severe block artifacts, which can be smoothed by the proposed COMPETE and modified H-MVME. But the latter suffered block noise in low complexity videos due to performing regular interpolation and block matching that led to static block noises. The proposed COMPETE effectively eliminates this block noise through weighted compensation and prediction.

## 4.3.2.

#### Reconstructed Wyner–Ziv Frame

The SI confidence affects the reconstructed WZF quality. For one reconstructed ${I}_{2t}^{\mathrm{WZ}}$ by MCTI and F-HOMO, due to low SI confidence, many image blocks cannot be well recovered from low confidence SI. In comparison, the COMPETE and H-MVME yield higher SI confidence and hence higher quality for ${\widehat{I}}_{2t}^{\mathrm{WZ}}$. Although COMPETE and H-MVME demonstrate comparable PSNRs for ${\widehat{I}}_{2t}^{\mathrm{WZ}}$, the former consumed less computations. The resultant images are shown in Fig. 17. Reconstructed videos demonstrate that moving objects, cars and persons, are blurred from MCTI and F-HOMO based WZFs, while both COMPETE and H-MVME effectively eliminate this artifact for slow-motion videos, e.g., legs in Breakdancer.

## 4.4.

### Practical Applications

The WZ decoder combines the SI and the received parity bits to recover the original symbol. Additional parity bits would be requested if the original symbols cannot be reliably decoded. This request-and-decode process is repeated until an acceptable symbol error probability is reached.^{2} The rate control performed by the decoder can reduce encoder computational loading. This feedback also enables the decoder to flexibly control SI generation from simple to sophisticated approaches, which can help to adapt to different encoder applications. However, this feedback channel used as an interactive decoding procedure may also hinder practical applications that require independent encoding and decoding. Instead of adopting this “decode-and-request” procedure, the decoder could be implemented with a correlation estimation algorithm, in which the rates of previously reconstructed frames are used to predict the required rates sent to the encoder. Feedback free^{45} and unidirection DVC^{46} have been proposed to make decoder operations independent of those of the encoder.

## 5.

## Conclusions

For a MVC that adopts DVC coding, MDVC, we proposed to utilize interview video correlations and exploit bit value probability distribution of transform coefficients under the block-DCT video codec framework to improve the SIF confidence and accuracy of decoded bits while speeding up the decoder rate control process. Contributions of this paper comprise (1) for specific multiview video applications, such as wireless video sensor and wireless video surveillance networks, the proposed MDVC utilizes the advantage of a DVC and multiview video framework to enable efficient and low complexity video encoding. Simulations verified that the MDVC can reduce encoding complexity to at least five times smaller than H.264/INTRA while enhancing the quality of reconstructed WZFs. (2) To improve the MDVC decoding performance, a multiview SI generation algorithm, COMPETE, was proposed to improve the quality of reconstructed SIF and WZFs. Both temporal correlation among intraview images and disparity correlations among interview images were well utilized to enhance WZF reconstruction. Simulation results showed that the PSNRs of reconstructed WZFs by COMPETE are 0.5 to 3.8 dB higher than those by MCTI when encoding low to high complexity videos. (3) To improve the MDVC rate control performance, we exploit the probability distribution of transform coefficient bits and reorder the transmission priorities of DCs and ACs, such that the turbo decoder would request the fewest bits to decode the WZF. Simulations demonstrate that the PSNRs of decoded WZFs are 0.2 to 3.5 dB higher than those encoded with H.264/INTRA under the same bit rates.

The COMPETE also outperformed H-MVME with 0.15 to 2.93 dB higher image PSNRs, in which the H-MVME outperforms MVME with 0.5 to 1 dB higher PSNR. Besides, the COMPETE effectively reduced the computation complexity, which is 1.29 to 2.56 times smaller than other SI reconstruction methods on average. Some recent research on video coding focus on free-view video codec and transmission. The proposed SI reconstruction method, COMPETE, under the MDVC framework can be extended to enhance the performance of free-view video codec that has to handle dynamic and mobile encoders and view reconstruction, which are considered as our future research. The COMPETE can also be carried out with a pixel-level disparity model. In addition, how to embed a small amount of information at the encoder^{22} to improve the decoding efficiency, together with the pixel-level disparity model, are also considered as our future research.

## Appendices

## Appendix:

### Linear Minimum Mean Squared Error

The LMMSE predictor is carried out to compute the ${w}_{j}$ for a MC block ${B}_{i}(I,v)$ with four observations and can be represented as

## (12)

$${E}_{i}[{e}^{2}]={E}_{i}\{{[{B}_{i}({I}_{2t})-{B}_{i}({\widehat{I}}_{2t}^{\mathit{int}})]}^{2}|\text{\hspace{0.17em}}\forall \text{\hspace{0.17em}}{B}_{i}\in {I}_{2t}\}\phantom{\rule{0ex}{0ex}}={E}_{i}[{({\mathbf{x}}_{i}-\sum _{j=1}^{4}{w}_{j}{\widehat{\mathbf{x}}}_{ij})}^{2}|\text{\hspace{0.17em}}\forall \text{\hspace{0.17em}}{x}_{i}\in {I}_{2t}],$$## (13)

$${E}_{i}[{\widehat{\mathbf{x}}}_{ij}\xb7({\mathbf{x}}_{i}-\sum _{j=1}^{4}{w}_{j}{\widehat{\mathbf{x}}}_{ij}\left)\right]=0,\phantom{\rule[-0.0ex]{1em}{0.0ex}}\mathrm{or}\phantom{\rule[-0.0ex]{1em}{0.0ex}}\sum _{j=1}^{4}{w}_{j}{\mathbf{R}}_{{\widehat{\mathbf{x}}}_{ij}{\widehat{\mathbf{x}}}_{ij}}={\mathbf{R}}_{{\mathbf{x}}_{i}{\widehat{\mathbf{x}}}_{ij}}.$$## Acknowledgments

This work is partially supported by the Taiwan Ministry of Science and Technology with Grant No. MOST 105-2221-E-011-116 and Taiwan Building Technology Center with Grant No. IBRC 105H451709.

## References

## Biography

**Shih-Chieh Lee** received his PhD from the National Taiwan University of Science and Technology in 2013 in electrical engineering. He is currently working at Nokia Networks as a network planning and optimization engineer. His research interests include image/video processing and the related topics in multimedia communications.

**Jiann-Jone Chen** received his PhD from the National Chiao-Tung University in 1997 in electronic engineering. He was a researcher with the Advanced Technology Center, Information and Communications Research Laboratories, Industrial Technology Research Institute (ITRI), Hsinchu. He is currently an associate professor in the Electrical Engineering Department of National Taiwan University of Science and Technology. His research interests include image/video processing, cloud video processing/streaming, image retrieval, and several topics in multimedia communications.

**Yao-Hong Tsai** received his PhD in information management from the National Taiwan University of Science and Technology (NTUST), Taipei, Taiwan, in 1999. He was a researcher with the Advanced Technology Center, Information and Communications Research Laboratories, Industrial Technology Research Institute (ITRI), Hsinchu. He is currently an associate professor with the Department of Information Management, Hsuan Chuang University, Hsinchu. His current research interests include image processing, pattern recognition, and computer vision.