Three-dimensional wavelet video coding has been investigated by many scholars because its multiresolution nature can support spatial and temporal scalabilities simultaneously. Of the various wavelet video coding schemes, most can be classified into two categories: “
$\mathrm{T}+2\mathrm{D}$
” and “
$2\mathrm{D}+\mathrm{T}$
”.^{1} The major difference between them is whether the temporal transform is implemented before spatial decomposition or not. Since motion compensated temporal filtering (MCTF) is usually used for the temporal transform, “
$\mathrm{T}+2\mathrm{D}$
” is also called a spatial domain MCTF (SDMCTF) scheme and “
$2\mathrm{D}+\mathrm{T}$
” is called an in-band MCTF (IBMCTF) scheme. The IBMCTF scheme is particularly attractive because of its inherent spatial scalability and flexible coding framework.

In the IBMCTF scheme, the coefficients of each spatial band obtained by 2-D spatial wavelet decomposition have some perceptual redundancy. At a given bitrate, if such visually redundant coefficients are completely coded, it will lead to the decrease of coding bits for comparative important coefficients in the spatial band, thus the overall perceptual quality of the coded video will be deteriorated. In fact, some redundant coefficients below the just-noticeable-distortion (JND) value can be removed safely since human eyes cannot sense any changes below the JND threshold around a coefficient due to their underlying spatial/temporal sensitivity and masking properties.^{2} From the signal compression viewpoint, the removal of the visually redundant coefficients will increase the coding bits of the visually important coefficients, improving visual quality.

In this paper, we propose a perceptually-adaptive preprocessing method for in-band MCTF-based 3-D wavelet video coding. A locally adaptive wavelet domain JND profile is first proposed, which is then incorporated into a preprocessor of the in-band MCTF to remove the visually redundant coefficients before performing the MCTF of each spatial band.

Figure 1 shows the framework of the proposed perceptually-adaptive in-band preprocessing scheme for 3-D wavelet video coding. The spatial wavelet transform is first applied to the original video sequence, which generates multiple spatial bands. Then each spatial band is preprocessed to remove the visually insignificant coefficients guided by a wavelet domain JND profile, which is built according to both the local property of each wavelet coefficient and the quantization noise visibility of each spatial band. After preprocessing, MCTF is performed to exploit the temporal correlation within each spatial band. For each temporal band of a certain spatial band, the spatial transform can be further employed to exploit the spatial correlation. Finally, the residual coefficients, motion vectors and modes of each spatiotemporal band are coded independently so that the server can simply drop the unnecessary spatiotemporal bands according to the resolution requested by the client.

Since human eyes have underlying spatial/temporal sensitivity and masking properties, an appropriate JND model can significantly help to improve the performance of video coding algorithms. Several methods for finding JND have been proposed based upon intensive research in subbands as well as some work in the image domain.^{3, 4, 5} Watson ^{3} constructed the model of discrete wavelet transform (DWT) noise visibility thresholds as a function of scale, orientation, and display visual resolution. Their threshold model is based on the psychovisual detection of noise injected to wavelet bands. In their model, the local property of each wavelet coefficient was not considered, so each coefficient in a spatial band shares the same threshold.

Based on Watson’s threshold model, we formulate a locally adaptive wavelet domain JND profile as given in Eq. 1, in which the Watson’s band-wise thresholds are modulated by the local activity factor of each wavelet coefficient:

where $T(\theta ,f)$ is the threshold of the quantization noise visibility of each spatial band, ${S}_{t}(l,\theta ,i,j)$ is a local activity factor, $l$ denotes the scale of spatial wavelet transform, $\theta $ is the different spatial band after each spatial wavelet transform, and its possible values of $\theta $ are $\{1,2,3,4\}$ , corresponding to the spatial low-low-pass $\mathrm{band}\phantom{\rule{0.2em}{0ex}}\left(\mathrm{LL}\right)$ , high-low-pass $\mathrm{band}\phantom{\rule{0.2em}{0ex}}\left(\mathrm{HL}\right)$ , high-high-pass $\mathrm{band}\phantom{\rule{0.2em}{0ex}}\left(\mathrm{HH}\right)$ and low-high-pass $\mathrm{band}\phantom{\rule{0.2em}{0ex}}\left(\mathrm{LH}\right)$ , and $i$ and $j$ denote the coordinates of the coefficient of each spatial band. The threshold $T(\theta ,f)$ can be computed as follows^{3}:

## 2

$$\mathrm{log}\left[T(\theta ,f)\right]=\mathrm{log}\left(a\right)+k{[\mathrm{log}\left(f\right)-\mathrm{log}\left({g}_{\theta}{f}_{0}\right)]}^{2}$$Considering that variance is a good indication of the local activity, we define the local activity factor of each wavelet coefficient ${S}_{t}(l,\theta ,i,j)$ as follows:

where ${\sigma}_{R}^{2}(l,\theta ,i,j)$ is the local variance in a $R\times R$ window centered at $(i,j)$ in the spatial band $(l,\theta )$ . The second item in the expression is similar to the most known form of the empirical noise visibility function (NVF) in image restoration applications.^{6}It is the basic prototype for many adaptive regularization algorithms in the image domain.

^{7, 8}Since the wavelet coefficients still have strong local activity even in the spatial high-frequency band, we can apply this prototype to the wavelet domain. Here $\lambda $ is a subband-dependent contrast adjustment parameter computed as in Eq. 4, assuming that the noise can be modeled by a nonstationary Gaussian process.

^{7}where ${\sigma}_{\mathrm{max}}^{2}(l,\theta )$ is the maximum local variance for the spatial band $(l,\theta )$ , and $D\u220a[50,100]$ is an empirical parameter.

The above adjustment factor shows that the JND values in the highly textured and edged areas are stronger than those in the flat regions in the same subband. With the above wavelet domain JND, we can define the following perceptually adaptive in- band preprocessor:

## 5

$$Co(l,\theta ,i,j)=\{\begin{array}{cc}0\hfill & \mathrm{if}\phantom{\rule{0.3em}{0ex}}\mid Co(l,\theta ,i,j)\mid \u2a7dJN{D}_{TH}(l,\theta ,i,j)\hfill \\ Co(l,\theta ,i,j)\hfill & \mathrm{else}\hfill \end{array}\text{,}$$In the above preprocessor, if a coefficient is below the wavelet domain JND value, it will be viewed as insignificant and set to be zero. Since the JND profile is locally adaptive, after this processor the visually insignificant coefficients are removed while the visually significant coefficients can remain. It will benefit the following processing process of each spatial band since the corresponding coding bits for the visually important coefficients will be increased. Thus the overall visual quality of coded video will be improved.

We validated the perceptually-adaptive in-band preprocessing scheme in MPEG scalable video coding (SVC) reference software for a wavelet ad-hoc group.^{9} In the experiments, the video is first decomposed into four spatial bands with a 9/7 filter. The coefficients of each spatial band are then perceptually preprocessed with the proposed scheme, in which the window size is
$5\times 5$
for computing local variance and the contrast adjustment factor
$D$
is set to be 100. After the preprocessing step, a four-level MCTF with a 5/3 filter is performed in each spatial band.

Figure 2 shows the visual quality comparison of the different decoded Foreman sequences with preprocessing and without preprocessing, respectively. In the figure, the decoded sequence named “ $\mathrm{Foreman}\_\mathrm{QCIF}\_15\mathrm{Hz}\_48\mathrm{k}$ ” means that the bit-stream of the “Foreman” sequence is decoded with image size of QCIF at a frame rate of $15\phantom{\rule{0.3em}{0ex}}\mathrm{frames}\u2215\mathrm{s}$ and a bitrate of $48\phantom{\rule{0.3em}{0ex}}\mathrm{kbits}\u2215\mathrm{s}$ . We can see that the visual quality is consistently better for the decoded video with the proposed preprocessing method at different resolution, different frame rate, and different bitrate. As shown in the figure, some artifacts and noise are removed. It makes that the flat areas, such as Foreman’s face and neck, look more smooth and comfortable. In addition, some important detail texture becomes clearer, such as Foreman’s mouth, teeth, and ears.

In order to further confirm the visual quality improvement by the proposed scheme, we performed subjective quality evaluation. The subjective quality evaluation is performed according to the double stimulus continuous quality scale method in Rec. ITU-R BT.500.^{10} The mean opinion score (MOS) scales for viewers to vote for the quality after viewing are: excellent (100–80), good (80–60), fair (60–40), poor (40–20), and bad (20–0). Five observers were involved in the experiments. The subjective visual quality assessment was performed in a typical laboratory environment, using a 21-in. SONY G520 professional color monitor with a resolution of
$1600\times 1200$
. The viewing distance is approximately six times that of the image height. Difference mean opinion scores (DMOS) are calculated as the difference of MOSs between the original video and the decoded video. The smaller the DMOS is, the higher the perceptual quality of the decoded video is. Table 1 shows the averaged DMOSs over all five subjects for the Foreman decoded sequences, where scheme I and II denote the IBMCTF without preprocessing and with preprocessing, respectively. From the table, we can see that the subjective rating is consistently better for the decoded sequences with the proposed scheme, and the average subjective quality gains of 6.71 measured in DMOS is achieved by the proposed scheme.

## Table 1

Average objective and subjective performance for Foreman(300frames) sequence with preprocessing (scheme I) and without preprocessing (scheme II).

Decoded sequence | Scheme | PSNR(Y) | PSNR(U) | PSNR(V) | DMOS |
---|---|---|---|---|---|

$\mathrm{QCIF}\_7.5\mathrm{Hz}\_32\mathrm{k}$ | I | 29.2284 | 36.1330 | 34.9841 | 33.70 |

II | 29.2434 | 36.2042 | 35.0537 | 29.42 | |

$\mathrm{QCIF}\_15\mathrm{Hz}\_48\mathrm{k}$ | I | 29.9829 | 36.7561 | 36.9186 | 35.67 |

II | 30.0137 | 36.8749 | 37.0029 | 28.34 | |

$\mathrm{CIF}\_15\mathrm{Hz}\_96\mathrm{k}$ | I | 30.8515 | 37.2755 | 38.0539 | 37.28 |

II | 30.8696 | 37.2846 | 38.0390 | 31.59 | |

$\mathrm{CIF}\_15\mathrm{Hz}\_192\mathrm{k}$ | I | 33.3847 | 39.1865 | 40.2487 | 28.96 |

II | 33.3322 | 39.1912 | 40.3064 | 20.81 | |

$\mathrm{CIF}\_30\mathrm{Hz}\_256\mathrm{k}$ | I | 33.7564 | 39.5507 | 40.7925 | 30.15 |

II | 33.7156 | 39.5924 | 40.8259 | 22.07 |

The PSNR results for the Foreman decoded sequences are listed in Table 1. From the table, we can find that the IBMCTF scheme with the proposed preprocessing has almost the same PSNR performance as the IBMCTF scheme without preprocessing. Interestingly, the objective coding performance does not increase. The underlying reason may be that signal distortion of a conventional IBMCTF is introduced by the embedded quantization for wavelet coefficients, while additional distortion from the JND-adaptive preprocessing needs to be considered in the proposed scheme. Therefore, although the removal of the visually insignificant coefficients can save some bits for coding the visually significant coefficients, it cannot guarantee the improvement of the overall objective quality measured by PSNR due to the additional signal distortion from preprocessing. In the motion-compensated residues preprocessor for the close-loop predictive coding paradigm,^{5} a method for determining the optimum parameter has been devised for improvement of PSNR at a given bitrate for nonscalable video coding. But such an optimization idea is inapplicable for the open-loop MCTF coding paradigm, which has to adapt to a wide range of bitrate and spatiotemporal resolutions. Therefore, the proposed preprocessing scheme ensures the improvement of the overall subjective quality instead of the objective quality.

## Acknowledgment

This work was supported by National Natural Science Foundation of China under Grant Nos. 60332030 and 60502034, and Shanghai Rising-Star Program under Grant No. 05QMX1435.

## References

*Proc. Picture Coding Symp.*, pp. 589–593 (2004). Google Scholar