This paper reports on recent developments in video coding standardization, particularly focusing on the Call for
Proposals (CfP) on video coding technology made jointly in January 2010 by ITU-T VCEG and ISO/IEC MPEG and the
April 2010 responses to that Call. The new standardization initiative is referred to as High Efficiency Video Coding
(HEVC) and its development has been undertaken by a new Joint Collaborative Team on Video Coding (JCT-VC)
formed by the two organizations. The HEVC standard is intended to provide significantly better compression capability
than the existing AVC (ITU-T H.264 | ISO/IEC MPEG-4 Part 10) standard. The results of the CfP are summarized, and
the first steps towards the definition of the HEVC standard are described.
A locally adaptive up-sampling method that improves the efficiency of a spatially scalable representation of
images in a spatial pyramid is presented. While linear methods use a globally optimized up-sampling filter
design, the method presented locally switches between enhancement of significant structures and smoothing of
flat regions that are dominated by noise. It is based on a locally adaptive Wiener filter expression that can
be implemented by the bilateral filter. The performance of the method is assessed in a scenario resembling its
possible use in MPEG's and ITU-T's joint current activity on scalable video coding (SVC).
As of its open-loop structure and good decorrelation capability, motion-compensated temporal filtering (MCTF) provides a robust basis for highly-efficient scalable video coding. Combining MCTF with spatial wavelet decomposition and embedded quantization results in a 3D wavelet video compression system, providing temporal, spatial, and SNR scalability. Recent results indicate that the overall coding performance of these systems can be maximized if temporal filtering is performed in spatial domain (t+2D approach). However, as compared to non-scalable video coding, the performance of t+2D systems may not be satisfactory if spatial scalability needs to be provided. One important reason for this fact is the problem of spatial scalability of motion information. In this paper we present a conceptually new approach for t+2D-based video compression with spatially scalable
motion information. We call our approach overcomplete MCTF since multiple spatial-domain temporal filtering operations are needed to generate the lower spatial scales of the temporal subbands. Specifically, the encoder performs MCTF-based generation of reference sequences for the coarser spatial scales. We find that the newly generated reference sequences are of satisfactory quality. Compared to the conventional t+2D system, our approach allows for optimization of the reconstruction quality at lower spatial scales while having reduced impact on the reconstruction quality at high spatial scales/bitrates.
With the multimedia content description interface MPEG-7, we have powerful tools for video indexing, based on which content-based search-and-retrieval with respect to separate shots and scenes in video can be performed. We especially focus on the parametric motion descriptor. The motion parameters, being finally coded in the descriptor values, require robust content extraction methods. In this paper, we introduce our approach to the extraction of global motion from video. For this purpose, we apply a constraint feature point selection and matching approach in order to find correspondences in images. Subsequently, an M-estimator is used for robust estimation of the motion model parameters. We evaluate the performance of our approach using affine and biquadratic motion models, also in comparison with a standard least-median-of-squares based approach to global motion estimation.
Nowadays, the video coding standards for object based video coding and the tools for multimedia content description are available. Hence, we have powerful tools that can be used for content-based video coding, description, indexing and organization. In the past, it was difficult to extract higher level semantics, such as video objects, automatically. In this paper, we present a novel approach to moving object region detection. For this purpose, we developed a framework which applies bidirectional global motion estimation and compensation in order to identify potential foreground object regions. After spatial image segmentation, the results are assigned to image segments, and further diffused over the image region. This enables robust object region detection also in cases, where the investigated object does not move completely all the time. Finally, each image segment can be classified as being either situated in the foreground or in the background. Subsequent region merging delivers foreground object masks which can be used in order to define the region-of-attention for content based video coding, but also for contour based object classification.
In interframe wavelet video coding, wavelet-based motion-compensated temporal filtering (MCTF) is combined with spatial wavelet decomposition, allowing for efficient spatio-temporal decorrelation and temporal, spatial and SNR scalability. Contemporary interframe wavelet video coding concepts employ block-based motion estimation (ME) and compensation (MC) to exploit temporal redundancy between successive frames. Due to occlusion effects and imperfect motion modeling, block-based MCTF may generate temporal high frequency subbands with block-wise varying coefficient statistics, and low frequency subbands with block edges. Both effects may cause declined spatial transform gain and blocking artifacts. As a modification to MCTF, we present spatial highpass transition filtering (SHTF) and spatial lowpass transition filtering (SLTF), introducing smooth transitions between motion blocks in the high and low frequency subbands, respectively. Additionally, we analyze the propagation of quantization noise in MCTF and present an optimized quantization strategy to compensate for variations in synthesis filtering for different block types. Combining these approaches leads to a reduction of blocking artifacts, smoothed temporal PSNR performance, and significantly improved coding efficiency.
The amount of multimedia data available worldwide is increasing every day. There is a vital need to annotate multimedia data in order to allow universal content access and to provide content-based search-and-retrieval functionalities. Since supervised video annotation can be time consuming, an automatic solution is appreciated. We review recent approaches to content-based indexing and annotation of videos for different kind of sports, and present our application for the automatic annotation of equestrian sports videos. Thereby, we especially concentrate on MPEG-7 based feature extraction and content description. We apply different visual descriptors for cut detection. Further, we extract the temporal positions of single obstacles on the course by analyzing MPEG-7 edge information and taking specific domain knowledge into account. Having determined single shot positions as well as the visual highlights, the information is jointly stored together with additional textual information in an MPEG-7 description scheme. Using this information, we generate content summaries which can be utilized in a user front-end in order to provide content-based access to the video stream, but further content-based queries and navigation on a video-on-demand streaming server.
For exploitation of temporal interdependencies between consecutive frames, in existing 3D wavelet video coding concepts a blockwise motion estimation (ME) and compensation (MC) is employed. Because of local object motion, rotation or scaling, the processing of occlusion areas is problematic. In these regions, the calculation of correct motion vectors (MV) is not always possible and blocking artifacts may appear at the motion boundaries to the connected areas, for which uniquely referenced MV could be estimated. In order to avoid this, smooth transitions can be included around the occlusion pixels, which means to blur out the block artifacts. The proposed algorithm is based on the MC-EZBC 3D wavelet video coder (Motion-Compensated Embedded video coding algorithm using ZeroBlocks of subband / wavelet coefficients and Context modeling), which employs a lifting approach for temporal filtering.
Object shape features are powerful when used in similarity search-&-retrieval and object recognition because object shape is usually strongly linked to object functionality and identity. Many applications, including those concerned with visual objects retrieval or indexing, are likely to use shape features. Those systems have to cope with scaling, rotation, deformation and partial occlusion of the objects to be described. The ISO standard MPEG-7 contains different shape descriptors, where we focus especially on the region-shape descriptor. Since we found, that the region-shape descriptor is not very robust against partial occlusion, we propose a slightly changed feature extraction method, which is based on central-moments. Further, we compare our method with the original region-shape implementation and show that, applying the proposed changes, the robustness of the region-shape descriptor against partial occlusions can be significantly increased.
A novel concept for SNR scalability with motion compensation in the enhancement layer is introduced. The quantization of the prediction error at different quantization step sizes is performed in the same loop. This allows the application of bit plane coding if the configuration of the quantizers is appropriately chosen. Since a layered prediction is employed at the encoder a drift can occur at a base layer decoder. The concept is, therefore, extended by a drift limitation operation. In this context, two approaches are investigated. One is based on a modification of the prediction error. In the second approach the drift is controlled by dynamic clipping of the enhancement prediction. The proposed SNR scalability concept is applied to the lowpass band of a wavelet-based video coding scheme. The performance is compared with a conventional approach to SNR scalability with two and three quantization layers, respectively.
Due to the rapidly growing multimedia content available on the internet it is highly desirable to index multimedia data automatically and to provide content based search and retrieval functionalities. The first step in order to describe and annotate video data is to split the sequences into sub-shots which are related to semantic units. This paper addresses unsupervised scene change detection and keyframe selection of video sequences. Unlike other methods this is performed by using a standardized multimedia content description of the video data. We apply the MPEG-7 scalable color descriptor and the edge histogram descriptor for shot boundary detection and show that this method performs well. Furthermore, we propose to store the output data of our system in a video segment description scheme to provide simple but efficient search and retrieval functionalities for video scenes based on color features.
Compression of stereoscopic and multiview video data is important, because the bandwidth necessary for storage and transmission linearly increase with the number of camera channels. This paper gives an overview about techniques that ISO's Moving Pictures Experts Group has defined in the MPEG- 2 and MPEG-4 standards, or that can be applied in the context of these standards. A good tradeoff between exploitation of spatial and temporal redundancies can be obtained by application of hybrid coding techniques, which combine motion-compensated prediction along the temporal axis, and 2D DCT transform coding within each image frame. The MPEG-2 multiview profile extends hybrid coding towards exploitation of inter-viewchannel redundancies by implicitly defining disparity-compensated prediction. The main feature of the new MPEG-4 multimedia standard with respect to video compression is the possibility to encode objects with arbitrary shape separately. As one component of the segmented object's shape, it shall be possible to encode a dense disparity map, which can be accurate enough to allow generation of alternative view s by projection. This way, a very high stereo/multiview compressions ratio can be achieved. While the main application area of the MPEG-2 multiview profile shall be in stereoscopic TV, it is expected that multiview aspects of MPEG-4 will play a major role in interactive applications, e.g. navigation through virtual 3D worlds with embedded natural video objects.
This paper describes algorithms that were developed for a stereoscopic videoconferencing system with viewpoint adaptation. The system identifies foreground and background regions, and applies disparity estimation to the foreground object, namely the person sitting in front of a stereoscopic camera system with rather large baseline. A hierarchical block matching algorithm is employed for this purpose, which takes into account the position of high-variance feature points and the object/background border positions. Using the disparity estimator's output, it is possible to generate arbitrary intermediate views from the left- and right-view images. We have developed an object-based interpolation algorithm, which produces high-quality results. It takes into account the fact that a person's face has a more or less convex surface. Interpolation weights are derived both from the position of the intermediate view, and from the position of a specific point within the face. The algorithms have been designed for a realtime videoconferencing system with telepresence illusion. Therefore, an important aspect during development was the constraint of hardware feasibility, while sufficient quality of the intermediate view images had still to be retained.
Three-dimensional (3-D) frequency coding is an alternative approach to hybrid coding concepts used in nowaday standards. However, the lack of Motion Compensation (MC) techniques is a drawback in 3-D coders when compared to hybrid MC coding. This paper presents a 3-D Subband Coding (SBC) scheme with MC, which is based on a separable structure. Motion-compensated 2-tap Quadrature Mirror Filters (QMFs) are employed for subband decomposition along the temporal axis, and a parallel Time Domain Aliasing Cancellation (TDAC) filter bank is used for decomposition in the spatial domain. The temporal axis analysis and synthesis operations are performed in a cascade structure. A special substitution technique, which occasionally places unfiltered values into the temporal lowpass band at any stage of cascade, guarantees perfect reconstruction synthesis in the case of a non- uniform Motion Vector Field (MVF) with full-pel MC accuracy. With sub-pel accuracy, though coding efficiency is higher, perfect reconstruction is no longer guaranteed. Lattice Vector Quantization (LVQ) has been employed to encode the subband samples. In contrast to hybrid coders, it is straightforward to perform quantization with spatio-temporal perceptual weighting. Coding results are presented, which show that standard hybrid coding concepts are outperformed by up to 4 dB with the 3-D MC-SBC/VQ coder.