Google started the WebM Project in 2010 to develop open source, royalty- free video codecs designed specifically for media on the Web. The second generation codec released by the WebM project, VP9, is currently served by YouTube, and enjoys billions of views per day. Realizing the need for even greater compression efficiency to cope with the growing demand for video on the web, the WebM team embarked on an ambitious project to develop a next edition codec AV1, in a consortium of major tech companies called the Alliance for Open Media, that achieves at least a generational improvement in coding efficiency over VP9. In this paper, we focus primarily on new tools in AV1 that improve the prediction of pixel blocks before transforms, quantization and entropy coding are invoked. Specifically, we describe tools and coding modes that improve intra, inter and combined inter-intra prediction. Results are presented on standard test sets.
The demand for streaming video content is on the rise and growing exponentially. Networks bandwidth is very costly and therefore there is a constant effort to improve video compression rates and enable the sending of reduced data volumes while retaining quality of experience (QoE). One basic feature that utilizes the spatial correlation of pixels for video compression is Intra-Prediction, which determines the codec’s compression efficiency. Intra prediction enables significant reduction of the Intra-Frame (I frame) size and, therefore, contributes to efficient exploitation of bandwidth. In this presentation, we propose new Intra-Prediction algorithms that improve the AV1 prediction model and provide better compression ratios. Two (2) types of methods are considered: )1( New scanning order method that maximizes spatial correlation in order to reduce prediction error; and )2( New Intra-Prediction modes implementation in AVI. Modern video coding standards, including AVI codec, utilize fixed scan orders in processing blocks during intra coding. The fixed scan orders typically result in residual blocks with high prediction error mainly in blocks with edges. This means that the fixed scan orders cannot fully exploit the content-adaptive spatial correlations between adjacent blocks, thus the bitrate after compression tends to be large. To reduce the bitrate induced by inaccurate intra prediction, the proposed approach adaptively chooses the scanning order of blocks according to criteria of firstly predicting blocks with maximum number of surrounding, already Inter-Predicted blocks. Using the modified scanning order method and the new modes has reduced the MSE by up to five (5) times when compared to conventional TM mode / Raster scan and up to two (2) times when compared to conventional CALIC mode / Raster scan, depending on the image characteristics (which determines the percentage of blocks predicted with Inter-Prediction, which in turn impacts the efficiency of the new scanning method). For the same cases, the PSNR was shown to improve by up to 7.4dB and up to 4 dB, respectively. The new modes have yielded 5% improvement in BD-Rate over traditionally used modes, when run on K-Frame, which is expected to yield ~1% of overall improvement.
Google started the WebM Project in 2010 to develop open source, royaltyfree
video codecs designed specifically for
media on the Web. The second generation codec released by the WebM project, VP9, is currently served by YouTube,
and enjoys billions of views per day. Realizing the need for even greater compression efficiency to cope with the
growing demand for video on the web, the WebM team embarked on an ambitious project to develop a next edition
codec, VP10, that achieves at least a generational improvement in coding efficiency over VP9. Starting from VP9, a set
of new experimental coding tools have already been added to VP10 to achieve decent coding gains. Subsequently,
Google joined a consortium of major tech companies called the Alliance for Open Media to jointly develop a new codec
AV1. As a result, the VP10 effort is largely expected to merge with AV1. In this paper, we focus primarily on new tools
in VP10 that improve coding of the prediction residue using transform coding techniques. Specifically, we describe tools
that increase the flexibility of available transforms, allowing the codec to handle a more diverse range or residue
structures. Results are presented on a standard test set.
The demand for high quality video is permanently on the rise and with it the need for more effective compression.
Compression scope can be further expanded due to increased spatial correlation of pixels within a high quality video frame.
One basic feature that takes advantage of pixels’ spatial correlation for video compression is Intra-Prediction, which
determines the codec’s compression efficiency. Intra-Prediction enables significant reduction of the Intra-frame (I-frame)
size and, therefore, contributes to more efficient bandwidth exploitation. It has been observed that the intra frame coding
efficiency of VP9 is not as good as that of H.265/MPEG-HEVC. One possible reason is that HEVC’s Intra-prediction
algorithm uses as many as 35 prediction directions, while VP9 uses only 9 directions including the TM prediction mode.
Therefore, there is high motivation to improve the Intra-Prediction scheme with new, original and proprietary algorithms
that will enhance the overall performance of Google’s future codec and bring its performance closer to that of HEVC. In
this work, instead of using different angles for predictions, we introduce four unconventional Intra-Prediction modes for
the VP10 codec – Weighted CALIC (WCALIC), Intra-Prediction using System of Linear Equations (ISLE), Prediction of
Discrete Cosine Transformations (PrDCT) Coefficients and Reverse Least Power of Three (RLPT). Employed on a
selection eleven (11) typical images with a variety of spatial characteristics, by using Mean Square Error (MSE) evaluation
criteria, we show that our proposed algorithms (modes) were preferred and thus selected around 57% of the blocks,
resulting in a reduced average prediction error, i.e. the MSE of 26%. We believe that our proposed techniques will achieve
higher compression without compromising video quality, thus improving the Rate-Distortion (RD) performances of the
compressed video stream.
Google started an opensource project, entitled the WebM Project, in 2010 to develop royaltyfree video codecs for the web. The present generation codec developed in the WebM project called VP9 was finalized in mid2013 and is currently being served extensively by YouTube, resulting in billions of views per day. Even though adoption of VP9 outside Google is still in its infancy, the WebM project has already embarked on an ambitious project to develop a next edition codec VP10 that achieves at least a generational bitrate reduction over the current generation codec VP9. Although the project is still in early stages, a set of new experimental coding tools have already been added to baseline VP9 to achieve modest coding gains over a large enough test set. This paper provides a technical overview of these coding tools.
Google has recently been developing a next generation opensource
video codec called
VP9, as part of the
experimental branch of the libvpx repository included in the WebM project (http://www.webmproject.org/). Starting
from the VP8 video codec released by Google in 2010 as the baseline, a number of enhancements and new tools have
been added to improve the coding efficiency. This paper provides a technical overview of the current status of this
project along with comparisons and other stateoftheart
video codecs H.
264/AVC and HEVC. The new tools that
have been added so far include: larger prediction block sizes up to 64x64, various forms of compound INTER
prediction, more modes for INTRA prediction, ⅛pel
motion vectors and 8tap
switchable subpel interpolation filters,
improved motion reference generation and motion vector coding, improved entropy coding and framelevel
adaptation for various symbols, improved loop filtering, incorporation of Asymmetric Discrete Sine Transforms and
larger 16x16 and 32x32 DCTs, frame level segmentation to group similar areas together, etc. Other tools and various
features are being actively worked on as well. The VP9 bitstream
is expected to be finalized by earlyto
Results show VP9 to be quite competitive in performance with mainstream stateoftheart
The availability of 3D hardware has so far outpaced the production of 3D content. Although to date many
methods have been proposed to convert 2D images to 3D stereopairs, the most successful ones involve human
operators and, therefore, are time-consuming and costly, while the fully-automatic ones have not yet achieved
the same level of quality. This subpar performance is due to the fact that automatic methods usually rely on
assumptions about the captured 3D scene that are often violated in practice. In this paper, we explore a radically
different approach inspired by our work on saliency detection in images. Instead of relying on a deterministic
scene model for the input 2D image, we propose to "learn" the model from a large dictionary of stereopairs, such
as YouTube 3D. Our new approach is built upon a key observation and an assumption. The key observation is
that among millions of stereopairs available on-line, there likely exist many stereopairs whose 3D content matches
that of the 2D input (query). We assume that two stereopairs whose left images are photometrically similar
are likely to have similar disparity fields. Our approach first finds a number of on-line stereopairs whose left
image is a close photometric match to the 2D query and then extracts depth information from these stereopairs.
Since disparities for the selected stereopairs differ due to differences in underlying image content, level of noise,
distortions, etc., we combine them by using the median. We apply the resulting median disparity field to the 2D
query to obtain the corresponding right image, while handling occlusions and newly-exposed areas in the usual
way. We have applied our method in two scenarios. First, we used YouTube 3D videos in search of the most
similar frames. Then, we repeated the experiments on a small, but carefully-selected, dictionary of stereopairs
closely matching the query. This, to a degree, emulates the results one would expect from the use of an extremely
large 3D repository. While far from perfect, the presented results demonstrate that on-line repositories of 3D
content can be used for effective 2D-to-3D image conversion. With the continuously increasing amount of 3D data
on-line and with the rapidly growing computing power in the cloud, the proposed framework seems a promising
alternative to operator-assisted 2D-to-3D conversion.
This work presents a new distributed multiview coding framework, based on the H.264/AVC standard operating
with mixed resolution frames. It allows for a scalable complexity transfer from the encoder to the decoder, which
is particularly suited for low-power video applications, such as multiview surveillance systems. Greater quality
sequences are generated by exploiting the spatial and temporal correlation between views at the decoder. The
results show a good potential for objective quality improvement over simulcast coding, with no extra rate cost.
In mobile-to-mobile video communications, both the transmitting and receiving ends may not have the necessary
computing power to perform complex video compression and decompression tasks. Traditional video codecs
typically have highly complex encoders and less complex decoders. However, Wyner-Ziv (WZ) coding allows for
a low complexity encoder at the price of a more complex decoder. We propose a video communication system
where the transmitter uses a WZ (reversed complexity) coder, while the receiver uses a traditional decoder,
hence minimizing complexity at both ends. For that to work we propose to insert a transcoder in the network to
convert the video stream. We present an efficient transcoder from a simple WZ approach to H.263. Our approach
saves a large amount of the computation by reusing the motion estimation performed at the WZ decoder stage,
among other things. Results are presented to demonstrate the transcoder performance.
A large number of practical coding scenarios deal with sources such as transform coefficients that can be well modeled as Laplacians. For regular coding of such sources, samples are often quantized by a family of uniform quantizers possibly with a deadzone, and then entropy coded. For the Wyner-Ziv coding problem when correlated side-information is available at the decoder, the side-information can be modeled as obtained by additive Laplacian or Gaussian noise on the source. This paper deals with optimal choice of parameters for practical Wyner-Ziv coding in such scenarios, using the same quantizer family as in the regular codec to cover a range of rate-distortion trade-offs, given the variances of the source and additive noise. We propose and analyze a general encoding model that combines source coding and channel coding and show that at practical block lengths and code complexities, not pure channel coding but a hybrid combination of source coding and channel coding with right parameters provide optimal rate-distortion performance. Further, for the channel coded bit-planes we observe that only high-rate codes are useful. We also provide a framework for on-the-fly parameter choice based on non-parametric representation of a set of seed functions, for use in scenarios where variances are estimated during encoding. A good understanding of the optimal parameter selection mechanism is essential for building practical distributed codecs.
A printed photograph is difficult to reuse because the digital information that generated the print may no longer be
available. This paper describes a mechanism for approximating the original digital image by combining a scan of the
printed photograph with small amounts of digital auxiliary information kept together with the print. The auxiliary
information consists of a small amount of digital data to enable accurate registration and color-reproduction,
followed by a larger amount of digital data to recover residual errors and lost frequencies by distributed Wyner-Ziv
coding techniques. Approximating the original digital image enables many uses, including making good quality
reprints from the original print, even when they are faded many years later. In essence, the print itself becomes the
currency for archiving and repurposing digital images, without requiring computer infrastructure.
Codecs such as H.264/AVC involve computationally intensive tasks that often prohibit the real-time implementation. It
has been observed that the complexity of such video encoders can be tuned gracefully to a desired level through the use
of a smaller set of macroblock types in mode decision and a lower motion vector precision in motion estimation. The
rate-distortion performance, however, will be affected consequently. In this paper, we propose a flexible syntax
mechanism (FSM) to tune the encoder complexity while maintaining a sufficient rate-distortion performance. The key
idea inherit in the proposed FSM consists of two folds: first is the specification at the higher level of the bitstream syntax
both the subset of macroblock types and the precision of motion vectors to be evaluated by the encoder, and second is
the redesign of the entropy coders accordingly to effectively represent the selected macroblock types and the motion
vectors. Since the entropy coding is optimized in terms of the bitrate consumption specifically for the subset of
macroblock modes and the motion vector precision, the rate-distortion performance will be enhanced compared to the
scenario where identical entropy codes are adopted regardless. Another advantage of our approach is the intrinsic
scalability in complexity for the application of video encoding under different complexity constraints. The proposed
approach may be considered for the next generation of video codecs with flexible complexity profiles.
Object segmentation is important in image analysis for imaging tasks such as image rendering and image retrieval. Pet
owners have been known to be quite vocal about how important it is to render their pets perfectly. We present here an
algorithm for pet (mammal) fur color classification and an algorithm for pet (animal) fur texture classification. Per fur
color classification can be applied as a necessary condition for identifying the regions in an image that may contain pets
much like the skin tone classification for human flesh detection. As a result of the evolution, fur coloration of all
mammals is caused by a natural organic pigment called Melanin and Melanin has only very limited color ranges. We
have conducted a statistical analysis and concluded that mammal fur colors can be only in levels of gray or in two
colors after the proper color quantization. This pet fur color classification algorithm has been applied for peteye
detection. We also present here an algorithm for animal fur texture classification using the recently developed multi-resolution
directional sub-band Contourlet transform. The experimental results are very promising as these transforms
can identify regions of an image that may contain fur of mammals, scale of reptiles and feather of birds, etc. Combining
the color and texture classification, one can have a set of strong classifiers for identifying possible animals in an image.
A spatial-resolution reduction based framework for incorporation of a Wyner-Ziv frame coding mode in existing video
codecs is presented, to enable a mode of operation with low encoding complexity. The core Wyner-Ziv frame coder
works on the Laplacian residual of a lower-resolution frame encoded by a regular codec at reduced resolution. The
quantized transform coefficients of the residual frame are mapped to cosets to reduce the bit-rate. A detailed rate-distortion
analysis and procedure for obtaining the optimal parameters based on a realistic statistical model for the
transform coefficients and the side information is also presented. The decoder iteratively conducts motion-based side-information
generation and coset decoding, to gradually refine the estimate of the frame. Preliminary results are presented
for application to the H.263+ video codec.
In this paper, we propose a layered complexity-aware encryption scheme to partially encrypt scalable video bitstreams to achieve reasonable security that the specific application entails and the involved computational complexity allows. The proposed scheme naturally combines wavelet-based scalable video coding technique and the concept of selective encryption. We also study the relationship between rate, distortion, and the involved encryption complexity (R-D-EC). Here, distortion is used to measure the level of security of the encrypted video bitstream, and the percentage of encrypted bitstream is used to denote encryption complexity. Our simulation results indicate that selective encryption using a prioritized bitstream structure provided by scalable video coding can achieve almost bitrate-independent encryption complexity control.
Part 7 of MPEG-21 entitled Digital Item Adaptation (DIA), is an emerging metadata standard defining protocols and descriptions enabling content adaptation for a wide variety of networks and terminals, with attention to format-independent mechanisms. The descriptions standardized in DIA provide a standardized interface not only to a variety of format-specific adaptation engines, but also to format-independent adaptation engines for scalable bit-streams. A fully format-independent engine contains a decision-taking module operating in a media-type and context independent manner, cascaded with a bit-stream adaptation module that models the bit-stream adaptation process as an XML transformation operating on a high-level syntax description of the bit-stream, with parameters derived from decisions taken. In this paper, we describe the DIA descriptions and underlying mechanisms that enable such fully format-independent scalable bit-stream adaptation. Further, a new model-based, compact and lightweight transformation language for scalable bit-streams is described for use in the bit-stream adaptation module. Fully format-independent adaptation mechanisms lead to universal adaptation engines that substantially reduce adoption costs for new media types and formats because the same delivery and adaptation infrastructure can be used for different types of scalable media, including proprietary and encrypted content.
Recently a methodology for representation and adaptation of arbitrary scalable bit-streams in a fully content non-specific manner has been proposed on the basis of a universal model for all scalable bit-streams called Scalable Structured Meta-formats (SSM). According to this model, elementary scalable bit-streams are naturally organized in a symmetric multi-dimensional logical structure. The model parameters for a specific bit-stream along with information guiding decision-making among possible adaptation choices are represented in a binary or XML descriptor to accompany the bit-stream flowing downstream. The capabilities and preferences of receiving terminals flow upstream and are also specified in binary or XML form to represent constraints that guide adaptation. By interpreting the descriptor and the constraint specifications, a universal adaptation engine sitting on a network node can adapt the content appropriately to suit the specified needs and preferences of recipients, without knowledge of the specifics of the content, its encoding and/or encryption. In this framework, different adaptation infrastructures are no longer needed for different types of scalable media. In this work, we show how this framework can be used to adapt fully scalable video bit-streams, specifically ones obtained by the fully scalable MC-EZBC video coding system. MC-EZBC uses a 3-D subband/wavelet transform that exploits correlation by filtering along motion trajectories, to obtain a 3-dimensional scalable bit-stream combining temporal, spatial and SNR scalability in a compact bit-stream. Several adaptation use cases are presented to demonstrate the flexibility and advantages of a fully scalable video bit-stream when used in conjunction with a network adaptation engine for transmission.
The radio plays a song that you like but that you do not recognize. How do you find the title and the artist? Previous approaches to finding a song in a database are based on pattern recognition. In some of the previous work features are extracted from a hummed song and decision rules are used to retrieve probable candidates from the database. Feature matching has not resulted in reliable searches from microphone samples. In this work, to find the song, we process a short, microphone recorded sample from it. Both a feature vector and a signal are precomputed for each song in a database and also extracted from the recording. The database songs are first sorted by feature distance to the recording. Then, normalized cross-correlation, even though nonlinear, is applied using overlap-save FFT convolution. A decision rule presents likely matches to the user for confirmation but controls the number of false alarms shown. This system, tested using hundreds of recordings, is reliable because signals are matched. The addition of the feature-ordered search and the decision rule result in database searches five times faster than signal matching alone.
This paper motivates and develops an end-to-end methodology for representation and adaptation of arbitrary scalable content in a fully content non-specific manner. Scalable bit-streams are naturally organized in a symmetric multi-dimensional logical structure, and any adaptation is essentially a downward manipulation of this structure. Higher logical constructs are defined on top of this multi-tier structure to make the model more generally applicable to a variety of bit-streams involving rich media. The resultant composite model is referred to as the Structured Scalable Meta-format (SSM). Apart from the implicit bit-stream constraints that must be satisfied to make a scalable bit-stream SSM-compliant, two other elements that need to be formalized to build a complete adaptation and delivery infrastructure based on SSM are: a binary or XML description of the structure of the bit-stream resource and how it is to be manipulated to obtain various adapted versions; and a binary of XML specification of outbound constraints derived from capabilities and preferences of receiving terminals. By interpreting the descriptor and the constraint specifications, a universal adaptation engine can adapt the content appropriately to suit the specified needs and preferences of recipients, without knowledge of the specifics of the content, its encoding and/or encryption. With universal adaptation engines, different adaptation infrastructures are no longer needed for different types of scalable media.
The essential motivations, towards an object-based approach to video coding, include possible object-based coding scheme. In this work we present an region-based video coder which uses a segmentation map obtained from the previous reconstructed frame, thereby eliminating the need to transmit expensive shape information to the decoder. While the inspiration for this work is derived from previous work by Yokoyama et al, there are major differences between our work and the earlier effort, in the segmentation scheme employed, the motion model, and the handling of overlapped and uncovered regions. We use an edge flow based segmentation scheme, which appears to produce consistent segmentation results over a variety of natural images. Since it combines luminance, chrominance and texture information for image segmentation, it is well suited to segment real world images For motion compensation, we choose an affine model, and use hierarchical region-matching for accurate affine parameter estimation. Heuristic techniques are used to eliminate overlapped and uncovered regions after motion compensation. Extensive coding results of our implementation are presented.
There is a growing need for new representations of video that allow not only compact storage of data but also content-based functionalities such as search and manipulation of objects. We present here a prototype system, called NeTra-V, that is currently being developed to address some of these content related issues. The system has a two-stage video processing structure: a global feature extraction and clustering stage, and a local feature extraction and object-based representation stage. Key aspects of the system include a new spatio-temporal segmentation and object-tracking scheme, and a hierarchical object-based video representation model. The spatio-temporal segmentation scheme combines the color/texture image segmentation and affine motion estimation techniques. Experimental results show that the proposed approach can handle large motion. The output of the segmentation, the alpha plane as it is referred to in the MPEG-4 terminology, can be used to compute local image properties. This local information forms the low-level content description module in our video representation. Experimental results illustrating spatio- temporal segmentation and tracking are provided.