H.264 is the emerging video coding standard, which aims at compressing high-quality video contents at low bit-rates. While its new encoding and decoding processes are similar to many previous standards, the new standard includes a number of new features and thus requires much more computation than most existing standards do. The complexity of H.264 standard poses a large amount of challenges to implementing the encoder/decoder in real-time via software on personal computers. Even after 2~3x performance improvement with media instruction on modern general-purpose processors and another 2~4x improvement from algorithmic optimization, the H.264 encoder is still too complicated to be implemented in real-time on a single processor. Based on the detailed analysis of the possibilities of parallelism in H.264 encoder, we proposed an efficient multithreading implementation of the H.264 video encoder. In order to guarantee enough concurrency of the whole system, an elaborate macroblock and inter-frame parallel scheduling scheme is presented. In addition, our macroblock-based multithreading scheme achieves almost no video quality losses in contrast to other parallelization schemes. Our results show that the multithreaded encoder can obtain another 3.96x speed-up on a four-processor system or 4.6x speed-up on a four-processor system with Hyper-Treading Technology. The techniques demonstrated in this work can be applied not only to H.264, but also to other video/image coding/decoding applications on personal computers.
As emerging video coding standards, e.g. H.264, aim at high-quality video contents at low bit-rates, the encoding and decoding processes require much more computation than most existing standards do. This paper analyzes software implementation of a real-time H.264 decoder on general-purpose processors with media instructions. Specifically, we discuss how to optimize the speed of H.264 decoders on Intel Pentium 4 processors. This paper first analyzes the reference implementation to identify the time-consuming modules. Our study shows that a number of components, e.g., motion compensation and inverse integer transform, are the most time-consuming modules in the H.264 decoder. Second, we present a list of performance optimization methods using media instructions to improve the efficiency of these modules. After appropriate optimizations, the decoder speed improved by more than 3x---it can decode a 720×480 resolution video sequence at 48 frames per second on 2.4GHz Intel Pentium 4 processors compared to reference software’s 12 frames per second. The optimization techniques demonstrated in this paper can also be applied to other video/image processing applications. Additionally, after presenting detailed application behavior on general-purpose processors, this paper discusses a few recommendations on how to design future efficient/powerful video/image applications/standards with given hardware implications.
In this paper, we present the architecture and a resource management and adaptation framework that goes beyond existing peer-to-peer and content delivery infrastructures to accommodate and accelerate multimedia peer applications and services. We will propose key technology components that allow the seamless adaptation of resources to enhance quality of service and the building of better tools and applications that utilize the underlying power of the peer-computing network, and show a prototype system that integrates the various components, as well as some sample applications that can be built on the proposed infrastructure.
The aim of this paper is to analyze the computational requirements of video watermarking algorithms running on PC-based systems and to study their implication for the design of general-purpose processors and systems. Selected watermarking algorithms are analyzed from a computational point of view. Application examples are executed on current general-purpose processor architecture to understand the computational requirements and to detect potential bottlenecks. In addition to this workload analysis, the potential exploitation of data level parallelism through the use of SIMD instructions available on current architectures is evaluated. Thread level parallelism schemes is also studied in current watermarking in order to understand the potential benefit of simultaneous multithreading processors and symmetric multiprocessor systems for such applications. Even if the study of the different watermarking algorithms is crucial to understand the requirements of a system, it is not sufficient. Indeed, watermarking schemes are very often only one kernel in a complete application and the interaction between the watermarking kernel and the rest of the application can highly influence the computational and memory bandwidth requirements of the system. Therefore the example of watermarking detection in a video decoder is used to understand the additional system implications due to the merging of video decoding and watermarking algorithms.
This work discusses implementation issues of real-time video/image/signal processing applications on personal computers. We give a list of performance optimization guidelines and demonstrate some examples by optimizing our video watermark detection scheme. In many applications, watermarking technology must have (1) the ability to be implemented at low cost, (2) robustness against common image processing operations, and (3) resilience against purely malicious attacks. Many works, including ours, have demonstrated watermark robustness and invisibility. This work demonstrates that, after some performance optimizations, we can decode a 704 X 480 MPEG-2 video and detect the watermarks, both in software, and display the decoded video frames in real-time on an Intel PentiumR III 500 MHz system. Currently, there is only 10.5% overhead of the watermark detection over video decoding. The cost of our optimized implementation is 43% lower than that of the unoptimized version. The optimization techniques demonstrated in this work can be applied to other watermarking schemes and other video/image/signal processing applications.
In this paper, we make use of true motion vectors for better error concealment. Error concealment in video is intended to recover the loss due to channel noise by utilizing available picture information. In our work, we do not change the syntax and thus no additional bits are required. This work focuses on improving the error concealment with transmitted true motion vectors. That is, we propose a 'true' motion estimation at the encoder while using a post-processing error concealment scheme that exploits motion interpolation at the decoder. Given the location of the lost regions and various temporal error concealment techniques, we demonstrate that our true motion vectors perform better than the motion vectors found by minimal-residue block-matching. Additionally, we propose a new error concealment technique that improves reconstruction quality when the previous frame has been heavily damaged. It has been observed that in the case of a heavily damaged frame, better predictions can be made from the past reference frame, rather than the current reference frame which is damaged. This is accomplished by extrapolating the decoded motion vectors so that they correspond to the past reference frame.