H.264 is the emerging video coding standard, which aims at compressing high-quality video contents at low bit-rates. While its new encoding and decoding processes are similar to many previous standards, the new standard includes a number of new features and thus requires much more computation than most existing standards do. The complexity of H.264 standard poses a large amount of challenges to implementing the encoder/decoder in real-time via software on personal computers. Even after 2~3x performance improvement with media instruction on modern general-purpose processors and another 2~4x improvement from algorithmic optimization, the H.264 encoder is still too complicated to be implemented in real-time on a single processor. Based on the detailed analysis of the possibilities of parallelism in H.264 encoder, we proposed an efficient multithreading implementation of the H.264 video encoder. In order to guarantee enough concurrency of the whole system, an elaborate macroblock and inter-frame parallel scheduling scheme is presented. In addition, our macroblock-based multithreading scheme achieves almost no video quality losses in contrast to other parallelization schemes. Our results show that the multithreaded encoder can obtain another 3.96x speed-up on a four-processor system or 4.6x speed-up on a four-processor system with Hyper-Treading Technology. The techniques demonstrated in this work can be applied not only to H.264, but also to other video/image coding/decoding applications on personal computers.
As emerging video coding standards, e.g. H.264, aim at high-quality video contents at low bit-rates, the encoding and decoding processes require much more computation than most existing standards do. This paper analyzes software implementation of a real-time H.264 decoder on general-purpose processors with media instructions. Specifically, we discuss how to optimize the speed of H.264 decoders on Intel Pentium 4 processors. This paper first analyzes the reference implementation to identify the time-consuming modules. Our study shows that a number of components, e.g., motion compensation and inverse integer transform, are the most time-consuming modules in the H.264 decoder. Second, we present a list of performance optimization methods using media instructions to improve the efficiency of these modules. After appropriate optimizations, the decoder speed improved by more than 3x---it can decode a 720×480 resolution video sequence at 48 frames per second on 2.4GHz Intel Pentium 4 processors compared to reference software’s 12 frames per second. The optimization techniques demonstrated in this paper can also be applied to other video/image processing applications. Additionally, after presenting detailed application behavior on general-purpose processors, this paper discusses a few recommendations on how to design future efficient/powerful video/image applications/standards with given hardware implications.