As emerging video coding standards, e.g. H.264, aim at high-quality video contents at low bit-rates, the encoding and decoding processes require much more computation than most existing standards do. This paper analyzes software implementation of a real-time H.264 decoder on general-purpose processors with media instructions. Specifically, we discuss how to optimize the speed of H.264 decoders on Intel Pentium 4 processors. This paper first analyzes the reference implementation to identify the time-consuming modules. Our study shows that a number of components, e.g., motion compensation and inverse integer transform, are the most time-consuming modules in the H.264 decoder. Second, we present a list of performance optimization methods using media instructions to improve the efficiency of these modules. After appropriate optimizations, the decoder speed improved by more than 3x---it can decode a 720×480 resolution video sequence at 48 frames per second on 2.4GHz Intel Pentium 4 processors compared to reference software’s 12 frames per second. The optimization techniques demonstrated in this paper can also be applied to other video/image processing applications. Additionally, after presenting detailed application behavior on general-purpose processors, this paper discusses a few recommendations on how to design future efficient/powerful video/image applications/standards with given hardware implications.