The DM642 is a next generation multimedia processor with a power full C64x DSP core and rich set of peripherals to meet the f requirements of various video applications. The STB is one of the widely used applications in audio-video broadcast arena. The MPEG-2 transport demultiplexer is the core front-end module in the STB application. An Optimized demultiplexer fully programmable architecture and a software implementation using DM642 is presented in this paper. This architecture fully utilizes the CPU power and the support available from the peripherals. The support includes PCR clock recovery, which is very critical for the entire STB application. The data flow and the control flow is tuned optimally in order to minimize the system overheads by reducing data bandwidth requirement and enhancing cache performance. This paper also describes techniques to parse the data efficiently by leveraging on 32-bit instructions and 64-bit load/store data access provided by the advanced C64x architecture. The benchmarks of the demultiplexer with a few typical transport streams are presented at the end.
The TMS320C64x DSP is a generation of high-speed DSPs with a rich instruction set and an efficient memory system for multimedia processing. Digital video decoding is one of the key applications in multimedia processing. It is a computationally intensive application, which requires high bandwidth to external memory and an efficient DMA engine. Reference models for video decoders typically follow a simple data flow that operates sequentially on one macroblock (MB) at a time. This structure leads to inefficiencies in real-time implementations including less than optimal utilization of program caches and DMA bandwidth. These issues become more significant with high-performance devices like the C64x DSP because the CPU efficiency and high-clock rate allow the core processing to occur much faster than on other processors. At the same time, the bandwidth to external memory has not increased at the same rate as the processing performance. This can lead the performance bottleneck to be I/O bandwidth instead of processing unless the system data flow is carefully designed. This paper describes an optimized flow for MPEG-2 decoding, which processes multiple blocks at a time to obtain optimum cache performance and DMA bandwidth efficiency. With this approach, system overhead is reduced from as high as 100% for worst-case B frames with the conventional flow to less than 20%.