An embedded single media processor named MediaDSP3200 core fabricated in a six-layer metal 0.18um CMOS process which implemented the RISC instruction set, DSP data processing instruction set and single-instruction-multiple-data (SIMD) multimedia-enhanced instruction set is described. MediaDSP3200 fuses RISC architecture and DSP computation capability thoroughly, which achieves RISC fundamental, DSP extended and single instruction multiple data (SIMD) instruction set with various addressing modes in a unified pipeline stage architecture. These characteristics enhance system digital signal processing performance greatly. The test processor can achieve 32x32-bit multiply-accumulate (MAC) of 320 MOPS, with 16x16-bit MAC of 1280MOPS. The test processor dissipates 600mW at 1.8v, 320MHz. Also, the implementation was primarily standard cell logic design style. MediaDSP3200 targets diverse embedded application systems, which need both powerful processing/control capability and low-cost budget, e.g. set-top-boxes, video conferencing, DTV, etc. MediaDSP3200 instruction set architecture, addressing mode, pipeline design, SIMD feature, split-ALU and MAC are described in this paper. Finally, the performance benchmark based on H.264 and MPEG decoder algorithm are given in this paper.
To accelerate media processing, many media enhancement instructions have been adopted into the instruction set of embedded processors. In this paper, a novel method, called interaction between instructions and algorithms (IIA), is proposed to optimize these media enhancement instructions. Based on the analysis for inherent characteristics of video processing algorithms and processor's architecture, three measures are proposed: three single-cycle instructions for manipulation on bit level are implemented to speed up variable-length decoding; a data path is designed to solve data misalignment in SIMD processing instead of software programs; a memory architecture is proposed to support 128-bit word parallel processing. All these suggestions are used in the optimization of an embedded processor, MediaDSP3200 which fuses RISC architecture and DSP computation capability thoroughly and achieves reduced instruction and 64-bit SIMD instruction set with various addressing mode in a unified RISC pipeline stage architecture. Simulation results show that this optimization method can reduce more than 26.4% of clock cycles for VLD, 47.8% for IDCT and 66.8% for MC in real-time processing.
Media processing such as real-time compression and decompression of video signal is now expected to be the driving force in the evolution of media processor. In this paper, a hardware and software co-design approach is introduced for a 32-bit media processor: MediaDsp3201 (briefly, MD32), which is realized in 0.18μm TSMC, 200MHz and can achieve 200 million multiply-accumulate (MAC) operations per second. In our design, we have emerged RISC and DSP into one processor (RISC/DSP). Based on the analysis of inherent characteristics of video processing algorithms, media enhancement instructions are adopted into MD32’instruction set. The media extension instructions are physically realized in the processor core, and improves video processing performance effectively with negligible additional hardware cost (2.7%). Considering the high complexity of the operation for media instructions, technology named scalable super pipeline is used to resolve problem of the time delay of pipeline stage (mainly EX stage). Simulation results show that our method can reduce more than 31% and 23% instructions for IDCT compared to MMX and SSE’s implementation and 40% for MC compared to MMX’s implementation.
This paper proposes pipelining and bypassing unit (BPU) design method in our 32-bit RISC/DSP processor: MediaDsp3201 (briefly, MD32). MD32 is realized in 0.18μm technology, 1.8v, 200MHz working clock and can achieve 200 million/s Multiply-Accumulate (MAC) operations. It merges RISC architecture and DSP computation capability thoroughly, achieves fundamental RISC, extended DSP and single instruction multiple data (SIMD) instruction set with various addressing modes in a unified and customized DSP pipeline stage architecture. We will first describe the pipeline structure of MD32, comparing it to typical RISC-style pipeline structure. And then we will study the validity of two bypassing schemes in terms of their effectiveness in resolving pipeline data hazards: Centralized and Distributed BPU design strategy (CBPU and DBPU). A bypassing circuit chain model is given for DBPU, which register read is only placed at ID pipe stage. Considering the processor’s working clock which is decided by the pipeline time delay, the optimization of circuit that serial select with priority is also analyzed in detail since the BPU consists of a long serial path for combination logic. Finally, the performance improvement is analyzed.