The implementation of processors for embedded systems implies various issues: main constraints are cost, power dissipation and die area. On the other side, new terminals perform functions that require more computational flexibility and effort. Long code streams must be loaded into memories, which are expensive and power consuming, to run on DSPs or CPUs. To overcome this issue, the “SlimCode” proprietary algorithm presented in this paper (patent pending technology) can reduce the dimensions of the program memory. It can run offline and work directly on the binary code the compiler generates, by compressing it and creating a new binary file, about 40% smaller than the original one, to be loaded into the program memory of the processor. The decompression unit will be a small ASIC, placed between the Memory Controller and the System bus of the processor, keeping unchanged the internal CPU architecture: this implies that the methodology is completely transparent to the core.
We present comparisons versus the state-of-the-art IBM Codepack algorithm, along with its architectural implementation into the ST200 VLIW family core.
The DM642 is a next generation multimedia processor with a power full C64x DSP core and rich set of peripherals to meet the f requirements of various video applications. The STB is one of the widely used applications in audio-video broadcast arena. The MPEG-2 transport demultiplexer is the core front-end module in the STB application. An Optimized demultiplexer fully programmable architecture and a software implementation using DM642 is presented in this paper. This architecture fully utilizes the CPU power and the support available from the peripherals. The support includes PCR clock recovery, which is very critical for the entire STB application. The data flow and the control flow is tuned optimally in order to minimize the system overheads by reducing data bandwidth requirement and enhancing cache performance. This paper also describes techniques to parse the data efficiently by leveraging on 32-bit instructions and 64-bit load/store data access provided by the advanced C64x architecture. The benchmarks of the demultiplexer with a few typical transport streams are presented at the end.
Associative Processors have become popular because of their ability to perform parallel operations in massive scale. The use of Associative Processors especially for MPEG4/H.263 video coding was found to have low power consumption. However they lack the ability to perform computationally intensive block transforms. The paper
discusses requirements for video processing and shows how Associative Processors are more suited for video coding than RISC architectures. We highlight the various drawbacks of using Associative Processors for video coding and propose a new Distributed Arithmetic based enhancement to the architecture that provides greater flexibility in the implementation of video coding algorithms. These modifications help in faster computation of DCT and simulations of the proposed enhancement show that MPEG 4 simple profile encoder can be implemented in less than 10 MIPS.
MPEG Layer III (MP3) audio coding algorithm is a widely used audio coding standard. It involves several
complex coding techniques and is therefore difficult to create an efficient architecture design. The variable length
decoding (VLD) e.g. Huffman decoding, is an important part of MP3, which needs great amount of search and memory
access operations. In this paper a data driven variable length decoding algorithm is presented, which exploits the signal
statistics of variable length codes to reduce power and a two-level table lookup method is presented. The decoder was
designed based on simplicity and low-cost, low power consumption while retaining the high efficiency requirements.
The total power saving is about 67%.
This paper presents a VLSI architecture and an efficient implementation of an embedded transform coprocessor for H.264 video compression standard. The proposed coprocessor was designed to work with an ARM946E-S processor. To enhance the performance, both data parallelism and pipelined architecture are utilized in the design. In this study, coprocessor was synthesized with 0.18 μm CMOS technology and its footprint is only 0.0838 mm2. Coprocessor can calculate 2-D transform for a macroblock in 30 clock cycles. The 2-D transform coprocessor dissipates 529 μW with 1.55-volt power supply at 10 MHz clock rate.
In this paper, the JPEG2000 encoder with fast Embedded Block Coding with Optimized Truncation (EBCOT) algorithm is implemented with Philips TriMedia TM-1300. EBCOT is the most important technology in the latest image coding standard, JPEG2000.Our aim is to use the advantage between fast EBCOT algorithm and DSP. Fast EBCOT algorithm on JPEG2000 can enhance the ability of image and commercial applications. Therefore, the feasibility and cost of implantation is the key issue. The proposed design can be implemented on TM-1300 platform quickly, and the design time and cost can reduce largely. The fast algorithm used on EBCOT context model can reduce the clock cycle to 32%~38% comparing to the original one.
A principal challenge for reducing the cost for designing complex systems-on-chip is to pursue more generic systems for a broad range of products. For this purpose, we explore three new architectural concepts for state-of-art video applications. First, we discuss a reusable scalable hardware architecture employing a hierarchical
communication network fitting with the natural hierarchy of the application. In a case study, we show that MPEG streaming in DTV occurs at high level, while subsystems communicate at lower levels. The second concept is a software design that scales over a number of processors to enable reuse over a range of VLSI process technologies. We explore this via an H.264 decoder implementation scaling nearly linearly over up to eight processors by applying data partitioning. The third topic is resource-scalability, which is required to satisfy realtime constraints in a system with a high amount of shared resources. An example complexity-scalable MPEG-2 coder scales the required cycle budget with a factor of three, in parallel with a smooth degradation of quality.
Video compression is a critical component of many multimedia applications available today. The interest in multimedia has generated a lot of research in the area of video coding in academy
and industry alike and several successful standards have emerged, e.g., ITU-T H.261, H.263, ISO/IEC MPEG-1, MPEG-2 and MPEG-4. Transform video coding method is used by all video standards today. Discrete Cosine Transform (DCT) is the most popular transform for video coding and, in fact, is used in all current video-coding standards. We present scalable architectures for DCT transform to adjust the complexity to the considered application. The range of possible architectures includes sequential and parallel processing of transform butterflies at each stage.
As wireless video products evolve, they demand more sophisticated processing at higher resolutions and frame rates. Computational performance and energy efficiency have become critical design issues. This paper presents the Quantized Color Pack eXtension (QCPX) combined with a loop unrolling (LU) technique to improve execution performance and energy efficiency of color image and video processing applications. QCPX applied to a 32-bit datapath processor supports parallel operations on two packed 16-bit YCbCr (Y: luminance, Cr and Cb: chrominance) color pixels, providing greater subword-level parallelism by increasing the number of smaller color pixels packed into a word. Instruction-level parallelism can be further enhanced through loop unrolling. These techniques provide greater performance and efficiency for multimedia workloads on mobile systems. Experimental results on a set of media benchmark applications indicate that the LU plus QCPX-optimized version achieves a speedup ranging from 3.8 to 7.9 while reducing the energy consumption from 76% to 87% over the baseline version on identically configured, dynamically scheduled ILP superscalar processors. The LU plus QCPX-optimized version also outperforms the LU plus MDMX-like (MIPS’s multimedia extension) version.
Motion estimation and compensation is a key component in video procesing. Motion estimation is necessary for high quality compression. It is also a key component in archive video restoration and motion picture post-production. Very accurate motion vectors are usually required in the latter two applications. More accurate motion vectors can also lead to greater coding efficiency. Real-time, accurate motion estimation is currently not attainable on standard desktop PCs. It usually requires some kind of dedicated hardware such as on video coding chips. Gradient based motion estimation is one which gives good accuracy for reasonable computational cost. This paper uses the Wiener based motion estimator as a vehicle to explore the acceleration of gradient based motion estimation on the PC.
Video Processing algorithms, and in particular those found in high-end television receivers, often have challenging demands for system resources. Therefore, most often, dedicated IC solutions are proposed to meet both the system and economic constraints. However as the functional requirements increase and as more diversity in terms of application support is required, dedicated solutions become less economic attractive, and hence a more heterogeneous architecture becomes more economic. In this paper, we present an architecture that is suited to run multiple very demanding video processing applications in real-time for consumer market applications.
Hardware/software co-simulation is a key step in hardware/software co-design flow. In this paper, a reconfigurable co-simulation platform called MPSP for media processor is described. This platform can be configured on both hardware and software quickly to accommodate different media processor for different simulation specification. The design of co-simulation environment on MPSP is based on library. A reconfigurable IP library and a software pack with API interface are provided as a part of MPSP. Based on this platform, the FPGA based co-simulation processing is greatly accelerated.
A revolutionary methodology of SOPC platform-based design environment for multimedia communications will be developed. We embed a softcore processor to perform the image compression in FPGA. Then, we plug-in an Ethernet daughter board in the SOPC development platform system. Afterward, a web surveillance platform system is presented. The web surveillance system consists of three parts: image capture, web server and JPEG compression. In this architecture, user can control the surveillance system by remote. By the IP address configures to Ethernet daughter board, the user can access the surveillance system via browser. When user access the surveillance system, the CMOS sensor
presently capture the remote image. After that, it will feed the captured image with the embedded processor. The embedded processor immediately performs the JPEG compression. Afterward, the user receives the compressed data via Ethernet. To sum up of the above mentioned, the all system will be implemented on APEX20K200E484-2X device.
With the pressure from the design productivity and various special applications, original design method for DSP can no longer keep up with the required speed. A novel design method is needed urgently. Intellectual Property (IP) reusing is a tendency for DSP design, but simple plug-and-play IP cores approaches almost never work. Therefore, appropriate control strategies are needed to connect all the IP cores used and coordinate the whole DSP. This paper presents a new DSP design procedure, which refers to System-on-a-chip, and later introduces a novel control strategy named DWC to implement the DSP based on IP cores. The most important part of this novel control strategy, pipeline control unit (PCU), is given in detail. Because a great number of data hazards occur in most computation-intensive scientific application, a new effective algorithm of checking data hazards is employed in PCU. Following this strategy, the design of a general or special purposed DSP can be finished in shorter time, and the DSP has a potency to improve performance with little modification on basic function units. This DWC strategy has been implement in a 16-bit fixed-pointed DSP successfully.
Register file (RF) has been widely used in latest DSP and media processors. It is very important to include a reasonable RF configuration in the processor designs to help reducing the chip area, consumption and architecture complexity. DSP and media processors need many direct accesses to memory, and this will reduce register accessing. Another kind of reducing RF accessing frequency is bypassing or forwarding implemented by software mechanism, that the successive following instructions can directly use the result produced by the previous one through bypassing logic rather than RF. Therefore, the RF accessing frequency is decreased further. According to the experiment result, this new RF configuration not only satisfies the requirements of traditional media processors, but also is well applied to media processors with very long instruction word (VLIW) architecture.