The demands of digital image processing, communications and multimedia applications are growing more rapidly than traditional design methods can fulfill them. Previously, only custom hardware designs could provide the performance required to meet the demands of these applications. However, hardware design has reached a crisis point. Hardware design can no longer deliver a product with the required performance and cost in a reasonable time for a reasonable risk. Software based designs running on conventional processors can deliver working designs in a reasonable time and with low risk but cannot meet the performance requirements. What is needed is a media processing approach that combines very high performance, a simple programming model, complete programmability, short time to market and scalability. The Universal Micro System (UMS) is a solution to these problems. The UMS is a completely programmable (including I/O) system on a chip that combines hardware performance with the fast time to market, low cost and low risk of software designs.
We have implemented software MPEG-2 decoders on two mediaprocessors, Texas Instruments TMS320C80 and Hitachi/Equator Technologies MAP1000, On the TMS320C80, its small instruction cache and inefficient advanced Digital Signal Processor architecture for bitstream parsing resulted in the performance of about twelve frames per second for 5 Mbit/s MPEG-2 bitstreams on a 50-MHz TMS320C80. On the other hand, the MPEG-2 decoder implemented on a 200-MHz MAP1000 achieved real-time performance in decoding of MPML bitstreams. The implementation details, such as tight loops and data flow, are presented. We also compare the architectural features of the two mediaprocessors in performing the MPEG-2 decoding. The instruction set specifically targeted for multimedia processing, better instruction cache utilization, and an independent variable length decoder are among the advantages of MAP1000.
Mediaprocessors offer several advantages over hardwired MPEG-2 decoder chips, such as the capability to perform multiple functions, update the algorithms and customize the system with enhanced features. MAP1000A is a highly integrated mediaprocessor platform with multiple processing units for parallel processing. The input bitstream is parsed and decoded by the VLX coprocessor, which is a small and fast processor designed for sequential operations. The decoded information is then passed to the VLIW Core that performs the pixel-intensive operations such as the inverse quantization, inverse DCT, half-pel interpolation, pixel averaging, and pixel addition. The VLIW Core's two-cluster architecture with 128-bit data path per cluster and partitioned operations achieves a high throughput on 8-bit and 16-bit pixel operations. Also, to avoid the VLIW Core from waiting for data, a dedicated data transfer engine called Data Streamer moves data between MAP1000A, external memory, and I/O devices in parallel with the VLIW Core's execution. The MPEG-2 video decoder on MAP1000A is written entirely in the C language, which is a significant advantage over previous processors which required assembly-language programming. At 220 MHz clock frequency, the MPEG-2 decoder takes less than 40% of the MAP1000A's cycles. Two MPML streams can be decoded simultaneously in real time, with enough cycles remaining to perform other tasks such as the audio and system decoding.
B-Splines computation for iterative 3D geometric modeling and graphic animation sessions imply large computational requirements which suggests the utilization of high- performance VLSI architectures. In this paper we describe an architecture for the computation of rational B-Spline surfaces and their derivatives. The architecture is based on the utilization of a highly regular and modular structure, suitable for VLSI implementation, which permits the reconfiguration of the system when no derivatives are required. A new scheduling system permits a fully exploited system in both configuration modes, through the understanding of the parallel structure of the algorithm.
Lossless coding of image data has been a very active area of research in the field of medical imaging, remote sensing and document processing/delivery. While several lossless image coders such as JPEG and JBIG have been in existence for a while, their compression performance for encoding continuous-tone images were rather poor. Recently, several state of the art techniques like CALIC and LOCO were introduced with significant improvement in compression performance over traditional coders. However, these coders are very difficult to implement using dedicated hardware or in software using media processors due to their inherently serial nature of their encoding process. In this work, we propose a lossless image coding technique with a compression performance that is very close to the performance of CALIC and LOCO while being very efficient to implement both in hardware and software. Comparisons for encoding the JPEG- 2000 image set show that the compression performance of the proposed coder is within 2 - 5% of the more complex coders while being computationally very efficient. In addition, the encoder is shown to be parallelizabl at a hierarchy of levels. The execution time of the proposed encoder is smaller than what is required by LOCO while the decoder is 2 - 3 times faster that the execution time required by LOCO decoder.
Multi-dimensional applications, such as image processing and seismic analysis, usually require the high computer performance obtained from the implementation of Application Specific Integrated Circuits (ASICs). The critical sections of such applications consist of nested loops with the possibility of embedded conditional branch instructions. Current commercial systems use branch predication techniques, which can also be applied in the design of ASIC systems. Those techniques utilize predicate registers to control the validity of computed results. The optimized design and allocation of such registers becomes then a significant factor in the performance of the system. By using branch prediction to transform control dependencies in data dependencies, the application of a multi-dimensional retiming to an MDFG permit the iterations of the original loop body to be naturally overlapped, making the existent parallelism explicit. Based on the retiming information, predicate registers are designed as shift registers that allow the correct execution of the filter function.
Recently, several commercial DSP processors with VLIW (Very Long Instruction Word) architecture were introduced. The VLIW architectures offer high performance over a wide range of multimedia applications that require parallel processing. In this paper, we implement an efficient 2D median filter for VLIW architecture, particularly for Texas Instrument C62x VLIW architecture. Median filter is widely used for filtering the impulse noise while preserving edges in still images and video. The efficient median filtering requires fast sorting. The sorting algorithms were optimized using software pipelining and loop unrolling to maximize the use of the available functional units while meeting the data dependency constraints. The paper describes and lists the optimized source code for the 3 X 3 median filter using an enhanced selection sort algorithm.
An 8-point inverse discrete cosine transform (IDCT) can be viewed as a matrix multiplication between an 8 X 8 coefficient matrix and an 8 X 1 input vector. It was shown that the matrix computations can be significantly reduced by separating the even and odd elements of the input vector. In this method, the 8 X 8 matrix multiplication is divided into two 4 X 4 matrix multiplications. The output elements are obtained by adding and subtracting the results of two matrix multiplications. On a mediaprocessor that has a large number of multipliers such as MAP1000A from Equator Technologies, we found that the vector-product algorithm yields a faster execution than other fast IDCT algorithms. This is due to the fact that a highly complex operation such as a vector product takes the same number of machine cycles as a simple operation such as a data move. MAP1000A can perform four 4-point vector products with a single instruction. We also found novel approaches to avoid moving data around in separating the even and odd elements of the input vector and in transporting the matrix between the row-wise and column-wise stages. As a result, an 8 X 8 IDCT can be computed in 60 cycles on MAP1000A, which is about 273 ns at 220 MHz clock frequency.
The paper proposes a new media processor architecture specifically designed to handle state-of-the-art multimedia encoding and decoding tasks. To achieve this, the architecture efficiently exploit Data-, Instruction- and Thread-Level parallelisms while continuously adapting its computational resources to reach the most appropriate parallelism level among all the concurrent encoding/decoding processes. Looking at the implementation constraints, several critical choices were adopted that solve the interconnection delay problem, lower the cache misses and pipeline stalls effects and reduce register files and memory size by adopting a clustered Simultaneous Multithreaded Architecture. We enhanced the classic model to exploit both Instruction and Data Level Parallelism through vector instructions. The vector extension is well justified for multimedia workload and improves code density, crossbars complexity, register file ports and decoding logic area while it still provides an efficient way to fully exploit a large set of functional units. An MPEG-2 encoding algorithms based on Hybrid Genetic search has been implemented that show the efficiency of the architecture to adapt its resources allocation to better fulfill the application requirements.
Equator Technologies, Inc. has used a software-first approach to produce several programmable and advanced VLIW processor architectures that have the flexibility to run both traditional systems tasks and an array of media-rich applications. For example, Equator's MAP1000A is the world's fastest single-chip programmable signal and image processor targeted for digital consumer and office automation markets. The Equator MAP3D is a proposal for the architecture of the next generation of the Equator MAP family. The MAP3D is designed to achieve high-end 3D performance and a variety of customizable special effects by combining special graphics features with high performance floating-point and media processor architecture. As a programmable media processor, it offers the advantages of a completely configurable 3D pipeline--allowing developers to experiment with different algorithms and to tailor their pipeline to achieve the highest performance for a particular application. With the support of Equator's advanced C compiler and toolkit, MAP3D programs can be written in a high-level language. This allows the compiler to successfully find and exploit any parallelism in a programmer's code, thus decreasing the time to market of a given applications. The ability to run an operating system makes it possible to run concurrent applications in the MAP3D chip, such as video decoding while executing the 3D pipelines, so that integration of applications is easily achieved--using real-time decoded imagery for texturing 3D objects, for instance. This novel architecture enables an affordable, integrated solution for high performance 3D graphics.
Mediaprocessors, such as Philips Trimedia and Hitachi/Equator Technologies MAP, combine the computational power of high-end DSPs with various I/O capabilities in a single programmable chip. due to their programmability, mediaprocessors have greater flexibility than ASICs and other special-purpose hardware. Early mediaprocessors, such as Texas Instruments TMS320C80 since its introduction in 1994, have had limited success due to their difficulty in programming, insufficient computational power, and high cost. Fortunately, several newer mediaprocessors, which are available or under development, are easier to program, are less expensive, and/or have more computational power. However, due to the earlier difficulties and inherent uncertainties in the programmable solutions, mediaprocessor user companies (set makers) are often hesitant in adopting the mediaprocessors in their products. Furthermore, set makers still need to expend a lot of time and manpower in making a successful transition from hardwired to mediaprocessor-based products. Therefore, we introduce the Mediaprocessor (MP) Consortium, which aims to remove the barrier to the widespread use of programmable mediaprocessors. Through publications, web site, training courses, software libraries, and objective evaluations of mediaprocessors, the MP Consortium can increase the awareness of the benefits of mediaprocessors over ASICs, make the transition to mediaprocessor-based products easier for set makers, and help them attain full advantage of using mediaprocessors.
The MPEG4-based visual presentation is a scene presentation, which is composed of multiple visual objects. The MPEG-4 standard allows viewers to interactively change their viewing positions relative to a scene. Therefore, an MPEG4- compliant graphical rendering device should be able to transform a decoded video object in 3D space according to viewer's viewing position. This type of texture transformation is known as perspective texture warping.
The rate at which MPEG digital video is decoded depends primarily on the resolutions and the pixel rates of the images that are ultimately displayed. A variety of MPEG decoders is currently available for standard-definition video. Higher-resolution video, such as HDTV, requires much higher decoding rates. The processing speeds needed for these high rates may not be attainable in a decoder that uses conventional digital processing and memory technologies without the use of parallel processing. This will continue to be true even after high-speed decoders become available for HDTV resolutions, as there will be other video applications (e.g., virtual reality, scientific, and medical imaging) for which still higher resolutions are needed. Consequently, higher processing speeds will be required, along with parallel processing. The MPEG decoding algorithm, however, was designed to process an entire picture sequentially, and as such is not well-suited for parallel processing implementations. In this paper, the general problem of parallelism in the decoding of MPEG video is considered, and a simple, efficient method of partitioning it into parallel processes is described.
This paper presents a method of implementing the perspective transform as specified in the MPEG-4 standard using 32-bit fixed-point reduced precision calculations instead of using 64-bit floating-point full precision operators. We achieve this by removing some redundant calculations and truncating the numerator and denominator terms of the transform without loss of accuracy.
This paper describes a method of motion estimation that determines the optimum prediction mode, along with the resulting motion vectors, as a part of the estimation process. It supports all six prediction modes that are allowed by the Main Profile of the MPEG-2 video coding standard, including the dual-prime modes, for frame and field picture types. If implemented on a processor whose architecture is optimized for the required operations, its computational complexity will not be much greater than that of conventional single-mode motion estimation.
The advent of MPEG-4 standard for manipulation and coding of multimedia data has resulted in the exploration of new algorithms and architectural solutions for object and content-based processing. MPEG-4 operates at the level of Video Object Planes (VOPs) in contrast to frame-based processing in MPEG 1 and 2. The motion estimation process in MPEG-4 requires both 8 X 8 and 16 X 16 block motion processing. In addition, a new approach termed padding is employed to improve the accuracy of motion estimation/compensation. We have recently presented a scalable architecture for estimating the motion of both 8 X 8 and 16 X 16 blocks. In this paper, we present a novel architecture for padding in MPEG-4. Padding is carried out for the boundary macroblocks of the VOPs in two steps, namely horizontal padding and vertical padding. Padding is based on the shape information of the individual video objects. The asynchronous communication in the proposed architecture enables high throughput while maintaining a low complexity. The architecture is simple, modular and scalable and has been simulated using VHDL.
Many contemporary microprocessor architectures incorporate multimedia extensions to accelerate media-rich applications using subword arithmetic. While these extensions significantly improve the performance of most multimedia applications, the lack of subword rearrangement support potentially limits performance gain. Several means of adding architectural support for subword rearrangement were proposed and implemented but none of them provide a fully general solution. In this paper, a new class of permutation instructions based on the butterfly interconnection network is proposed to address the general subword rearrangement problem. It can be used to perform arbitrary permutation (without repetition) of n subwords within log n cycles regardless of the subword size. The instruction coding and the low-level implementation for the instructions are quite simple. An algorithm is also given to derive an instruction sequence for any arbitrary permutation.