Multimedia processing presents challenges form the perspectives of both hardware and software. Each media in a multimedia environment requires different processes, techniques, algorithms and hardware implementations. Each of technologies involved in multimedia processing depends on VLSI implementation for cost efficiency. Although recent advance solved many hardware problems, but the high demands of multimedia processing require special architectural approaches. In this paper, we propose the categorization of main architectural alternatives for multimedia computing architecture design as in any other processor design namely: dedicated and programmable. Combination of dedicated and programmable modules in a multimedia architecture offers a compromise between the two strategies as an adapted architecture for multimedia purposes. The design trends in each category and the classification of different multimedia architectures are detailed in this paper.
With the introduction of faster processors and special instruction sets tailored to multimedia, a number of exciting applications are now feasible on the desktops. Among these is the DVD playback consisting, among other things, of MPEG-2 video and Dolby digital audio or MPEG-2 audio. Other multimedia applications such as video conferencing and speech recognition are also becoming popular on computer systems. In view of this tremendous interest in multimedia, a group of major computer companies have formed, Multimedia Benchmarks Committee as part of Standard Performance Evaluation Corp. to address the performance issues of multimedia applications. The approach is multi-tiered with three tiers of fidelity from minimal to full compliant. In each case the fidelity of the bitstream reconstruction as well as quality of the video or audio output are measured and the system is classified accordingly. At the next step the performance of the system is measured. In many multimedia applications such as the DVD playback the application needs to be run at a specific rate. In this case the measurement of the excess processing power, makes all the difference. All these make a system level, application based, multimedia benchmark very challenging. Several ideas and methodologies for each aspect of the problems will be presented and analyzed.
Energy efficient computing is growing in demand as portable systems require energy efficiency in order to maximize the battery life. Memory power consumption is becoming an increasingly larger fraction of the total power consumption of a given system. In this paper, we provide data and insight into how the choice of cache parameters affects memory power consumption of video algorithms. We make use of memory traces generated as a result of running typical MPEG- 2 motion estimation algorithms to simulate a large number of cache configurations. The cache simulation data is then combined with on-chip and off-chip memory power models to compute memory power consumption. We provide a detailed study of how varying cache size, block size, and associativity affects memory power consumption. The configurations of particular interest are the ones that optimize power under certain constraints. We also study the role of process technology in these experiments. In particular, we look at how moving to a more advanced process technology for the on-chip cache affects optimal points of operation with respect to memory power consumption.
Object-based media refers to the representation of audiovisual information as a collection of objects - the result of scene-analysis algorithms - and a script describing how they are to be rendered for display. Such multimedia presentations can adapt to viewing circumstances as well as to viewer preferences and behavior, and can provide a richer link between content creator and consumer. With faster networks and processors, such ideas become applicable to live interpersonal communications as well, creating a more natural and productive alternative to traditional videoconferencing. In this paper is outlined an example of object-based media algorithms and applications developed by my group, and present new hardware architectures and software methods that we have developed to enable meeting the computational requirements of object- based and other advanced media representations. In particular we describe stream-based processing, which enables automatic run-time parallelization of multidimensional signal processing tasks even given heterogenous computational resources.
SIMD processor arrays are becoming popular for their fast parallel executions of low- to medium-complexity image and video processing algorithms, and most stages of the compression standards. In many existing techniques, visual data processing algorithms and compression standards possess a high degree of parallelism. In particular, the processing of a certain pixel/block does not generally require data from a distant pixel/block, and the instructions for processing these pixels/blocks are usually identical. Thus, these algorithms map naturally onto the architecture of the SIMD processor arrays. In this paper, the architectures of the recently SIMD processor arrays will be reviewed together with algorithms demonstrating their superior features. Due to the length of the paper, only processor arrays implemented at the chip-level are discussed, especially those whose logic circuits are merged/embedded in the SRAM or DRAM memory process. While some processor arrays are designed by embedding the memory modules onto the existing processors, a significant number of processor arrays are realized by integrating logic circuits onto existing RAM to save the inherently large memory bandwidth, and thus achieving a performance in the order of Tera instructions per second.
Resizing techniques are commonly used in graphic applications and window environments to change image size and adapt it to the video resolution. Different algorithms have been proposed to perform image reduction with different effects on the image content. In this paper we present a proposal for a graphic coprocessor, based on TMS320C80, optimized for line drawing low-loss reduction and suitable for VDT presentations. Based on DCT and local thresholding, the algorithm is characterized by fine granularity, task independence and consequently good matching with parallel processing. In the proposed system, the line drawing is spilt into 32*32 pixels windows. In order to perform the image reduction, each window is subject to DCT computation and inverse transformation of the only low-frequency coefficients. The reduced image is thresholded to a value function of the black pixel percentage in the original image and the reduction ratio. The coprocessor is able to reduce an A0 line drawing acquired at 300 dpi in less than 0.2 seconds, by a reduction ratio f 1/32, producing good information preservation in the new image suitable for VDT presentation.
The length of a statically created instruction schedule determines to a great extent the performance of program executions on VLIW architectures. In this paper we present a simple, yet effective, method to reduce the length of a static instruction schedule by introducing new hardware operations, referred to as super operations. A super operation replaces a number of operations, while maintaining functionality, hence decreasing the total number of operations to be executed and thereby eliminating the dependencies between them. In order to replace a number of operations, super operations must often process more operands and produce more results than traditional operations. The Philips TM-1000 is a VLIW based architecture. Its CPU is a 5-issue machine with 27 functional units, each connected to one issue-slot. To support super operations, we extend the hardware with special functional units which are connected to more than one issue-slot. In this paper we discuss the modifications that were made to the compiler in order to support super operations and we demonstrate the ease with which super operations can be applied by the application programmer. To a lesser extent, we address consequences of super operations concerning the hardware. Furthermore, we demonstrate the benefit of super operations by showing the performance improvement for some multimedia applications.
This paper presents a design study of the memory system for a very long instruction word (VLIW) video signal processor (VSP). The gap between memory and modern processors is continuously becoming wider and wider, and thus memory systems have been a subject of active research for a long time.However, memory issues in VLIW machines have not yet been addressed. Real-time video signal processing requires a fast memory with high-bandwidth and high-connectivity. Efficient memory system design is particularly important for VSPs that combine significant amounts of memory on-chip with the processor, which we expect to become common in the next generation of VSPs. In this paper we use trace-driven methodology to analyze the parallelism, especially that of memory operations, in video applications. With a scheduling range of up to ne billion operations, we analyzed large traces of several real applications including H.263, MPEG2 and MPEG4. We found that even with a conservative configuration the average speedup is more than 8.
When implementing today's video compression standards on programmable processors, it is essential to optimize the algorithms with respect to the underlying hardware. As an example, the core decoder functions of the H.263 hybrid coding scheme were implemented on a SIMD controlled processor with four parallel VLIW data paths, the HiPAR-DSP. The decoder tasks were implemented employing local memory, parallelization on several levels, and data statistics. Special effort was paid on the computation intensive tasks IDCT, and motion compensated frame reconstruction. To speed up the IDCT computation, a data dependent approach was chosen, which distinguishes different block types. The determination of IDCT block type could be parallelized together with other tasks, thus no additional overhead is required. Frame reconstruction mainly benefits from data parallel operations and transparent DMA transfers to and from external memory.
This paper surveys the state-of-the-art in very long instruction word (VLIW) architectures for video signal processing (VSP). Several factors make VLIW and video a good match, including the large amounts of data parallelism in video programs and the ability to implement VLIW VSPs on a single chip. The paper first introduces the canonical VLIW architecture, then considers several alternative architectural approaches for video processing, and finally discusses some VLIW VSP architectures in more detail.
Prefetch techniques may, in general, be applied to reduce the miss rate of a processor's data cache and thereby improve the overall performance of the processor. More in particular, stream prefetch techniques can be applied to prefetch data streams that are often encountered in multimedia applications. Stream prefetch techniques exploit the fact that data from such streams are often accessed in a regular fashion. Implementing a stream prefetch technique involves two issues, viz. stream combined hardware/software stream prefetch technique. A special stream-prefetch instruction is introduced to alert the hardware that load instructions access a data stream. Subsequently, prefetching is handled by the hardware automatically in such a way that the rate at which the data is prefetched is synchronized with the rate at which the prefetched data is processed by the application. These kinds of stream prefetch techniques have been proposed earlier but use instruction addresses for synchronization. The technique that is introduced in this paper uses a different synchronization mechanism that does not suffer from drawbacks of instruction address synchronization.
Demand for highly flexible and fast implementations for bitstream parsing and variable-length-decoding (VLD) arises, if applications are targeted that shall support either MPEG- 4 or multiple standards like MPEG-2, H.263 or Dolby AC3. The paper shows that especially today's multimedia oriented RISC processors incorporating multiple parallel arithmetic units are slowed down by these kind of bit-level operations. Therefore, a new architecture is proposed, that adds function specific blocks into the data path of a RISC processor, that are highly adapted to the processing of variable-length coded bitstream data. The increased functional complexity of basic instructions results in a significant speedup over software implementations on standard RISC processors. Two typical functions, that are frequently used in bitstream parsing, ShowBits and GetBits, are executed in a single clock-cycle with a 64 bit rotator circuit. Constant input-rate VLD of one, two or four bits per clock-cycle can be implemented using internal RAM. Look- up-tables can be used for word-parallel decoding and VLC. Optionally memory entries can be saved using content addressable memories in addition to a data RAM. The proposed architecture has been implemented as a functional extension to an existing RISC core with additional 9k gates of logic, 8k RAM and an interface to a CAM. Synthesis result show an estimate of 160 MHz achievable clock frequency using a 0.35 (mu) technology. The resulting performance is sufficient for MPEG-2 HDTV or MPEG-4 applications.
In this work we discuss a technique denoted localized domain-pools in the context of parallelization of the encoding phase of fractal image compression. Performance problems occurring on distributed memory MIMD architectures may be resolved using this technique.
Block-based motion estimation (ME) has been a very active area of research in the field of video signal processing and coding. Traditional ME techniques consider only the sum of the absolute differences between the current and the reference blocks to find the motion vectors (MV). Recently, several rate-constrained ME techniques have been proposed which considers the effect of encoding the MV by adding an entropy constraint to the distortion measure. Such techniques show an improved performance at low bit-rates. The displaced frame difference blocks are typically transformed using discrete cosine transform and then quantized in the transform domain. In our proposed scheme, an entropy-constrained ME algorithm is proposed which considers the number of bits allotted for coding the DFD's in addition to those used for encoding the MV and the MB mode information. Simulation results show that the proposed technique performs significantly better than the standard technique for a wide range of bit-rates. In addition, a computationally less complex scheme is also proposed, although with reduced performance improvements over the conventional scheme. The implementation aspects of the proposed schemes are also briefly discussed.
Motion estimation is a temporal image compression technique where an n X n block of pixels in the current frame of a video sequence is represented by a motion vector with respect to the best matched block in a search area of the previous frame, and the DCT coefficients of the estimated error terms. In this paper, a fast technique for motion estimation is proposed and later mapped onto the SIMD structure of the computational*RAM (C*RAM). C*RAM is a conventional computer DRAM with built-in logic circuitry at the sense-amplifier to take advantage of the high on-chip memory bandwidth and massively parallel SIMD operations. The proposed technique, first, attempts to reduce the n-bit grayscale frames into 1-bit binary frames using morphological filters, and to search for motions of the extracted features on the binary frames. While the reduction procedure requires a small percentage of computation using the full grayscale, the search procedure is performed by simple XOR logic operations and 1-b distortion accumulations on the entire search area. The second part of the paper presents the mapping of the proposed technique onto the C*RAM architecture.