As part of our research into programmable media processors, we conducted a multimedia workload characterization study. The tight integration of architecture and compiler in any programmable processor requires evaluation of both technology-driven hardware tradeoffs and application-driven architectural tradeoffs. This study explores the latter area, providing an examination of the application-driven architectural issues from a compiler perspective. Using an augmented version of the MediaBench multimedia benchmark suite, compiling and analysis of the applications are performed using the IMPACT compiler. Characteristics including operation frequencies, basic block and branch statistics, data sizes, working set sizes, and scheduling parallelism are examined for purposes of defining the architectural resources necessary for programmable media processors.
The Chidi system is a PCI-bus media processor card which performs its processing tasks on a large field-programmable gate array (Altera 10K100) in conjunction with a general purpose CPU (PowerPC 604e). Special address-generation and buffering logic (also implemented on FPGAs) allows the reconfigurable processor to share a local bus with the CPU, turning burst accesses to memory into continuous streams and converting between the memory's 64-bit words and the media data types. In this paper we present the design requirements for the Chidi system, describe the hardware architecture, and discuss the software model for its use in media processing.
This paper presents a new parallelism manager for multimedia multiprocessors. An analysis of recent multimedia applications shows that the available parallelism moves from the data-level to the control-level. New architectures are required to be able to extract this kind of dynamic parallelism. Our proposed parallelism management describes the parallelism with a topological description of the task dependence graph. It allows to represent various and complex parallelism patterns. This parallelism description is separated from the program code to allow the task manager to decode it in parallel with the task execution. The task manager is based on a queue bank that stores the task graph. Control commands are inserted in the task dependence graph to allow a dynamic modification of this graph, depending on the processed data. Simulations on classical multiprocessing benchmarks show that in case of simple parallelism, we have similar performances than classical systems. However, the performances on complex applications are improved up to 12%. Multimedia applications have also bee simulated. The results show that our task manager can efficiently handle complex dynamic parallelism structures.
We show that microSIMD architectures are more efficient for media processing than other parallel architectures like SIMD or MIMD parallel processor architectures, and VLIW or superscalar architectures. We define alternative mappings of data onto subwords, and show that the index mapping is an ideal mapping for achieving maximal subword parallelism with minimal revamping of the original serial loop code. We show an example where packed data loaded directly into registers from memory can be interpreted as index-mapped data rather than area-mapped data. This allows increased use of the subword parallelism provided by the microSIMD architecture, by exploiting data parallelism across loop iterations rather than within a loop. We also show how to convert rapidly between data mappings by using the Mix permutation instruments, first defined in the MAX-2 multimedia extensions for PA-RISC processors. We propose a new instruction, MixPair, which cuts by half the cost of parallel Mix functional units, while achieving maximum subword permutation performance.
We have developed the University of Washington Image Computing Library (UWICL), the high-performance image processing library for a next-generation mediaprocessor currently under development, named the Media Accelerated Processor (MAP1000). The primary goal of this library is to provide the algorithm developers and application programmers with a flexible and efficient library of core image computing functions. The UWICL is organized as a set of three hierarchical layers. Each function in this multilayered framework consists of an application module, a function module, and a tight-loop module. The MAP has an intelligent DMA controller called the Data Streamer that allows efficient data flow management. In cache-based architectures, streaming image data from the external memory generates many costly data-cache misses, in many cases leading to a severe performance bottleneck. The MAP's Data Streamer is designed to address this problem. To reduce the number of the data-cache misses further, a ping-ponging data flow scheme is employed in UWICL functions, i.e., while execution units are processing a block of data currently in the data cache, the Data Streamer brings the next data block to the data cache before it is actually needed. We compare the performance of key imaging functions on the MAP and the Texas Instruments TMS320C80, one of the most powerful mediaprocessors currently available. Typically, a MAP function is 1.5 to 6.6 times faster than the corresponding TMS320C80 implementation. Also, we demonstrate the advantages of the UWICL's multilayered library organization over the single-layered approach with an example in implementing the Canny's edge detector. The multilayered implementation of this algorithm outperforms the single- layered version by 26%.
A wavelet Application Programmers' Interface (API) for the TriMedia TM1000 multimedia processor will be described and demonstrated. The challenge undertaken for this API was to provide a single scalable interface to a subband encoder with acceptable performance at bit rates from 9600 bps to 2 Mbps. Further, the API as designed supports the utilization of multiple TriMedia resources. Specifically, the encoder to be demonstrated, is a fully motion compensated wavelet transform which does not utilize the Discrete Cosine Transform for the encoding of the prediction errors. The video encoder has numerous novel features including an integer valued limited precision wavelet transform; a fully motion compensated wavelet API; integral support for multiple processors; and simple mapping to hardware.
Convolution is widely used as an effective tool for enhancing image features, such as points, lines, or edges, and smoothing noise. One major challenge in implementing convolution in real time has been its large computational requirement. For example, convolving a 512 X 512 image with a 7 X 7 kernel requires 50 million operations. Therefore, to achieve the computational performance needed in real-time applications, hardwired solutions with ASICs and/or fixed-function chips with little programmability have been used. The disadvantages associated with hardwired implementations are that they are rigid, unifunctional and not upgradable. Our approach has been programmable convolution, which is flexible, multi-functional, easily upgradable and has a performance comparable to the hardwired implementations. This paper describes an efficient algorithm for convolution, which can be implemented in software on the new generation of VLIW mediaprocessors. These processors can perform multiple multiplication, addition and load/store operations in a single instruction, which can be used effectively in convolution to reduce the execution time. We have implemented this algorithm on a new mediaprocessor called the MAP1000TM where it takes 8.6 ms for the convolution of a 512 X 512 image with a 7 X 7 kernel. This performance is 7 times faster than the previously reported software-based convolution on the Texas Instruments TMS320C80 mediaprocessor and is comparable with the hardwired implementations for the same image and kernel size. This algorithm and its implementation on the next- generation programmable mediaprocessor clearly demonstrate the feasibility of software-based convolution.
Acceptable video quality in standards-based video communication requires the use of hardware acceleration. Traditional approaches to this in recent years have been a hardwired approach, in which video compression algorithms cannot be changed, and a fully programmable approach, in which a single VLIW processor is programmed in a high-level language (such as `C') or assembly. The drawback of the hardwired approach has been inflexibility and difficulty of adding performance enhancements since new silicon is needed to make improvements. The drawback of the fully programmable approach is the difficulty in developing software and tools for the processor. This has the new result of lengthening the overall design cycle time. A new approach, described in this paper, uses multiple VLIW processing elements, some of which are pipelined, that have reprogrammable microcode. This allows particular algorithm classes, such as DCT/IDCT, motion compensation or motion estimation, to be reprogrammed to improve performance without changing the silicon. First, overviews of a new MPEG4 Video Communication system architecture and its hardware synthesis process are given. Then, two methods for reprogramming the microcode for the motion estimator are discussed and compared. Finally, comparisons of our system architecture to the traditional hardwired architecture and the fully programmable architecture are given.
The planned transition from analog to digital television will affect most consumers before the next decade is over. Once analog broadcasting has been replaced with digital broadcasting, consumers will be required to purchase set top boxes or new digital TV's. While the transition has already started, there is much confusion in the industry as how to implement digital television. The HDTV standard specifies how digital TV signals are to be broadcast and encoded but many typical broadcast operations have been overlooked. Also at this time, industry is looking towards computer-TV convergence as a way to make TV entertainment more interactive and less passive. This paper addresses what role MPEG-4 may play in the upcoming DTV evolution and how it can solve problems not addressed by the MPEG-2 standard.
One important factor in deciding the success of a new consumer product or integrated circuit is minimized time-to- market. A rapid prototyping methodology that encompasses algorithm development in the hardware design phase will have great impact on reducing time-to-market. In this paper, a proven hardware design methodology and a novel top-down design methodology based on Frontier Design's DSP Station tool are described. The proven methodology was used during development of the MC149570 H.261/H.263 video codec manufactured by Motorola. This paper discusses an improvement to this method to create an integrated environment for both system and hardware development, thereby further reducing the time-to-market. The software tool chosen is DSP Station tool by Frontier Design. The rich features of DSP Station tool will be described and then it will be shown how these features may be useful in designing from algorithm to silicon. How this methodology may be used in the development of a new MPEG4 Video Communication ASIC will be outlined. A brief comparison with a popular tool, Signal Processing WorkSystem tool by Cadence, will also be given.
Within the European Emphasis project, the architecture of a co-processor for MPEG-4 video rendering has been specified. This co-processor has a dedicated unit that performs the inverse perspective transform of video objects for final frame composition. This paper presents a complete study of computation accuracy and proposes a possible architecture for hardware implementation.
The new MPEG-4 standard allows for interactivity. high compression, and/or universal accessibility and portability of multimedia content (natural audio and video, synthetic content). The visual part of MPEG-4 specifies algorithms for object oriented audio-visual coding. In addition to the conventional frame-based functionalities of the MPEG-i and MPEG-2 standards, the MPEG-4 video coding will also support arbitrary shaped video objects. Therefore the concept of video object planes (VOPs) has been introduced. Each frame of an input video sequence is segmented into a number of regions, which may cover image or video content of interest. These regions are encoded in the so called alpha-plane, giving the objects contour. In contrast to the MPEG-112 standards, the video input is no longer considered as a rectangular region. The shape, motion and texture information of a VOP is encoded and transmitted in a video object layer, covering all information of one video object (VO). Similar to the MPEG- 1/2 coders, the MPEG-4 video coding scheme processes the successive images of a VOP sequence in a block-based manner (e.g. motion estimation/compensation, DCT). The coding and decoding are based on macroblocks (MBs) of 16x16 pixel size. Therefore an image padding technique is used for the macroblocks of an MPEG-4 image, which contain the shape edge of an object. These blocks are called contour macroblocks. Their non-object pixel are filled using the padding technique. This technique, which will be described in detail in chapter two, turned out to be a computational complex and very irregular operation. A dedicated hardware accelerator for the MPEG-4 padding algorithm has been designed to remove this task from a general purpose host processor. The accelerator architecture exploits the data dependency of the padding algorithm to allow for a very high macroblock throughput rate. The global architecture is described in chapter three, while the data dependent scheduling of operation is sketched in chapter four. Chapter five will give a conclusion about the performance results and the hardware cost of the module.
Multimedia system design presents challenges from the perspectives of both hardware and software. Each media in a multimedia environment requires different processes, techniques, algorithms and hardware implementations. Multimedia processing which necessitates real time digital video, audio, and 3D graphics processing is an essential part of new systems as 2D graphics and image processing was in current systems. Multimedia applications require efficient VLSI implementations for various media processing algorithms. Emerging multimedia standards and algorithms will result in hardware systems of high complexity. In addition to recent advances in enabling VLSI technology for high density and fast circuit implementations, special investigation of architectural approaches is also required. In the past few years, multimedia hardware design has captured the most attentions among researchers. New programmable processors, high-speed storage and modern parallelism techniques are among the variety of subjects, which are being addressed in this domain. A detailed categorization of available multimedia processing strategies is required to help designers in adapting these techniques into new architectures. Some of important options in multimedia hardware design include: processor structure, parallelization and granularity, data distribution techniques, instruction level parallelism, memory interface and flexibility. In this paper, we address important issues in the design of a programmable multimedia processor.
In the past several years, there has been a surge of new programmable mediaprocessors introduced to provide an alternative solution to ASICs and dedicated hardware circuitries in the multimedia PC and embedded consumer electronics markets. These processors attempt to combine the programmability of multimedia-enhanced general purpose processors with the performance and low cost of dedicated hardware. We have reviewed five current multimedia architectures and evaluated their strengths and weaknesses.
In this paper, we present phase one of a three-phase project. The objective of the project is to design, develop and implement a flexible programmable video coprocessor. The processor targets applications for the MPEG format. Six basic processing tasks have been identified as the main job of the coprocessor. They contribute to a wide variety of operations frequently needed by multimedia applications. These tasks are resolution conversion, frame rate changing, quality and rate control (bits per pixel), filtering, video compositing and video cut detection. This phase presents a critical comprehensive study of the algorithms capable of performing these operations in the DCT domain. Two cases were considered with and without motion compensation. This phase is an essential step for laying down the architecture of different modules and achieving the most efficient implementation. Included in this paper too, are the design philosophy that has been adopted, the design objectives that were set and the outline of the coprocessor building blocks along with their interactions. Phase two will cover the details of the coprocessor design and, finally, phase three will be the actual implementation and testing of the chip.
Half-pel motion compensation, unlike its full-pel counterpart, requires the availability of up to four pixels from the reference picture to generate each compensated pixel. To compensate a 16 X 16 macroblock, a 17 X 17 array of pixels is needed. The number of memory access cycles necessary to process a macroblock, if half-pel motion compensation is employed, is greater than the number otherwise needed by 33, or 13% of the macroblock size. In some motion prediction modes, two 17 X 9 pixel arrays are used, and the number of additional cycles increases to 50, or 20% of the macroblock size. This affects the timing requirements for digital video decoding. In particular, a clock frequency higher than the pixel rate is required, as is buffering for pixel data to convert between the two rates. This paper considers the above problem and presents a method of reference picture memory access that eliminates the additional processing time required for half-pel motion compensation.
Fractal image compression is a relatively new and effective technique with a high compression ratio and short decoding time. However, the disadvantage is explicit, as the time consumed in the encoding procedure is enormous. Here we developed a new encoding approach, which gains drastic improvement in speed, compared with the conventional method (by Fisher and Jacobs). The essence of the algorithm is two- step matching rather than one-step while comparing domains with a range. In the first step we select some candidate domain blocks (CDBs) which are more `near' to the range. And in the second step, we select the most matched domain from CDBs. Both of the steps are very simple. As a result, the total time spent in two steps is even shorter than one step. Experiments show that the improved algorithm is 2 to 4 times faster than the conventional one (by Fisher and Jacobs). Furthermore, the quality of the recovered images is almost as same as that acquired from the conventional method, with 0.1 dB reduction at most. In addition, MMX technique is employed in the core part of the algorithm. Experiments indicate that by MMX technique the speed is near 3 times as fast as before.
This paper presents how to implement the block-matching motion estimation algorithm efficiently by Field Programmable Gate Arrays (FPGAs) based Custom Computer Machine (CCM) for video compression. The SPACE2 Custom Computer board consists of up to eight Xilinx XC6216 fine- grain, sea-of-gate FPGA chips. The results show that two Xilinx XC6216 FPGA can perform at 960 MOPs, hence the real- time full-search motion estimation encoder can be easily implemented by our SPACE2 CCM system.
Telepresence teaching is a very useful mean of allowing university teaching whenever there is a lack of local resources. To be effective it requires a user friendly environment for both teachers and learners. The experience of the technical University of Milan in delivering telepresence courses has promoted investigation and development of effective teleteaching equipment. A layered equipment architecture has been introduced while users requirements and available technologies are intertwined to identify equipment functional specifications. Modular equipment as a Teacher Tracker, a Laser Beam Mouse, a Student Pointer and a Multiplexed Videocodec are among the designed and prototyped equipment. The activity indicates the possibility of applying well established technologies in teleteaching systems by promoting University laboratories where user requirements and a large spectrum of technologies are investigated to anticipate useful products whenever markets specifications are not yet established.
Media applications are characterized by large amounts of available parallelism, little data reuse, and a high computation to memory access ratio. While these characteristics are poorly matched to conventional microprocessor architectures, they are a good fit for modern VLSI technology with its high arithmetic capacity but limited global bandwidth. The stream programming model, in which an application is coded as streams of data records passing through computation kernels, exposes both parallelism and locality in media applications that can be exploited by VLSI architectures. The Imagine architecture supports the stream programming model by providing a bandwidth hierarchy tailored to the demands of media applications. Compared to a conventional scalar processor, Imagine reduces the global register and memory bandwidth required by typical applications by factors of 13 and 21 respectively. This bandwidth efficiency enables a single chip Imagine processor to achieve a peak performance of 16.2GFLOPS (single-precision floating point) and sustained performance of up to 8.5GFLOPS on media processing kernels.