This paper presents an overview on architectures for multimedia purposes. Emphasis is given on flexible, programmable processors to enable processing of different standardized or proprietary multimedia applications. Several parallelization strategies to enhance performance especially for video coding are described. This includes architectures like SIMD, MIMD and associative controlling. Exploitation of instruction-level parallelism by use of techniques like VLIW and packed-arithmetic extends this discussion. Reference to design examples from the literature is given. To help develop cost-effective architectures for a set of applications, two methods for modeling hardware and algorithms are explained. Instrumentation of algorithms implemented in software is discussed as a method to determine characteristics and features of given algorithms. Additionally, a more general approach is presented that analyzes different parallelization potentials for a class of algorithms. These are mapped on a simple hardware model using only a few parameters for performance evaluation. Limitations of software instrumentation and the presented modeling approach are discussed.
Future desktop and portable computing systems will have as their core an integrated multimedia system. Such a system will seamlessly combine digital video, digital audio, computer animation, text, and graphics. Furthermore, such a system will allow for mixed-media creation, dissemination, and interactive access in real time. Multimedia architectures that need to support these functions have traditionally required special display and processing units for the different media types. This approach tends to be expensive and is inefficient in its use of silicon. Furthermore, such media-specific processing units are unable to cope with the fluid nature of the multimedia market wherein the needs and standards are changing and system manufacturers may demand a single component media engine across a range of products. This constraint has led to a shift towards providing a single-component multimedia specific computing engine that can be integrated easily within desktop systems, tethered consumer appliances, or portable appliances. In this paper, we review some of the recent architectural efforts in developing integrated media systems. We primarily focus on two efforts, namely the evolution of multimedia-capable general purpose processors and a more recent effort in developing single component mixed media co-processors. Design considerations that could facilitate the migration of these technologies to a portable integrated media system also are presented.
This paper presents a design methodology for a high- performance, programmable video signal processor (VSP). The proposed design methodology explores both technology-driven hardware tradeoffs and application-driven architectural tradeoffs for optimizing cost and performance within a class of processor architectures. In particular, this methodology allows concurrent consideration of these competing factors at different levels of design sophistication, ranging from early design exploration towards full processor simulation. We present the results of this methodology for an aggressive very-long-instruction-word (VLIW) video signal processor design and discuss its utility for other programmable signal processor designs.
The generation of video and audio coding methods to follow the present, pioneering generation have not yet been standardized, but it is possible to predict many of their characteristics. I discuss these, with particular reference to their impact on the design of software and hardware systems for multimedia.
The H-Bus is a dedicated input bus designed for transmitting digital media streams from multiple sources to one acquiring host device. Unlike previous solutions to this problem which utilized a star topology to connect each source to a bank of analog-to-digital converters in the host device, the H-Bus places the A/D converters at each of the sources and uses a linear topology to connect the devices together in one contiguous chain. In this paper we explain the engineering requirements which motivated our design, describe the electrical and physical characteristics of, and applications for, the H-Bus.
Many increasingly important applications, such as video compression, graphics, or other multimedia applications, require only 8 or 16 bit data words. Using the full 64 bit data path available in most computers to perform these low- precision calculations is an inefficient use of resources. Many manufacturers have addressed this problem by introducing new instructions that allow the packing of subword quantities into a full data word. This paper presents a new software-only technique that accomplishes this same objective by packing subword integer quantities into a double precision floating point word. This technique works even in machines that have not been specially modified with new multimedia instructions. While the idea of packing subword integer quantities into a larger integer data word has been proposed before, this technique is unique in packing integer subwords into a single floating point word with a shared exponent. The traditional floating point arithmetic operations of add, subtract, and multiply are used to pack and unpack the subword quantities. Therefore, the algorithm will work on any machine that supports the IEEE double precision floating point arithmetic standard with no machine-specific code required. Furthermore, the methodology can be implemented in a high-level language such as C. In this paper we describe this technique in general and then demonstrate its validity by implementing it in a public domain mpeg decoder application, mpeg_play distributed by the Berkeley Plateau Multimedia Research Group. We achieved an average speed-up of 13.8%. While there is some degradation in quality because calculations are carried out with lower precision, there is no noticeable difference in image appearance. A quantitative comparison of the image quality is presented.
Integrating processing elements in DRAM makes very large bus widths available: at least 2K processing elements fit in a 4 Mb chip or 4 K in a 16 Mb DRAM. The processors can add an area overhead as low as 10% and power overhead of about 10 - 25%. To get these efficiencies, the processors have to be pitch-matched to the DRAM. Interprocessor communication is also severely limited, especially when going 'off-chip' while retaining low-cost packaging. These 'computing RAMs' (C$CCLRAM) can form the main memory for SISD or MIMD hosts, making their contributions to the computing load scalable. The SIMD nature of C$CCLRAM matches large image-processing tasks with high uniformity and locality of reference, making real-time DCT, anti-aliasing and a variety of transformations available at the low cost required for consumer applications. Even given a PE 'budget' of 70 - 200 transistors, and with the limited interconnect characteristic of low-cost DRAM, there are quite a few architectural choices available to the computer architect. These can be made to favor the data widths and operations needed for image processing while retaining good generality.
The DSP architecture PRISMA for object-based video signal processing is presented in this paper. Considering the specific hardware requirements of object-based algorithms a parallel architecture has been developed, which consists of 8 programmable data paths. To utilize the processing power provided by these data paths, a new controlling scheme is employed by the PRISMA processor. This dynamic associative controlling distributes 3 independent instruction streams to the 8 data paths and comprises the advantages of alternative controlling approaches, like SIMD and MIMD. It allows an efficient execution of data-dependent operations as well as a flexible partitioning of the processing resources at runtime, which is advantageous for parallel processing of concurrent objects with different performance requirements.
Visual media processing is becoming increasingly important because of the wide variety of image and video based applications. Recently, several architectures have been reported in the literature to implement image and video processing algorithms. They range from programmable DSP processors to application specific integrated circuits (ASICs). DSPs have to be software programed to execute individual operations in image and video processing. However they are not suitable for real-time execution of highly compute intensive applications such as fractal block processing (FBP). On the other hand, dedicated architectures and ASICs are designed to implement specific functions. Since they are optimized for a specific task, they cannot be used in a wide variety of applications. In this paper, we propose a parallel and pipelined architecture called fractal engine to implement the operations in FBP. Fractal engine is simple, modular, scaleable and is optimized to execute both low level and mid level operations. We note that implementation of the basic operations by fractal engine enables efficient execution of a majority of visual computing tasks. These include spatial filtering, contrast enhancement, frequency domain operations, histogram calculation, geometric transforms, indexing, vector quantization, fractal block coding, motion estimation, etc. The individual modules of fractal engine have been implemented in VHDL (VHSIC hardware description language). We have chosen to demonstrate the real-time execution capability of fractal engine by mapping a fractal block coding (FBC) algorithm onto the proposed architecture.
Chromatic announced the Mpact media processor architecture in October 1995. In October 1996 hardware and software were released to production and the second generation architecture was announced. This paper begins with an update on the first generation media processors, with hardware and software status. Next, it covers the growing goals of multimedia performance and describes Chromatic's next generation media processor architecture. Finally, the newer modules of the architecture are discussed in more detail.
Lately, VLIW architectures have become popular because of their good cost-performance ratio for e.g. multimedia applications. Multimedia applications are characterized by regular signal processing and, therefore, they are apt for analysis by compilers. VLIW architectures exploit this by scheduling the instruction stream at compile time and, thus, reducing the complexity and costs of instruction issue hardware. However, sometimes we encounter signal processing algorithms that we would like to be regular and predictable but that are so only to a certain extent. Polyphase filtering is one such algorithm. It contains a regular filter part, but its input and output streams run at rates that are not correlated to each other in a simple way. Compile time analysis is, therefore, only partly possible, which poses an inherent problem for VLIW architectures. In this paper, we describe the steps that we went through to optimize the polyphase filter for a specific instance of a VLIW architecture: the Philips TriMedia processor. We show which architectural features help to make the TriMedia processor more efficient for such irregular algorithms.
This paper describes how media processing programs may be accelerated by using the multimedia instruction extensions that have been added to general-purpose microprocessors. As a concrete example, it describes MAX2, a minimalist, second- generation set of multimedia instructions included in the PA-RISC 2.0 processor architecture. MAX2 implements subword parallel instructions, which utilize the microprocessor's 64-bit wide data paths to process multiple pieces of lower- precision data in parallel. It also includes innovative, new instructions like Mix, which are very useful for matrix transpose and other common data rearrangements. The paper examines some typical multimedia kernels, like block match, matrix transpose, box filter and the IDCT, coded with and without the MAX2 instructions, to illustrate programming techniques for exploiting subword parallelism and superscalar instruction parallelism. The kernels using MAX2 show significant speedups in execution time, and more efficient utilization of the processor's resources.
Compression of video data is a highly compute-intensive activity consisting of both regular vector-style computations and general algorithmic computations. Furthermore, the conventional compression algorithms allow the encoder some degrees of freedom in the encode process, where picture quality and degree of compression can be traded off for amount of computation. These characteristics have led to a variety of approaches to video encoding. At one extreme, real-time compression can be achieved through the use of high performance vector and general purpose co- processors to generate high compression ratios and high quality. At the other end of the spectrum, compression can be performed in real-time quite easily by doing minimal analysis of the picture to enhance quality or improve compression. The DECchip 21230 strikes a compromise between these two extremes by supporting the regular vector-style computations on an inexpensive co-processor chip, but does most of the general algorithmic computation on the host CPU. This partitioning leads to a number of scheduling and buffering challenges that are addressed by a novel decomposition of the encoding process.
Graphics systems have long been using standard libraries or APIs to insulate applications from implementation specifics. The same approach is applicable to natural image representations based on object primitives, such as proposed for MPEG4 standardization. The rendering of these image objects can be hidden behind APIs and supported either in hardware or software, depending on the level of representation they address, so that higher-level manipulation of these objects is made independent of the pixel level. We evaluate the trade-offs involved in the choice of these primitives to be used as pivotal intermediate representations. The example addressed is shape coding for image regions obtained from segmentation. Shape coding primitives based on either contour (chain codes), union of elementary patterns and alphaplane are evaluated with regard to both the possibility to support them on different architecture models and the level of functionalities they make available.
Since its introduction as a consumer product in 1994, the digital set-top box has experienced rapid growth and is predicted to grow at an even stronger rate through the end of this decade. In just the past three years, the architecture of the MPEG decoders have experienced several changes. Initially the audio and video decoder were separate chips, but recently they have been integrated onto a single chip along with the transport, encryption, and user interface functions of the set-top box. In addition to a higher level of integration, the system memory requirements have been reduced and more features for the end user have been added. In the future, the remainder of the digital functions in the set-top box will be integrated onto a single chip. To be successful in providing a cost effect solution, a mixture of hardware and software modules will be needed to provide the appropriate amount of flexibility and the smallest implementation. The hardware/software partitioning will change with each technology node until a fully programmable implementation becomes the most cost effective solution.
A new architecture for real-time MPEG-2 encoding/decoding is presented in this paper. This architecture is based on an array of TI MVPs. The main feature of this architecture is its programmability. The inherent parallelism of the MPEG-2 algorithm is investigated in order to map it to the processor array. An I/O algorithm for the major encoding function, motion estimation, is developed to demonstrate the possibility of overlapping processing and I/O.
In this paper, a computational random access memory (C*RAM) implementation of MPEG-2 video compression standard is presented. This implementation has the advantage of processing image/video data in parallel and directly in the frame buffers. Therefore, savings in execution time and I/O bandwidth due to massively parallel on-chip computation and reduction in the data transfer among chips is achieved. As a result, MPEG-2 video encoding can be realized in real-time on a programmable 64 Mb DRAM-based C*RAM.
A novel design to the problem of clock synchronization in software MPEG-2 decoders is presented. A software MPEG decoder is attractive in terms of cost and performance. However a software decoder is prone to timing uncertainty and delay jitters. By a clever use of adaptive filtering and sub sampling of time stamps, a frequency locked loop can be designed to deliver almost instantaneous capture of the unknown encoder clock frequency and with high tolerance to delay jitter. The exact analysis of the system is complex. By invoking suitable approximations, a complete design methodology is derived. Computer simulations verify the design approach illustrated.
Recent developments in the technology of digital video compression, transmission and displays have made the multiple viewpoint digital video viable for many applications, e.g. stereoscopic view for 3D TV and arbitrary angle scene compositing for virtual camera, etc. To facilitate these applications the Motion Pictures Experts Group (MPEG) of the International Standards Organization (ISO), that successfully created the MPEG-1 & 2 standards, has been working on amending the MPEG-2 standard to create a new MPEG-2 profile, called the Multi-View Profile (MVP). MVP features a two-layer -- base layer and enhancement layer -- video coding scheme. The base layer video is coded as MPEG-2 main profile (MP) bitstream. The enhancement layer video is coded with temporal scalability tools and exploits the correlation between the two viewing angles to improve the compression efficiency. This type of two layer approach guarantees backward and forward compatibility with main profile receivers and encoders; i.e. an MVP decoder will be able to decode any main profile bit stream at the same level and a main profile receiver will be able to decode the base layer of an MVP stream to generate and display mono view scenes of the same program. In stereoscopic video applications the base layer is assigned to the left eye view and the enhancement layer to the right eye view. This paper provides an overview of the MPEG-2 MVP and focuses in detail on the stereoscopic video compression algorithms.
The MPEG audio-video synchronization algorithms and a couple of implementation approaches are presented. This paper discusses sources which cause the audio-video out of synchronization apparatus of synchronization status settling, mechanisms and risk of resynchronization. MPEG and AC3 audio formatting, video-on-demand (VOD), digital-video- disc (DVD) and CD-ROM features and applications are discussed. An area reduced scalable megacell handling the synchronization status settling is proposed.
A software MPEG decoder, though attractive in terms of performance and cost, opens up new technical challenges. The most critical question is: When does a software decoder drop a frame? How to predict its timing performance well ahead of its implementation? It is not easy to answer these questions without introducing a stochastic model of the decoding time. With a double buffering scheme, fluctuations in decoding time can be smoothed out to a large extent. However, dropping of frames can not be totally eliminated. New ideas of slip and asymptotic synchronous locking are shown to answer critical design questions of a software decoder. Beneath the troubled world of frame droppings lies the beauty and harmony of our stochastic formulation.
An MPEG-2 Main Profile, high level compliant HDTV video decoder requires a variable length decoder (VLD) that can decode macroblocks at rates exceeding 100 million code words per second. The implementation of a high-performance VLD for such an application presents a major challenge in architecture design. The capability of the VLD to process macroblocks in real-time can reduce system memory requirements and simplify decoder architectures. It is more desirable to conceive of a 'one-piece' VLD capable of operating with minimal logic and memory resources and in real-time. Parallel partitioning of the video decoder on the VLD level increases the overall complexity and memory utilization. The two-word bit stream segmentation method developed by Philips Research -- USA achieves high performance without the expense of high hardware complexity or the addition of extra system memory. In this respect, this VLD implementation is very suitable for consumer digital HDTV video decoders. However, the performance guarantee of this architecture is associated with carefully specified statistical tradeoffs. The described method of pair-mach Huffman transcoding provides the VLD performance guarantee on a macroblock level without any statistical tradeoffs. Applied to the main body of the bit stream, this method produces excellent performance results for both consumer and professional MPEG profiles.
We have developed a high-speed image file server for a super-high-resolution (SHR) image display system. The SHR system displays images with a resolution of 2000 TV lines, and the data of one still image is about 24 MB in size. Therefore we need an image file server that has a 24 MB/s data-transfer rate and that enables random access to any of the stored data. We applied this high-speed image file server to the SHR system that requires 24 MB of image data per frame (equivalent in quantity to the UDTV1 standards) and is capable of continuously displaying one frame per second.
A fast disk array is designed for the large continuous image storage. It includes a high speed data architecture and the technology of data striping and organization on the disk array. The high speed data path which is constructed by two dual port RAM and some control circuit is configured to transfer data between a host system and a plurality of disk drives. The bandwidth can be more than 100 MB/s if the data path based on PCI (peripheral component interconnect). The organization of data stored on the disk array is similar to RAID 4. Data are striped on a plurality of disk, and each striping unit is equal to a track. I/O instructions are performed in parallel on the disk drives. An independent disk is used to store the parity information in the fast disk array architecture. By placing the parity generation circuit directly on the SCSI (or SCSI 2) bus, the parity information can be generated on the fly. It will affect little on the data writing in parallel on the other disks. The fast disk array architecture designed in the paper can meet the demands of the image storage.
Over the last decade, the video camera has become a common diagnostic/tool for many scientific, industrial and medical applications. The amount of data collected by video capture systems can be enormous. For example, standard NTSC video requires 5 MBytes/sec, with many groups wanting higher resolution either in bit-depth, spatial resolution and/or frame speed. Despite great advances in video capture systems developed for the mass media and teleconferencing markets, the smaller markets of scientific and industrial applications have been ignored. This is primarily due to their need to maintain the independent nature of each camera system and to maintain the high quality of the video data. Many of the commercial systems are capable of digitizing a single camera (B/W or color) or multiple synchronized B/W cameras using an RGB color video capture chip set. In addition, most manufacturers utilize lossy compression to reduce the bandwidth before storing the data to disk. To address the needs of the scientific community, a high- performance data and video recorder has been developed. This system utilizes field programmable gate arrays (FPGAs) to control the analog and digital signals and to perform real- time lossless compression on the incoming data streams. Due to the flexibility inherent in the system, it is able to be configured for a variety of camera resolutions, frame rates and compression algorithms. In addition, alternative general purpose data acquisition modules are also being incorporated into the design. The modular design of the video/data recorder allows the carrier components to be easily adapted to new bus technology as it becomes available or the data acquisition components to be tailored to a specific application. Details of the recorder architecture are presented along with examples applied to thermonuclear fusion experiments. A lossless compression ratio of 3:1 has been obtained on fusion plasma images, with further reductions expected, allowing the video recorder to capture up to ten independent video inputs and apply the compression in real-time.
An optical mass storage system using high-density recording media has been developed for large file systems such as multimedia databases, video-on-demand servers, and file backup systems. Quadruple-density 130-mm diameter magneto- optical disks raise the maximum capacity of this system to 2 tera bytes. The use of high-density media reduces the cost of file storage and the installation area and cuts the data transfer time. Compatibility with current systems using optical disks makes it easy to introduce this new system in place of older systems. For high transfer rate, we developed two disk drives. One uses two laser beams on one positioner and a new verification technique in the recording cycle by the write beam. The other uses overwritable media to reduce the overhead in the write cycle. The system cost is reduced by using standard components and high-density media. And high reliability is achieved by using cross access to drives and medium-handling mechanisms. This system is suitable for storing archived files on-line because it has enough capacity and access speed, unlike conventional systems whose archived data is stored off-line on magnetic tapes.
Compression in the context of texture and bump mapping brings high-performance texture mapping in the range of low- cost systems. The size for the texture memory is significantly reduced and the necessary bandwidth between the memory and texturing unit is lowered. Textures are compressed using a lossy compression method based on vector quantization. We are proposing an alternative approach to embossing textured surfaces, where bump maps are described with bitmaps and are part of the compressed texture data. The design of a texturing circuit architecture is presented, where embossing or engraving of textured surfaces is executed by dedicated decoding and filtering hardware. The circuit decompresses textures and decodes compressed bump maps to produce wrinkled textured surfaces. Applying encoding to normal vectors results in narrower data paths. Vectors are represented and handled in a compressed format defined by their vertical and horizontal angles. In order to enhance subjective image quality, an optional space-variant filter can be applied locally during the texture mapping phase to reduce the artifacts introduced by the lossy compression method.