In the field of consumer electronics, the advent of new features such as Internet, games, video conferencing, and mobile communication has triggered the convergence of television and computers technologies. This requires a generic media-processing platform that enables simultaneous execution of very diverse tasks such as high-throughput stream-oriented data processing and highly data-dependent irregular processing with complex control flows. As a representative application, this paper presents the mapping of a Main Visual profile MPEG-4 for High-Definition (HD) video onto a flexible architecture platform. A stepwise approach is taken, going from the decoder application toward an implementation proposal. First, the application is decomposed into separate tasks with self-contained functionality, clear interfaces, and distinct characteristics. Next, a hardware-software partitioning is derived by analyzing the characteristics of each task such as the amount of inherent parallelism, the throughput requirements, the complexity of control processing, and the reuse potential over different applications and different systems. Finally, a feasible implementation is proposed that includes amongst others a very-long-instruction-word (VLIW) media processor, one or more RISC processors, and some dedicated processors. The mapping study of the MPEG-4 decoder proves the flexibility and extensibility of the media-processing platform. This platform enables an effective HW/SW co-design yielding a high performance density.
The objective of this research is to design and develop a flexible programmable video coprocessor. The processor targets applications for MPEG2 format. Five basic processing tasks have been identified as the main job of the coprocessor. They contribute to a wide variety of operations frequently needed by multimedia applications. These tasks are frame rate conversion (increase of frame rate or decrease of frame rate), resolution conversion, changing bits per pixel, filtering, and video compositing operations (rotations or mirroring of frame). The first phase of this project1 presented a critical comprehensive study of the algorithms capable of performing these tasks in the DCT domain. In this paper the details of coprocessor design, implementation and the simulation of the chip are presented.
This paper describes hardware/software codesign method of the extendible embedded RISC core VIRGO, which based on MIPS-I instruction set architecture. VIRGO is described by Verilog hardware description language that has five-stage pipeline with shared 32-bit cache/memory interface, and it is controlled by distributed control scheme. Every pipeline stage has one small controller, which controls the pipeline stage status and cooperation among the pipeline phase. Since description use high level language and structure is distributed, VIRGO core has highly extension that can meet the requirements of application. We take look at the high-definition television MPEG2 MPHL decoder chip, constructed the hardware/software codesign virtual prototyping machine that can research on VIRGO core instruction set architecture, and system on chip memory size requirements, and system on chip software, etc. We also can evaluate the system on chip design and RISC instruction set based on the virtual prototyping machine platform.
We present the Light Field Video Camera, an array of CMOS image sensors for video image based rendering applications. The device is designed to record a synchronized video dataset from over one hundred cameras to a hard disk array using as few as one PC per fifty image sensors. It is intended to be flexible, modular and scalable, with much visibility and control over the cameras. The Light Field Video Camera is a modular embedded design based on the IEEE1394 High Speed Serial Bus, with an image sensor and MPEG2 compression at each node. We show both the flexibility and scalability of the design with a six camera prototype.
The Object-Based Media Group at the MIT Media Laboratory is developing robust, self-organizing programming models for dense ensembles of ultra-miniaturized computing nodes which are deployed by the thousands in bulk fashion, e.g. embedded into building materials. While such systems potentially offer almost unlimited computation for multimedia purposes, the individual devices contain tiny amounts of memory, lack explicit addresses, have wireless communication ranges only in the range of millimeters to centimeters, and are expected to fail at high rates. An unorthodox approach to handling of multimedia data is required in order to achieve useful, reliable work in such an environment. We describe the hardware and software strategies, and demonstrate several examples showing the processing of images and sound in such a system.
In the past few years, programmable mediaprocessors (e.g., Hitachi/Equator Technologies MAP-CA and Texas Instruments TMS320C6x) have been replacing ASICs and other hardwired components in imaging applications (e.g., medical imaging modalities, machine vision systems, and video conferencing). Due to the high performance requirements of many imaging applications, older general-purpose processors were not suitable for these kinds of applications. For instance, in 1993 the TMS320C80 was about 50 times faster than the Intel 486 processor. However, recent advances in the architecture and instruction sets of general-purpose processors have closed the gap significantly in performance between these processors and programmable mediaprocessors. For example, the MMX, SSE, and SSE2 extensions to the Pentium 4 instruction set give the Pentium 4 a legitimate multimedia instruction set that is comparable to the instruction sets found in mediaprocessors, thus further blurring the boundary between general-purpose processors and mediaprocessors. The combination of the instruction set extensions and a new architecture that supports very high clock frequencies give the Pentium 4 performance in imaging functions comparable to high performance mediaprocessors and thereby make the Pentium 4 a candidate for applications where its large size, high cost, and high power consumption are not overriding issues.
Texas Instruments recently introduced its latest C6000 family DSP core; TMS320C64x. C64x is a VLIW DSP core with eight 32-bit functional units, two levels of on-chip memory and a programmable Direct Memory Access (DMA) controller. We have developed about 25 image/video computing functions and have assessed its performance and suitability in image/video computing. We present some of these results along with an example on how we mapped image warping optimally onto the C64x core. C64x, although a 32-bit architecture, has a throughput similar to that of 64-bit architectures. It is powerful due to its large and multi-level on-chip memory, a number of available functional units, high clock frequency, and ease of programming.
This paper presents a wavelet based codec (SMAWZ) that uses the most important features of the SPIHT algorithm (zero-trees) while eliminating all concepts that are incompatible with a reduced hardware environment. This is done by replacing dynamic list structures with static maps (significance maps) which leads to a simpler and spacially oriented coefficient scan order. Additionally, extensions such as wavelet packet support and object based coding are included.
This paper presents the results of a quantitative evaluation of the instruction fetch characteristics for media processing. It is commonly known that multimedia applications typically exhibit a significant degree of processing regularity. Prior studies have examined this processing regularity and qualitatively noted that in contrast with general-purpose applications, which tend to retain their data on-chip and stream program instructions in from off-chip, media processing applications are exactly the opposite, retaining their instruction code on-chip and commonly streaming data in from off-chip. This study expounds on this prior work and quantitatively validates their conclusions, while also providing recommendations on architectural methods that can enable more effective and affordable support for instruction fetching in media processing.
One approach to transformation based compression is the Matching Pursuit Projection (MPP). MPP or variants of it have been suggested for designing image compression and video compression algorithms and have been among the top performing submissions within the MPEG-4 standardization process. In the case of still image coding, the MPP approach has to be paid with an enormous computational complexity. In this work we discuss sequential, as well as parallel speedup techniques of a MPP image coder which is competitive in terms of rate-distortion performance.
MoNETTM (Media over Net) SX000 product family is designed using a scalable voice, video and packet-processing platform to address applications with channel densities from few voice channels to four OC3 per card. This platform is developed for bridging public circuit-switched network to the next generation packet telephony and data network. The platform consists of a DSP farm, RISC processors and interface modules. DSP farm is required to execute voice compression, image compression and line echo cancellation algorithms for large number of voice, video, fax, and modem or data channels. RISC CPUs are used for performing various packetizations based on RTP, UDP/IP and ATM encapsulations. In addition, RISC CPUs also participate in the DSP farm load management and communication with the host and other MoP devices. The MoNETTM S1000 communications device is designed for voice processing and for bridging TDM to ATM and IP packet networks. The S1000 consists of the DSP farm based on Carmel DSP core and 32-bit RISC CPU, along with Ethernet, Utopia, PCI, and TDM interfaces. In this paper, we will describe the VoIP infrastructure, building blocks of the S500, S1000 and S3000 devices, algorithms executed on these device and associated channel densities, detailed DSP architecture, memory architecture, data flow and scheduling.
Advances in broadband access are expected to exert a profound impact in our everyday life. It will be the key to the digital convergence of communication, computer and consumer equipment. A common thread that facilitates this convergence comprises digital media and Internet. To address this market, Equator Technologies, Inc., is developing the Dolphin broadband set-top box reference platform using its MAP-CA Broadband Signal ProcessorT chip. The Dolphin reference platform is a universal media platform for display and presentation of digital contents on end-user entertainment systems. The objective of the Dolphin reference platform is to provide a complete set-top box system based on the MAP-CA processor. It includes all the necessary hardware and software components for the emerging broadcast and the broadband digital media market based on IP protocols. Such reference design requires a broadband Internet access and high-performance digital signal processing. By using the MAP-CA processor, the Dolphin reference platform is completely programmable, allowing various codecs to be implemented in software, such as MPEG-2, MPEG-4, H.263 and proprietary codecs. The software implementation also enables field upgrades to keep pace with evolving technology and industry demands.
The architecture for block-based video applications (e.g. MPEG/JPEG coding, graphics rendering) is usually based on a processor engine, connected to an external background SDRAM memory where reference images and data are stored. In this paper, we reduce the required memory bandwidth for MPEG coding up to 67% by identifying the optimal block configuration and applying embedded data compression up to a factor four. It is shown that independent compression of fixed-sized data blocks with a fixed compression ratio can decrease the memory bandwidth for a limited set of compression factors only. To achieve this result, we exploit the statistical properties of the burst-oriented data exchange to memory. It has been found that embedded compression is particularly attractive for bandwidth reduction when a compression ratio 2 or 4 is chosen. This moderate compression factor can be obtained with a low-cost compression scheme such as DPCM with a small acceptable loss of quality.
Being able to transmit the audio bitstream progressively is a highly desirable property for network transmission. MPEG-4 version-2 audio supports fine grain bit rate scalability in the Generic Audio Coder (GAC). It has a Bit-Sliced Arithmetic Coding (BSAC) tool, which provides scalability in the step of 1kbit/sec per audio channel. However, this fine grain scalability tool is only available for mono and stereo audio material. Not much work has been done on progressively transmitting multichannel audio sources. MPEG Advanced Audio Coding (AAC) is one of the most distinguished multichannel digital audio compression systems. Based on AAC, we develop a progressive syntax-rich multichannel audio codec in this work. It not only supports fine grain bit rate scalability for the multichannel audio bitstream, but also provides several other desirable functionalities. A formal subjective listening test shows that the proposed algorithm achieves a better performance at several different bit rates when compared with MPEG-4 BSAC for the mono audio sources.
There are a lot of home network technologies. One of them is IEEE 1394 technology that supports automatic configuration, QoS guaranteed real-time A/V transmission and high bandwidth. In the near future, IEEE 1394 home network will be deployed and IEEE 1394 digital home appliances will be universally available at a reasonable price for customers. Then, one needs an integrated control technology. This kind of control technology is based on IEC 61883-1 FCP and AV/C CTS. This paper issues the IEEE 1394-based technology to control digital home appliances and the test results from its implementation show the possibility to be used in a home gateway as a simple but effective control technology.
A C-compiler is a basic tool for most embedded systems programmers. It is the tool by which the ideas and algorithms in your application (expressed as C source code) are transformed into machine code executable by the target processor. Our research was to develop an optimizing C-compiler for a specified 16-bit DSP. As one of the most important part in the C-compiler, Code Generation's efficiency and performance directly affect to the resultant target assembly code. Thus, in order to improve the performance of the C-compiler, we constructed an efficient code generation based on RTL, an intermediate language used in GNU CC. The code generation accepts RTL as main input, takes good advantage of features specific to RTL and specified DSP's architecture, and generates compact assembly code of the specified DSP. In this paper, firstly, the features of RTL will be briefly introduced. Then, the basic principle of constructing the code generation will be presented in detail. According to the basic principle, this paper will discuss the architecture of the code generation, including: syntax tree construction / reconstruction, basic RTL instruction extraction, behavior description at RTL level, and instruction description at assembly level. The optimization strategies used in the code generation for generating compact assembly code will also be given in this paper. Finally, we will achieve the conclusion that the C-compiler using this special code generation achieved high efficiency we expected.
In this paper, a time division multiplexed task scheduling (TDM) is designed for HDTV video decoder is proposed. There are three tasks: to fetch decoded data from SDRAM for displaying (DIS), read the reference data from SDRAM for motion compensating (REF) and write the motion compensated data back to SDRAM (WB) on the bus. The proposed schedule is based on the novel 4 banks interlaced SDRAM storage structure which results in less overhead on read/write time. Two SDRAM of 64M bits (4Bank×512K×32bit) are used. Compared with two banks, the four banks storage strategy read/write data with 45% less time. Therefore the process data rates for those three tasks are reduced. TDM is developed by round robin scheduling and fixed slot allocating. There are both MB slot and task slot. As a result the conflicts on bus are avoided, and the buffer size is reduced 48% compared with the priority bus scheduling. Moreover, there is a compacted bus schedule for the worst case of stuffing owning to the reduced executing time on tasks. The size of buffer is reduced and the control logic is simplified.
The Holo-Chidi Video Concentrator Card is a frame buffer for the Holo-Chidi holographic video processing system. Holo- Chidi is designed at the MIT Media Laboratory for real-time computation of computer generated holograms and the subsequent display of the holograms at video frame rates. The Holo-Chidi system is made of two sets of cards - the set of Processor cards and the set of Video Concentrator Cards (VCCs). The Processor cards are used for hologram computation, data archival/retrieval from a host system, and for higher-level control of the VCCs. The VCC formats computed holographic data from multiple hologram computing Processor cards, converting the digital data to analog form to feed the acousto-optic-modulators of the Media lab's Mark-II holographic display system. The Video Concentrator card is made of: a High-Speed I/O (HSIO) interface whence data is transferred from the hologram computing Processor cards, a set of FIFOs and video RAM used as buffer for data for the hololines being displayed, a one-chip integrated microprocessor and peripheral combination that handles communication with other VCCs and furnishes the card with a USB port, a co-processor which controls display data formatting, and D-to-A converters that convert digital fringes to analog form. The co-processor is implemented with an SRAM-based FPGA with over 500,000 gates and controls all the signals needed to format the data from the multiple Processor cards into the format required by Mark-II. A VCC has three HSIO ports through which up to 500 Megabytes of computed holographic data can flow from the Processor Cards to the VCC per second. A Holo-Chidi system with three VCCs has enough frame buffering capacity to hold up to thirty two 36Megabyte hologram frames at a time. Pre-computed holograms may also be loaded into the VCC from a host computer through the low- speed USB port. Both the microprocessor and the co- processor in the VCC can access the main system memory used to store control programs and data for the VCC. The Card also generates the control signals used by the scanning mirrors of Mark-II. In this paper we discuss the design of the VCC and its implementation in the Holo-Chidi system.