This paper describes the new C64x DSP core including instruction set extensions that enhance performance for image and video processing. Key features include packed data processing and special instructions to accelerate algorithms such as motion estimation. Devices based on the C64x will be ideally suited for key target applications including video infrastructure and image analysis.
The architecture of mediaprocessors has become increasingly sophisticated to accommodate the need for more performance in processing various media data. However, due to the inability of mediaprocessor compilers to fully detect the parallelism available in a program and maximize the utilization of the mediaprocessor's on-chip resources, C intrinsics, which are hints to the compiler on which assembly instructions to use, have been employed to achieve better performance. Nonetheless, these intrinsics are mediaprocessor-dependent, thus limiting the portability of mediaprocessor software. To help increase the portability of mediaprocessor software, we have developed a Mediaprocessor Programming Interface (MPI), which translates one set of C intrinsics into another. In many cases, the translated code for the target mediaprocessor has similar performance to the code developed with native intrinsics. We believe that the MPI can facilitate the reuse of mediaprocessor software as well as the development of mediaprocessor-independent software.
Smart cameras use video/image processing algorithms to capture images as objects, not as pixels. This paper describes architectures for smart cameras that take advantage of VLSI to improve the capabilities and performance of smart camera systems. Advances in VLSI technology aid in the development of smart cameras in two ways. First, VLSI allows us to integrate large amounts of processing power and memory along with image sensors. CMOS sensors are rapidly improving in performance, allowing us to integrate sensors, logic, and memory on the same chip. As we become able to build chips with hundreds of millions of transistors, we will be able to include powerful multiprocessors on the same chip as the image sensors. We call these image sensor/multiprocessor systems image processors. Second, VLSI allows us to put a large number of these powerful sensor/processor systems on a single scene. VLSI factories will produce large quantities of these image processors, making it cost-effective to use a large number of them in a single location. Image processors will be networked into distributed cameras that use many sensors as well as the full computational resources of all the available multiprocessors. Multiple cameras make a number of image recognition tasks easier: we can select the best view of an object, eliminate occlusions, and use 3D information to improve the accuracy of object recognition. This paper outlines approaches to distributed camera design: architectures for image processors and distributed cameras; algorithms to run on distributed smart cameras, and applications of which VLSI distributed camera systems.
We propose a robust programming model for dense ensembles of ultra miniaturized computing nodes which are deployed in bulk fashion, e.g. embedded into building materials. We sketch a hardware reference model as an initial guide to the application domain, then we describe a programming model based on mobile code fragments which self assemble into larger structures. We outline a simple representative application that we have developed and tested on a system simulator.
Ray tracing is a rendering technique for producing high quality and photo-realistic images. The basic requirement for ray tracing a specific geometric primitive is the calculation of intersections between rays and the geometry. Bezier clipping algorithm is a promising technique for computing high quality spline-ray interactions but the slowness of this technique suggests the utilization of specific VLSI architectures. In this paper we present an improved algorithm for Bezier clipping and its VLSI implementation. Specifically, some modifications over the original algorithm have been developed in order to avoid the non-constant number of cycles per iteration and the undesirable and inherent division operations of the original algorithm. As a result, we obtain a regular architecture, characterized by a fixed and optimum scheduling, which minimizes the timing requirements of the Bezier clipping algorithm.
Over the past few years, technology drivers for microprocessors have changed significantly. Media data delivery and processing--such as telecommunications, networking, video processing, speech recognition and 3D graphics--is increasing in importance and will soon dominate the processing cycles consumed in computer-based systems. This paper presents the architecture of the VASP-4096 processor. VASP-4096 provides high media performance with low energy consumption by integrating associative SIMD parallel processing with embedded microprocessor technology. The major innovations in the VASP-4096 is the integration of thousands of processing units in a single chip that are capable of support software programmable high-performance mathematical functions as well as abstract data processing. In addition to 4096 processing units, VASP-4096 integrates on a single chip a RISC controller that is an implementation of the SPARC architecture, 128 Kbytes of Data Memory, and I/O interfaces. The SIMD processing in VASP-4096 implements the ASProCore architecture, which is a proprietary implementation of SIMD processing, operates at 266 MHz with program instructions issued by the RISC controller. The device also integrates a 64-bit synchronous main memory interface operating at 133 MHz (double-data rate), and a 64- bit 66 MHz PCI interface. VASP-4096, compared with other processors architectures that support media processing, offers true performance scalability, support for deterministic and non-deterministic data processing on a single device, and software programmability that can be re- used in future chip generations.
This paper deals with different aspects of wavelet packet (WP) based video coding. In introducing experiments we show that WP decomposition and specifically WP decomposition in conjunction with the best basis algorithm are superior in terms of quality as compared to the standard discrete wavelet transform but show prohibitive computational demands (especially for real-time applications). The main contribution of our work is therefore the examination of three parallelization methods for WP based video coding. Two inter-frame based parallelization methods (group-of-picture parallelization and frame-by-frame parallelization) exploit the properties of a videostream (full independence between GOPs and rather high independence between single frames) better than inter-frame parallelization, but show a higher demand in terms of memory and don't respect the frame order defined by the input video stream. We highlight the advantages and drawbacks of all three methods and show experimental results obtained on a Siemens hpcLine cluster and a Cray T3E.
The wavelet transform is more and more widely used in image and video compression. One of the best known algorithms in image compression is the SPIHT algorithm which involves the wavelet transform. As today the parallelization of the wavelet transform is sufficiently investigated this work deals with the parallelization of the compression algorithm itself as a next step.
Sprite encoding/decoding is one of the new tools proposed in the MPEG-4 standard. Sprite encoding operates at the Video Object Plane level. Since the size of a Video Object Plane might change with time, it presents challenges for hardware implementation. This is unlike the other video object coding tools that operate at a fixed-size block level that can be directly mapped to processor structures. Hence there is a need for an analysis that allows a priori estimation of the computational complexity in order to facilitate the following: (1) Assess resources required to implement Sprite Encoding in software. (2) Design of hardware for various modules of Sprite Encoding. (3) Understand the static and dynamic nature of Sprite Encoding operations. This paper presents a detailed complexity analysis of the Sprite Encoding process in terms of basic operations. The computational complexity has been estimated. A modified Sprite encoding algorithm has been proposed. The proposed algorithm provides a similar result in terms of sprite generation but is more amenable to a hardware efficient implementation.
An embedded high-quality multichannel audio coding algorithm is proposed in this research. The Karhunen-Loeve Transform is applied to multichannel audio signals in the pre- processing stage to remove inter-channel redundancy. Then, after processing of several audio coding blocks, transformed coefficients are layered quantized and the bit stream is ordered according to their importance. The multichannel audio bit stream generated by the proposed algorithm has a fully progressive property, which is highly desirable for audio multicast applications in heterogenous networks. Experimental results show that, compared with the MPEG Advanced Audio Coding algorithm, the proposed algorithm achieves a better performance with both the object Mask-to- Noise-Ratio measurement and the subjective listening test at several different bit rates.
MPEG-4 is a multimedia standard that requires Video Object Planes (VOPs). Generation of VOPs for any kind of video sequence is still a challenging problem that largely remains unsolved. Nevertheless, if this problem is treated by imposing certain constraints, solutions for specific application domains can be found. MPEG-4 applications in mobile devices is one such domain where the opposite goals namely low power and high throughput are required to be met. Efficient memory management plays a major role in reducing the power consumption. Specifically, efficient memory management for VOPs is difficult because the lifetimes of these objects vary and these life times may be overlapping. Varying life times of the objects requires dynamic memory management where memory fragmentation is a key problem that needs to be addressed. In general, memory management systems address this problem by following a combination of strategy, policy and mechanism. For MPEG4 based mobile devices that lack instruction processors, a hardware based memory management solution is necessary. In MPEG4 based mobile devices that have a RISC processor, using a Real time operating system (RTOS) for this memory management task is not expected to be efficient because the strategies and policies used by the ROTS is often tuned for handling memory segments of smaller sizes compared to object sizes. Hence, a memory management scheme specifically tuned for VOPs is important. In this paper, different strategies, policies and mechanisms for memory management are considered and an efficient combination is proposed for the case of VOP memory management along with a hardware architecture, which can handle the proposed combination.
In terms of image and video compression, it is well known that Wavelet Transform (WT) can achieve higher compression efficiency than Discrete Cosine Transform (DCT) when post transform coding scheme of similar computational complexity is used. On the other hand it is also well known that wavelet approach has a higher computational complexity than DCT both in software and in hardware. When both audio and video compression are required as in the case of video recording, it is desirable to achieve higher compression efficiency using WT and to share the same hardware that is based on WT technology. It is the intention of this paper to present an architecture for a WT slave processor. In this paper, our own results for image and audio compression will be presented to show the effectiveness of wavelet transform. We will then show that integer based wavelet transform has enough accuracy for both audio and video base on our own experience. We will then present decompression executable codes which is an intermediate step before the hardware architecture. We will then show an architectural design for an integer Wavelet Slave Processor (WSP) for decompression. This proposed WSP can be designed, as variation on a theme, for the compression of audio and video data.
VR (Virtual Reality) technology is entering DVR (Distributed Virtual Reality) era. DVR need to meet the real time requirement. This requirement results in more complex and heavier computational tasks, significantly increasing the load of a node in a DVR. High-performance parallel graphic system can be used, but they are very expensive. To decrease the cost and make full use of existing network resources, we present a parallel graphic rendering system in this paper. The system adopts Master-Slave model in network topology. COBRA is used as the network computational environment of the distributed rendering system. To further improve the performance of system, we used many new techniques, such as task division based on object space, task overlay and dynamic load balancing. The experiment results indicate that the time each computer needed to render dropped with the increase of the number of the computers contributing to the computation. Although the transferring time of the network also grows, it has no remarkable impact on the change of overall time because only a little more controlling information need to be transferred than before.
Processor's architecture has great effect on the performance of whole processor array. In order to improve the performance of SIMD array architecture, we modified the structure of BAP (bit-serial array processor) processing element based on the BAP128 processor. The array processor chip of modified bit-serial array processor (MBAP in abbreviation) with 0.35 micrometers CMOS technology is designed for embedded image understanding system. This paper not only presents MBAP architecture, but also gives the architecture feature about this design. Toward basic macro instructions and low-level processing algorithms of image understanding, the performance of BAP and MBAP is compared. The result shows that the performance of MBAP has much improvement on BAP, at the cost of increasing 5% chip resource.