We present an instruction-level power dissipation model of the Intel XScale microprocessor. The XScale implements the ARM ISA, but uses an aggressive microarchitecture and a SIMD Wireless MMX co-processor
to speed up execution of multimedia workloads in the embedded domain. Instruction-Level power modelling was first proposed by Tiwari et. al in 1994. Adaptations of this model have been found to be applicable to simple ARM processors. Research also shows that instructions can be clustered into groups with similar energy characteristics. We adapt these methodologies to the significantly more complex XScale processor. We characterize the processor in terms of the energy costs of opcode execution, operand values, pipeline stalls etc. through accurate measurements on hardware. This instruction-based (rather than microarchitectural) approach allows us to build a high-speed power-accurate simulator that runs at MIPS-range speeds, while
achieving accuracy better than 5%. The processor core accounts only for a portion of overall power consumption, and we move beyond the core to explore the issues involved in building a SystemC simulation framework that models power dissipation of complete systems quickly, flexibly and accurately.
The energy consumption profiling of the H.264 video decoder on VLIW embedded processors using the Trimaran simulator is conducted. Based on this study, we observe that the branch operations in the quarter-pixel (QP) interpolation and the DCT slow down the issue rate of the VLIW processors. Then, several new instruction architecture sets are proposed to address this issue. These new instructions can be used to speedup the issue rate, and reduce the total energy consumption. Finally, experimental results of the proposed instruction-level power-efficient strategies on the TI C6416 processor are reported and discussed.
Increasing demand for configuration time aware processing with stringent constraints for flexibility necessitates the design and development of a dynamically fast reconfigurable processor. This research work presents results obtained from hybrid FPGA architecture design methodology proposed in earlier work. Hybrid architecture is formed of ASIC units and LUT based processing elements. ASIC units represent tasks or core clusters obtained through common sub-graph analysis between basic blocks within and across routines of computation intensive applications and are basically recurring patterns. Results show that partial reconfiguration with the use of computation cores embedded in a sea of LUTs offer potential for massive savings in gate density by eliminating the need for redundant sub-circuit pattern configurations. Since ASICs cover only parts of data flow graphs, remaining computations are implemented on LUT based reconfigurable hardware. A new packing algorithm is proposed to form LUT based processing elements. Packing cost function prioritizes reduction of input/output pins of the clusters being formed. Results show that significant savings in number of nets to be routed are obtained through proposed method.
Proc. SPIE 5683, Breaking the I/O bottleneck for high-compute performance processing with Xtensa LX configurable and extensible processor architecture, 0000 (8 March 2005); https://doi.org/10.1117/12.586013
The challenges of new embedded applications have conflicting requirements: complex algorithms, evolving standards, shorter product cycles dictate programmable solutions, and yet, high data bandwidth, compute power and lower power consumption dictate carefully crafted hardwired functional modules. An application specific instruction set processor (ASIP) is ideally suited to provide most of the advantages of hardwired logic, while maintaining the time-to-market and programmability advantages of a general purpose processor. This paper presents the unique blend of high compute performance and i/o bandwidth of the configurable and extensible Xtensa LX ASIP architectures. Xtensa LX provides high compute performance with wide instruction words using multiple operation slots that enable superscalar performance suitable for data-intensive applications. Xtensa LX also provides high I/O bandwidth through its multiple load/store units that provide parallel low latency access or external DMA access to local memories and virtually unlimited number of ports and queues directly connected to the processor core functional units and system control registers, which remove the I/O bottleneck of traditional processors. The advantages of Xtensa LX features are proven with their impressive performance results: 171.6 ConsumerMark on out-of-the-box simulation of EEMBC consumer suite and a BDTIsimMark2000™ score of 6150 at 370MHz.
RISC and DSP, two main architectures, have their own features. The main idea of RISC is “simple is fast”. Acting as controller, RISC is based on Load/Store structure, register-register Instruction Set Architecture (ISA), general purpose registers and cache. On the other hand, designed for signal processing, DSP emphasizes large data accessing and fast computing. It’s based on register-memory ISA, diverse addressing modes, data address generator, multiplier accumulator and RAM. As Embedded Systems grow fast, no single core architecture, neither RISC nor DSP, could meet the needs anymore. Combination is necessary. There are two kinds of combination: dual-core or single core. Single core means RISC core and DSP core melt into one core with common resource and unified ISA. A 32b media processor named MediaDSP3201 (MD32 for short) is a new member of this family. In this paper, the MD32 design is introduced and concentrated on ISA design and pipeline design. They are important in architecture design. Compatibility runs through the whole design. The ISA should include features from both RISC ISA and DSP ISA. The pipeline should fit the designed ISA as good as possible. MD32 was made by TSMC at the first try on 2004 spring. Application programs running on it show that the design is successfully and the chip is suitable for Embedded System applications.
Consumer electronics products are multi-functional devices that combine a set of media applications. Media data in such products is largely processed in heterogeneous multiprocessor subsystems that are integrated into a system on chip (SoC). A product engineer configures each subsystem for a collection of predefined applications when deploying the SoC in a product. Oftentimes, the system supports a large number of desired application configurations, or 'use cases’. The system moves from one configuration to the next by adapting the configuration of a running application, referred to as 'dynamic reconfiguration’. This paper presents a practical approach to dynamic application reconfiguration in a heterogeneous multiprocessor subsystem. The targeted media applications are constructed as a graph of concurrently executing interconnected tasks that exchange information through streams of data. Configuring such a streaming graph entails the instantiation and interconnection of tasks, setting of task parameters, assignment of tasks to coprocessors, and the allocation of communication buffers in memory. The paper derives a reconfiguration interface that can be supported in hardware, yet isolates application configuration knowledge from the coprocessor hardware. Though simple and easy to use, the interface addresses the key challenge of reconfiguring individual tasks while maintaining real-time behavior and data integrity of the overall set of concurrently executing applications.
There has been an ever increasing demand for fast and power efficient solutions for mobile multimedia computing applications. The research discussed in this paper proposes an automated tool-set to design a reconfigurable architecture targeted towards multimedia applications, which are both data and control intensive. One important design step is custom memory design. This paper discusses a novel methodology to design a power, area and time efficient memory architecture for a given Control Data Flow Graph (CDFG) of an application. It uses the concept of Predicated Data Flow Analysis to get the memory requirements of each control path of the CDFG and a novel algorithm is used to merge these requirements. Final memory architecture is reconfigurable during run-time and a dynamic memory manager has been designed to support the same. An illustrative example involving a self-generated CDFG is shown to demonstrate the flow of the proposed algorithm. Results for various multimedia algorithms found in MPEG-4 codec show the effectiveness of this approach over memory design based on conventional Data Flow Analysis techniques.
H.264 is the latest video compression standard. Its rate distortion is greatly improved comparing to the MPEG-1, MPEG-2, MPEG-4, H.261 and H.263. Among many features of H.264, sub-pixel motion compensation is one of the factors that make H.264 a better coding scheme. H.264 implements both half-pixel interpolation and quarter-pixel interpolation. The computational complexity of sub-pixel motion compensation is therefore high. This paper presents an efficient VLSI architecture for fast implementation of sub-pixel interpolation of H.264. Several techniques are designed to reduce the number of memory access and accelerate the interpolation computations.
The first step towards the design of video processors and video systems is to achieve an accurate understanding of the major video applications, including not only the fundamentals of the many video compression standards, but also the workload characteristics of those applications. Introduced in 1997, the MediaBench benchmark suite provided the first set of full application-level benchmarks for studying video processing characteristics, and has consequently enabled significant research in computer architecture and compiler research for multimedia systems. To expedite the next generation of systems research, the MediaBench Consortium is developing the MediaBench II benchmark suite, incorporating benchmarks from the latest multimedia technologies, and providing both a single composite benchmark suite as well as separate benchmark suites for each area of multimedia. In the area of video, MediaBench II Video includes both the popular mainstream video compression standards, such as Motion-JPEG, H.263, and MPEG-2, and the more recent next-generation standards, including MPEG-4, Motion-JPEG2000, and H.264. This paper introduces MediaBench II Video and provides a comprehensive workload evaluation of its major processing characteristics.
The H.264 video compression standard uses a context-adaptive binary arithmetic coder (CABAC) as an entropy coding mechanism. While the coder provides excellent compression efficiency, it is computationally demanding. On typical general-purpose processors, it can take up to hundreds of cycles to encode a single bit. In this paper, we propose an architecture for a CABAC encoder that can easily be incorporated into system-on-chip designs for H.264 compression. The CABAC is inherently serial and we divide the problem into several stages to derive a design that can provide a throughput of two cycles per encoded bit. The engine proposed is capable of handling binarization of the syntactical elements and provides the coded bit-stream via a first-in first-out buffer. The design is implemented on an Altera FPGA platform that can run at 50 MHz enabling a 25 Mbps encoding rate.
With the introduction of a variety of novel coding tools in H.264 has come an increase in complexity that few processor architectures can facilitate. Prior coding loops, such as MPEG-2, provided fewer variations and optional capabilities as a part of the standard implementation; and as such they were readily partitioned in an intuitive manner with little deviation. Induced by the need to scale to such high-complexity algorithms, homogenous multiprocessor architectures are becoming more common. H.264 poses with it several new options to the software architect in approaching the issue of partitioning the coding blocks most efficiently across a multiprocessor architecture. In this paper, we address issues that arise from the mapping of H.264 onto Multiprocessor DSP chips. We discuss aspects of algorithm partitioning, reference frame coherency, and synchronization issues. We show flexible methods for mapping the algorithm onto MDSPs which allow scalability over coding tools, resolutions, and computation/bandwidth availability.
Register allocation is an important part of optimizing compiler. The algorithm of register allocation via graph coloring is implemented by Chaitin and his colleagues firstly and improved by Briggs and others. By abstracting register allocation to graph coloring, the allocation process is simplified. As the physical register number is limited, coloring of the interference graph can’t succeed for every node. The uncolored nodes must be spilled. There is an assumption that almost all the allocation method obeys: when a register is allocated to a variable v, it can’t be used by others before v quit even if v is not used for a long time. This may causes a waste of register resource. The authors relax this restriction under certain conditions and make some improvement. In this method, one register can be mapped to two or more interfered “living” live ranges at the same time if they satisfy some requirements. An operation named merge is defined which can arrange two interfered nodes occupy the same register with some cost. Thus, the resource of register can be used more effectively and the cost of memory access can be reduced greatly.
The performance improvement of the in-loop deblocking filter module in the H.264/AVC video coding standard in embedded systems is studied in this research. A novel prediction scheme is presented in to reduce the complexity of the filter selection process and hence increase overall performance. We first examine the H.264/AVC deblocking filters by studying their correlation in terms of the filter type and
pattern among a sequence of consecutive P frames and I frames. The experimental results show a high correlation of the filter skip rate and the filter pattern between different P frames and their leading I frame. Based on the correlation analysis, a binary history table predictor (the BHT predictor) and a complete history table predictor (the CHT predictor) are proposed to facilitate the deblocking filter selection process while maintaining good subjective and objective visual quality. We further present a hybrid filter prediction scheme that integrates both BHT and CHT to further improve prediction results.
General purpose workstations must support a wide variety of application characteristics; but it is hard to find a single CPU scheduling scheme that satisfactorily schedules processes from all types of applications. It is particularly difficult to get periodic deadline-driven continuous media processes to satisfactorily co-exist with others. A number of schemes have been proposed to address this issue, but these all suffer from one or more of the following limitations: i) unacceptable inefficiency, ii) non-determinism (i.e. introducing significant burstiness or jitter), iii) inability to explicitly support deadlines (so that deadlines may be missed even when the CPU is underloaded). This paper presents “SHRED (SHaretokens, Round-robin, Earliest-deadline-first, Deferred-processing)” -an efficient, proportional-share, deterministic, scheduling scheme that enables periodic deadline-driven processes to meet their explicit deadlines wherever possible, and degrades gracefully and adaptively when this is not possible. The scheme simultaneously ensures that non-deadline processes always obtain their fair share of CPU time whether in conditions of underload or overload. For experimental evaluation, a prototype of SHRED has been developed by replacing the Linux standard scheduler with the SHRED scheduler. The prototype has been evaluated against the standard Linux scheduler for various parameters and also against two proportional-share schemes, namely Stride and VTRR scheduling, for its overhead and its effect on jitter.
This paper proposes pipelining and bypassing unit (BPU) design method in our 32-bit RISC/DSP processor: MediaDsp3201 (briefly, MD32). MD32 is realized in 0.18μm technology, 1.8v, 200MHz working clock and can achieve 200 million/s Multiply-Accumulate (MAC) operations. It merges RISC architecture and DSP computation capability thoroughly, achieves fundamental RISC, extended DSP and single instruction multiple data (SIMD) instruction set with various addressing modes in a unified and customized DSP pipeline stage architecture. We will first describe the pipeline structure of MD32, comparing it to typical RISC-style pipeline structure. And then we will study the validity of two bypassing schemes in terms of their effectiveness in resolving pipeline data hazards: Centralized and Distributed BPU design strategy (CBPU and DBPU). A bypassing circuit chain model is given for DBPU, which register read is only placed at ID pipe stage. Considering the processor’s working clock which is decided by the pipeline time delay, the optimization of circuit that serial select with priority is also analyzed in detail since the BPU consists of a long serial path for combination logic. Finally, the performance improvement is analyzed.
The Wi-Fi walkman is a mobile multimedia application that we developed to investigate the technological and usability aspects of human-computer interaction with personalized, intelligent and context-aware wearable devices in peer-to-peer wireless environments such as the future home, office, or university campuses. It is a small handheld device with a wireless link that contains music content. Users carry their own walkman around and listen to music. All this music content is distributed in the peer-to-peer network and is shared using ad-hoc networking. The walkman naturally interacts with the users and users’ interest with each other in a peer-to-peer environment. Without annoying interactions, it can learn the users’ music interest/taste and consequently provide personalized music recommendation according to the current situated context and user’s interest.
We studied several well-known timing synchronization algorithms in this paper, and applied them to our distributed real-time camera system. We proposed a peer-to-peer framework for distributed real-time gesture recognition, and this study serves as the keystone of the system. Cameras and microprocessors are now cheap enough that many smart camera nodes can be used in a single system. A smart camera node contains not only video capturing devices, but also processing elements to perform both capture and process in the same package. Sometimes, using multiple relatively inexpensive cameras with lower resolution might even provide better performance and be more cost efficient than using single high-end camera. In order to perform real-time video processing, distributed multiple cameras require distributed processing power, and the distributed nodes have to remain synchronized to ensure the correctness of distributed video processing. Though clock synchronization has already been well studied in decades, most previous work uses domain knowledge to achieve good synchronization result on a certain network structure. Precision in microseconds could be achieved in several previous approaches. However, for our distributed real-time video processing system, such precision is not required; only frame-precision is needed for our system, which is around tens of milliseconds. In our distributed camera system we assume no deterministic network structure for the camera nodes. Since there is no single time synchronization algorithm would suit every network structure, we would like to investigate the possibility to perform clock synchronization in application layer, where domain knowledge of underlying structure is not needed. We choose algorithms that achieve synchronization by exchanging messages between camera nodes to achieve synchronization on heterogeneous network structures. We investigate three well-known message-exchanging timing synchronization methods: Lamport’s; Lundelius’ and Halpern’s algorithms, and perform experiments on our distributed camera system. All these methods can tolerate up to one third of faulty camera nodes, and Halpern’s algorithm could even survive as long as the network is connected. Three different network configurations are used in our simulation, and each configuration can represent certain realistic camera distributions. Among all the algorithms we studied, Halpern’s algorithm is the simplest one in sense of computation complexity, and can achieve the most precise synchronization, regardless the huge variation in transmission delay. Halpern’s algorithm also uses less message exchanges, and can even be hidden within normal data packages, which is a better choice for our distributed real-time camera system.
Multimedia streaming over wireless network faces the problem of low-bandwidth data transmission in an error prone environment. Furthermore, due to the frame dependency exploited by the video coding schemes, packet loss could degrade the perceptual quality of the media streams. In this paper, we design and implement a group-of-pictures (GOP) based video packet interleaving technique to reduce the impact of bursty packet losses. At the server side of our system, the packets of B or P frames are interleaved into the packets of a single I frame. At the client side, the de-interleaving method is developed based on the RTP timestamp of RTP header. We also apply the technique to the MPEG-4 video codec in the streaming system and integrate its error resilient tools -- video packet, data partition, and RVLC, to increase the performance of GOP based video packet interleaving technique. From the experiment results, we show that our technique improves the perceptual quality better than the classic scheme does.
With the proliferation of networked devices, today's multimedia applications operate in highly heterogeneous and dynamic environments. An attractive way of dealing with this situation is to make applications self-adaptive, i.e. able to observe them-selves and their execution environment, to detect significant changes and to reconfigure their own behavior in QoS-specific ways.
This approach has been studied many projects, especially in the context of multimedia applications. However, reconfiguration mechanisms are generally implemented in ad hoc ways and often hard-coded within application code. This requires predicting all possible situations at development time and therefore, several key requirements cannot be addressed, in particular: the generality to a wide range of applications, the customizability to each execution context and the flexibility of reconfiguration mechanisms.
This paper describes PLASMA, a component-based framework for building self-reconfigurable multimedia applications. PLASMA relies on a recursive composition model, a hierarchical reconfiguration management and a dynamic Architecture Description Language (ADL), in order to arbitrarily compose multimedia applications and their reconfiguration policies. This paper describes the design concepts underlying PLASMA and illustrates the use of PLASMA with detailed examples.
Most of the current wireless communication devices use embedded processors for performing different tasks such as physical layer signal processing and multimedia applications. Embedded processors provide a reasonable trade-off between application specific implementation and hardware sharing by different algorithms for more optimal design and flexibility. At the same time the widespread popularity of these processors drives the development of algorithms specifically tailored for embedded environments. Fast Fourier Transform (FFT) is a universal tool, which has found many applications in communications and many application specific architectures and Digital Signal Processor (DSP) implementations are available for FFT. In this paper our focus is in embedded algorithms for spread spectrum communication receivers, which are using FFT as an engine to compute convolutions. Using FFT-based correlators one can search over all possible so-called code phases of direct sequence spread spectrum (DS-SS) signal in parallel with fewer operations than conventional correlators do. However in many real-life scenarios the receiver is provided with a timing assistance which confines the uncertainty in code phase within a limited area. The FFT based search is becoming redundant and a reasonable strategy is to modify the FFT based methods for better utilization of embedded processor resources. In this paper we suggest a reduced complexity frequency domain convolution approach for the search over limited number of code phases.
Associative Processors can perform parallel operations in massive scale because of which they are found to be efficient for video coding. Due to the inherent nature of the architecture, performing DCT becomes computationally intensive. To overcome this drawback, multiple DCTs are performed in parallel. This approach results in huge data traffic as it is performed for multiple blocks of video data. In this paper we present a new approach to perform DCT on associative processor. In this approach we make use of the shape of DCT basis vectors to extract parallelism. Such an approach reduces both the average number of cycles and the data traffic involved in video coding. Further, the video coding can be performed on Macro block basis thereby reducing a huge number of redundant operations.
Lossless compression of raw CCD images captured using color filter arrays has several benefits. The benefits include improved storage capacity, reduced memory bandwidth, and lower power consumption for digital still camera processors. The paper discusses the benefits in detail and proposes the use of a computationally efficient block adaptive scheme for lossless compression. Experimental results are provided that indicate that the scheme performs well for CCD raw images attaining compression factors of more than two. The block adaptive method also compares favorably with JPEG-LS. A discussion is provided indicating how the proposed lossless coding scheme can be incorporated into digital still camera processors enabling lower memory bandwidth and storage requirements.