Hardware accelerators are used to speed up execution of specific tasks such as video coding. Often the purpose
of hardware acceleration is to be able to use a cheaper or, for example, more energy economical processor for
executing the majority of the application in software. However, when using hardware acceleration, new overheads
are produced mainly due to the need to transfer data to and from the accelerator and signaling the readiness
of the accelerator computation to the processor. We find the traditional mechanisms suboptimal for fine-grain
hardware acceleration, especially when energy efficiency is important.
This paper explores a technique unique to Transport Triggered Architectures to interface with hardware
accelerators. The proposed technique places hardware accelerators to the processor data path, making them
visible as regular function units to the programmer. This way communication costs are reduced as data can
be transferred directly to the accelerator from other processor data path components and synchronization can
be done by polling a simple ready flag in the accelerator function unit. Additionally, this setup enables the
instruction scheduler of the compiler to schedule the hardware accelerator like any other operation, thus partially
hide its latency with other program operations.
The paper presents a case study with an audio decoder application in which fine-grain and coarse-grain
hardware accelerators are integrated to the processor data path as function units. The case is used to study
several different synchronization, communication, and latency-hiding techniques enabled by this kind of setup.
Video coding standards, such as MPEG-4, H.264, and VC1, define hybrid transform based block motion compensated techniques that employ almost the same coding tools. This observation has been a foundation for defining the MPEG Reconfigurable Multimedia Coding framework that targets to facilitate multi-format codec design. The idea is to send a description of the codec with the bit stream, and to reconfigure the coding tools accordingly on-the-fly. This kind of approach favors software solutions, and is a substantial challenge for the implementers of mobile multimedia devices that aim at high energy efficiency. In particularly as high definition formats are about to be required from mobile multimedia devices, variable length decoders are becoming a serious bottleneck. Even at current moderate mobile video bitrates software based variable length decoders swallow a major portion of the resources of a mobile processor. In this paper we present a Transport Triggered Architecture (TTA) based programmable implementation for Context Adaptive Binary Arithmetic de-Coding (CABAC) that is used e.g. in the main profile of H.264 and in JPEG2000. The solution can be used even for other variable length codes.
Proc. SPIE. 6821, Multimedia on Mobile Devices 2008
KEYWORDS: Digital signal processing, Modulation, Doppler effect, Video, Receivers, Signal processing, Analog electronics, Forward error correction, Orthogonal frequency division multiplexing, Contrast transfer function
In this paper, we present the system and software implementation of the Digital Video Broadcasting protocol for hand
held applications (DVB-H), on the Sandbridge Technology's multithreaded digital signal processor SB3011. The I and Q
base-band analog output signals from the tuner are digitized, filtered and further processed conforming to ETSI EN 302
304 V1.1.1 (2004-06). All processing blocks including the receiver synchronization and forward error correction are
executed entirely in software. At 1.5 Mbps the processor usage is less than 40% with maximum power consumption of
Block effect is one of the most annoying artifacts in digital video processing and is especially visible in low-bitrate
applications, such as mobile video. To alleviate this problem, we propose an adaptive quantization method for inter
frames that can reduce visible block effect in DCT-based video coding. In the proposed method, a set of quantization
matrices are constructed before processing the video data. Matrices are constructed by exploiting the temporal frequency
limitations of human visual system. The method is adaptive to motion information and is able to select an appropriate
quantization matrix for each inter-coded block. Based on the experimental results, the proposed scheme can achieve
better subjective video quality compared to conventional flat quantization especially at low-bitrate application.
Moreover, it does not introduce extra computational cost in software implementation. This method does not change
standard bitstream syntax, so it can be directly applied to many DCT-based video codecs. A potential application could
be for mobile phone and other digital devices with low-bitrate requirement.
Low complexity video coding schemes are aimed to provide video encoding services also for devices with restricted
computational power. Video coding process based on the three-dimensional discrete cosine transform (3D DCT)
can offer a low complexity video encoder by omitting the computationally demanding motion estimation operation.
In this coding scheme, extended fast transform is also used, instead of the motion estimation, to decorrelate
the temporal dimension of video data. Typically, the most complex part of the 3D DCT based coding process
is the three-dimensional transform. In this paper, we demonstrate methods that can be used in lossy coding
process to reduce the number of one-dimensional transforms required to complete the full 3D DCT or its inverse
operation. Because unnecessary computations can be omitted, fewer operations are required to complete the
transform. Results include the obtained computational savings for standard video test sequences. The savings
are reported in terms of computational operations. Generally, the reduced number of computational operations
also implies longer battery lifetime for portable devices.
Proc. SPIE. 6507, Multimedia on Mobile Devices 2007
KEYWORDS: Digital signal processing, 3D acquisition, Video acceleration, Detection and tracking algorithms, Video, Denoising, Signal processing, Video processing, Algorithm development, Motion estimation
The recent development of in the field of embedded systems has enabled mobile devices with significant computation
power and long battery life. However, there are still a limited number of video applications for such platforms. Due to
high computational requirements of video processing algorithms, an intensive assembler optimization or even hardware
design is required to meet the resource constraints of the mobile platforms. One example of such challenging video
processing problem is video denoising.
In this paper, we present a software implementation of a state-of-the-art video denoising algorithm on a mobile
computational platform. The chosen algorithm is based on the three-dimensional discrete cosine transform (3D DCT)
and block-matching. Apart from its architectural simplicity, algorithm allows the computational scalability due to the
"sliding window"-style processing. In addition, main components of this algorithm are 8-point DCT and block matching
which can be efficiently calculated with hardware acceleration of the modern DSP.
Our target platform is the OMAP Innovator development kit, a dual processor environment including ARM 925 RISC
general purpose processor (GPP) and TMS320C55x digital signal processor (DSP). The C55x DSP offers a hardware
acceleration support for computing of the DCT and block-matching intensively used in the chosen denoising algorithm.
Hardware acceleration can offer a significant "speed-up" in comparison to assembler optimization of source codes. The
results demonstrate a possibility to implement an efficient video denoising algorithm on a mobile computational
platform with limited computational resources.
This paper describes a device capable of performing the following tasks: it samples and decodes the composite video
analog TV signal, it encodes the resulting RGB data into a MPEG-4 stream and sends it over a WiMAX link. On the
other end of the link a similar device receives the WiMAX signal, in either TDD or FDD mode, decodes the MPEG data
and displays it on the LCD display. The device can be a hand held device, such as a mobile phone or a PDA. The
algorithms for the analog TV, WiMAX physical layer, WiMAX MAC and the MPEG encoder/decoder are executed
entirely in software in real time, using the Sandbridge Technologies' low power SB3011 digital signal processor. The
SB3011 multithreaded digital signal processor includes four DSP cores with eight threads each, and one ARM processor.
The execution of the algorithms requires the entire four cores for the FDD mode. The WiMAX MAC is executed on the
Application-specific programmable processors tailored for the requirements at hand are often at the center of
today's embedded systems. Therefore, it is not surprising that considerable effort has been spent on constructing
tools that assist in codesigning application-specific processors for embedded systems. It is desirable that such
design toolsets support an automated design flow from application source code down to synthesizable processor
description and optimized machine code. In this paper, such a toolset is described. The toolset is based on a
customizable processor architecture template, which is VLIW-derived architecture paradigm called Transport
Triggered Architecture (TTA). The toolset addresses some of the pressing shortcomings found in existing toolsets,
such as lack of automated exploration of the "design space", limited run time retargetability of the design tools
or restrictions in the customization of the target processors.
In this paper, we propose an image coding scheme using adaptive resizing algorithm to obtain more compact coefficient representation in the block-DCT domain. Standard coding systems, e.g. JPEG baseline, utilize the block-DCT transform to reduce spatial correlation and to represent the image information with a small number of visually significant transform coefficients. Because the neighboring coefficient blocks may include only a few low-frequency coefficients, we can use downsizing operation to combine the information of two neighboring blocks into a single block.
Fast and elegant image resizing methods operating in transform domain have been introduced previously. In this paper, we introduce a way to use these algorithms to reduce the number of coefficient blocks that need to be encoded. At the encoder, the downsizing operation should be performed delicately to gain compression efficiency. The information of neighboring blocks can be efficiently combined if the blocks do not contain significant highfrequency components and if the blocks share similar characteristics. Based on our experiments, the proposed method can offer from 0 to 4 dB PSNR gain for block-DCT based coding processes. Best performance can be expected for large images containing smooth homogenous areas.
This paper describes how a scene cut detector could be utilized in a video codec based on the three-dimensional discrete cosine transform (3D DCT). In the 3D DCT based video codec, data is processed with 8x8x8 cubes, hence a set of 8 images need to be available in a memory at a time. A change of video scene may occur between any of those images stored in the memory. Rapid scene change within an 8x8x8 cube produces significant high frequency coefficients into the temporal dimension of the DCT domain. If the important high frequency coefficients are discarded, the information between the scenes is mixed around the scene cut position causing ghost artifacts into the reconstructed video sequence. Therefore, an approach to handle each of the eight possible scene change situations within an 8x8x8 cube is proposed. The proposed method includes the utilization of the 8x8x4 DCT, forced-fill, repeat previous frame, and average to previous frame techniques. By utilizing a scene cut detector into the 3D DCT based video codec, unnecessary quality drops could be avoided without reducing the compression ratio. Notable quality improvements could be achieved for images around a scene cut position.
In this paper, a simplified three-dimensional discrete cosine transform (3D DCT) based video codec is proposed. The computational complexity of the baseline 3D DCT based video codec is reduced by simplifying the transformation block. In video sequences with low motion activity, consecutive images are highly correlated in the temporal dimension, thus the DCT does not usually produce significant coefficient values to the higher temporal frequencies. Therefore, we have a possibility to use a simple averaging operation and the 2D DCT, instead of the full 3D DCT operation, for some of the cubes in processing. Furthermore, some of the resulting cubes could be combined together to achieve more efficient binary representation.
Based on our results, simplifications considerably improved compression efficiency of the 3D DCT based codec for video sequences with low motion activity. In addition, the compression efficiency for video sequences with high motion activity was maintained. At the same time, the coding speed of the simplified 3D DCT based video codec was increased from the original. Although the compression efficiency of H.263 video codec was not reached, the encoding speed of the 3D DCT based video encoder was many times faster than the encoding speed of the H.263.
Video compression is a critical component of many multimedia applications available today. The interest in multimedia has generated a lot of research in the area of video coding in academy
and industry alike and several successful standards have emerged, e.g., ITU-T H.261, H.263, ISO/IEC MPEG-1, MPEG-2 and MPEG-4. Transform video coding method is used by all video standards today. Discrete Cosine Transform (DCT) is the most popular transform for video coding and, in fact, is used in all current video-coding standards. We present scalable architectures for DCT transform to adjust the complexity to the considered application. The range of possible architectures includes sequential and parallel processing of transform butterflies at each stage.
In this paper we study the number of Euclidean distance transform values. We show that there is (from the number- theoretic point of view) a high redundancy in the number of different Euclidean distance values. Our number-theoretic approach allows us to give an approximation of the number of algebraic independent transform values. This can be used to optimize the future hardware implementation.