As today's video applications are being requested in many portable end-user devices, and these ones are far
capable of holding and processing large amounts of video data, there is a need for bit rate improvement in
compression algorithms. The objective of this paper is to propose a hardware based post-compression enhancer
situated between the Video Coding Layer and the Network Abstraction Layer of H.264. Our research analyzes
the resulting bit streams produced by the emerging H.264 standard. The goal is to enhance compression rates by
proposing simple post-compression techniques based in symbol's statistics. The CABAC and CAVLC entropy
coders used in H.264 work optimally for 1-bit symbols, and the statistical distribution among them is almost
the best. Our studies reveal that the bit streams presents similar results for 8-bit symbols, and thus a post-compression
using well known byte-based mechanisms will not yield better results; further more, our studies
also show that they even degrade the original compression rate. Nevertheless, a non equally distribution using
6-bits symbols in 2046-bits discrete data packets is found, which can be exploited to boost compression. This
distribution varies between 5.4% for the most probable symbol and 0.98% for the least probable symbol in
average. Again, simple coding a few of the most probable symbols will result in bit rate reduction. A 1-
bit compression enhanced used flag penalty must be introduced for each discrete packet, increasing its size in
KEYWORDS: Digital signal processing, Clocks, Video, Field programmable gate arrays, Video compression, Algorithm development, Computer architecture, Motion estimation, System on a chip, Passive elements
Due to the timing constraints in real time video encoding, hardware accelerator cores are used for video compression.
System on Chip (SoC) designing tools offer a complex microprocessor system designing methodologies
with an easy Intellectual Property (IP) core integration. This paper presents a PowerPC-based SoC with a
motion-estimation accelerator core attached to the system bus. Motion-estimation (ME) algorithms are the
most critical part in video compression due to the huge amount of data transfers and processing time. The main
goal of our proposed architecture is to minimize the amount of memory accesses, thus exploiting the bandwidth
of a direct memory connection. This architecture has been developed using Xilinx XPS, a SoC platforms design
tool. The results show that our system is able to process the integer pixel full search block matching (FSBM)
motion-estimation process and interframe mode decision of a QCIF frame (176*144 pixels), using a 48*48 pixel
searching window, with an embedded PPC in a Xilinx Virtex-4 FPGA running at 100 MHz, in 1.5 ms, 4.5 % of the total processing time at 30 fps.
This paper presents efficient integer-pel and fractional-pel motion estimation VLSI architectures for luma video component in H.264/AVC. The proposed architectures were designed as hardware accelerators for 32-bit processors to reduce computation cost and processing time. Both accelerators use the full-search block-matching algorithm to fulfil the standard requirements with maximum quality. The integer motion estimator is composed by a systolic 16x16 processing elements array with optimal memory management and effective data-path. The array was designed to adjust the search window size and shape at macroblock level without a high control overhead. Simulation results show computing and time reduction from 21.5%, to 60.7% using a search window shape different than square with a maximum PSNR degradation of 0.014 dB. The fractional motion estimation architecture improves time operation of previous designs by means of two parallel-pipeline stages, an effective block flow and faster interpolation modules. The design can process the 41 macroblock partitions and sub-partitions in quarter-pel resolution in 606 clock cycles. Operating at 100-MHz clock frequency, the architecture supports 720p HD video format @ 30 fps for one reference frame. Implementation results based on FPGA devices using VHDL are included.
KEYWORDS: Optical filters, Surgery, Wavelets, Finite impulse response filters, Field programmable gate arrays, Linear filtering, Discrete wavelet transforms, Medical imaging, Very large scale integration, Binary data
The present paper describes a new architecture for a Discrete Wavelet Packet Transform (DWPT) based on a folded Distributed Arithmetic (DA) implementation, which makes possible to expand two complete stages (4-subband DWPT). The proposed parameterized architecture can use different CDF wavelet coefficients with modified precision.
As the distributed arithmetic technique brings the possibility to make scalable designs, the proposed architecture can be easily parameterized. The data input and coefficient precision can be increased modifying the register size and the space memory, respectively. The number of coefficients can be change too increasing the memory and replicating the register structure. Our architecture uses only two FIR filters (high-pass and low-pass) that are folded to calculate various wavelet stages together in time. A discrete DWPT implementation using CDF(9/7) wavelet coefficients are implemented on VIRTEX-E1000-6 FPGA for different precisions. Finally, the use of both, the folding technique and the DA structure has offered a frequency operation of 75 MHz with 393 Flip-flop Slices (with 8 bits precision operation) on the FPGA.
KEYWORDS: Digital signal processing, Logic, Digital filtering, Wavelets, Discrete wavelet transforms, Very large scale integration, Structural design, Electronics engineering, Computer architecture, Time-frequency analysis
The present article describes a new high-efficient architecture for 1-D discrete wavelet packet transform (DWPT) base on lifting, folded and pipeline techniques, which makes possible to expand three completes levels. An architecture for a CDF(2,2) wavelet base is proposed. We have designed a filter bank using a lifting factorization for these coefficients and we have used an extension of the recursive pyramid algorithm (RPA) to obtain the three complete levels. We have pipelined our architecture to reach a maximally fast structure with only one logic operator in the critical path. Moreover, our architecture performances 75 % of hardware utilization for a DWPT realization. A comparative is presented between our DWPT architecture with others DWPT architectures. Our proposal lifting pipelined DWPT architecture is a maximally fast structure with only one logic operator in the critical path. Others DWPT architectures are based on memory access, that implies lower operation frequency and higher power consumption as our architecture.
The present work describes a new architecture for a CDF(2,2) wavelet base. The proposed architecture is based on the recursive pyramid algorithm (RPA) and the multirate folding technique to obtain better performance. The used of folding and retiming techniques improves the area and speed-rate. In order to obtain a maximally fast structure, we have modified the initial architecture scheduling getting internal pipelining delays to minimize the logic depth to one adder.
Two different implementations using lifting scheme and polyphase decomposition are discussed. The lifting implementation requires approximately 52 % less hardware resources than the polyphase structure. Finally a comparative between our architecture and others folded architectures, which make all the computations into one filter bank, is presented. Our folded architecture reduces the number of registers and logic operators, increasing the frequency operation and minimizing the occupied area with the same throughput (one input / one output). Moreover, replicating delays block we can easily scale this architecture up. Our architecture performances an 87,5% hardware utilization.