KEYWORDS: Signal to noise ratio, Digital signal processing, Logic, Clocks, Field programmable gate arrays, Data processing, Signal processing, Logic devices, Algorithm development, Computer architecture
Applications based on Fast Fourier Transform (FFT) such as signal and image processing require high computational
power, plus the ability to choose the algorithm and architecture to implement it. This paper explains the realization of a
Split Radix FFT (SRFFT) processor based on a pipeline architecture reported before by the same authors. This
architecture has as basic building blocks a Complex Butterfly and a Delay Commutator. The main advantages of this
* To combine the higher parallelism of the 4r-FFTs and the possibility of processing sequences having length of any
power of two.
* The simultaneous operation of multipliers and adder-subtracters implicit in the SRFFT, which leads to faster
operation at the same degree of pipeline.
The implementation has been made on a Field Programmable Gate Array (FPGA) as a way of obtaining high
performance at economical price and a short time of realization. The Delay Commutator has been designed to be
customized for even and odd SRFFT computation levels. It can be used with segmented arithmetic of any level of
pipeline in order to speed up the operating frequency. The processor has been simulated up to 350 MHz, with an
EP2S15F672C3 Altera Stratix II as a target device, for a transform length of 256 complex points.
Several commercial processors have selected the radix-8 multiplier architecture to increase their speed, thereby reducing
the number of partial products. Radix-8 encoding reduces the digit number length in a signed digit representation. Its
performance bottleneck is the generation of the term 3X, also referred to as hard multiple. This term is usually computed
by an adding and shifting operation, 3X=2X+X, in a high-speed adder. In a 2X+X addition, close full adders share the
same input signal. This property permits simplified algebraic expressions associated to a 3X operation other than in a
conventional addition. This paper shows that the 3X operation can be expressed in terms of two signals, Hi and Ki,
functionally equivalent to two carries. Hi and Ki are computed in parallel using architectures which lead to an area and
speed efficient implementation. For the purposes of comparison, implementation based on standard-cells of conventional
adders has been compared with the proposed circuits based on these Hi and Ki signals. As a result, the delay of proposed
serial scheme is reduced by roughly 67% without additional cost in area, the delay and area of the carry look-ahead
scheme is reduced by 20% and 17%, and that of the parallel prefix scheme is reduced by 26% and 46%, respectively.
KEYWORDS: Signal to noise ratio, Statistical analysis, Video, Error analysis, Computer programming, Quantization, Very large scale integration, Computer architecture, Standards development, Video coding
The H.264/AVC (Advanced Video Codec) is the latest standard for video coding. It assumes a scalar forward quantizer performed at the encoder which can be implemented directly in integer arithmetic. An efficient architecture for the computation of forward quantization of H.264/AVC is presented in this paper. It uses a modification of the quantization operation which reduces the arithmetic operations, and a truncated Booth multiplier based on adaptative statistical approach, which reduces the hardware. The JM reference software's C code has been re-written to analyze the effect of new algorithm and of truncated Booth multiplier. Simulations made up over popular test sequences used in video standardization show the validity of this approach. These results demonstrate that, at low QP, the PSNR is improved between a maximum of +0.81db and a minimum of 0.31db, with a slight increase in the Bit Rate being around 0.8%. Finally, a suitable architecture for VLSI implementation is presented, which reduces in a 26% the area, 32% the power and 21% the critical path delay in comparison with classical implementation. Moreover, it also reduces the area and increase the speed in comparison with architectures presented in references.
This paper describes the architecture of an 8x8 2-D DCT/IDCT processor with high throughput and a cost-effective architecture. The 2D DCT/IDCT is calculated using the separability property, so that its architecture is made up of two 1-D processors and a transpose buffer (TB) as intermediate memory. This transpose buffer presents a regular structure based on D-type flip-flops with a double serial input/output data-flow very adequate for pipeline architectures. The processor has been designed with parallel and pipeline architecture to attain high throughput, reduced hardware and maximum efficiency in all arithmetic elements. This architecture allows that the processing elements and arithmetic units work in parallel at half the frequency of the data input rate, except for normalization of transform which it is done in a multiplier operating at maximum frequency. Moreover, it has been verified that the precision analysis of the proposed processor meets the demands of IEEE Std. 1180-1990 used in video codecs ITU-T H.261 and ITU-T H.263. This processor has been conceived using a standard cell design methodology and manufactured in a 0.35-μm CMOS CSD 3M/2P 3.3V process. It has an area of 6.25 mm2 (the core is 3mm2) and contains a total of 11.7k gates, of which 5.8k gates are flip-flops. A data input rate frequency of 300MHz has been established with a latency of 172 cycles for the 2-D DCT and 178 cycles for the 2-D IDCT. The computing time of a block is close to 580ns. Its performances in computing speed as well as hardware complexity indicate that the proposed design is suitable for HDTV applications.
The Discrete Cosine Transform (DCT) is the most widely used transform for image compression. The Integer Cosine Transform denoted ICT (10, 9, 6, 2, 3, 1) has been shown to be a promising alternative to the DCT due to its implementation simplicity, similar performance and compatibility with the DCT. This paper describes the design and implementation of a 8×8 2-D ICT processor for image compression, that meets the numerical characteristic of the IEEE std. 1180-1990. This processor uses a low latency data flow that minimizes the internal memory and a parallel pipelined architecture, based on a numerical strength reduction Integer Cosine Transform (10, 9, 6, 2, 3, 1) algorithm, in order to attain high throughput and continuous data flow. A prototype of the 8×8 ICT processor has been implemented using a standard cell design methodology and a 0.35-μm CMOS CSD 3M/2P 3.3V process on a 10 mm2 die. Pipeline circuit techniques have been used to attain the maximum frequency of operation allowed by the technology, attaining a critical path of 1.8ns, which should be increased by a 20% to allow for line delays, placing the estimated operational frequency at 500Mhz. The circuit includes 12446 cells, being flip-flops 6757 of them. Two clock signals have been distributed, an external one (fs) and an internal one (fs/2). The high number of flip-flops has forced the use of a strategy to minimize clock-skew, combining big sized buffers on the periphery and using wide metal lines (clock-trunks) to distribute the signals.