Proc. SPIE. 7444, Mathematics for Signal and Information Processing
KEYWORDS: Signal to noise ratio, Digital signal processing, Optical spheres, Matrices, Digital filtering, Computing systems, Telecommunications, Embedded systems, Wireless communications, Filtering (signal processing)
Modern embedded and reconfigurable systems need to support a wide range of applications, many of which may
significantly benefit from hardware support for floating-point arithmetic. Some of these applications include
3D graphics, multiple-input multiple-output (MIMO) wireless communication algorithms, orthogonal frequency
division multiplexing (OFDM) based systems, and digital filters. Many of these applications have real-time
constraints that cannot tolerate the high latency of software emulated floating-point arithmetic. Moreover,
software emulation can lead to higher energy consumption that may be unsuitable for applications in powerconstrained
environments. This paper examines applications that can potentially benefit from hardware support
for floating-point arithmetic and discusses some approaches taken for floating-point arithmetic in embedded
and reconfigurable systems. Precision and range analysis is performed on emerging applications in the MIMO
wireless communications domain to investigate the potential for low power floating-point units that utilize reduced
precision and exponent range.
KEYWORDS: Digital signal processing, 3D applications, Visualization, 3D modeling, Personal digital assistants, Light sources and illumination, Computer engineering, 3D displays, Computer graphics, 3D image processing
In order to support a broad dynamic range and a high degree of precision, many of 3D renderings fundamental algorithms have been traditionally performed in floating-point. However, fixed-point data representation is preferable over floating-point representation in graphics applications on embedded devices where performance is of paramount importance, while the dynamic range and precision requirements are limited due to the small display sizes (current PDA's are 640 × 480 (VGA), while cell-phones are even smaller). In this paper we analyze the efficiency of a CORDIC-augmented Sandbridge
processor when implementing a vertex processor in software using fixed-point arithmetic. A CORDIC-based solution for vertex processing exhibits a number of advantages over classical Multiply-and-Acumulate solutions. First, since a single primitive is used to describe the computation, the code can easily be vectorized and multithreaded, and thus fits the major Sandbridge architectural features. Second, since a CORDIC iteration consists of only a shift operation followed by an addition, the computation may be deeply pipelined. Initially, we outline the Sandbridge architecture extension which encompasses a CORDIC functional unit and the associated instructions. Then, we consider rigid-body rotation, lighting, exponentiation, vector normalization, and perspective division (which are some of the most important data-intensive 3D graphics kernels) and propose a scheme to implement them on the CORDIC-augmented Sandbridge processor. Preliminary results indicate that the performance improvement within the extended instruction set ranges from 3× to 10× (with the exception of rigid body rotation).
This paper describes truncated squarers, which are specialized squarers with a portion of the squaring matrix eliminated. Rounding error and errors due to matrix reduction are quantified and analyzed. Constant and variable correction techniques are presented that minimize either the mean error or the maximum absolute error as required by the application. Area and delay estimates are presented for a number of designs, as well as error statistics obtained both analytically and numerically by exhaustive simulation. As an example, one design of a 16-bit truncated squarer using constant correction is 10.1% faster and requires 27.9% less area than a comparable standard squarer with true rounding. The range of error for this truncated squarer is -0.892 to +0.625 ulps, compared to +/-0.5 ulps for the standard squarer.
It is becoming increasingly common to image time-resolved flow patterns through the vascular system in all three
spatial dimensions using non-invasive methods. The capability to generate four-dimensional (4D) (x, y, z and time)
vascular flow data is growing in several modalities. Vastly undersampled Isotropic PRojection (VIPR) is one such
method using high-resolution, fast Magnetic Resonance Imaging (MRI) of the vasculature system during intravenous
contrast injection. VIPR currently produces 4D data sets of twenty to forty frames of 256<sup>3</sup> voxels each, and stronger
magnets will allow higher resolution time series that generate gigabytes of data. Real-time visualization and analysis of
4D data can quickly overwhelm the memory and processing capabilities of desktop workstations. 4D Cluster
Visualization (4DCV) offers a straightforward, scalable approach to interactively display and manipulate 4D,
reconstructed, VIPR data sets. 4DCV exploits the inherently parallel nature of 4D frame data to interactively manipulate
and render individual 3D data frames simultaneously across all nodes of a visualization cluster. An interactive
animation is produced in real-time by reading back the 2D rendered results to a central animation console where the
image sequence is assembled into a continuous stream for display. Basic 4DCV can be extended to allow rendering of
multiple frames on one node, compression of image streams for serving remote clinical workstations, and local archival
storage of 3D data frames at the cluster nodes for quick retrieval of medical exams. 4D Cluster Visualization concepts
can also be extended to distributed and Grid implementations.
Novel arithmetic units are needed to achieve the cost, performance, power, and functionality requirements of emerging multimedia systems. This paper presents the design and implementation of a 64-bit arithmetic and logic unit (ALU) for multimedia processing. The 64-bit ALU supports subword-parallel processing by allowing one 64-bit, two 32-bit, four 16-bit, or eight 8-bit operations to be performed in parallel. In addition to conventional ALU operations, the ALU also supports several operations for enhanced multimedia processing including parallel compare, parallel average, parallel minimum, parallel maximum, and parallel shift and add. To efficiently implement a variety of multimedia applications, the ALU supports saturating and wrap-around arithmetic operations on unsigned and two's complement operands. This paper compares the area and delay of the 64-bit multimedia ALU to those of a more conventional 64-bit ALU.
Truncated multipliers offer significant improvements in area, delay, and power. However, little research has been done on their use in actual applications, probably due to concerns about the computational errors they introduce. This paper describes a software tool used for simulating the use of truncated multipliers in DCT and IDCT hardware accelerators. Images that have been compressed and decompressed by DCT and IDCT accelerators using truncated multipliers are presented. In accelerators based on Chen's algorithm (256 multiplies per 8 x 8 block for DCT, 224 multiplies per block for IDCT), there is no visible difference between images reconstructed using truncated multipliers with 55% of the multiplication matrix eliminated and images reconstructed using standard multipliers with the same operand lengths and intermediate precision.
Barrel shifters are often utilized by embedded digital signal processors and general-purpose processors to manipulate data. This paper examines design alternatives for barrel shifters that perform the following functions: shift right logical, shift right arithmetic, rotate right, shift left logical, shift left arithmetic, and rotate left. Four different barrel shifter designs are presented and compared in terms of area and delay for a variety of operand sizes. This paper also examines techniques for detecting results that overflow and results of zero in parallel with the shift or rotate operation. Several Java programs are developed to generate structural VHDL models for each of the barrel shifters. Synthesis results show that data-reversal barrel shifters have less area and mask-based data-reversal barrel shifters have less delay than other designs. Mask-based data-reversal barrel shifters are especially attractive when overflow and zero detection is also required, since the detection is performed in parallel with the shift or rotate operation.
This paper presents a general FIR filter architecture utilizing truncated tree multipliers for computation. The average error, maximum error, and variance of error due to truncation are derived for the proposed architecture. A novel technique that reduces the average error of the filter is presented, along with equations for computing the signal-to-noise ratio of the truncation error. A software tool written in Java is described that automatically generates structural VHDL models for specific filters based on this architecture, given parameters such as the number of taps, operand lengths, number of multipliers, and number of truncated columns. We show that a 22.5% reduction in area can be achieved for a 24-tap filter with 16-bit operands, 4 parallel multipliers, and 12 truncated columns. For this implementation, the average reduction error is only 9.18 × 10<sup>-5</sup> ulps, and the reduction error SNR is only 2.4 dB less than the roundoff SNR of an equivalent filter without truncation.
Symmetric table addition methods (STAMs) approximate functions by performing parallel table lookups, followed by multioperand addition. STAMs require significantly less memory than direct table lookups and are faster than piecewise linear approximations. This paper investigates the application of STAMs to the sigmoid function and its derivative, which are commonly used in artificial neural networks. Compared to direct table lookups, STAMs require between 23 and 41 times less memory for sigmoid and between 24 and 46 times less memory for sigmoid's derivative, when the input operand size is 16 bits and the output precision is 12 bits.
In many digital signal processing and multimedia applications, results that overflow are saturated to the most positive or most negative representable number. This paper presents efficient techniques for performing saturating n-bit integer multiplication on unsigned and two's complement numbers. Unlike conventional techniques for saturating multiplication, which compute a 2n-bit product and then examine the n most significant product bits to determine if overflow has occurred, the techniques presented in this paper compute only the (n + 1) least significant bits of the product. Specialized overflow detection units, which operate in parallel with the multiplier, determine if overflow has occurred and the product should be saturated. These techniques are applied to designs for saturating array multipliers that perform either unsigned or two's complement saturating integer multiplication, based on an input control signal. Compared to array multipliers that use conventional methods for saturation, these multipliers have about half as much area and delay.
This paper presents a high-speed method for approximating reciprocals and square roots. This method is based on piecewise second-order Taylor expansion, yet it requires one third as many coefficients and has less delay. It can be implemented using a table lookup, a multiplication and a merged operation, which has approximately the same area and delay as a multiplication. The table lookup and merged operation are independent and can be performed in parallel. To reduce the overall hardware requirements, this method can share hardware with an existing multiplier.