Due to the synchronization method employed by most modern digital circuits, the maximum propagation delay through
an adder unit is typically used to set the system level delay for addition operations. The actual delay of a binary addition
computation is fundamentally tied to the longest carry propagation chain created by certain input operands. Although the
probability of lengthy carry propagation chains is quite low, modern synchronous adders devote a large portion of their
silicon area and energy consumption to speeding up the propagation of carries through the adder. Therefore,
considerable die area and system power must be spent optimizing the improbable worst case delay. Using asynchronous
self-timed circuits, similar adder performance can be obtained at a fraction of the hardware cost and energy consumption.
This paper shows the inadequacy of characterizing self-timed adder performance using the assumption that typical input
operands are uniformly randomly distributed, and presents a new self-timed adder characterization benchmark based on
the SpecINT 2000 benchmark suite. The SpecINT 2000 benchmark was selected since there is no published carry
propagation chain distribution for this modern benchmark suite and the benchmark is well suited for measuring carry
propagation chains due to its code size and the fact that it is a non-synthetic benchmark. All totaled, over 4.7 billion
addition/subtraction operations resulting from address and data calculations were tabulated to create the SpecINT 2000
Carry Propagation Distribution. This new carry propagation distribution sheds light on the accuracy of existing
distributions based on the Dhrystone integer benchmark and demonstrates how measuring self-timed adder performance
with a uniformly random input distribution can overestimate self-timed adder performance by over 50 percent.
Artificial neural networks have been used in applications that require complex procedural algorithms and in systems which lack an analytical mathematic model. By designing a large network of computing nodes based on the artificial neuron model, new solutions can be developed for computational problems in fields such as image processing and speech recognition. Neural networks are inherently parallel since each neuron, or node, acts as an autonomous computational element. Artificial neural networks use a mathematical model for each node that processes information from other nodes in the same region. The information processing entails computing a weighted average computation followed by a nonlinear mathematical transformation. Some typical artificial neural network applications use the exponential function or trigonometric functions for the nonlinear transformation. Various simple artificial neural networks have been implemented using a processor to compute the output for each node sequentially. This approach uses sequential processing and does not take advantage of the parallelism of a complex artificial neural network. In this work a hardware-based approach is investigated for artificial neural network applications. A Field Programmable Gate Arrays (FPGAs) is used to implement an artificial neuron using hardware multipliers, adders and CORDIC functional units. In order to create a large scale artificial neural network, area efficient hardware units such as CORDIC units are needed. High performance and low cost bit serial CORDIC implementations are presented. Finally, the FPGA resources and the performance of a hardware-based artificial neuron are presented.
Today FPGAs are used in many digital signal processing applications. In order to design high-performance area efficient DSP pipelines various arithmetic functions and algorithms must be used. In this work, FPGA-based functional units for Cosine, Arctangent, and the Square Root functions are designed using bipartite tables and iterative algorithms. The bipartite tabular approach was four to 12 times faster than the iterative approach but requires 8-40 times more FPGA hardware resources to implement these functions. Next, these functions along with the FPGA hardware multipliers and a reciprocal bipartite table unit are used for hardware rectangular-to-polar and polar-to-rectangular conversion macro-functions. These macro-functions allow for a 7-10 times performance improvement for the high-performance pipelines or an area reduction of 9-17 times for the low cost implementations. In addition, software tool to design FPGA based DSP pipelines using the Cosine, Sine, Arctangent, Square Root, and Reciprocal units with the hardware multipliers is presented.
The current trend of exponential increases in clock frequency and an increase in the number of transistors per die causes increases in power consumption, total die area dedicated to the clock distribution network, and clock overhead incurred relative to the clock cycle time. Self-timed circuits may provide an alternative approach to synchronous circuit design that helps to reduce the negative characteristics of the high-speed clocks needed by synchronous circuits. This work presents a gate-level performance model and transistor-level performance, power and area approximations for both self-timed and static CMOS ripple-carry adders. These results show that for self-timed circuits with uniformly random input operands the average performance of a ripple-carry adder is logarithmic and improves performance by 37% with a 30% increase in the total transistor width as compared to a static CMOS ripple-carry adder.
High-performance arithmetic algorithms are often based on functional iteration and these algorithms do not directly produce a remainder. Without the remainder, rounding often requires additional computation or increased quotient precision. Often multiplicative divide algorithms compute the quotient as the product of the dividend and the reciprocal of the divisor, <i>Q</i> =<i>a x (1/b). </i>Typical rounding techniques require that the quotient error be less than a maximum bound such as 1/2 unit in the last place (ulp). When using normalized floating point numbers the quotient error may be approximately twice as large as the reciprocal error since a<sub>max</sub> ≈ 2 and E<sub>q</sub> ≈ 2 <i>x</i> E<sub>r</sub>. If the rounding algorithm requires |E<sub>q</sub>| < 1/2 ulp, then the reciprocal error bound must be |E<sub>r</sub>| < 1/4 ulp. This work proposes a quantitative method to relax the reciprocal error bound for normalized floating point numbers to achieve a fixed quotient error bound. The proposed error bound of E<sub>r</sub> < 1/(2 <i>x</i> <i>b</i>) guarantees the quotient error, E<sub>q</sub> < 1/2 ulp and the reciprocal error is in the range of 1/4 to 1/2 ulp. Using the relaxed error bound, the reciprocal error may be greater in the region where it is hardest to compute without increasing the quotient error bound.
A parametric time delay model to compare floating point unit implementations is proposed. This model is used to compare a previously proposed floating point adder using a redundant number representation with other high-performance implementations. The operand width, the fan-in of the logic gates and the radix of the redundant format are used as parameters to the model. The comparison is done over a range of operand widths, fan-in and radices to show the merits of each implementation.
Schwarz demonstrates the reuse of a multiplier partial product array (PPA) to approximate higher-order functions such as the reciprocal, division, and square root. This work presents techniques to decrease the worst case error for the reciprocal approximation computed on a fixed height PPA. In addition, a compensation table is proposed that when combined with the reciprocal approximation produces a fixed precision result. The design space for a 12-bit reciprocal is then studied and the area- time tradeoff for three design points is presented. Increasing the reciprocal approximation computation decreases the area needed to implement the function while increasing the overall latency. Finally, the applicability of the proposed technique to the bipartite ROM reciprocal table is discussed. The proposed technique allows hardware reconfigurability. Programmable inputs for the PPA allow the hardware unit to be reconfigured to compute various higher-order function approximations.