Binary sorting is a well-defined problem that has a range of proposed solutions. Svoboda showed how one particular sorting technique can be used to implement a parallel counter. Among other uses, this type of counter can be used to perform binary multiplication. This paper presents several extensions to Svoboda's work using multi-bit sorting groups as the base building blocks for sorting. Additionally, the existing bitonic sorting algorithm is modeled, synthesized, and compared to Svoboda sorting for a range of word sizes.
A large number of adder designs are available based on the constraints of a particular application, e.g., speed, fanout,
wire complexity, area, power consumption, etc. However, a lower-bound has been set on the speed of these adders and it
has not been possible to design reliable adders faster than this lower bound. This paper deals with the design and
implementation of a speculative adder, that takes advantage of the probabilistic dependence of the maximum carrypropagate
chain length on the adder operand size. That is, this type of adder is designed to produce correct results for a
vast majority of inputs that have carry-propagate chains shorter than the length for which the adder has been designed.
An improvement is proposed to an earlier design of a speculative adder, by using Ling equations to speed it up. The
resulting speculative adder, called the ACLA has been compared with the earlier design and traditional adders like Ling
and Kogge-Stone in terms of area, delay and number of gates required. The ACLA is at least 9.8% faster and 20%
smaller than the previous design. A circuit for error detection and error correction has also been implemented, resulting
in the Reliable Adder (RA). When implemented as a sequential circuit, such a combination of ACLA and RA can
significantly increase the average speed of the adder unit.
Fast multipliers consist of an array of AND gates, a bit reduction stage, and a final two-operand addition. There are
three widely recognized types of fast multipliers: Wallace, Dadda, and reduced area. These multipliers are distinguished
by their techniques for the bit reduction stage; however, little research has been invested in the optimization of the final
addition stage. This paper presents an approach for investigating the optimal final adder structure for the three types of
fast multipliers. Multiple designs are characterized using the Virginia Tech 0.25 μm TSMC standard cell library and
Synopsys Design Vision to determine area, power consumption, and critical delay compared with a traditional carry
look ahead adder. A figure of merit that includes these measurements is used to compare the performance of the
proposed adders. Although this analysis neglects loading, interconnect, and several other parameters that are needed to
accurately model the multipliers in a modern VLSI process technology, the results obtained using the figure of merit
suggest that the final adder in each of the three fast multipliers can indeed be optimized.
Newton Raphson Functional Approximation is an attractive division strategy that provides quadratic convergence. With appropriate computational resources, it can be faster than digit recurrence methods if an accurate initial approximation is available. Several table lookup based initial approximation methods have been proposed previously. This paper examines some of these methods and implements a 24 bit divider utilizing a ROM smaller than 1 Kb. A Taylor series based reciprocal approximation method is used that employs a table lookup followed by multiplication. Simulations confirm that the design achieves desired accuracy after one Newton Raphson iteration.
This paper compares several designs of spanning tree adders for 16 and 32 bit widths. The carry select part of the
spanning tree is done using ripple carry and carry skip adders (4, 8 and 16 bits) and compared in terms of delay,
complexity and power consumption. The spanning tree design is also compared with that of a conventional carry
lookahead adder. All the designs are done using only 2 input NAND and NOR gates and inverters in 0.18 μm CMOS
technology. The delay and power consumption is determined by use of simulations performed with Synopsys and
Cadence design tools. The spanning tree adder realized with carry skip adders is about 40% faster than the carry
lookahead adder with an approximate increase of 17% in complexity and 22% in power.
With aggressive technology scaling there has been a rapid increase in leakage currents in modern CMOS processes. Leakage is mainly composed of sub-threshold and gate leakage. Leakage has been exponentially increasing with the scaling of threshold voltage and gate oxide thickness. This has resulted in power consumption being drastically affected. With the explosive increase in usage of embedded multimedia processors, high performance circuits need to be ultra low power in the standby mode while having a moderate power budget during runtime. One of the most power consuming high performance macros of an embedded processor is the floating point unit. In this paper we look into various macros of a floating point unit designed using a dynamic power optimized logic style called Limited Switch Dynamic Logic and circuit solutions for reducing the impact of leakage on these macros. The results are obtained for CMOS technology nodes of 90nm and 65nm.
Although Dadda multipliers offer the greatest speed potential with a delay proportional to log(n), they are not often used in everyday designs because of their irregular structure and the ensuing difficulty this entails in their implementation. This paper presents a program which automatically generates HDL code describing a Dadda multiplier of specified size. The resulting HDL code is then synthesized to a generic library in the TSMC13G process (0.13um). It is observed that delay increases only marginally when increasing the multiplier size from 16 to 64 bits, while total area increases drastically.
There are many good concepts that have been developed that collectively enable the achievement of incredibly high levels of performance in current computers. In this paper, seven key concepts are identified.
A technique for reducing the sign extension overhead in adder trees is presented. A generalized version of the technique is shown to reduce the number of redundant sign extension computations required for reducing parallel adder trees from N terms to two terms. Additionally, the technique eliminates the fan-out latency that traditional sign extension places on late arriving sign bits. Twos complement number growth is also managed in carry-save form without the need for carry propagation. The application of the technique to 2N term adder trees is demonstrated. The implementation requires no computational overhead and needs minimal hardware. This design not only reduces hardware complexity, but also reduces computation delay. Finally, a simple circuit transformation to the traditional 4-2 compressor allows simple construction of circuits utilizing the technique.
The two well-known fast multipliers are those presented by Wallace and Dadda. Both consist of three stages. In the first stage, the partial product matrix is formed. In the second stage, this partial product matrix is reduced to a height of two. In the final stage, these two rows are combined using a carry propagating adder. In the Wallace method, the partial products are reduced as soon as possible. In contrast, Dadda's method does the minimum reduction necessary at each level to perform the reduction in the same number of levels as required by a Wallace multiplier. It is generally assumed that, for a given size, the Wallace multiplier and the Dadda multiplier exhibit similar delay. This is because each uses the same number of pseudo adder levels to perform the partial product reduction. Although the Wallace multiplier uses a slightly smaller carry propagating adder, usually this provides no significant speed advantage. A closer examination of the delays within these two multipliers reveals this assumption to be incorrect. This paper presents a detailed analysis for several sizes of Wallace and Dadda multipliers. These results indicate that despite the presence of the larger carry propagating adder, Dadda's design yields a slightly faster multiplier.
A new MSB-first bit-serial adder/subtracter architecture is proposed. The architecture utilizes a modified Manchester carry chain to accommodate the carry from the future LSB's. The carry chain is shown to have the constant delay of two AND gates and
one XOR gate regardless of the operand width, which allows a fast constant operational clock frequency. When compared to the conventional parallel addition approach where the operand bits are stored and then added in parallel, the proposed architecture
also provides a significant area saving. It is also shown that the
proposed architecture can be generalized for radix-r operands.
This paper examines the design of a hybrid prefix adder under the condition of non-uniform input signal arrival. This is encountered in the final adder for fast parallel multipliers, which use column compression reduction. The prefix graph scheme efficiently accommodates the non-uniform arrival times. Rules are presented for designing hybrid prefix adders under such conditions. This rule produces adders which are faster and less complex than previous work.
A carry-skip adder is faster than a ripple carry adder and it has a simple structure. To maximize the speed it is necessary to optimize the width of the blocks that comprise the carry skip adder. This paper presents a simple algorithm to select the size of each block. Assuming that each logic gate has a unit delay, the algorithm achieves slightly faster designs for 64 and 128 bit adders than previous methods developed by Guyot, et al. and Kantabutra.
A high speed multiplier-accumulator (MAC) architecture is presented that accepts up to 2k pairs of n-bit 2's complement input operands and generates one 2n+k-bit result. The implementation uses a 2 stage pipelined multiplier with Dadda reduction and a single carry propagating adder. A significant speedup is achieved with this implementation over conventional MAC designs. The complexity is reduced slightly. An example design is presented with 24-bit input operands designed using a 0.35 micrometer CMOS technology.
This paper presents a hybrid pipelined and multiplexed architecture for finite-duration impulse response (FIR) filters used in real- time applications. The emphasis is placed on efficient hardware utilization, compared to conventional multiplexed or pipelined architectures. The multiply-accumulate (MAC) unit used here is a Booth recoded, Wallace tree based multiplier, in which the accumulator is merged with the partial product matrix. However, the approach is equally well suited to other multiplier/accumulator implementations. A novel sign extension technique is described and incorporated into the partial product reduction phase. This effectively reduces the number of rows that need sign extension, and can be used in conjunction with existing sign extension techniques. The final structure described is a form of pipelined tap-division multiplexing with the goal of maximum hardware re-use during run-time, given input data rate constraints. A method to compute the optimal hybrid pipeline depth is presented, based on the ratio of the input data rate to the critical speed of the hardware in the target technology. Performance is measured against FIR structures implemented using conventional, multiplexed, and pipelined approaches. It is shown that hardware complexity reductions of up to 18% over conventional pipelined architectures, and up to 29% over multiplexed approaches can be achieved.
Fault tolerance is increasingly important as society has come to depend on computers for more and more aspects of daily life. The current concern about the Y2K problems indicates just how much we depend on accurate computers. This paper describes work on time- shared TMR, a technique which is used to provide arithmetic operations that produce correct results in spite of circuit faults.
This paper discusses recent results in the area of merged arithmetic. The original sum of products technique has been extended to include add-multiply structures and cascaded multiplication. The merged implementations generally are less complex than conventional designs and offer equivalent to improved performance.