The application of reconfigurable computing to exploit the inherent serialism in DSP algorithms is investigated with the aim of building cost-effective, real-time hardware. The DSP algorithm is expected to be decomposed so that the most compute intensive and data flow oriented part is separated from the less compute intensive and more control flow oriented part. The different modules in the compute intensive part are expected to be serialized and implemented on a reconfigurable platform. The proposed architecture consists of an array of a minimum of two Field Programmable Gate Arrays (FPGAs). The FPGAa are grouped in two sets such that when one set executes the current batch of modules the second set could be configured to execute the next batch of modules. The control flow oriented part of the algorithm is implemented on a DSP processor.
This paper addresses the problem of implementing narrow-band FIR filters using FPGAa. Rather than employing a conventional multiply-accumulate unit to compute the inner-product, an alternative method based on requantization of the input data stream is presented. The requantization process preserves the dynamic range of the signal components contained in the bandwidth of the filter, while shifting the requantization noise to the spectral region to be rejected by the filter. The reduced bit length representation of the requantization input data samples removes the requirement for a full multiplier in the filter hardware. This makes the method very attractive for realization using FPGA technology. The filtering technique is described, and implementation results using a Xilinx XC4010 FPGA are presented.
This paper resumes the development of an integrate tool for designing high-speed, real-time, FIR-filter circuits. The system is composed of programmable IC and an associate software for filter repsonse analysis, synthesis of coefficients, and circuit programming. The architecture is highly regular, easily expandable and its control is distributed. The chip can be programmed by a PC or by using an EPROM. The prototypes have been fabricated using the CMOS 1.5micrometers Standard Cell of ES2. Moreover, some heuristics about multipliers upgrated to CMOS 1micrometers - Cadence DFWII are resumed.
The fast Fourier transform algorithm is specified in a data parallel version of 'C'. This specification is used to produce a custom circuit suitable for use in a system based on reconfigurable logic. Performance estimates indicate that this approach is capable of producing the 2D Fourier transform of images at real time video rates.
This paper investigates the concept of transformable computing as a means for achieving cost- effective high-performance computing. In this preliminary work we will provide performance figures for some representative DSP problems, specifically linear convolution and Fourier transforms. Our implementations are based on the highly parallel approach. We present an experimental setup in which an EVC1-s board is interfaced with a 40Mhz Sparc 10 SunStation. We show that using FPGA-based board provides significant speedups for the aforementioned problems over using the Sparc 10 processor alone. More importantly, it is shown that speedup will in general improve as the number of scheduled tasks increases.
This paper describes the use of the SPLASH-2 custom computing platform for real-time median and morphological filtering images. SPLASH-2 is an FPGA-based attached processor that can be reconfigured to perform a wide variety of tasks. Although not specifically designed for image processing, the architecture is well suited for the repetitive computations and high data transfer rates that characterize most low-level image processing problems. Median filtering is a particularly good benchmark, since nonlinear rank ordering must be performed for 2D neighborhoods at every pixel location in an image. General-purpose workstations are inefficient at such tasks, whereas SPLASH-2 can be configured to perform this at a rate of 30 images per second. This paper presents the hardware/software codesign process that we have used to implement this operation, which can be pipelined with other operations by using additional SPLASH-2 processor boards. The results presented here illustrate that custom computing architectures are much faster than conventional uniprocessors, and offer an attractive altrenative to dedicated special-purpose harware when high performance is required.
The computation power needed in communication and image processing is increasing so much that parallel architecture can be seen as the only solution available. Serious drawbacks of such architectures are their lack of flexibility and the complexity of their programming. The goal of the ArMenX project is to offer a flexible development and prototyping platform for signal and image parallel processing. The ArMenX architecture consists of a set of replicated processing nodes. Each node consists of three tightly coupled units: a Transpute, a FPGA, and a DSP. The ArMenX nodes are interconnected by two media: one is an asynchronous serial link, and the other is a high bandwidth parallel ring. Thanks to this flexible multi-DSP architecture, ArMenX allows efficient implementation of a 'large kind of algorithms' involved in signal and image processing. We will present three examples of working applications on this parallel architecture: a parallel forward and backward propagation for neural network; a parallel implementation of computing an electromagnetic field (for an electrostatic and magnetostatic probelm) using an artificial neural network; and a large image compression algorithm using wavelet transform. The present work deals with the integration of a high level programming neural network environment. This environment will make it easier to take advantage of this architecture's flexibility.
The Cheops system is a compact, modular platform developed at the MIT Media Laboratory for acquisition, processing, and display of digital video sequences and model-based representations of moving scenes, and is intended as both a laboratory tool and a prototype architecture for future programmable video decoders. Rather than using a set of basic, computationally intensive stream operations that may be performed in parallel and embodies them in specialized hardware. However, Cheops incurs a substantial performance degradation when executing operations for which no specialized processor exists. We have designed a new reconfigurable processor that combines the speed of special purpose stream processors with the flexibility of general-purpose computing as a solution to the problem. Two SRAM based field-programmable gate arrays are used in conjunction with a Power PC 603 processor to provide a flexible computational substrate, which allows algorithms to be mapped to a combination of software and dedicated hardware within the data-flow paradigm. We review the Cheops system architecture, describe the hardware design of the reconfigurable processor, explain the software environment developed to allow dynamic reconfiguration of the device, and report on its performance.
A dynamic instruction set computer (DISC) has been developed to support demand-driven instruction set modification. Using partial reconfiguration, DISC pages instruction modules in and out of an FPGA as demanded by the executing program. Instructions occupy FPGA resources only when needed and FPGA resources can be reused to implement an arbitrary number of performance-enhancing application-specific instructions. DISC further enhances the functional density of FPGAs by physically relocating instruction modules to available FPGA space. An image processing application was developed on DISC to demonstrate the advanteges of paging application-specific instruction modules.
Volume visualization is a popular method for viewing simulated or experimental 3D data sets from applications such as medical imaging, computational fluid dynamics, and climate modeling. However, most software and low-cost hardware implementations of visualization algorithms do not have sufficient performance for inter-active viewing. This paper discusses a method for low-cost, parallel hardware acceleration of volume rendering using a PC-hosted FPGA board. Our method uses a parallel distributed memory approach for compositing and tranformation of volume data, and it provides insight into efficient use of low-cost memory systems.
N-body methods are used to simulate the evolution and interaction of galaxies. These simulations are usually run on large-scale supercomputers or on very expensive full-custom hardare. This paper presents an alternative hardware method for acceleration of N-body simulations. The method yields a significant fraction of the performance of custom hardware and provides a great deal more flexibility. A protoype implementation is presented.
As field programmable gate arrays (FPGAs) and complex programmable logic devices (CPLDs) become faster, denser, and cheaper, many designers that previously used programmable logic devices (PLDs), and had a need in their next design for more functionality in a smaller footprint or board space, have switched to using these FPGAs or CPLDs to incorporate their design. With the advent of JTAG 1149.0 boundary test specification, there came a specified method for reprogramming the FPGAs and CPLDs live in the field. Using the electrical-erasable manufacturing process, reconfigurable hardware or logic was invented. It is perfect for prototyping as well as field applications where upgrades can be done live in a matter of seconds from personal computers that a new redesign has just been compiled. In this paper we discuss several issues experienced while using the EPX780 reconfigurable FPGA such as: 1) why the new design required a reconfigurable FPGA, 2) problems encountered in implementation including place and route, compiling, simulating, and testing, and 3) the future use of the reconfigurable hardware devices including selection of proper development systems. Overall, there will be several tips and design rules in using reconfigurable devices generally and FLEX 780s development specifically.
In this paper, the first user/field-programmable analog integrated circuit, an analog counterpart to a digital FPGA is presented. This paper provides an overview of the new technology, explains its internal operation, how to use it, and how to benefit from it. Examples of various programmable functions and several applications are given to demonstrate the large degree of flexibility as well as the unprecedented ease of design.
We describe the use of a reconfigurable interface board based on FPGAs and a UNIX workstation to implement a correlation tracker with 3.8ms latency. The correlation tracker is part of an active mirror system in use at the Swedish Vacuum Solar Telescope, La Palma, Canary Islands. The reconfigurable interface is used to leverage the workstation CPU, relieving it of tasks that it performs poorly such as rapid context switching and low-level bit manipulation. The reconfigurable interface handles control of external devices, high- performance input (16 MB/s) and data preformatting. The workstation CPU, a 64-bit microprocessor, performs the bulk of the computation. For the key computations of the correlation tracker we are able to treat 8 pixels in parallel in the CPU's 64-bit integer datapath. We present the structure of the CCD interface configuration and the implementations of the key algorithms on the workstation CPU. We describe the design trade-offs that arose during the development of the system, and demonstrate the symbiosis between components implemented in software and configurable hardware.
In-system programmable, SRAM-based field programmable gate arrays (FPGAa) can be used to create processors and coprocessors whose internal architecture as well as interconnections can be reconfigured to match the needs of a given application. Exploiting the inherent speed and parallelism of a hardware solution, FPGA-based coprocessors can execute computationally-intensive tasks while maintaining the flexibility of a programmable solution. The subject of numerous research projects over the past few years, the use of FPGAa as reconfigurable computing elements is poised to expand rapidly in the commercial market. This manuscript overviews the use of SRAM-based FPGAa as processing elements. Several prominent research projects and commercial applications are cited, and technology trends are discussed.
We define reconfigurable computing systems as those machines that use the reconfigurable aspects of field programmable gate arrays (FPGAs) to implement an algorithm. Researchers throughout the world have shown that computationally intensive software algorithms can be transposed directly into hardware design for extreme performance gain. Hardware objects are algorithms implemented as dynamically downloadable hardware designs. Hardware objects execute on reconfigurable computing systems based on SRAM-style FPGAs. A hardware object can be created via schematic and VHSIC hardware description language or Verilog hardware description language. To use a hardware design in a software program, it must be converted into a hardware object. The hardware object can be used over and over or in combination with other hardware objects. This hardware object technology method of programing reconfigurable computers is the subject of this paper.
The most significant digit first function evaluation method (E-method) allows efficient evaluation of polynomials and certain rational fucntions on custon hardware. The time required for the computation is of the order of m carry-free addition operations, m being the number of digits in the result. We discuss a digit-parallel and a digit-serial implementation of this method on a DecPeRLe-1 board, made up with Xilinx FPGAs. After a presentation of the E-method, we give a discription of the architecture of the DecPeRLe-1 board, present our designs and analyze their performances.
WILDFIRE is a commercial reconfigurable computer architecture based on field programmable gate array (FPGA) technology. Programmers achieve high processing performance by rapidly modifying the internal hardware architecture through software to efficiently accommodate the specific processing needs of an applications. The WILDFIRE hardware and the accompanying software environment for application development and runtime operation are presented. Suitable applications for WILDFIRE and future capabilities are also discussed.
Prototypes are invaluable for studying special purpose parallel architectures and custom computing. We have built a configurable custom computing engine, based on field programmable gate arrays, to enable experiments on an interesting scale. The Teramac configurable hardware system can execute synchronous logic designs of up to one million gates at rates up to one megahertz. Search and retrieval of nontext data from very large databases can be greatly accelerated using special purpose parallel hardware. We are using Teramac to conduct experiments with special purpose processors involving search of nontext databases.
Partial reconfiguration is the ability of certain FPGAs to reconfigure only selected protions of the device while other portions contiue to operate undisturbed. When used in conjunction with the runtime reconfiguration (RTR) implementation strategy, the performance of the system can be greatly enhanced. RRANN2 is a RTR artifical neural network that uses partial reconfiguration. Its operation is divided into a series of sequentially executed stages with each stage implemented as a separate circuit module. System operation consists of sequencing through these modules at runtime, one configuraiton at a time. By carefully organizing each circuit module in order to establish a large number of functional and physical commonalities, partial reconfiguration is used to leave common circuitry resident on the FPGAs during system reconfiguration. Transitioning between configurations can then be accomplished by updating only the differences between circuit modules. This significantly enhances overall performance by reducing the amount of time the RTR application spends configuring. RRANN2 exhibited a 53.5% reduction in reconfiguration time through the use of partial reconfiguration. This paper presents the methodology used to design the RRANN2 system.
Stand-alone digital signal processors (DSPs) support many on-chip functions and are highly optimized for the demands of high-speed computing. The problem associated with this functional optimization is that the increase in performance comes at the expense of flexibility. To make the DSP general purpose enough for a wide variety of applications, a custom ASIC must be used to achieve the desired performance. DSPs and ASICs are not able to easily adapt on-the-fly to different algorithms. Even DSPs that can do this don't match the high level of optimization provided by an ASIC. Recent developments in FPGA design tools enable system designers to develop in-system reconfigurable adaptive DSP hardware. Designed to exploit register rich, dynamically recongigurable field programmable gate arrays, high speed custom DSP functions can be created and implemented, resulting in significantly improved performance for compute-intensive applications, including graphics and image processing, telecommunications, networking and instrumentation.