PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.
Many potential applications for reconfigurable computing need the dynamic range provided by floating-point arithmetic. However, doing floating-point on FPGAs is difficult because of the large amount of hardware required, particularly for multipliers. Some limited success has been obtained through digit-serial implementation of IEEE floating-point multipliers, but the IEEE representation is not easily or efficiently implemented in serial form. Therefore, we have been exploring alternate number representations. Signed-digit representations have shown some promise, since their form lends them to serial computation, which consumes much less hardware than fully parallel approaches. We show how the signed-digit representation can be used to implement floating-point arithmetic, and we present prototype implementations using Altera FPGAs.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
We present a design of high-radix digit-slices for the implementation of on-line multiply-add operator (OMA). Our evaluation of performance and cost shows that speedups above 1.5 can be obtained with respect to radix 2 at reasonable increase in cost. The design and evaluation are based on the Xilinx FPGAs. We also discuss the use of OMAs modules in solving linear recurrences.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Driven by the excellent properties of FPGAs and the need for high-performance and flexible computing machines, interest in FPGA-based computing machines has increased dramatically. Fixed-point adders are essential building blocks of any computing systems. In this work, various high-speed addition algorithms are implemented in FPGAs devices, and their performance is evaluated with the objective of finding and developing the most appropriate addition algorithms for implementing in FPGAs, and laying the ground-work for evaluating and constructing FPGA-based computing machines. The results demonstrate that the performance of adders built with the FPGAs dedicated carry logic combined with some other addition algorithms will be greatly improved, especially for larger adders.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
This paper describes an application in high-performance signal processing using reconfigurable computing engines. The application is a 250 MHz cross-correlator for radio astronomy and was developed using the fastest available Xilinx FPGAs. We report experimental results on the operation of reconfigurable computers at 250 MHz, and describe the architectural innovations required to build a 250 MHz reconfigurable computer. Extensions of the technique to a variety of high-performance real-time signal processing algorithms are discussed.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The finite difference method is a numerical analysis technique used to solve problems involving irregular geometries, complicated boundary conditions, or both. The geometries are represented using partial differential equations. The solutions to the partial differential equations can be easily generated with the aid of a computer. As the geometries become increasingly complex, the solutions of the partial differential equations become computationally more intensive. Configurable computing machines are an emerging class of computing platform which are characterized by providing the computational performance of application specific processors, yet retaining the flexibility and rapid reconfigurability attributed to general-purpose processors over a diversity of tasks. Structural modeling of underwater vehicles relies upon analysis involving complex boundary conditions. The finite difference method can be used to perform heat and shock analysis on the vehicles. This paper presents an implementation and performance figures for a specific domain of the finite difference method -- a two-dimensional heat transfer modeling system using a Splash-2 configurable computing machine (CCM).
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Over the past three years we have developed a suite of software tools to support the use of systems with dynamically reconfigurable hardware by scientists and engineers who are not skilled hardware designers. A typical application for our system is the support software for a reconfigurable data acquisition card consisting of a fixed analog section and a configurable digital section built from one or more field programmable gate arrays (FPGAs), static RAM chips, and, perhaps, a dedicated processor. In this paper we discuss our software tools, the prototype data acquisition hardware that we have developed, and an example application.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Breakthroughs in Algorithms, Imaging, DSP, and Numerical Analysis II
A new general-purpose internal sorting algorithm, called ABCsort, appears unusually well-suited for FPGA implementation. ABCsort is an O(N) algorithm (worst case, in both time and space) that is, even in software, both much faster than other internal sorts and extraordinarily flexible. ABCsort makes only read accesses to record keys, which facilitates its use in parallel on a shared-memory multiprocessor system. Although it will sort floating-point data, it requires no floating-point arithmetic; the algorithm is independent of data type except for semantics logic which is ideal for implementation in reconfigurable FPGA.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Considerable success has been achieved in developing signal processing algorithms that are efficient from the standpoint of number of operations. However, what is needed now is to develop new algorithms which are better adapted to existing hardware, or to device new architectures that more efficiently exploit existing signal processing algorithms. This latter approach forms the basis of this paper. An FPGA architecture is described that takes advantage of the reduced computational requirements of the polynomial transform method for computing 2-D DFTs. The performance of the architecture is presented and is shown to use 36% less FPGA resources than a row-column DFT processor. A multi-FPGA architecture is described that is capable of processing 24 512 by 512 pixel images per second. The multi-FPGA processor is 46% more area efficient than a row-column DFT implementation.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Communication technology is undergoing a revolutionary change. Most of the traditional signal processing is moving from an analog to an algorithm basis, Chiptelos, a software based reconfigurable subsystem, provides applications up to radio frequencies for telemetry, telecommunication and navigation. It allows test and evaluation of a design early in the project stage on a single integrated development environment.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Image processing typically requires either a very low frame rate or a considerable amount of dedicated hardware to achieve satisfactory results. Further, custom algorithms often require development of the entire video capture and processing system. Reconfigurable logic can be used to realize a dynamically alterable image capture and processing system. Algorithm changes at the frame rate are made possible with high speed and partial reconfiguration. Partial reconfiguration permits common functions such as video timing and memory control to be kept in place while the custom processing algorithms are dynamically replaced or modified. The use of a common framework with application overlays also allows the designer to concentrate on the image processing algorithm instead of worrying about the background functions. The use of reconfigurable logic for image processing provides the flexibility of a general purpose DSP with the speed of dedicated hardware.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Real-time video-rate image processing requires orders of magnitude performance beyond the capabilities of general purpose computers. ASICs deliver the required performance, however they have the drawback of fixed functionality. Field programmable gate arrays (FPGAs) are reprogrammable SRAM based ICs capable of real-time image processing. FPGAs deliver the benefits of hardware execution speeds and software programmability. An FPGA program creates a custom data processor, which executes the equivalent of hundreds to thousands of lines of C code on the same clock tick. FPGAs emulate circuits which are normally built as ASICs. Multiple real-time video streams can be processed in Giga Operations' Spectrum Reconfigurable Computing (RC) PlatformTM. The Virtual Bus ArchitectureTM enables the same hardware to be configured into many image processing architectures, including 32-bit pipelines, global busses, rings, and systolic arrays. This allows an efficient mapping of data flows and memory access for many image processing applications and the implementation of many real-time DSP filters, including convolutions, morphological operators, and recoloring and resampling algorithms. FPGAs provide significant price/performance benefits versus ASICs where time to market, cost to market, and technical risk are issues. And FPGA descriptions migrate efficiently and easily into ASICs for downstream cost reduction.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The use of reconfigurable field-programmable gate arrays (FPGAs) for imaging applications show considerable promise to fill the gap that often occurs when digital signal processor chips fail to meet performance specifications. Single chip DSPs do not have the overall performance to meet the needs of many imaging applications, particularly in real-time designs. Using multiple DSPs to boost performance often presents major design challenges in maintaining data alignment and process synchronization. These challenges can impose serious cost, power consumption and board space penalties. Image processing requires manipulating massive amounts of data at high-speed. Although DSP chips can process data at high-speeds, their architectures can inhibit overall system performance in real-time imaging. The rate of operations can be increased when they are performed in dedicated hardware, such as special-purpose imaging devices and FPGAs, which provides the horsepower necessary to implement real-time image processing products successfully and cost-effectively. For many fixed applications, non-SRAM- based (antifuse or flash-based) FPGAs provide the raw speed to accomplish standard high-speed functions. However, in applications where algorithms are continuously changing and compute operations must be modified, only SRAM-based FPGAs give enough flexibility. The addition of reconfigurable FPGAs as a flexible hardware facility enables DSP chips to perform optimally. The benefits primarily stem from optimizing the hardware for the algorithms or the use of reconfigurable hardware to enhance the product architecture. And with SRAM-based FPGAs that are capable of partial dynamic reconfiguration, such as the Cache-Logic FPGAs from Atmel, continuous modification of data and logic is not only possible, it is practical as well. First we review the particular demands of image processing. Then we present various applications and discuss strategies for exploiting the capabilities of reconfigurable FPGAs along with DSPs. We describe the benefits of a compute-oriented FPGA architecture and how partial dynamic reconfiguration delivers unprecedented capabilities for imaging systems and products.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Architectures for High-Performance Reconfigurable Computing
As field programmable gate arrays (FPGAs) and complex programmable logic devices (CPLDs) become faster, denser, and cheaper, many designers that previously used programmable logic devices (PLDs) and have a need in their next design for more functionality in a smaller footprint or board space have switched to using these FPGAs or CPLDs to incorporate their design. With the advent of JTAG 1149.0 boundary test specification, a specified method for reprogramming the FPGAs and CPLDs live in the field was invented. Using the electrical-erasable manufacturing process, reconfigurable hardware or logic was invented. It is perfect for prototyping as well as field applications where upgrades can be done live in a matter of seconds from personal computers that a new redesign has just been compiled. In this paper I discuss several issues experienced while using the EPX780 reconfigurable FPGA such as (1) why the new design required a reconfigurable FPGA, (2) problems encountered in implementation including place and route, compiling, simulating, and testing, and (3) the future use of the reconfigurable hardware devices including selection of proper development systems. Overall there are several tips and design rules in using reconfigurable devices generally and FLEX 780s development specifically.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In 1945 the work of J. von Neumann and H. Goldstein created the principal architecture for electronic computation that has now lasted fifty years. Nevertheless alternative architectures have been created that have computational capability, for special tasks, far beyond that feasible with von Neumann machines. The emergence of high capacity programmable logic devices has made the realization of these architectures practical. The original ENIAC and EDVAC machines were conceived to solve special mathematical problems that were far from today's concept of 'killer applications.' In a similar vein programmable hardware computation is being used today to solve unique mathematical problems. Our programmable hardware activity is focused on the research and development of novel computational systems based upon the reconfigurability of our programmable logic devices. We explore our programmable logic architectures and their implications for programmable hardware. One programmable hardware board implementation is detailed.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
This paper introduces a systematic approach for abstract modeling of VLSI digital systems using a hierarchical decomposition process and HDL. In particular, the modeling of the back propagation neural network on a massively parallel reconfigurable hardware is used to illustrate the design process rather than toy examples. Based on the design specification of the algorithm, a functional model is developed through successive refinement and decomposition for execution on the reconfiguration machine. First, a top- level block diagram of the system is derived. Then, a schematic sheet of the corresponding structural model is developed to show the interconnections of the main functional building blocks. Next, the functional blocks are decomposed iteratively as required. Finally, the blocks are modeled using HDL and verified against the block specifications.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The drive towards shorter design cycles for analog integrated circuits has given impetus to several developments in the area of field-programmable analog arrays (FPAAs). Various approaches have been taken in implementing topological and parametric programmability of analog circuits. Recent extensions of this work have married FPAAs to their digital counterparts (FPGAs) along with data conversion interfaces, to form field-programmable mixed signal arrays (FPMAs). This survey paper reviews work to data in the area of programmable analog and mixed signal circuits. The body of work reviewed includes university and industrial research, commercial products and patents. A time-line of important achievements in the area is drawn and the status of various activities is summarized.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
WILDFORCE is the first PCI-based custom reconfigurable computer that is based on the Splash 2 technology transferred from the National Security Agency and the Institute for Defense Analyses, Supercomputing Research Center (SRC). The WILDFORCE architecture has many of the features of the WILDFIRE computer, such as field- programmable gate array (FPGA) based processing elements, linear array and crossbar interconnection, and high- performance memory and I/O subsystems. New features introduced in the PCI-based WILDFIRE systems include memory/processor options that can be added to any processing element. These options include static and dynamic memory, digital signal processors (DSPs), FPGAs, and microprocessors. In addition to memory/processor options, many different application specific connectors can be used to extend the I/O capabilities of the system, including systolic I/O, camera input and video display output. This paper also discusses how this new PCI-based reconfigurable computing engine is used for rapid-prototyping, real-time video processing and other DSP applications.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Currently all networking hardware must have predefined tradeoffs between latency and bandwidth. In some applications one feature is more important than the other. We present a system where the tradeoff can be made on a case by case basis. To show this we implement an extremely low latency semaphore passing network within a point to point system.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Wormhole run-time reconfiguration (RTR) is an attempt to create a refined computing paradigm for high performance computational tasks. By combining concepts from field programmable gate array (FPGA) technologies with data flow computing, the Colt/Stallion architecture achieves high utilization of hardware resources, and facilitates rapid run-time reconfiguration. Targeted mainly at DSP-type operations, the Colt integrated circuit -- a prototype wormhole RTR device -- compares favorably to contemporary DSP alternatives in terms of silicon area consumed per unit computation and in computing performance. Although emphasis has been placed on signal processing applications, general purpose computation has not been overlooked. Colt is a prototype that defines an architecture not only at the chip level but also in terms of an overall system design. As this system is realized, the concept of wormhole RTR will be applied to numerical computation and DSP applications including those common to image processing, communications systems, digital filters, acoustic processing, real-time control systems and simulation acceleration.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Computational bottlenecks are endemic to digital signal processing (DSP) systems. Several approaches have been developed aiming to eliminate these bottlenecks in a cost- effective way. Application-specific integrated circuits (ASICs) effectively resolve specific bottlenecks, but for most applications they entail long development times and prohibitively high non-recurring engineering costs. Multi- processor architectures are cost-effective, but performance gains are at best linear with the number of processors. We show how properly targeted and designed reconfigurable acceleration subsystems (RASs), implemented using field- programmable gate arrays (FPGAs), can resolve computational bottlenecks in a cost-effective manner for a broad range of DSP applications. A model is proposed to quantify the benefits of computational acceleration on reconfigurable platforms and to determine which DSP applications are amenable to effective computational acceleration. The architecture and functionality of X-CIMTM, a reconfigurable acceleration subsystem recently introduced by MiroTech Microsystems, is described. X-CIM functions as a reconfigurable co-processor for TMS320C4x DSP processors, and impressive performance gains are reported. On a benchmark application consisting of a complex non-linear algorithm, TMS320C40 processed images in 6977ms working alone and in 182 ms when supported by an X-CIM co-processor.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Design Methods and Tools for Reconfigurable Computing
The malleable architecture generator (MARGE) is a tool set that translates high-level parallel C to configuration bit streams for field-programmable logic based computing systems. MARGE creates an application-specific instruction set and generates the custom hardware components required to perform exactly those computations specified by the C program. In contrast to traditional fixed-instruction processors, MARGE's dynamic instruction set creation provides for efficient use of hardware resources. MARGE processes intermediate code in which each operation is annotated by the bit lengths of the operands. Each basic block (sequence of straight line code) is mapped into a single custom instruction which contains all the operations and logic inherent in the block. A synthesis phase maps the operations comprising the instructions into register transfer level structural components and control logic which have been optimized to exploit functional parallelism and function unit reuse. As a final stage, commercial technology-specific tools are used to generate configuration bit streams for the desired target hardware. Technology- specific pre-placed, pre-routed macro blocks are utilized to implement as much of the hardware as possible. MARGE currently supports the Xilinx-based Splash-2 reconfigurable accelerator and National Semiconductor's CLAy-based parallel accelerator, MAPA. The MARGE approach has been demonstrated on systolic applications such as DNA sequence comparison.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The utility of configurable computing platforms has been demonstrated and documented for a wide variety of applications. Retargeting an application to custom computing machines (CCMs) has been shown to accelerate execution speeds with respect to execution on a sequential, general- purpose processor. Unfortunately, these platforms have proven to be rather difficult to program when compared to contemporary general-purpose platforms. Retargeting applications is non-trivial, due to the lack of design tools which work at a high level and consider all available computational units in the target architecture. To make configurable computing accessible to a wide user base, high- level entry tools -- preferably targeted toward familiar programming environments -- are needed. Also, in order to target a wide variety of custom computing machines, such tools cannot depend on a particular, fixed, architectural configuration. This paper introduces resource pools as an abstraction of general computing devices which provides a homogeneous description of FPGAs, ASICs, CPUs, or even an entire network of workstations. Also presented is an architecture-independent design tool which accepts a target architecture's description as a collection of resource pools, and partitions a program written in a high-level language onto that architecture, effectively synthesizing a hardware description for the FPGA portions of A CCM, and a software description for any attached CPUs.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
We introduce dynamic computation structures (DCS), a compilation technique to produce dynamic code for reconfigurable computing. DCS specializes directed graph instances into user-level hardware for reconfigurable architectures. Several problems such as shortest path and transitive closure exhibit the general properties of closed semirings, an algebraic structure for solving directed paths. Motivating our application domain choice of closed semiring problems is the fact that logic emulation software already maps a special case of directed graphs, namely logic netlists, onto arrays of field programmable gate arrays (FPGA). A certain type of logic emulation software called virtual wires further allows an FPGA array to be viewed as a machine-independent computing fabric. Thus, a virtual wires compiler, coupled with front-end commercial behavioral logic synthesis software, enables automatic behavioral compilation into a multi-FPGA computing fabric. We have implemented a DCS front-end compiler to parallelize the entire inner loop of the classic Bellman-Ford algorithm into synthesizable behavioral verilog. Leveraging virtual wire compilation and behavioral synthesis, we have automatically generated designs of 14 to 261 FPGAs from a single graph instance. We achieve speedups proportional to the number of graph edges - - from 10X to almost 400X versus a 125 SPECint SparcStation 10.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Over the past decade or more, processor speeds have increased much more quickly than memory speeds. As a result, a large, and still increasing, processor-memory performance gap has formed. Many significant applications suffer from substantial memory bottlenecks, and their memory performance problems are often either too unusual or extreme to be mitigated by cache memories along. Such specialized performance 'bugs' require specialized solutions, but it is impossible to provide case-by-case memory hierarchies or caching strategies on general-purpose computers. We have investigated the potential of implementing mechanisms like victim caches and prefetch buffers in reconfigurable hardware to improve application memory behavior. Based on technology and commercial trends, our simulation-based studies use a forward-looking model in which configurable logic is located on the CPU chip. Given such assumptions, our results show that the flexibility of being able to specialize configurable hardware to an application's memory referencing behavior more than balances the slightly slower response times of configurable memory hierarchy structures. For our three applications, small, specialized memory hierarchy additions such as victim caches and prefetch buffers can reduce miss rates substantially and can drop total execution times for these programs to between 60 and 80% of their original execution times. Our results also indicate that different memory specializations may be most effective for each application; this highlights the usefulness of configurable memory hierarchies that are specialized on a per-application basis.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Configurable systems offer increased performance by providing hardware that matches the computational structure of a problem. This hardware is currently programmed with CAD tools and explicit library calls. To attain widespread acceptance, configurable computing must become transparently accessible from high-level programming languages, but the changeable nature of the target hardware presents a major challenge to traditional compiler technology. A compiler for a configurable computer should optimize the use of functions embedded in hardware and schedule hardware reconfigurations. The hurdles to be overcome in achieving this capability are similar in some ways to those facing compilation for heterogeneous systems. For example, current traditional compilers have neither an interface to accept new primitive operators, nor a mechanism for applying optimizations to new operators. We are building a compiler for heterogeneous computing, called Scale, which replaces the traditional monolithic compiler architecture with a flexible framework. Scale has three main parts: translation director, compilation library, and a persistent store which holds our intermediate representation as well as other data structures. The translation director exploits the framework's flexibility by using architectural information to build a plan to direct each compilation. The translation library serves as a toolkit for use by the translation director. Our compiler intermediate representation, Score, facilities the addition of new IR nodes by distinguishing features used in defining nodes from properties on which transformations depend. In this paper, we present an overview of the scale architecture and its capabilities for dealing with heterogeneity, followed by a discussion of how those capabilities apply to problems in configurable computing. We then address aspects of configurable computing that are likely to require extensions to our approach and propose some extensions.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
HML allows us to specify hardware at a very abstract level, and automatically generate VHDL from our specifications. The VHDL is used along with commercial CAD tools to generate field programmable logic. In this paper we present HML, a hardware description language based on SML, and discuss the translation process from HML to VHDL. As an example we use HML to specify a DTMF receiver. We present the HML for a Booth multiplier and discuss the design flow from HML to an FPGA implementation of that multiplier. HML is the only language available that applies advances in programming languages and type theory to hardware description.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
During the development of scientific disciplines, mainstream periods alternate with revolution periods, where 'out of the way disciplines' can become a mainstream. Just in the moment increasing turbulences announce a new revolution. The variety of 'high performance computing' scenes will be mixed up. Can an increasing application of structurally programmable hardware platforms (computing by the yard) break the monopoly of the von Neumann mainstream paradigm (computing in time) also in multipurpose hardware? From a co-design point of view, the paper tries to provide an overview through the turbulences and tendencies, and introduces a fundamentally new machine paradigm, which uses a field-programmable data path array (FPDPA) providing instruction level parallelism. The paper drafts a structured design space for all kinds of parallel algorithm implementations and platforms: procedural programming versus structural programing, concurrent versus parallel, hardwired versus reconfigurable. A structured view by rearranging the variety of computing science scenes seems to be feasible.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The versatility of field programmable gate array (FPGA) technology has led to the use of FPGA devices in a variety of video processing applications. FPGA use ranges from the implementation of 'glue logic' to the construction of reconfigurable coprocessors, and from prototyping only to large volume production. Several systems are cited herein as representative examples of the use of FPGA devices in video processing systems, including both research projects and commercial products.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The field programmable gate array (FPGA) is a promising technology for increasing computation performance by providing for the design of custom chips through programmable logic blocks. This technology was used to implement and test a hardware random number generator (RNG) versus four software algorithms. The custom hardware consists of a sun SBus-based board (EVC) which has been designed around a Xilinx FPGA. A timing analysis indicates the Sun/EVC hardware generator computes 1 multiplied by 106 random numbers approximately 50 times faster than the multiplicative congruential algorithm. The hardware and software RNGs were also compare using a Monte Carlo photon transport algorithm. For this comparison the Sun/EVC generator produces a performance increase of approximately 2.0 versus the software generators. This comparison is based upon 1 multiplied by 105 photon histories.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
This paper presents an efficient methodology for decomposing and modularizing large computations so that they can be easily mapped onto FPGAs and other programmable logic structures. The paper focuses on the multidimensional discrete cosine transform (DCT). The main advantage of the proposed decomposition strategy is that it enables constructing large m-dimensional DCTs from a single stage of smaller size m-dimensional DCTs. We demonstrate the power of our technique by mapping 2-d DCT computations of various sizes on an FPGA-based transformable computer and report their performance (both in terms of speed and gate utilization).
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
This paper demonstrates how an FPGA-based transformable coprocessor can be used to implement a real-time MPEG-1 video decoder with enhanced features. The transformable coprocessor consists of an FPGA, local static RAM, and a host bus interface built into the FPGA. The gate-limited FPGA core is reconfigured frequently to implement various parts of the video decoding process in real-time. Our results show that, through reconfiguration, FPGA-based processors can handle complex tasks (such as high-quality video decoding) adequately. We also identify the major bottlenecks that impede achieving higher speedups with the FPGAs. For MPEG-1 video processing, the major slowdown is caused by the excessive data transfers and bottlenecks due to bus interfaces and lack of sufficient storage in FPGA.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
FPGAs have become a competitive alternative for high performance DSP applications, previously dominated by general purpose DSP and ASIC devices. This paper describes the benefits of using an FPGA as a DSP co-processor, as well as, a stand-alone DSP engine. Two case studies, a Viterbi decoder and a 16-tap FIR filter are used to illustrate how the FPGA can radically accelerate system performance and reduce component count in a DSP application. Finally, different implementation techniques for reducing hardware requirements and increasing performance are described in detail.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Many DSP applications require dedicated hardware to achieve acceptable levels of performance. This is particularly true of real-time applications that have strict timeline requirements on processing throughput and latency. This paper outlines an FPGA-based reconfigurable processor architecture targeted to embedded DSP applications. The processor core consists of a high gate count FPGA multichip module (MCM) supplemented with four dedicated floating point multipliers. A dual port data memory provides a 480 Mbyte/sec channel to the processor and a 240 Mbyte/sec channel to the external interface. Coefficient memories are also included for static look-up table storage. A configuration bit stream loaded from non-volatile memory or an external source is used to program the FPGA.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
For a number of years, we have studied the large-scale fine-grained limit of cellular-logic-array calculations and
computers-with particular emphasis on applications to physical simulation. Perhaps the most relevant lessons
of this work for the FPGA community have to do with the applicability of virtual-processor techniques to these
logic-array computations-and by extension to the design of FPGA's themselves. These techniques allow us to
tradeoff speed against size, to balance resources devoted to data storage with those devoted to processing, and
to time-share communication resources as we share processors. An application area of particular interest to the
FPGA community is in logic emulation, where a virtual processing approach lets us maximize useful processor
cycles by having processing hardware follow computational wavefronts through arrays of virtual logic ("temporal
pipelining"). This technique is of direct relevance to FPGA design.
Our virtual processor approach is embodied in our indefinitely scalable cAM-8 cellular automata (CA) machine.
Personal-computer-scale prototypes, designed and built at MIT using 1988 technology, are still about as fast as
any conventional computer for most large-scale physical CA applications. Using today's high-bandwidth DRAM'S,
machine's with the same number of memory chips could be built that run 100 times faster. Rather than build a
new dedicated CA machine processor, it is attractive to instead add appropriate DRAM I/O and data-buffering
circuitry to an FPGA design, to create a general-purpose class of FPGA's optimized for large-scale virtualprocessor
applications.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.