Selected parallel pipelined signal processing architectures are reviewed. These architectures are discussed in terms of the algorithms used and the methods of implementation whether analog or digital. The role of systolic arrays and wavefront processors in implementing matrix computations is discussed. Future directions of research are indicated.
Efficient computational procedures are described for updating the spectral decomposition of a complex Hermitian matrix to which a matrix of rank one has been added. These methods will be useful in on-line implementation of high-resolution direction finding and beamforming algorithms.
Systolic arrays for determining the Singular Value Decomposition of a mxn, m > n, matrix A of bandwidth w are presented. After A has been reduced to bidiagonal form B by means of Givens plane rotations, the singular values of B are computed by the Golub-Reinsch iteration. The products of plane rotations form the matrices of left and right singular vectors. Assuming each processor can compute or apply a plane rotation, 0(wn) processors accomplish the reduction to bidiagonal form in 0(np) steps, where p is the number of superdiagonals. A constant number of processors can then determine each singular value in about 6n steps. The singular vectors are computed by rerouting the rotations through the arrays used for the reduction to bidiagonal form, or else "along the way" by employing another rectangular array of O(wm) processors.
An iterative algorithm for the singular value decomposition (SVD) of a non-zero m x n matrix M is described and illustrated numerically. Derivations of the algorithm and sufficient conditions for convergence are outlined.
The difficulties in designing systolic processors can be reduced by applying the architectural transformations of code motion, retiming, slowdown, coalescing, parallel/serial compromises and partitioning to a more easily designed combinational or semisystolic form of the processor. In this paper, the use of these transformations and the attendant tradeoffs in the design of architectures for adaptive filtering based on the Gram-Schmidt algorithm are considered. A modification to the classical Gram-Schmidt algorithm which eliminates the use of division under certain assumptions is suggested. Also, size and speed statistics are given for a projected .5 micron VHSIC implementation of the processor.
An algorithm is described, based on that of Faddeev, which solves for Cx+D, given Ax=B. This algorithm provides a framework for systematic manipulations of matrices using a single systolic array and simple data flow. An example is given of its use in solving several least squares problems.
The ordinary Singular Value Decomposition (SVD) is widely used in statistical and signal processing computation, both for the insight it provides into the structure of a linear operator, and as a technique for reducing the computational word length required for least-squares solutions and certain Hermitian eigensystem decompositions by roughly a factor of two, via computing directly on a data matrix, rather than on the corresponding estimated correlation or covariance matrix. Although the SVD has long been utilized as a method of off-line or non-real-time computation, parallel computing architectures for its implementation in near real time have begun to emerge. The Generalized Singular Value Decomposition (GSVD) bears the same relationship to the computation of certain Hermitian generalized eigensystem decompositions that the ordinary SVD bears to the corresponding ordinary,eigensystem decompositions. This paper discusses methods for computing the GSVD via a sequence of more familiar computations and indicates the relation of the GSVD to the MUSIC algorithm of R. Schmidt.
This paper addresses the theoretical and algorithmic issues related to optimal temporal localization (and systolization) of Signal Flow Graph (SFG) computing networks. Based on a cut-set localization procedure we propose an algorithm that computes the optimal localization of an SFG. The basic algorithm is then extended so that it can be used with hierarchically specified SFG's, thus significantly improving the computational efficiency. The algorithms can be easily coded and incorporated into a computer-aided-design (CAD) system.
Automated design and descriptive tools are essential for the practical application of highly parallel special-purpose hardware such as systolic arrays. The use of special-purpose hardware can greatly increase the capabilities of signal processing systems. However, the more limited applications base makes design costs a critical factor in determining technical and economic viability. Systolic systems can be described at several levels of abstraction, each of which has unique descriptive requirements. This paper focuses on the descriptive issues involved at the system architectural level. Tools at this level must cridge the gap between logic- and circuit-oriented computer-aided design tools and algorithmic descriptions of systolic architectures. Traditionally, hardware description languages (HDLs) have been used at this level to describe conventional computer architectures. Systolic architectures, however, have different requirements. This paper examines these requirements and develops a set of criteria for evaluating HDLs. Four popular HDLs are evaluated and their strengths and weaknesses noted. The final section of the paper summarizes ongoing efforts at Los Alamos to develop a systolic array HDL based on the CONLAN family of languages.
Cellular computations (i.e., systolic or wavefront computations) are embedded in a vector space, one of whose dimensions models time, and the others, space. Cellular computations that are related by a linear transformation may have different properties with respect to input/output schedules, chip area, communication topology, latency, period, and the presence/absence of broadcasting and pipelining. Linear transformations of space-time are used to explore design alternatives in a formal way. Cellular computations for convolution and matrix product are used to illustrate this linear transformation technique.
We discuss in a tutorial manner the principles and techniques of on-line arithmetic. Several examples of on-line algorithms for the basic operations, the evaluation of vector and matrix expressions, solving linear systems and evaluating polynomials, are used to illustrate the characteristics of on-line arithmetic.
Hardware for performing matrix operations at high speeds is in great demand in signal and image processing and in many real-time and scientific applications. VLSI technology has made it possible to perform fast large-scale vector and matrix computations by using multiple copies of low-cost processors. Since any functional error in a high performance system may seriously jeopardize the operation of the system and its data integrity, some level of fault-tolerance must be obtained to ensure that the results of long computations are valid. A low-cost checksum scheme had been proposed to obtain fault-tolerant matrix operations on multiple processor systems. However, this scheme can only correct errors in matrix multiplication; it can detect, but not correct errors in matrix-vector multiplication, LU-decomposition, and matrix inversion. In order to solve these problems with the checksum scheme, a very general matrix encoding scheme is proposed in this paper to achieve fault-tolerant matrix operations with multiple processor systems. Since many signal and image processing algorithms involving a "multiply-and-accumulate" type of expression can be transformed into matrix-vector multiplication operations and executed in a linear array, this scheme is extremely useful for cost-effective and fault-tolerant signal and image processing.
An iterative algorithm for the solution of a quadratic matrix equation (the algebraic Ricatti equation) is detailed. This algorithm is unique in that it allows the solution of a nonlinear matrix equation in a finite number of iterations to a desired accuracy. Theoretical rules for selection of the operation parameters and number of iterations required are advanced and simulation verification and quantitative performance on an error-free processor are provided. An error source model for an optical linear algebra processor is then advanced, analyzed and simulated to verify and quantify our performance guidelines. A comparison of iterative and direct solutions of linear algebraic equations is then provided. Experimental demonstrations on a laboratory optical linear algebra processor are included for final confirmation. Our theoretical results, error source treatment and guidelines are appropriate for digital systolic processor implementation and for digital-optical processor analysis.
The advent in position-emission tomography of the measurement of the differential time-of-flight of annihilation photons, in addition to their line-of-flight, has necessitated the development of new reconstruction algorithms to replace the conventional filtered back-projection algorithm used when only line-of-flight information is available. The "confidence-weighted" and "estimated posterior-density weighted" algorithms are reviewed. The structure of these algorithms makes the use of parallel, pipelined processing-architectures possible and perhaps necessary for their practical use in clinical studies.
This paper presents the vital role mathematical modeling plays in the solution to the signal processing problem. The paper will compare the traditional open loop methodology of signal processing to the modern closed loop methodology. Advantages and disadvantages will be discussed. In the past, the system discriminants have been derived from knowledge of the signal bandwidth and characteristics of the receiver while actual signal propagation has been ignored. The result of such a formulation generally leads to incorrect statistics. This traditional approach has been improved by use of a closed loop methodology (CLM) technique. The methodology employs an adaptive processor which provides correct signature statistics.
The CMU programmable systolic chip (PSC) is an experimental, microprogrammable chip designed for the efficient implementation of a variety of systolic arrays. The PSC has been designed, fabricated, and tested. The chip has about 25,000 transistors, uses 74 pins, and was fabricated through MOSIS, the DARPA silicon broker, using a 4 micron nMOS process. A modest demonstration system involving nine PSCs is currently running. Larger demonstrations are ready to be brought up when additional working chips are acquired. The development of the PSC, from initial concept to a silicon layout, took slightly less than a year, out testing, fabrication, and system demonstration took an additional year. This paper reviews the PSC, describes the PSC demonstration system, and discusses some of the lessons learned from the PSC project.
CMU is currently building a programmable, 32-bit floating-point systolic array processor using only off-the-shelf components. The 10-cell processor, with one cell implemented on one board, can process 1024-point complex FFI's at a rate of one FFT every 600 μs. Under program control, the same processor can perform many other primitive computations in signal, image, and vision processing, including two-dimensional convolution, dynamic programming, and real or complex matrix multiplication, at a rate of 100 million floating-point operations per second. This particular systolic array processor is called the Warp, suggesting that it can perform a variety of transformations at a very high speed. For a mobile robot demonstration planned in 1985, the Warp is expected to speed up the navigation process by at least one order of magnitude. The Warp has a relatively simple architecture given its performance. The processor is a linear array of cells (or processing elements) that in general takes inputs from one end. and produces outputs at the other end. The processor can efficiently implement many systolic algorithms where communication between adjacent cells is intensive. The processor can also efficiently implement many non-systolic algorithms where each cell operates on its own data independently from the rest. This paper describes the structure of the Warp.
ESL is building a high performance signal processor for DARPA and the Navy which will use several systolic arrays to do linear algebra operations. These systolic arrays are all constructed from just one type of systolic integrated circuit. This systolic chip performs 32-bit floating point arithmetic at a ten megaflop rate. The chip contains all the registers needed to make it a systolic cell. All the control and registers for doing complex multiplication and addition are included in the chip. The processor contains four types of systolic arrays. The first does matrix multiplication. The second updates the Cholesky factor of a matrix from the corresponding factor of a rank-one modification of the matrix. The third array does forward and backsolves. The fourth does back-solves against many right-hand-sides. These systolic arrays have address generators which allow them to be used on many different sized problems. The detailed control for the machine is created at compile time from high level commands given by the user. Much of the hardware design effort has gone into memories and address generators which support running many different problem sizes on fixed sized arrays.
Optical systolic processors perform rapid parallel multiplications of vector and matrix elements. These multiplications may be performed by convolving the digits of individual numbers. Current transducers (particularly acousto-optic cells) make time-domain convolution the method of choice. A time-domain convolver can operate in a time-integrating or space-integrating mode. These two modes can be considered as variations of a basic multiplier unit. Starting with the basic multiplier, four matrix-vector and two matrix-matrix multipliers can be derived. Some of these processors are new and some have been previously identified. Within this unifying framework the performance of all these pro-cessors may be analyzed and design tradeoffs identified.
A digital approach to integrated optical convolution and correlation is proposed and analyzed. The method relies on the application of a series of time-delayed signals to a large group of individually addressable electro-optic Bragg gratings in order to provide the digital equivalent of a traveling index wave. Unlike previous systems, the wavelength of the traveling index wave is under program control. The associated time-bandwidth product may be varied independently of the applied frequency making the technique suitable for signal processing applications. A modification of the theory of analog Bragg diffraction is developed which includes the effects of discrete spatial sampling in the convolution and correlation process for both continuous-time and sampled-time inputs. An analysis of the potential advantages of this approach for algebraic processors and broad-band ambiguity processors is also presented.
Optical systolic array processing algorithms and architectures that may be appropriate for the high-speed real-time linear estimation of scalar data sequences are considered. A Kalman-filter-based algorithm for the linear prediction of such sequences is described and analyzed using simple computer simulations, and possible optical systolic array processor hardware implementations of this algorithm are briefly discussed. The overall goal is the optimization of prediction speed and accuracy as a function of the assumed order of a linear model and practical constraints such as system power consumption, size, reliability, and cost.
Described in this paper is a highly-parallel algorithm (referred to as the Permutation Algorithm) for computing the singular value decomposition (SVD) of a matrix. This algorithm is an extension of an earlier algorithm developed by Brent and Luk. The Permutation Algorithm is unique in that it relies heavily on the repeated use of the more fundamental matrix-matrix multiply operation. Therefore, high-speed numerical optical processors currently being developed for performing the matrix-matrix multiply operation appear well suited for implementing the Permutation Algorithm as well.
An electro-optical technique of performing analog sampled-data signal processing operations using the inherently fast matrix multiplication of incoherent optical systems has been under investigation at the Naval Ocean Systems Center for a number of years. Several papers have previously described both a fundamental mathematical treatment and a simplified model of the electro-optical processor (EOP) operation. The present paper will review the characteristics of these processors and, in particular, will describe the results of the characterization and evaluation of an incoherent optical signal processing system consisting of a light-emitting diode (LED), a matrix mask, and a customized charge-coupled device (CCD). Such a processor can perform a broad variety of useful one-dimensional linear transformations such as the Discrete Fourier Transform, multi-channel cross-correlation, and multi-channel matched filtering. The desirability of such a system is due to the high speed, ruggedness, reliability, and compact size which can be achieved with this technology which, in many applications, far surpasses the computational capabilities of conventional digital electronic systems. Based on the results of the comparison of an analytical model of the EOP and the measured performance of the EOP system, conclusions and recommendations concerning the capabilities and limitations of future systems using this technology can be discussed. In addition, a working system capable of performing Fourier transforms at up to 5 megasamples/sec. will be demonstrated.
The theory of operation of the Real-Time Acousto-Optic SAR Processor is reviewed and recent experimental results are presented. The results include a demonstration of the real-time imaging capability of the processor with simulated radar signals. An advanced version of this processor is then described in which a programmable reference function is entered via a second acousto-optic device to eliminate the need for a 2-D SLM. In this implementation the reference function is updated by electronic means to give the processor the flexibility to adapt rapidly to changes in the parameters of the radar/target geometry.
The general form of a system for optical signal or data processing is described. The concept of using systolic flow of information and massive parallelism for efficient, high throughput optical computation is presented. An accompanying problem, the fact that an upper limit of the processing speed is often imposed not by the optical processing elements, but by the devices that form the interfaces between the light and electronic domain, is also discussed. GaAs CCDs are a technology that can provide a solution to these "bottleneck" problems; examples of representative applications are provided.
We designed an optical processor using wavelength division multiplexing which performs matrix-vector and matrix-matrix multiplications. The Wavelength Division Multiplexing (WDM) Optical Processor can perform analog and digital operations. Its principal advantage over other processors is its parallel architecture that enables high speed calculations.
A technique is described for performing two-dimensional (2-D) space-variant operations using two acousto-optic modulators (AOM's) operating in a crossed fashion. A time-integrating detector performs the appropriate summation of output terms for the spatially sampled input. System constraints and the feasibility of performing real-time operations will be discussed, along with the presentation of experimental results. The approach has the potential for broadening the range of operations performed by optical computers.
A cyclic, multi-channel, analog encoding scheme is proposed as a way of representing numbers in optical computers. It can be represented as an n-tuple of complex numbers or phasors. It is similar to residue arithmetic but applies to non-integers. With this scheme the operation of addition is mapped into multiplication. Thus summation of many terms must be carried out by sequential modulation rather than superposition. Though this is a disadvantage for some optical architectures, it seems well suited to certain integrated optic structures and, may be appropriate for certain systolic architectures.