Previous reviews of signal processing computational needs and their systolic implementation have emphasized the need for a small set of matrix operations, primarily matrix multiplication, orthogonal triangularization, triangular backsolve, singular value decomposition, and the generalized singular value decomposition. Algorithms and architectures for these tasks are sufficiently well understood to begin transitioning from research to exploratory development. Substantial progress has also been reported on parallel algorithms for updating symmetric eigensystems and the singular value decomposition. Another problem which has proved to be easier than expected is inner product computation for high-speed high resolution predictive analog-to-digital conversion. Although inner product computation in a general setting will require 0(log n) time via a tree, the special structure of the prediction problem permits the use of a systolic transversal filter, producing a new predicted value in time 0(1). Problem areas which are still in an early stage of study include parallel algorithms for the Wigner-Ville Distribution function, L1 norm approximation, inequality constrained least squares, and the total least squares problem.
Levinson's Toeplitz solver breaks down if any leading principal minors are zero, and recently methods have been proposed to handle such cases by "jumping over" such singularities. It is shown that, as might be expected, the occurrence of small leading minors causes a serious loss of accuracy in solvers such as the Levinson algorithm. However, no Toeplitz algorithm has previously been proposed to handle such near-singularities. A pivoting scheme has been incorporated into the Toeplitz solver of Bareiss which allows near-singularities to be treated without significant loss of accuracy. As special cases, the pivoted Toeplitz solver also handles exact singularities and numerical singularities - singularities which appear as near-singularities when finite-precision floating point arithmetic is used. Bareiss's algorithm also yields the LU factors of a Toeplitz matrix, and it is shown that a restricted version of the pivoted Toeplitz solver produces an LU factorisation of a row and column permuted version of the original Toeplitz matrix. Finally, it is shown how the Toeplitz inverter of Trench and Zohar can be derived from the multipliers produced by the Bareiss algorithm, and this connection is used to derive a pivoted version of the Trench-Zohar algorithm from the pivoted Bareiss algorithm.
Differing concepts of the stability of algorithms for solving linear equations are scrutinized. These concepts are expanded to two new concepts: strong stability and weak stability; examples are given for the need for these concepts.
A preconditioned conjugate gradient method is developed for solving the linear equality constrained problem of minimizing |p - Gw|2 subject to the constraint CH w = f . For signal processing problems, the method has the advantage that the solution w is easy to update after the matrix G has been updated.
Both the singular value decomposition (SVD) and the QR factorization play central roles in signal processing algorithms. The usual tradeoff is that the SVD is more expensive but can reveal rank more reliably. In this paper, we show how to construct a QR factorization which can also reveal the rank reliably. For matrices with low rank deficiency, the overhead over the usual QR procedures is negligible. It also appears possible to implement the new procedure in systolic arrays.
We present a new algorithm for computing the QR factorization of an (m+1)x(n+1) Toeplitz matrix in O(mn) multiplications. The algorithm exploits the procedure for the rank-1 modification and the fact that both principal mxn submatrices of the Toeplitz matrix are identical. The matrix R is generated row by row and the matrix Q column by column, starting from the first row and column respectively. Each row of R (column of Q) is calculated from the previous row (column) after three implicit modifications of rank-1 to the matrix R, one updating and two downdatings. The procedure for rank-1 updating is due to Gill, Golub, Murray and Saunders, while that for rank-1 downdating is new and can be regarded as the reverse of rank-1 updating. Both updating and downdating operate on rows of R (columns of Q) from left to right (from top to bottom) which makes efficient parallel implementation possible. The QR factorization can be obtained in O(m+n) parallel steps with 0(n) processors.
In this paper we present a parallel algorithm for the symmetric algebraic eigenvalue problem. The algorithm is based upon a divide and conquer scheme suggested by Cuppen for computing the eigensystem of a symmetric tridiagonal matrix. We extend this idea to obtain a parallel algorithm that retains a number of active parallel processes that is greater than or equal to the initial number throughout the course of the computation. Computation of the eigensystem of the tridiagonal matrix is reviewed. Also, brief analysis of the numerical properties and sensitivity to round off error is presented to indicate where numerical difficulties may occur. We show how to explicitly overlap the initial reduction to tridiagonal form with the parallel computation of the eigensystem of the tridiagonal matrix. The algorithm is therefore able to exploit parallelism at all levels of the computation and is well suited to a variety of architectures. Computational results have been presented in  for several machines. These results are very encouraging with respect to both accuracy and speedup. A surprising result is that the parallel algorithm for the tridiagonal case, even when run in serial mode, can be significantly faster than the previously best sequential algorithm on large problems, and is effective on moderate size problems when run in serial mode.
Algorithms of reduced complexity are developed for the computation of SVD of complex matrices in the triangular systolic array of Luk10. A sufficient condition for convergence of these algorithms is also given which is similar to the real matrix problem.
Updating the eigenvalue decomposition (EVD) of the estimated correlation matrix arises in many signal processing problems. In this paper, we present an improved, highly parallel, linear rank-one EVD update algorithm which is based on a quadratic rank-one method due to Bunch, et al . Both methods are evaluated and compared. A number of simulations involving the frequency estimation of non-stationary narrow-band signals are also performed in which linear prediction techniques are used with eigenvector pre-processing.
The CS decomposition and the Generalized Singular Value Decomposition are closely related. Indeed, the stable computation of the latter requires calculation of the former. In this paper we show how a number of important numerical activities in signal and image processing benefit from exploitation of these decompositions.
A systolic array for linearly constrained least-squares optimisation is described. It is shown how the "canonical" array, designed to perform least-squares minimisation with one simple constraint, may also be used for problems involving an arbitrary number of general linear constraints.
This paper concerns the recursive least squares algorithm of J.G. McWhirter. We analyze in detail how the method's stability depends on the condition of the data matrix, and we present a new procedure for estimating the error in the computed solution.
A novel approach to the general problem of signal parameter estimation is described. Though the technique (ESPRIT) is discussed in the context of direction-of-arrival estimation, it can be applied to a wide variety of problems including spectral estimation. ESPRIT exploits an underlying rotational invariance among signal subspaces induced by an array of sensors with a translational invariance structure (e.g., pairwise matched and co-directional antenna element doublets). The new approach has several significant advantages over earlier techniques such as MUSIC including improved performance, reduced computational load, freedom from array characterization/calibration, and reduced sensitivity to array perturbations. Results of computer simulations carried out to evaluate the new algorithm are presented.
Exact characterization of the array in terms of its geometry and the sensor gain and phase is important in beamforming and directions-of-arrival estimation problems. In practice the sensor gain and phase may not be known exactly due to drift in the sensor characteristics. We present a subspace approach which uses the received array signals to estimate the sensor gain and phase. We assume that the directions-of-arrival of some of the several sources that impinge on the array are known and show how this information can be used to determine sensor parameters. The resulting estimates can then be used to improve the robustness of array processing algorithms both for directions-of-arrival estimation and beamforming. Results of simulations carried out to verify the performance of this method are presented.
Signal-subspace processors that exploit the eigenstructures of covariance matrices have directional responses that are much more narrow than the responses of more conventional systems. EVEV methods are considered here in three specific forms that can be attributed to Pisarenko, Schmidt, and Johnson. The peaks of the processors' outputs are examined for deviation from the true direction of arrival, and variance of the results for several observations. Beamformer outputs have been computed from input data recorded from an actual array. The results demonstrate (1) that the performance of the Pisarenko processor is unsatisfactory and (2) that improved performance is achieved by the use of EVEV systems such as MUSIC at a price of some increase of computational load.
The user of a high resolution microwave imaging system views it as a large phased array radar. To the designer, however, it is a huge signal processing system with a multiport input (the phased antenna array) and a single output port ( the image display) . Between the input and the output, there is a great deal of signal handling and processing. Of course, there is a transmitter and receivers with local oscillators, mixers, IF amplifiers and quadrature demodulators. These are standard items required for frequency translation, amplification and demodulation; the real action as well as the intellectual depth in the creation of the system lies in the signal processing.
A modification of the Maximum-Likelihood method of spatial spectrum estimation is here described which yields unbiased estimates of the covariance matrix of a subset of the signal arrivals without requiring estimates of directions-of-arrival of signals outside the subset. The method works provided the subset is comprised of members uncorrelated with arrivals outside the subset, and provided the matrix is of full rank. To obtain it, the overall covariance matrix of the array outputs, the noise power, and the arrival directions of the members of the subset, are required. The latter two may be found via the MUSIC method. The procedure is useful in reducing the computational burden when only emissions from a limited region of space are of interest. An application is described wherein the method is used to more clearly observe totally coherent arrivals immersed in a larger set by using the data obtained via the modified ML procedure to eliminate all but the coherent arrivals.
We describe an analogy between imaging in delay-doppler radar/sonar and positron-emission tomography. This suggests new processing algorithms for the radar/sonar imaging problem that may permit improved visualization of targets for practical ambiguity functions. A receiver architecture consisting of a bandpass matched filter, square-law envelope detector, and specialized processing is proposed to produce images.
This paper deals with a novel architecture that makes real-time projection-based algorithms a reality. The design is founded on raster-mode processing, which is exploited in a powerful and flexible pipeline. This architecture, dubbed "P3 E" ( Parallel Pipeline Projection Engine), supports a large variety of image processing and image analysis applications. The image processing applications include: discrete approximations of the Radon and inverse Radon transform, among other projection operators; CT reconstructions; 2-D convolutions; rotations and translations; discrete Fourier transform computations in polar coordinates; autocorrelations; etc. There is also an extensive list of key image analysis algorithms that are supported by P E, thus making it a profound and versatile tool for projection-based computer vision. These include: projections of gray-level images along linear patterns (the Radon transform) and other curved contours; generation of multi-color digital masks; convex hull approximations; Hough transform approximations for line and curve detection; diameter computations; calculations of moments and other principal components; etc. The effectiveness of our approach and the feasibility of the proposed architecture have been demonstrated by running some of these image analysis algorithms in conventional short pipelines, to solve some important automated inspection problems. In the present paper, we will concern ourselves with reconstructing images from their linear projections, and performing convolutions via the Radon transform.
Synthetic aperture radar (SAR) and inverse synthetic aperture radar (ISAR) are enhanced resolution imaging radar systems. These systems receive coherent radar signal returns from an illuminated scene and use the signal modulation related to terrain reflectivity and relative motion between the scene and the observation platform to form an image of the scene. The extraction, storage, and coherent processing of these modulation terms represent an enormous signal processing task which classically has been performed by optical techniques. The flexibility and reduced cost of modern high performance, high bandwidth digital signal processing systems is the motivation for full DSP based imaging radar systems. In this paper we review the common DSP tasks performed by basic SAR and ISAR imaging radars.
The Wigner-Ville distribution (WVD), a form of time-frequency analysis, is shown to be useful in the analysis of a variety of non-stationary signals both deterministic and stochastic. The properties of the WVD are reviewed and alternative methods of calculating the WVD are discussed. Applications are presented.
We have shown that for the class of likelihood problems resulting from a complete-incomplete data specification in which the complete-data x are nonuniquely determined by the measured incomplete-data y via some many-to-one set of mappings y=h(x), the density which maximizes entropy is identical to the conditional density of the complete data given the incomplete data which would be derived via rules of conditional probability. It is precisely this identity between the maxent density and the conditional density which results in the fact that maximum-likelihood estimation problems may be solved via an iterative maximization of the sum of the entropy plus expected log-likelihood; the maximization is jointly with respect to the maxent density and the likelihood parameters. In this paper we demonstrate that for the problems of tomographic image reconstruction and spectrum estimation from finite data sets, this view results in the derivation of maximum-likelihood estimates of the image parameters and covariance parameters via an iterative maximization of the entropy function.
A two-dimensional linear-convolution algorithm is presented which couples the efficiency of the polynomial transform convolution-algorithm with the parallelism of finite-field arithmetic. This algorithm requires fewer arithmetic operations than most two-dimensional convolution algorithms, and possesses a structure which facilitates systolic architectures. Forming the polynomial transform in a finite field allows large numbers to be broken into small residues, each of which may be processed in parallel. A systolic architecture is described which may be used to evaluate two-dimensional convolutions in 0(N) processor cycles. This architecture is comprised of engagement processors, to form the polynomial transform, and pipelined processors, to form the polynomial product.
A new VLSI design method is developed for digital signal processors based on residue number systems and linear systolic arrays. The method breaks large processors into a set of small parallel processors that are interconnected only at the input and the output. Each of the parallel processors is a linear systolic array made up of a set of nearest-neighbor-connected, minimally complex, identical cells. The method has been illustrated by the design of a digital finite impulse response (FIR) filter which is projected for single chip VLSI implementation in 1.25 μm technology to have more than 5 x 1012 gate-Hz/cm2 throughput rate. A proof-of-concept filter has been implemented in 4 micron nMOS by 23 custom chips, each of approximately 20,000 transistor complexity. The filter has a combination of characteristics difficult to achieve with other design techniques--128 fully programmable taps and 136 dB dynamic range--yet it contains only 150,000 gates and has 5 MHz throughput. A fault-tolerant version of the filter provides the error-correction capability of triplication and voting, but it requires a redundancy of only 40%--one-third that required by triplication. In general, an RNS filter with N parallel processing channels can be provided with the fault tolerance of triplication and voting with one-Nth as much redundant hardware. By choosing a systolic architecture that computes outputs in parallel, an amount of fault tolerance can be provided that equals the use of triplication and voting at every stage of the filter.
In 1971, J.S. Walther generalized and unified J.E. Volder's coordinate rotation (CORDIC) algorithms. Using Walther's algorithms a few commonly used functions such as divide, multiply-and-accumulate, arctan, plane rotation, arctanh, hyperbolic rotation can be implemented on the same simple hardware (shifters and adders, elementary controller) and computed in approximately the same time. Can other useful functions be computed on the same hardware by further generalizing these algorithms?
Our positive answer lies in a deeper understanding of Walther's unification: the key to the CORDIC algorithms is that all of them effect the multiplication of a vector by the exponential of a 2 X 2 matrix. The importance of this observation is readily demonstrated as it easily yields the convergence conditions for the CORDIC algorithms and an efficient way of extending the domain of convergence for the hyperbolic functions. A correspondence may be established between elementary functions such as square-root, √(x2+y) , inverse square-root or cubic root and exponentials of simple matrices. Whenever such a correspondence is found, a CORDIC-like algorithm for computing the function can be synthesized in a very straightforward manner. The algorithms thus derived have a simple structure and exhibit uniform convergence inside an adjustable, precisely defined, domain.
A very time consuming step in the existing sequential stack decoding algorithm, known as the stack reordering, is shown to be not fully necessary. The algorithm is then modified and restated in a more concise version so that it can be efficiently implemented by a special type of systolic array called systolic priority queue. This scheme is shown to have simple and regular structure as well as high speed for decoding any convolutional codes, including long convolutional codes.
It is shown how the Viterbi algorithm and the Fast Fourier Transform can be described in terms of the same graph topology and different algebraic kernels. This similarity is exploited to show how a network of 2n processors, interconnected as an n-dimensional cube and with no shared memory, can be efficiently used to implement a concurrent Viterbi algorithm. The modularity of this approach allows the decoding of large constraint length codes with multi-chip VLSI systems operating at different degrees of parallelism. The interprocessor communication overhead is minimized by using the trace-back method, which completely avoids any survivor exchange or any global memory operation. Results are presented on the efficiency achieved for codes with constraint length up to 15 on a general purpose hypercube computer, and comparisons are made with other parallel architectures.
Delayed Decision Feedback Sequence Estimation (DDFSE) is a new signal processing algorithm for the detection of signals in the presence of noise. This paper describes the role of DDFSE in digital recording. The potential performance and complexity advantages of this algorithm are demonstrated by comparison to existing, alternative techniques.
We introduce a unified checksum scheme for the LU decomposition, Gaussian elimination with pairwise pivoting and the QR decomposition. The purpose is to detect and locate a transient error during a systolic array computation. We show how to represent the error as a rank-one perturbation to the original data, so that we need not worry when the error occurred. Finally, we perform a floating point error analysis to determine the effects of rounding errors on the check-sum scheme.
The computations of eigenvalues and singular values are key to applications including signal and image processing. Since large amounts of computation are needed for these algorithms, and since many digital signal processing applications have real-time requirements, many different special-purpose processor array structures have been proposed to solve these two algorithms. This paper develops a new methodology to incorporate fault tolerance capability into processor arrays which have been proposed for these problems. In the first part of this paper, earlier techniques of algorithm-based fault tolerance are applied to QR factorization and QR iteration. This technique encodes input data at a high level by using the specific property of each algorithm and checks the output data before they leave the systems. In the second part of the paper, special properities of eigenvalues and singular values are used to achieve the error detection without encoding the input data. Fault location and reconfiguration are performed only after an erroneous signal has been detected. The introduced overhead is extremely low in terms of both hardware and time redundancy.
An important technique for object recognition in electro-optics, signal processing, and image understanding is to use a training algorithm to create a data base against which to compare data for objects being tested. The data for each training and test object is represented as a vector in a space of possible high dimension, perhaps in the hundreds or thousands. It is usually desired to project this data onto a space of much lower dimension in such a way that separation of object clusters is preserved. The difficulty with using this approach as it is usually presented is that it leads to inordinately large generalized matrix eigensystems that must be analyzed. Just as drastic is the large amount of data required for implementation. For instance, Fisher's linear discriminant method usually requires having at least as many training vectors as the dimension of the representation space. This is a severe limitation in that it would be preferable to train on reasonable amounts of data, say on samples of 20 data vectors in each class of objects. In this paper, we present a new methodology based on the Fisher linear discriminant method, but for the underdetermined case; that is, for the case of having only relatively small amounts of training data for each cluster of objects. The new algorithm is based partly on the original Fisher algorithm and partly on more recent fast algorithms for matrix factorizations. We also present examples showing application of this algorithm to the problem of automatic target recognition using images from FLIR data.
Many important algorithms in signal and image processing, speech and pattern recognition or matrix computations consist of coupled systems of recurrence equations. Systolic arrays are regular networks of tightly coupled simple processors with limited storage that provide cost-effective high-throughput implementations of many such algorithms. While there are some mathematical techniques for finding efficient systolic implementations for uniform recurrence equations, there is no general theory for more general coupled systems of affine recurrence equations. The first elements of such a theory are presented in this paper.
Real time signal processing tasks impose stringent requirements on computing systems. One approach to satisfying these demands is to employ intelligently interconnected multiple arithmetic units, such as multipliers, adders, logic units and others, to implement concurrent computations. Two problems emerge: 1) Programming: Programs with wide instruction words have to be developed to exercise the multiple arithmetic units fully and efficiently to meet the real-time processing loads; 2) Design: With a given set of real-time signal processing tasks, design procedures are needed to specify multiple arithmetic units and their interconnection schemes for the processor. This paper presents a compiler which provides a solution to the programming and design problems. The compiler that has been developed translates blocks of RISC-like instructions into programs of wide microinstructions; each of these microinstructions initiates many concurrently executable operations. In so doing, we seek to achieve the maximum utilization of execution resources and to complete processing tasks in minimum time. The compiler is based on an innovative "Dispatch Stack" concept, and has been applied to program Floating Point System(FPS) processors; the resulting program for computing inner-product and other signal processing tasks are as good as those obtained by laborious hand-compilation. We will then show that the compiler developed for programming can be used advantageously to design real-time signal processing systems with multiple arithmetic units.