The structured condition number of the solution of the Yule-Walker system of equations is given. It is found that there is little difference between this structured condition and the general condition number of a Toeplitz matrix. As a consequence, general purpose linear system solvers are stable for solving the Yule-Walker equations. By constructing appropriate examples it is shown that the Levinson algorithm is only weakly stable and is less trustworthy than the LDLT algorithm. Our round-off error analysis also suggests that for better accuracy Schur coefficients should be computed by the Schur algorithm and then used in the Levinson algorithm for computing the solution of Yule-Walker equations.
In this paper, we present a numerical stability analysis of Bareiss algorithm for solving a symmetric positive definite Toeplitz system of linear equations. We also compare Bareiss algorithm with Levinson algorithm and conclude that the former has superior numerical properties.
We study fast preconditioned conjugate gradient (PCG) methods for solving least squares problems min (parallel)b-T(chi) (parallel)2, where T is an m X n Toeplitz matrix of rank n. Two circulant preconditioners are suggested: one, denoted by P, is based on a block partitioning of T and the other, denoted by N, is based on the displacement representation of TTT. Each is obtained without forming TTT. We prove formally that for a wide class of problems the PCG method with P converges in a small number of iterations independent of m and n, so that the computational cost of solving such Toeplitz least squares problems is O(m log n). Numerical experiments in using both P and N are reported, indicating similar good convergence properties for each preconditioner.
A new derivation of the QR-based fast recursive least squares algorithms for Toeplitz matrices is presented. Algorithms for computing Q and R in the QR decomposition of the data matrix are proposed. These algorithms can be efficiently incorporated with the fast recursive least squares algorithm and can be performed only when they are needed.
A new preconditioner is proposed for the solution of an N X N Toeplitz system TNx equals b, where TN can be symmetric indefinite or nonsymmetric, by preconditioned iterative methods. The preconditioner FN is obtained based on factorizing the generating function T(z) into the product of two terms corresponding, respectively, to minimum-phase causal and anticausal systems and therefore called the minimum-phase LU (MPLU) factorization preconditioner. Due to the minimum-phase property, (parallel)FN-1(parallel) is bounded. For rational Toeplitz TN with generating function T(z) equals A(z-1)/B(z-1) + C(z)/D(z), where A(z), B(z), C(z), and D(z) are polynomials of orders p1, q1, p2, and q2, we show that the eigenvalues of FN-1TN are repeated exactly at 1 except at most (alpha) F outliers, where (alpha) F depends on p1, q1, p2, q2, and the number (omega) of the roots of T(z) equals A(z-1)D(z) + B(z-1)C(z) outside the unit circle. A preconditioner KN in circulant form generalized from the symmetric case is also presented for comparison.
We present a simple matrix representation of the Berlekamp-Massey algorithm for solving a Hankel system of linear equations, and we show how pivoting can be added to the algorithm to improve numerical accuracy of the method.
We explore the application of the nonsymmetric Lanczos algorithm to two different problem domains, the theory of moments and orthogonal polynomials, and the factorization of Hankel matrices. The connection with a third problem domain, algorithm-based fault tolerant computing, was explored in a companion paper. We find that in the simplest case, where all leading submatrices are nonsingular, the methods reduce to classical algorithms such as the original nonsymmetric Lanczos method and the Chebyshev algorithm. We propose a back-up pivoting strategy for factorizing a Hankel matrix which avoids treating rank deficiency as a special case.
A general solution for the problem of time-frequency signal representation of nonlinear FM signals is provided, based on a generalization of the Wigner-Ville distribution. The Wigner- Ville distribution (WVD) is a second order time-frequency representation. That is, it is able to give ideal energy concentration for quadratic phase signals and its ensemble average is a second order time-varying spectrum. The same holds for Cohen's class of time-frequency distributions, which are smoothed versions of the WVD. The WVD may be extended so as to achieve ideal energy concentration for higher order phase laws, and such that the expectation is a time-varying higher order spectrum. The usefulness of these generalized Wigner-Ville distributions (GWVD) is twofold. Firstly, because they achieve ideal energy concentration for polynomial phase signals, they may be used for optimal instantaneous frequency estimation. Second, they are useful for discriminating between nonstationary processes of differing higher order moments. In the same way that the WVD is generalized, we generalize Cohen's class of TFDs by defining a class of generalized time-frequency distributions (GTFDs) obtained by a two dimensional smoothing of the GWVD. Another results derived from this approach is a method based on higher order spectra which allows the separation of cross-terms and auto- terms in the WVD.
A general approach for obtaining joint representations in signal analysis is presented. The method is applied to scale representations. We define a new scale operator and show that it leads to joint representations of the Altes and Marinovic type and we also define joint representations with inverse frequency and show that it leads to Bertrand-Bertrand type distributions. We derive the uncertainty principle for scale and obtain the minimum time-scale uncertainty signal.
In this paper, we show how quadratic time-frequency representations are a generalization of the spectrogram and we review our results for time-frequency analysis and display of chirps and speech. We then show comparative performance on phase-shifted keyed communication signals. The concept of quadratic filtering is then introduced and linked to Teager's energy detector and the resolution advantages over linear filtering are demonstrated.
The well-known uncertain principle is often invoked in signal processing. It is also often considered to have the same implications in signal analysis as does the uncertainty principle in quantum mechanics. The uncertainty principle is often incorrectly interpreted to mean that one cannot locate the time-frequency coordinates of a signal with arbitrarily good precision, since, in quantum mechanics, one cannot determine the position and momentum of a particle with arbitrarily good precision. Renyi information of the third order is used to provide an information measure on time-frequency distributions. The results suggest that even though this new measure tracks time-bandwidth results for two Gabor log-ons separated in time and/or frequency, the information measure is more general and provides a quantitative assessment of the number of resolvable components in a time frequency representation. As such, the information measure may be useful as a tool in the design and evaluation of time-frequency distributions.
This paper presents a review of some concepts associated with time-frequency distributions-- the instantaneous frequency, group delay, instantaneous bandwidth, and marginal properties-- and generalizes them in time-frequency via rotation of coordinates. This work emphasizes the need to examine time-frequency distributions in the general time-frequency plane, rather than restricting oneself to a time and/or frequency framework. This analysis leads to a generalized uncertainty principle, which has previously been introduced in radar theory. This uncertainty principle is invariant under rotation in the time-frequency plane, and should be used instead of the traditional definition of Gabor. It is desired to smooth a time-frequency distribution that is an energy density function into one that is an energy function. Most distributions are combinations of density and energy functions but the Wigner-Ville distribution is purely a density function. By using a local version of the generalized uncertainty principle, the Wigner- Ville distribution is smoothed into a signal dependent spectrogram using an iterative algorithm. It is believed that this procedure may represent, in some way an optimum removal of signal uncertainty in the time-frequency plane.
The Wigner distribution and its smoothed versions, i.e., Choi-Williams and Gaussian kernels, are evaluated for propulsion system diagnostics. The approach is intended for off-line kernel design by using the ambiguity domain to select the appropriate Gaussian kernel. The features produced by the Wigner distribution and its smoothed versions correlate remarkably well with documented failure indications. The selection of the kernel on the other hand is very subjective for our unstructured data.
The on-line identification of continuously adaptive rational models is considered for data records which are realizations of nonstationary processes. Emphasis is placed on the treatment of arbitrary time variations for the model coefficients. This motivates a novel approach to time-varying modeling, based only on limited a priori knowledge about the nature of the non- stationarity, namely the expected bandwidth of the time evolution for the parameter vector. The cost criterion considered is a constrained least squares cost functional which incorporates with equal weight all instantaneous errors up to the current time of observation. The constraint is specified from the expected time evolution bandwidth through explicit filtering of the parameters to be estimated. Computational considerations and associated trade-offs are discussed for the modeling of various nonstationary data.
An efficient algorithm is presented for computing the continuous wavelet transform and the wideband ambiguity function on a sample grid with uniform time spacing but arbitrary sampling in scale. The method is based on the chirp z-transform and requires the same order of computation as constant-bandwidth analysis techniques, such as the short-time Fourier transform and the narrowband ambiguity function. An alternative spline approximation method which is more efficient when the number of scale samples is large is also described.
A high performance VLSI architecture to perform combined multiply-accumulate, divide, and square root operations is proposed. The circuit is highly regular, requires only minimal control, and can be pipelined right down to the bit level. The system can also be reconfigured on every cycle to perform one or more of these operations. The throughput rate for each operation is the same and is wordlength independent. This is achieved using redundant arithmetic. With current CMOS technology, throughput rates in excess of 80 million operations per second are expected.
This paper describes aspects of the arithmetic algorithms, architecture, and VLSI engineering of the 64-bit floating-point unit of the T9000 Transputer. The unit is fully conformant to the IEEE-754 floating-point arithmetic standard, and has been implemented in a 1 micrometers , triple- metal CMOS technology. The 160,000 transistor design performs addition in 40 ns, double precision multiplication in 60 ns, and double-precision division or square root in 300 ns. It will sustain 17 MFlops on the Linpac benchmark, yet occupies less than 15 mm2 of silicon--about 8.5% of the die area of T9000.
An arithmetic unit based on a high-speed multiplier with a redundant binary addition tree is proposed. It is efficient for numerical computations with iteration of multiplications and addition/subtractions. A new multiplier recoding method makes the arithmetic unit efficient for these computations.
In this paper we consider using 'digit serial' processing to build high performance parallel structures, in particular, parallel signal processors. Digit serial arithmetic processors have digit serial data transmission combined with digit serial computation. Three digit serial arithmetic processors are presented and compared with their digit parallel counterparts. We show that by using a digit serial approach we can achieve a higher throughput than with a digit parallel processor, even though the two processors are structurally similar and have components of similar complexity.
An arithmetic-level approach to increasing computation rate in solving recurrence problems is discussed. The approach, based on on-line arithmetic principles, is discussed in problems such as root finding and recursive filters. The main objective is to highlight the potential benefits and problems of the approach.
We present here some algorithms for on-line computation of elementary functions. These algorithms use shift-and-add as elementary step and need signed digit representations of numbers. Then, we give some theoretical results about on-line computation of functions. For instance, we show that a finite automaton (in practice a bounded size and memory operator) can compute in on-line only piecewise affine functions.
As CORDIC algorithms receive more and more attention in elementary function evaluation and signal processing applications, the problem of their VLSI realization has attracted considerable interest. In this work we review the CORDIC fundamentals covering algorithm, architecture, and implementation issues. Various aspects of the CORDIC algorithm are investigated such as efficient scale factor compensation, redundant and non-redundant addition schemes, and convergence domain. Several CORDIC processor architectures and implementation examples are discussed.
The need to construct architectures in VLSI has focused attention on unnormalized floating point arithmetic. Certain unnormalized arithmetics allow one to 'pipe on digits,' thus producing significant speed up in computation and making the input problems of special purpose devices such as systolic arrays easier to solve. We consider the error analysis implications of using unnormalized arithmetic in numerical algorithms. We also give specifications for its implementation. Our discussion centers on the example of Gaussian elimination. We show that the use of unnormalized arithmetic requires change in the analysis of this algorithm. We will show that only for certain classes of matrices that include diagonally dominant matrices (either row or column), Gaussian elimination is as stable in unnormalized arithmetic as in normalized arithmetic. However, if the diagonal elements of the upper triangular matrix are post normalized, then Gaussian elimination is as stable in unnormalized arithmetic as in normalized arithmetic for all matrices.
This paper describes a programmable radar signal processor architecture developed at the Naval Research Laboratory (NRL). The design incorporates T.I. TMS320C30 programmable digital signal processor devices, Xilinx programmable gate arrays, TRW FFT devices, and a parallel array of Inmos Transputer microprocessors. The architecture is extremely flexible and is applicable to a wide variety of applications.
Several important problems in signal processing, such as linear prediction, linear regression, or spectrum factorization, need close-to-Toeplitz matrices to be factored. To solve these problems, several fast algorithms have been derived. They differ by the kind of adaptivity (block processing or exponential weighting of the data) and by the kind of recursion (order or time recursion), but they all have a common vector ladder recursion, involving an hyperbolic rotation. These algorithms are therefore well suited for implementation on an array processor. But there exists a number of applications where an efficient parallel implementation of these algorithms on a DSP network would be very attractive. The redundancy suppression, which is performed in the equations to get a fast algorithm, destroys the processing regularity of the corresponding standard algorithms, which prevents efficient high level parallel implementation. Auxiliary quantities, such as generalized reflections coefficients, are introduced that don't have the same dimensions as primary quantities. As a result there is a loss of efficiency in such tasks processing, and this leads to use array partitioning in as many sub-arrays as the number of DSPs available. If the number m of DSPs is low compared to primary quantities of dimension p (m << p/m) (i.e. a low level parallelism) and if the dimension a of reflection coefficients is also low ((alpha) << p/m), the global efficiency of a parallel implementation on a DSP network may still be interesting, with in addition, the advantages associated to such a network : simple design, simple control. To illustrate this, an application of the method to a Fast Recursive Least Squares algorithm is presented.
A programmable, reconfigurable DSP array has been designed in response to the need for a real-time adaptive sidelobe canceller to support wide bandwidth high frequency radar concepts. These concepts incorporate multiple (up to 128) mainbeams and many degrees of freedom using many auxiliary antenna elements, and employ frequency sub-banding to partition a 1 MHz instantaneous bandwidth. The real-time sidelobe canceller is based on the Gram-Schmidt orthogonalization procedure and uses concurrent block adaptation to derive an optimal solution for the available data. The sidelobe canceller implementation configures the modular DSP array into a two-dimensional Gram-Schmidt processor. A proof-of-concept sidelobe canceller implementation will provide the capability to perform sidelobe cancellation on two simultaneous mainbeams using eight auxiliary channels. The DSP array sidelobe canceller can be expanded to support more than 128 mainbeams using auxiliary channels. The array consists of TMS320C30 based processing nodes providing four independent data ports (two inputs and two outputs) connected via high speed serial data links supporting two megawords-per-second sustained throughput rates (8 megawords-per-second burst rates). These manually configured data connections are separated from the control structure, allowing a variety of interconnect strategies (one-dimensional, two-dimensional, etc.). A host processor is used to download application code and control the system. This programmable, reconfigurable array processor can also be used for a variety of other applications including the singular value decomposition, matrix-matrix multipliers, and FFTs.
This paper describes an efficient implementation of auxiliary constraints for a concurrent block least squares adaptive sidelobe canceller when a single array of sensors is used to form one or more main beams. The approach is to compute QR decomposition of the auxiliary data matrix and then send this information to main beam processors, where the constraints are applied using a blocking matrix and the individual residuals are computed. The blocking matrix can be chosen with special structure which is used to derive a new fast algorithm and architecture for constrained main beam processing that reduces the operation count from order n3 to order n2, where n is the number of auxiliary sensors.
We compare the performance of three parallel supercomputers executing a bispectrum estimation code used to remove distortions from astronomical data. We discuss the issues in parallelizing the code on an 8-processor shared-memory CRAY Y-MP and a 1024-processor distributed-memory nCUBE machine. Results show that elapsed times on the nCUBE machine are comparable to those on the CRAY Y-MP. Execution of the nCUBE was more than 40 times faster than that of a single processor CRAY-2 resulting in more than 50 times better cost performance. Cost performance on the nCUBE is more than 25 times better than an 8- processor CRAY Y-MP.
The iterative phase gradient autofocus (PGA) algorithm for automatically focusing synthetic aperture radar images has been implemented on both a 16384-processor Connection Machine and the 1024-processor nCUBE 2 hypercube. Massive parallelism has proven its value by dramatically reducing processing times over those achieved on sequential machines by an order of magnitude or more. This is especially important for very large images or where high volumes of input and output are encountered. We provide an overview of the PGA algorithm, highlighting opportunities for significant speed-up in parallel architectures. This is followed by implementation details and timing results.
Two methods of focusing synthetic aperture radar (SAR) images are compared. Both a conventional subaperture cross-correlation method and a new phase gradient autofocus (PGA) algorithm developed at Sandia National Laboratories are shown to perform well if high-order phase errors are not present. With the introduction of significant high-order phase errors [e.g., due to uncompensated platform motion], both algorithms suffer a loss in performance. However, relative performance degradation is less for PGA than for the subaperture focusing technique. An explanation is presented for the observed behavior of the two autofocus techniques.
We present an analysis and computational results relating to the regularized restoration of subpixel information from undersampled data. The method makes use of a small set of images in various stages of defocus. An iterative implementation permits the incorporation of a non- negativity constraint. The problem we consider is fundamentally under-determined, but useful results can be obtained in reasonably low noise conditions.
In this paper, we introduce a rank-one spherical subspace update that is appropriate for tracking the dominant (signal) and/or subdominant (noise) subspaces associated with a slowly time-varying correlation matrix. This non-iterative, highly parallel, numerically stabilized, subspace update is closely related to rank-one eigenstructure updating. However, a rank-one subspace update involves less computation than simple rank-one correlation accumulation. Moreover, the frequency tracking capabilities of the non-iterative subspace update are virtually identical to and in some cases more robust than the more computationally expensive eigen- based methods.
Some recent results in the theory of adaptive array detection are presented. Detection curves are given for various detection algorithms. These detection algorithms are suitable for nulling out noise or interference sources in which the noise statistics are to be estimated from target- free data. They share the constant false alarm rate property, so that their false alarm rate can be set without knowledge of the noise covariance matrix. Also considered is a detection algorithm employing diagonal loading.
This paper describes a cascade decomposition of the generalized sidelobe canceller (GSC) implementation for linearly constrained minimum variance beamformers. The GSC is initially separated into an adaptive interference cancellation module followed by a non-adaptive beamformer. We prove that the adaptive interference cancellation module can be decomposed into a cascade of first (or higher) order adaptive interference cancellation modules, where the order corresponds to the number of adaptive degrees of freedom represented in the module. This distributes the computational burden associated with determining the adaptive weights over several lower order problems and facilitates simultaneous implementation of beamformers with differing numbers of adaptive degrees of freedom.
In many applications (radar, communication, plasma physics) the signal of interest is sinusoidal and is hidden in non-white noise. The multi-window method of spectrum estimation gives a constant false alarm test for the presence of a sinusoid in a time series if the noise can be assumed Gaussian. In this paper we generalize the method to an array and study the resulting test. Expressions for the probabilities of detection and of false alarm are obtained analytically and receiver operation characteristic curves are computed for a particular scenario under different conditions.
We compare the modified Wigner distribution functions obtained via the Choi-Williams kernel and its rotation, as well as by the tilted Gaussian kernel. Based on several commonly used examples, we demonstrate that the modified Wigner distribution obtained via the Gaussian kernel can minimize the artifacts more effectively and has the capability of selectively filtering out undesired components.
The performance of both single frequency and multifrequency Beamspace Root-MUSIC incorporating coherent signal subspace processing is assessed employing data from an experimental array referred to as the Multi-Parameter Adaptive Radar System (MARS). Beamspace Root-MUSIC is a recently developed computationally efficient beamspace implementation of MUSIC. The experiment that yielded the data to which Beamspace Root- MUSIC is applied emulated a low-angle radar tracking scenario with two coherent signals arriving within a beamwidth of each other at broadside. Accurate estimates are obtained with the incorporation of the multipath geometry as a-priori information in the form of the bisector angle.
A new approach to the problem of detecting the number of signals in unknown colored noise environments is presented. Based on an assumption that the noise is correlated only over a limited spatial range, the principle of canonical correlation analysis is applied to the outputs of two spatially-separated arrays. The number of signals is determined by testing the significance of the sample canonical correlation coefficients. The new method is shown to work well in both white and unknown colored noise situations and does not require any subjective threshold setting. Instead, a set of threshold values are generated according to a specified or desired false alarm rate. Simulation results are included to illustrate the comparative performance of the proposed canonical correlation technique (CCT), versus the well-known AIC and MDL criteria, in colored noise. It is found that the performance of the AIC and MDL criteria degrade very rapidly as the degree of color in the noise increases. On the other hand, the performance of the CCT method is relatively insensitive with respect to variations in degree of color.
Inverse synthetic aperture radar (ISAR) is a radar imaging method by which the rotation of a target is utilized to produce a two dimensional image. To achieve high resolution in both range and cross range, a series of stepped frequency waveforms are transmitted. These sample the target's radar response in frequency and time. Frequency transforms to time, which is proportional to range, and time transforms to doppler frequency, which is proportional to cross range. Therefore a two dimensional Fourier transform can be applied to the two dimensional data set to produce a radar image. However this is an approximation as the data is in a polar format, which only approximates a rectangular grid. Therefore resampling (interpolation) is required to change the grid from a polar to a rectangular format. The resampling in this case is straightforward. In an attempt to obtain higher resolution images, the Fourier transform has been replaced by the multiple signal classification (MUSIC) algorithm. The justification for this is that the targets of interest are manmade and so have sharp edges and corners. Therefore they consist of a number of corner reflectors with a background of continuous reflectors. The corner reflectors by their nature will generally give much stronger reflections, so one can with a certain degree of accuracy, approximate the ship as a collection of corner reflectors. Over the small change in aspect angle for which ISAR imaging is performed, corner reflectors can be approximated as point scatterers. This leads to the data being modeled as a collection of complex exponentials with added white Gaussian noise. The noise being due to thermal noise in the radar system. This type of data set is ideal for the two dimensional MUSIC algorithm. There are two major difficulties in applying the MUSIC algorithm to ISAR imaging. First, the MUSIC estimator is more sensitive to the sampling grid being polar than the Fourier transform is. Second, the resampling is less effective for the MUSIC algorithm than for the Fourier transform. Therefore it is more important to perform resampling when using the MUSIC algorithm, but harder to do it effectively. The reasons for this problem and possible ways of solving it, are the topic of this paper.
We present an algorithm that can detect and isolate a single passive antenna failure under the assumption of slowly time varying signal sources. Our failure detection algorithm recursively computes an eigenvalue decomposition of the covariance of the "syndrome" vector. The sensor failure is detected using the largest eigenvalue, and the faulty sensor is located using the corresponding eigenvector. The algorithm can also be used in conjunction with existing singular value decomposition or orthogonal triangularization based recursive antenna array processing methods.
Estimation and tracking of the instantaneous amplitudes and frequencies of superimposed, slowly varying narrowband signals is a difficult signal processing problem that shows up in many applications. Our approach to this problem is to achieve global noise averaging via SVD-based rank reduction of a matrix constructed from the entire data record. Compared to methods that use local noise averaging using many smaller matrices, the strength of the new approach is in affecting better noise reduction at a lower computational cost. Moreover, no model is needed for the variation of the amplitudes and frequencies with time. In this paper, Cramer-Rao bounds for the variance of the error in tracking the instantaneous amplitudes and frequencies (without assuming a parametric model for their variation with time) will be presented. In addition, we will also show how the algorithm can be used on sensor array data to estimate range and direction of multiple narrow-band sources.
A new type of high performance VLSI systolic array is presented that is able to perform two- dimensional convolution with kernels sizes large than the physical array of processing elements. This array is particularly well-suited for neural network image processing algorithms that use large connected neighborhoods to model the transformations between layers of neurons. The new VLSI systolic array can also perform the two-dimensional convolution with the small kernels (such as 3 X 3) that are often used in the more standard image processing. In addition, the systolic array can perform one-dimensional convolution and matrix-vector multiplication. The interface of the array to external memory is designed such that a conventional linear memory architecture is used for accessing and storing data. No variable length scan conversion shift registers are needed by the systolic array to access an image stored in a conventional raster scan format. Such scan conversion variable length shift registers are often required with other systolic array architectures. The VLSI array is extendible so that both a single chip and a multiple chip architecture system can be built.
As is well know, collision checking in a generalized sense is a substantial contributor to processor load in a wide range of path-planning applications. These include high-speed terrain following and obstacle avoidance in low-flying aircraft, path planning on local scales for autonomous vehicles, both undersea and on land, and a number of areas in robotic motion planning, not only for mobile robot navigation but also for such problems as arm motion in cluttered environments. Some algorithms have been quoted as requiring from 80% to as much as 95% of the available path-planning time for collision checking. Under such conditions, special-purpose hardware designed for the requirements of collision checking offers the promise of major overall performance improvements in real-time path planning. Recent work at the Lockheed Palo Alto Research Laboratories has produced a hardware system providing a major acceleration in collision-checking capability with minimal added hardware, and demonstrated an approach leading to still further major increments in speed for this function. This paper first describes the architecture of the system, which is known as TIGER (Three- dimensional Intersect & Geometrical Evaluator in Real time). This is followed by an analysis of the computational load of the collision-checking problem and a discussion of important design and performance considerations arising from the approach both within TIGER itself and in the integration of TIGER into a higher-level path-planning system.