In this paper we present a novel approach towards the problem of solving sets of linear equations, as they appear in many digital signal processing problems. This approach avoids a back substitution step or an orthogonal transformation, after the factorization step, as is the case for the conventional direct methods of QR and LQ factorization. In fact, the novel algorithm enables one to calculate the solution x forwardly from the factorization of the matrix A, using orthogonal or J-orthogonal transformations. It can be combined with the (generalized) Schur algorithm, which does the Cholesky factorization of the matrix A efficiently in case A is positive definite, symmetric and Toeplitz or close to Toeplitz. In case A is a general (non singular) matrix, the complete equations solver appears to be an orthogonal equivalent of Faddeeva's and can be similarly generalized towards a larger class of matrix arithmetic.
For general linear algebraic computations in science, engineering and mathematics, problem sizes typically grow until constraints imposed by computation rates, I/O or memory availability result in unreasonably long execution times. From this point of view such computing architectures as systolic arrays provide a potential for processing much larger arrays. However, in order to utilize this potential, the problem of how to efficiently decompose large arrays onto a fixed systolic array in a numerically stable way must be solved. In some cases this is most efficiently done by partitioning the algorithm so that it can be run as a sequence of operations on input subarrays that are of the same size as the underlying hardware array. For some cases, such as those associated with banded matrices, it is necessary only to process the non-zero data. In all cases, however, the algorithms must run on the same hardware array without an undue amount of extra control or communication overhead. In this paper we describe algorithmic and architectural approaches to providing a fairly broad range of linear algebraic operations on matrices that are much larger than the underlying hardware array. All of the techniques we describe here have been simulated in detail using APL in order to verify correctness.
This paper will present preliminary concepts for the design of a systolic array of processors specifically aimed at efficient implementation of a core set of matrix operations consisting of matrix multiplication, QRD, SVD and generalized SVD. The algorithms to be implemented will be discussed briefly. Concepts for efficient implementation of the algorithms will be presented along with future plans.
The Wigner-Ville Distribution (WVD) is a valuable tool for time-frequency signal analysis. In order to implement the WVD in real time an efficient algorithm and architecture have been developed which may be implemented with commercial components. This algorithm successively computes the analytic signal corresponding to the input signal, forms a weighted kernel function and analyses the kernel via a Discrete Fourier Transform (DFT). To evaluate the analytic signal required by the algorithm it is shown that the time domain definition implemented as a finite impulse response (FIR) filter is practical and more efficient than the frequency domain definition of the analytic signal. The windowed resolution of the WVD in the frequency domain is shown to be similar to the resolution of a windowed Fourier Transform. A real time signal processsor has been designed for evaluation of the WVD analysis system. The system is easily paralleled and can be configured to meet a variety of frequency and time resolutions. The arithmetic unit is based on a pair of high speed VLSI floating-point multiplier and adder chips. Dual operand buses and an independent result bus maximize data transfer rates. The system is horizontally microprogrammed and utilizes a full instruction pipeline. Each microinstruction specifies two operand addresses, a result location, the type of arithmetic and the memory configuration. input and output is via shared memory blocks with front-end processors to handle data transfers during the non access periods of the analyzer.
The laboratory realization of a hybrid time and space integrating acousto-optic array processor is described with the fabrication of the system and its electronic support and a case study finite element solution on the laboratory system facility emphasized. The output detector system in this processor is unique and allows the use of different number representations. We emphasize the use of this system for a new sign-magnitude bipolar data representation that is quite attractive for use with a new one-channel LU decomposition algorithm and architecture to solve linear algebraic equations (LAEs). These features are employed in our finite element case study. This work represents: the first laboratory optical matrix vector multi-channel processor, the first laboratory realization and demonstration of a new LU decomposition algorithm, the first laboratory LAE direct solution demonstration, the first finite element method optical laboratory solution demonstration, a new mixed-radix to binary conversion technique, plus a new partitioning technique to allow higher accuracy than the number of bit channels permits as well as system hardware and speed trade-offs. Such optical array processors are of use as linear algebra processors, associative memory processors, feature extractors for pattern recognition, and in nearest neighbor classifiers as well as neural networks.
Architectures for systolic array processor elements for calculating the singular value decomposition (SVD) are proposed. These special purpose VLSI structures incorporate the coordinate rotation (coRDic) algorithms to diagonalize 2X 2 submatrices of a large array. The area-time complexity of the proposed architectures is analyzed along with topics related to a prototype implementation.
Structures defined by quantities of primitives based on edges are beside textures very essential features for understanding the visual world. A complete set of filters is introduced forming edges or lines of several classes but also crossings of lines or areas characterized by different boarders. It can be shown that it is possible to built such arrangements without restrictions to the size of array they overlapp, useing a hierarchical linking procedure. The inherent problem in processing sequenzes of growing length, combinatorical explosion, is here avoided by the proposed hierarchical system which generalizes the continuity of quantities of primitives by these primitives itself. A second hierarchy, providing building blocks for primitives of 2n-fold size, makes informations from different levels of both hierarchies comparable. This allows to predicate the continuity and smoothness of a picture describing element even if the connectivity is locally destroyed.
A stochastic search technique called simulated annealing can solve a class of problems termed non-convex optimization by seeking the lowest minimum of a multi-minima function. Simulated annealing is a generalized Monte Carlo technique with a continuously decreasing variance controlled by the temperature parameter.
A VLSI implementation of a Fast Fourier Transform (FFT) processor consisting of a mesh interconnection of complex floating-point butterfly units is presented. The Cooley-Tukey radix-2 Decimation-In-Frequency (DIF) formulation of the FFT was chosen since it offered the best overall compromise between the need for fast and efficient algorithmic computation and the need for a structure amenable to VLSI layout. Thus the VLSI implementation is modular, regular, expandable to various problem sizes and has a simple systolic flow of data and control. To evaluate the FFT architecture, VLSI area-time complexity concepts are used, but are now adapted to a complex floating-point number system rather than the usual integer ring representation. We show by our construction that the Thompson area-time optimum bound for the VLSI computation of an N-point FFT, area-time2oc = ORNlogN)1+a] can be attained by an alternative number representation, and hence the theoretical bound is a tight bound regardless of number system representation.
The next generation of image processing hardware employs modular, high performance processors and a "patch panel" video data bus. Using the VME bus for command and control signals, the user is free to arrange image processing modules in parallel and/or pipelined fashion to meet the demands of his algorithm with respect to performance of a desired task.
We present a design and its VLSI implementation of a radix-2 on-line algorithm for the basic function Y = AX + B in NMOS technology and discuss its area/time characteristics. The design uses internal pipelining to achieve a short step time of about three gate delays. The on-line delay is 5. The implementation is modular using a 150-transistor bit-slices. We also illustrate the use of the module in implementing a root solver for a polynomial equation.
An evaluation of alternatives for the implementation of a processor for the Singular Value Decomposition (SVD) is presented. Cost and performance of implementations with replication (which includes linear systolic arrays) and pipelining are evaluated. The algorithm used is the parallel method proposed by Brent and Luk, which was adapted for fewer processors. It is shown that the most efficient implementation depends on the throughput required. For low throughput, the optimal architecture is replication (i.e. linear array) of a processor with one pipelined arithmetic unit (AU). However, a single pipelined processor is more efficient for higher throughput than what is achievable with such an array. The architecture devised is a multilevel pipelined system, which uses the local parallelism existing in subcomputations of the algorithm. The realization complexity of this scheme is similar to the linear systolic array, and computes the decomposition of matrices of any size without hardware modifications. The design is based on a systematic methodology which is applicable to multi-instance algorithms implemented with only one type of operation unit, where instances are divided into groups and dependences exist between corresponding instances in those groups. This methodology gives the most efficient solutions for different hardware costs.
This paper examines signal processing applications of high speed bit-serial arithmetic and control circuitry which has been designed and tested for performance in the 50 MHz to 100 MHz range. A dense layout technique allows as many as 70 28-bit serial multipliers to be placed on a single 1.25 micron CMOS chip. At 70 MHz, this provides a throughput of 144 million multiplications per second. Applications to Fourier transform processors, digital filters, and matrix multi-plication are presented as examples. The bit-serial hardware is shown to have several advantages over bit-parallel alternatives.
In a simple imaging device, the swn-pling speeds are usually limited- by the adjacent pixel interference (API). The correction for this distortion can be treated similarly to those techniques developed in the field of communication because signals in both fields are shaped and distorted by their respective transfer functions, i.e., a serial stream of past data points are convolved together. In the field of communication intersymbol interference (ISI) has been under extensive study in the past decade and, as a consequence, simple techniques to separate these convolved data points have been developed. In contrast, because image processing was concerned with two dimensional optical signal distortions, the image world has corrected the neighboring pixel interference through sophisticated methods, i.e., complicated computer algorithms and sophisticated circuits. However, when limited frequency response in the signal processing circuit causes neighboring pixels to convolve together, they can be deconvolved with an inexpensive and simple transversal filter network with speed in excess of 5mhz pixel rate. Furthermore, this filtering technique can enhance the signal to noise ratio for a given system by bandwidth enhancement of a low-noise low-frequency amplifier. This paper briefly explains the phenomenon of the (ISI) and technique used for its correction, then demonstrates that an analogous API exists in the world of image processing and that same correction technique can be used.
The development of automatic target tracking systems has enabled more accurate determination of target position, velocity, acceleration, and other parameters needed for weapons guidance and target designation. With the advent of low cost, high speed digital computers and image processing hardware, it has become more and more feasible to incorporate digital image processing techniques in target tracking systems. When computing target position ( aim point) from discrete images, several problems can arise. A major problem is caused by noise. In target tracking noise originates from two sources. The first source is system and sensor noise. This is usually modeled by additive white Gaussian noise. Another type of noise is caused by clutter near the target. Clutter objects can make it more difficult for the tracker to separate the target from its surrounding background. The result is target pixels can be classified as background pixels and visa versa. This, in turn, causes errors in computing the target aim point. An investigation of the effects of system and sensor noise on target tracking is presented. Two statistical models are derived and simulation results are presented showing the accuracy of the models. The results obtained are applicable to the clutter noise case when clutter causes pixel classification errors which are random in nature.
In thermographic measurements, quantitative surface temperature evaluation is often uncertain. The main reason is in the lack of available reference points in transient conditions. Reflective markers were used for automatic marker recognition and pixel coordinate computations. An algorithm selects marker icons to match marker references where particular luminance conditions are satisfied. Automatic marker recognition allows luminance compensation and temperature calibration of recorded infrared images. A biomedical application is presented: the dynamic behaviour of the surface temperature distributions is investigated in order to study the performance of two different pumping systems for extracorporeal circulation. Sequences of images are compared and results are discussed. Finally, the algorithm allows to monitor the experimental environment and to alert for the presence of unusual experimental conditions.
A fault tolerant systolic processor system is proposed for computing the singular value decomposition of an nxn matrix. This approach uses only orthogonal interconnections and simple multiply and accumulate processors in the array. The fault tolerant properties are achieved through a composite of simple low overhead structures. The square root computations, and all fault tolerance computations are performed in one highly pipelined boundary processor. The architecture requires 0(n) processors and 0(n2 log n) time.
The performance of an associative memory based on the Hopfield Model of a neural network is data dependent. When programmed memories are too similar (a small Hamming distance between memories) the associative memory system is easily confused; settling either to incorrect or in some cases, undefined states. This paper describes a series of computer simulations performed on a 100-node Hopfield network. The programs were written in the APL language, which is highly efficient for this type of system. The simulations examined the sources of confusion and led to a preprocessing approach which substantially reduces the confusion. The simulations were also extended in the direction of coupling several small neural networks to form one integrated low-confusion associative memory. The coupling of the neural subnetworks was through a voting scheme wherein each node of a subnetwork consulted the analogous node of the other subnetworks; the decision to change state or remain the same is based on majority rule. The performance of these two associative memory systems is detailed and compared to a conventional Hopfield system.
The inherent parallelism, great processing speed, and massive interconnectability of optical systems have made the latter natural candidates for realizing various neural net models of associative memory, feature extraction, novelty filtering and problem optimization. Moreover, through the use of nonlinear thresholding and feedback techniques, such systems (hereafter referred to as neuro-optic processors or NOPs) appear to achieve the degree of operational accuracy long sought after in analog optical processor systems. Key elements of a typical NOP architecture include an optical interconnect network, an array of nonlinear optical processing elements, and a suitable means for providing optical feedback. In this paper we address ourselves to the optical interconnect network. Specifically we present practical design and information storage capacity analyses of the covariance matrix operator, = Emni=ilvm, >< Vmd, which characterizes this interconnect network. Here > is Hilbert-space object (state) vector in discrete or continuous 2D configuration space (in Dirac notation) and M represents the extent of NOP exposure to a particular object domain (within the context of pattern recognition, for example, M is the number of elements in a training pattern set). Central to our analyses is a discussion of the holographic technique which may be used to make R in hardware form.
Following the neocognitron architecture described by Fukushima, an Artificial Neural System (ANS) has been programmed in Fortran and run on an IBM PC AT. Our independent experience with this ANS confirms the findings of Fukushima for the neocognitron architecture. Specifically we exercised both the learning and recognition modes of the ANS. In the learning mode, alphanumeric characters are learned and distinguished without instruction or outside correction of errors. In the recognition mode, alphanumeric characters are recognized with tolerance to position, scale and geometric distortion. We describe the neocognitron architecture and explain the basis of its operation for both the learning and recognition modes.
This manuscript represents the second paper in a series of publications describing alternative combinatorial logic based optical computing architectures. Within the first paper titled "Combinatorial Logic Based Optical Computing," justification for combinatorial logic is initially debated through the coupled use of extensive optical interconnects with the natural "and-or-invert" capability of most every optical system.1 This is followed by a simple example of an optical "text" processor, a word and phrase comparitor which can be used for massive text look-up and search. This demonstrates the ability to design a wide range of digital optical architectures capable of performing numerous, if not myriad, sets of digital functions other than mathmatical. The original paper continues to describe a digital optical full adder enhanced by the full global broadcast capability of optics. Finally, a 2 x 2 bit systolic multiplier is described for matrix processing where the limitations of the DMAC algorithm are finally overcome.2-5 The output is thus a full 4 bit binary weighted result.
A previously proposed optical crossbar interconnected signal processor is reviewed for application to speech recognition. A traditional recognition approach is described that uses: linear prediction, dynamic time warping, and parsing. A wait-and-see rule based parser is used. The major parts Of the system are implemented on the optical processor and show that high performance should be achievable for large vocabularies. Further, hidden Markov models for exploiting temporal relations in establishing word and sentence hypotheses, also appear to be efficiently implementable on the processor. Methods for recognizing phonemes in 2-D spectrograms are also expected to lend themselves to optical processing.
A coherent optical method and concept are presented for the generation of an intense, very precisely spaced 2-D array of laser beams which maintain their relative position over significant distances. The array is achieved through the use of a recently invented quadriprism.1 A signal pattern encoded on a 128x128 array has been precisely transported over several meters. An attractive feature of the system is the low optical energy in the array interstitial space. The interposing of grids or arrays of rectangular lattice electrodes yields very low diffraction losses, thereby making the beam array ideal, e.g., in pipeline processors. Some conceptual applications are discussed.
A new semiconductor device -- the gate-controlled photodiode (GCPD) is presented here. Its photocurrent is dependent on the product of the light intensity and the gate voltage. Its potential applications in optical matrix multiplication is discussed.
The use of acousto-optic and electro-optic signal processing for aiding the testing of high speed and/or high density VLSI circuits is introduced. Various VLSI testing scenarios that involve comparison of the chip output and the expected reference output are discussed. Two acousto-optic architectures are discussed that could be useful for recording and com-paring high speed digital data. Experimental proof-of-principle results are presented along with a prototype processor module. The acousto-optic and electro-optic implementation of a third architecture that is useful for comparing and compressing high speed digital data, is also discussed along with initial experimental results.
Modeling (continuous and discrete) and numerical evaluation of the models (digital) have played an important role in the developement of optical processors and associated devices. Discrete Fourier transform based models are attractive for various acousto-optic information processor architectures. DFT-based models for incoherent and coherent AO spectrum analyzers are presented. These models are then extended to systolic AO linear algebra processors, especially the frequency-multiplexed modulator/deflector-type which is very similar to an AO spectrum analyzer. The models are general and allow for prediction of system response to arbitrary input and prediction of error effects. In addition, the models may be extended to include architectural modification, such as feedback paths or various forms of two dimensional processing. Algorithms for fast digital evaluation of the models are presented in the Appendix, so that accurate calculation is performed with minimal computer run times.
Optical fibers are excellent communication media with broad bandwidth, low loss, light weight, low cross talk, and immunity to electromagnetical interferences. With high speed lasers and photodetectors, optical fibers are also attractive for processing broadband signals. Many noncoherent optical fiber signal processing devices reported in literature were constructed to perform various functions such as frequency filtering, high speed pulse generation, encoding, and decoding. The basic components of an optical fiber signal processing device include light sources, delay lines, attenuators, directional couplers, and photodetectors. Recently the lattice optical fiber structures which are consisted of two-fiber directional couplers and various length of delay lines are investigated intensively. This structures can provide some good features such as complete mathematical formulation, straight forward implementation, modularity, and convenience for filter synthesis. However, due to the positive system property, i.e. non-negative coefficients of the noncoherent optical fiber signal processing devices, the traditional signal processing design methods and procedures can not be applied. For example, in the filter design, the passband band-width, transition band roll-off, and side lobe response can not be completely controlled without complicating the device structure. In this paper, we investigate the possibility of using optical fiber to design the filters with the traditional properties such as maxi-mally flat or equal-ripple responses. Then we will report a new approach, which employs window functions and uses two-fiber as well as multi-fiber directional couplers, to synthe-size filters with desired responses. The mathematical derivation, synthesis procedures, and frequency responses of various filters will also be presented.
A Space Integrating (SI) Optical Linear Algebra Processor (OLAP) employing space and frequency-multiplexing, new partitioning and data flow, and achieving high accuracy performance with a non base-2 number system is described. Laboratory data on the performance of this system and the solution of parabolic Partial Differential Equations (PDEs) is provided. A multi-processor OLAP system is also described for the first time. It use in the solution of multiple banded matrices that frequently arise is then discussed. The utility and flexibility of this processor compared to digital systolic architectures should be apparent.
An optical processor based on a photorefractive bismuth silicon oxide (BSO) crystal is proposed, which can perform four-product correlation, using all three dimensions of the crystal. The position of the correlation spot inside the crystal bears a one-to-one correspondence with the correlation parameters. A hexagonal crystal shape is designed for this application. Profitable variations on the basic experimental design are discussed, as well as the complicating effects of optical activity in BSO.
A three dimensional architecture is described using optical sources and detectors to interconnect integrated circuit layers. By stacking circuit planes and communicating through optical vias, substantial savings are realized in volume and cost along with increases in density and speed. Highly parallel architectures and efficient hardware for array processing are now possible using technologies that are rapidly becoming practical and available.
Threshold logic element designs in circulating packet form are presented for the implementation of addition and subtraction using modified sign digit (MSD) arithmetic. This arithmetic is attractive for digital optical computing due to its inherent parallelism and pipelining characteristics, which capitalize on natural strengths of optics. To illustrate application of these concepts, a design for CORDIC rotation modules to accomplish the complex Givens rotations required for systolic array QU matrix factorization is presented. This design accomplishes QU factorization using only threshold logic elements and bit-shift operations in a systolic configuration. Although implementable in principle by either electronic or optical means, the design is amenable to optical implementation because it involves high levels of parallelism and interconnections.
A microcomputer-based optical linear transformation processing system is proposed. The technique utilizes a systolic array processing method to perform various types of linear transformations, such as discrete Fourier transformation, discrete Hilbert transformation, discrete chirp-Z transformation and many others. By partially parallel addressing two magneto-optic spatial light modulators (MOSLM), this proposed system would offer high speed and parallel processing capability of optics and programmabilty of microcomputer.