In this paper some of the techniques that have been been employed in the design of a set of high performance (20- 40 MHz) DSP standard products 1,2 are discussed. To support real-time operation, each device has an architecture that is dedicated to the function being performed. The chip set includes video line delays, a 64-tap 12-bit rank value filter, a 1024-tap binary template matcher, a 64-tap 8-bit FIR filter, a 9-tap 60 MHz FIR filter, a histogram and Hough transform processor, an FFT chip set and an object contour tracer.
The amount of data generated by computed tomography (CT) scanners is enormous, making the image reconstruction operation slow, especially for 3-D and limited-data scans requiring iterative algorithms. The inverse Radon transform, commonly used for CT image reconstructions from projections, and the forward Radon transform are computationally burdensome for single-processor computer architectures. Fortunately, the forward Radon transform and the back projection operation (involved in the inverse Radon transform) are easily calculated using a parallel pipelined processor array. Using this array the processing time for the Radon transform and the back projection can be reduced dramatically. This paper describes a unified, pipelined architecture for an integrated circuit that computes both the forward Radon transform and the back projection operation at a 10 MHz data rate in a pipelined processor array. The trade-offs between computational complexity and reconstruction error of different interpolation schemes are presented along with an evaluation of the architecture's noise characteristics due to finite word lengths. The fully pipelined architecture is designed to reconstruct 1024 pixel by 1024 pixel images using up to 1024 projections over 180 degrees. The chip contains three pipelined data-paths, each five stages long, and uses a single multiplier.
The use of picture coding in real applications will be limited by the availability of suitable VLSI products for the implementation of such systems. This paper looks at some of the progress to date in developing VLSI products for image coding followed by some applications for these products outside of image coding for communications and storage.
Two Application Specific Integrated Circuits (ASIC) have been developed for hierarchical representation and compression of medical images for Picture Archiving & Communication System (PACS) environments. Hierarchical image representation enables progressive transmission of images and provides an economic means of rapidly surveying several images simultaneously. Furthermore, it facilitates the adaptation of high resolution images to displays with lower resolution. Compression enables efficient use of storage media and transmission channel capacities. Together the chip set implements a loss-less transform coding system.The first device implements the S-Transform algorithm (a 2x2 pixel case of the Hadamard Transform) which creates a hierarchical representation of an image. The second device implements the Lempel-Ziv compression algorithm, an adaptive codec that maps variable length strings of characters to fixed length code words. Used in tandem the chip set achieves image processing rates of up to 7.5million pixels per second at a system clock rate of 10MHz, for 13 bit pixel digitization. Both chips are microprocessor programmable. Because of the limited modes of the algorithm the S-Transform chip implementation focused on efficient utilization of silicon area and high processing rate. On the other hand because of its potential use outside of imaging applications, the Lempel-Ziv codec focused on versatility and speed. We describe the architecture and features of each chip and the characteristics of a system designed around these devices.
Computer vision applications are characterised by the need to provide a rapid response to information derived from a high-speed continuous data stream. For example, real-time computer vision using noninterlaced (60 Hz) raster-scanned image frames requires the completion of all processing tasks (including data input and output) within a frame-time of 16.7 msec, suggesting processing rates of 10-1000 GOPS (Giga Operations Per Second (e.g. iO 8-bit additions per second) and input data rates approaching 1 Gbytes/sec. It is the requirement for such high-performance, unattainable by sequential architectures, which can only be achieved with massively parallel processing.
Real time computer vision applications require extremely high processing speeds which are very challenging for architectures as well as VLSI technologies. In this paper we investigate two different VLSI and architectural approaches to the decision analysis part of a complex low level image segmentation architecture (LISA). The two different technological approaches are full custom application specific integrated circuits (ASIC) and programmable gate arrays. The two different architectures are a modification of a classical statistical classifier and a connectionist approach using a layered feed forward net. We will also briefly mention the feature extraction part of the architecture, which is essential to understand the complete architecture and is realized using both technologies. LISA as a whole is capable of performing real time (20 M pixels/sec) grey level image segmentation based on grey level and textural properties of the objects. The architectures and technologies will be compared in terms of development time, design methodology, and finally experimental results will be shown.
This paper presents the architectural features and imaging applications of the Orthogonal MultiProcessor (OMP) system, which is under construction at the University of Southern California with research funding from NSF and assistance from several industrial partners. The prototype OMP is being built with 16 Intel i860 RISC microprocessors and 256 parallel memory modules using custom-designed spanning buses, which are 2-D interleaved and orthogonally accessed without conflicts. The 16-processor OMP prototype is targeted to achieve 430 MIPS and 600 Mflops, which have been verified by simulation experiments based on the design parameters used. The prototype OMP machine will be initially applied for image processing, computer vision, and neural network simulation applications. We summarize important vision and imaging algorithms that can be restructured with neural network models. These algorithms can efficiently run on the OMP hardware with linear speedup. The ultimate goal is to develop a high-performance Visual Computer (Viscom) for integrated low- and high-level image processing and vision tasks.
Parallel image analysis systems are shown by the Abingdon Cross benchmark to excel over all other architectures both in terms of speed and cost performance. The Abingdon Cross benchmark is task specific so that the performance of the system under test can be optimized and adjusted to take full advantage of the system's capabilities. The authors have devised a new set of benchmarks for parallel image analysis systems consisting of individual tests of the following operations: (1) Point process, (2) integer convolution, (3) Fourier transform, (4) Boolean algebra, (5) histograming, (6) maximum ranking, (7) median ranking, (8) erosion! dilation, (9) memory to disk transfer, and (10) memory to display transfer.
The Warwick Pyramid Machine (WPM) is an M-SIMD (Multiple-Single Instruction Multiple Data) heterogeneous pyramid architecture for image understanding. Details of the implementation are given. The properties and performance of the architecture are discussed. A generic image analysis task is the detection and identification of compact, convex, blob-like objects. The detection of such blobs is illustrated in detail with a modification of the circle Hough transform. This is shown to suit the global SIMD nature of the architecture. The subsequent stage of segmentation demonstrates the local processing capabilities of the M-SIMD architecture. The image analysis examples reported use forward-looking infrared images of vehicles, and electron micrographs of virus particles. In both cases the aim is to detect candidate regions of the image for further detailed analysis.
In previous work, we have discussed the issue of real-time control, which is important for the effective application of massively parallel architectures in real-time, or near real-time, signal processing. We first briefly review our work on how to incorporate adaptive programming decisions in systems accommodating massive data parallelism without losing machine efficiency. This involves re-examination of functions to be performed by the software versus those performed by the hardware. Our discussion is based on a Multiple Instruction Multiple Data (MIMD) controller that we have constructed for a Geometric Single Instruction Single Data (GSIMD) array. We then address two more issues facing a GSIMD architecture for its efficient use, especially in vision applications. These are: how to treat the overlap in small neighborhood operations when a large image array is segmented for processing on a smaller GSIMD array, and how to transmit small amounts of information between the array processor and external machines for adaptive signal processing. We describe the architecture of our hardware solutions.
In order to deliver the computational power required by real-time manipulation and display of multidimensional objects, we present a massively parallel octree architecture, based on a new Interconnection Network, the Partitionable Spanning Multibus Hypercube (PSMH). Its goal is, to use one Processing Element per obel (object element), as opposed to one Processing Element per voxel (volume element). The underlying idea of the PSMH, is to take advantage of the data hierarchical ordering to reduce the computational cost. As a basic tool, we first derive a routing algorithm for the case of an object shift. Its running time is of order O(max(n3, m)), for an 8n PSMH, where m is the message length in bits. As we do not consider voxels but obels, we design a compaction algorithm, which meets the routing requirements. We get a compression ratio of O(2n). This is followed by a parallel neighbor finding technique, to account for the compaction in the routing operations.
The problem of image estimation is posed in a Maximum A Posteriori (MAP) framework. Gibbs distributions are used for the prior and degradation models. Finding the MAP estimate then reduces to obtaining the global minimum of a non-convex energy function defined over the intensity and line processes. The latter are used to model discontinuities like edges present in any scene. When several constraints on the interactions between the line processes are removed, deterministic algorithms have been used to find "good" suboptimal solutions. These deterministic algorithms perform relaxation on the intensities alone. We have added the missing constraints on the line process interactions. Deterministic algorithms have been suggested which now perform relaxation on the intensities and the line processes. In the absence of the interaction constraints, our algorithms are shown to be equivalent to the previously suggested algorithms. This has been achieved by invoking the adiabatic approximation. The adiabatic approximation eliminates the dynamics of "fast" relaxing variables which in our case are the line processes. In demonstrating this equivalence, we describe the utility of two new processes; the gradient (GRAD) and gradient magnitude (GMAG) processes. The line process can be obtained through a monotonic transformation of the GMAG process.
A parallel approach to contour extraction and coding on an Exclusive Read Exclusive Write (EREW) Parallel Random Access Machine (PRAM) is presented and analyzed. The algorithm is intended for binary images. The labeled contours can be represented by lists of coordinates, and/or chain codes, and/or any other user designed codes. Using O(n2/log n) processors, the algorithm runs in O(logn) time, where n by n is the size of the processed binary image.
Decomposition into level sets refers to assigning a code with respect to intensity or elevation while labeling refers to assigning a code with respect to disconnected regions. We present a sequence of parallel algorithms for these two processes. The process of labeling includes re-assign labels into a natural sequence and compare different labeling algorithm. We discuss the difference between edge-based and region-based labeling. The speed improvements in this labeling scheme come from the collective efficiency of different techniques. We have implemented these algorithms on an in-house built Geometric Single Instruction Multiple Data (GSIMD) parallel machine with global buses and a Multiple Instruction Multiple Data (MIMD) controller. This allows real time image interpretation on live data at a rate that is much higher than video rate. The performance figures will be shown.
This paper presents an analytical method to measure the performance of a proposed parallel processing computer architecture for image processing applications. The main idea behind this architecture is to carry out image processing work in a highly parallel manner so that the response time is shorter. The proposed architecture keeps both the processor and the memory system as busy as possible in order to obtain faster response time and proper utilization of the hardware compared to a conventional parallel processing architecture. Our proposed architecture consists of an array of processing elements(PEs) , a system control unit(SCU), interconnection network and memory modules. Each PE contains two central processing units(CPU), one is responsible for the execution of all non-memory operations and the other is responsible for all memory operations. The overall response time of a job is faster because we divide the actual execution and the memory operation into two separate entities and carry them out concurrently. We develop an analytical method to evaluate the performance of the proposed architecture in terms of processor utilization and number of busy memory modules. We also present some image processing algorithms suitable for the proposed architecture and analyze their performance compared to the conventional system. The results we obtained indicate that our architecture provides better processor utilization, response time and memory usages compared to a conventional parallel architecture.
This paper discusses three-dimensional mathematical morphology using FCC (Face Centered Cubic) tessellation for executing both augmentation (dilation) and reduction (erosion). By treating gray level images as three-dimensional surfaces and using this tessellation with its 12-element neighborhood, the author has designed an architecture having flash-convolver speed for use in gray level image processing. The first step in this novel approach is to convert the gray level image into a binary surface by column encoding. After conversion the gray level image consists of volume elements (voxels) each of which has a binary value of either one or zero. The FCC convolution kernel is then ud to address each voxel and modify its value according a lookup table consisting of 213 = 8192 entries. This paper will describe the algorithms which can be executed using this methodology.
A multi-window video architecture is a unique coarse-grained multiprocessor implementation used for image processing. Such an architecture consists of many local rectangular regions (windows) that are processed in parallel by individual processing units. The architecture provides many windows, whose position, size, shape and sampling rate are individually controllable. Each window captures image data into its own local memory, so image memory contention is eliminated. The motivation behind multi-windowing stems from the fact that there is often an excess of information in an image; we usually are only interested in a few objects that require further processing. By performing the processing in parallel only within windows, processing speeds are greatly increased. A key problem in multi-windowing is how to automatically assign windows to the relevant objects within an image. In order to address this problem, we introduce a strategy that models itself closely to the human attention ability, therefore we refer to it as an attentive sensing strategy. The attentive sensing strategy is carried out in two stages. The first 'pre-attentive' stage consists of several parallel processing units which rapidly extract salient information in parallel across the entire image. The second stage, termed 'attentive sensing', consists of a distribution of focused sensory processing on only the salient objects in the image. Information gathered by the pre-attentive stage guides the processing of the attentive stage, resulting in the elimination of irrelevant information. We carry out this sensing strategy on a multi-windowing vision system designed and built for the inspection of integrated circuit wafers.
A class of orthogonal-access parallel organizations is studied for applications in image and vision analysis. These architectures consist of a massive memory and a reduced number of processors which access the shared memory. The memory can be envisaged as an array of memory modules in the k-dimensional space, with each row of modules along a certain dimension connected to one bus. Each processor has access to one bus along each dimension. It is shown that these organizations are communication-efficient and can provide processor time optimal solutions to a wide class of image and vision problems. In the two-dimensional case, the basic organization has ii processors and an n x n memory array which can hold an n x n image, and it provides 0(n) time solution to several image computations including: histograming, histogram equalization, computing connected components, convexity problems, and computing distances. Such problems also take 0(n) time on a two-dimensional mesh with n2 processors. For the general k-dimensional case, a class of orthogonal data movement operations can be implemented on such organizations to yield processor-time optimal image and vision algorithms.
This paper presents a simple and effective method for the concurrent manipulation of linearly ordered data structures on hypercube systems, and extends it to cater to multidimensional images. The method is based on the existence of a binomial search tree rooted at any arbitrary processor node of the hypercube such that (i) every edge of the tree corresponds to a direct link between a pair of hypercube nodes, and (ii) it spans any arbitrary sequence of n consecutive nodes as specified by a gray code ordering, using a fan-out of at most [log2n] and a depth of ([log2n] + 1) . The search trees spanning different processor lists are vertex disjoint and are formed dynamically and concurrently, They can be specified using information local to each node. Thus, they can be used for performing operations such as broadcast and merge simultaneously on image components with non-uniform sizes. The concurrent search reduces the complexity of several low and intermediate-level image processing algorithms to depend on the size of the largest image segment rather than the size of the entire image.
Using the Linda system, a parallel matrix multiplication program was written. This program uses N2 + 1 ICAP processors to compute the product of two N x N matrices. Due to the computational simplicity of matrix multiplication, a single ICAP processor running a sequential matrix multiplication program was faster than the Linda program. The computational complexity for which the Linda system becomes faster than a single ICAP processor was found. This was done by assuming that N2 computations still needed to be done, each requiring the row of one matrix and the column of another. Once a processor acquires these two arrays, an operation was simulated by running a delay loop. By increasing the size of the delay loop, more complex operations could be "simulated."
In this paper, an efficient parallel structure using the multirate techniques is introduced for the analysis of images by the methods of moments. The basic idea of the proposed method is to pass the original image through a bank of bandpass filters which split the image into a set of subband signals. These are lowpass translated by down-sampling resulting in a set of sub-images of lower dimensions, which are then analyzed separately using the method of moments. In image analysis using the method of moments, the major time-consuming task is the computation of the moments. Moreover, it is hard to reconstruct the detail of images from the moments since only moments of higher orders carry the fine detail of an image and they are vulnerable to white noise, such as the quantization noise. With the proposed technique, the computation time is reduced due to the parallel structure, the computational complexity is reduced by using only the dominant subimage in most applications and the fine detail of the image if necessary can be provided by the moments of upper-band subimages.
Implementation issues of neural network formulations onto an available computer architectures are unique. In conventional signal and image processing, for example, many of the frequently performed mathematical operations, like convolutions and transformations, are matrix-vector operations. Computational requirements of these algorithms are met by parallel hardware architectures tailored for efficient handling of vector data. However, many neural network systems require more than just making the computation parallel. Implementation of neural network architectures, such as those exemplified by Grossberg's Boundary Contour System, involve the solution of hundreds of coupled ordinary non-linear differential equations. A neurocomputer whose basic computational unit performs integration, rather the usual arithmetic operations, would be ideal for these systems. The training of the back propagation network, similarly, can be expressed as a problem of solving a system of 8ttff, coupled ordinary differential equations for the weights which are treated as continuous functions of time. The ability to efficiently solve differential equations on digital and parallel computers is therefore quite important for the implementation of artificial neural networks. The central idea of a mapping, described in this paper, involves the replacement of differentiation operators and functions, in the given equation, respectively with the so-called differentiation matrices and function matrices. With the help of a so-called projection matrix, a differential equation is transformed into a rectangular vector-matrix equation which can then be solved on a systolic processor. The algorithm is computationally efficient and enables one to numerically compute, separately, the general solution for the homogeneous part and the particular solution for any specified forcing function.
The theory of Artificial Neural Networks (ANN's) shows that ANN's can perform useful image recognition functions. Simulations on uniprocessor sequential machines, however, destroy the parallelism inherent in ANN models and this results in a significant loss of speed. Simulations on parallel machines are therefore essential to fully exploit the advantages of ANN's. We show how to simulate ANN's on an SIMD architecture, the Reduced Mesh of Trees (RMOT). The architecture has p PE's and n2 memory arranged in a p x p array of modules (p is a constant less than or equal to n). This massive memory is used to store connection weights. A fully connected, single layer neural network with n neurons can be mapped easily onto the architecture. An update in this case requires O(n2/p) time steps. A sparse network can also be simulated efficiently on the architecture. The proposed architecture can also be used for the efficient simulation of multilayer networks with a Back Propagation learning scheme. The architecture can easily be implemented within the framework of existing hardware technology.
The Alopex process is a biologically-influenced computational paradigm that uses a stochastic procedure to find the global optimum of linear and nonlinear functions. Unlike other neural networks, it is not a connectionist network with dense interconnections, and may be implemented in digital VLSI since the processing at the neuronal processing element (PE) level is simple and all PEs can be operated in synchrony. The Alopex algorithm uses the Boltzmann probability distribution function to generate probabilities of taking positive or negative steps away from its current position on the path toward the global optimal solution. Extensive simulations have verified the use of Alopex for solving a variety of problems. We present evaluation results of the effect of approximations, simplifications and level quantizations on the algorithm to yield simple, space-efficient and fast processing elements and other units. These modifications are necessary in order to implement Alopex in digital VLSI. Initial hardware modeling studies have involved a 4x4 array of neuronal PEs arranged for a Single-Instruction-Multiple-Data (SIMD) architecture. The model incorporates several digitally implementable solutions, such as a simpler Boltzmann probability density function (with a six-bit computation window), a binary annealing schedule, and a simpler cost function. Algorithmic simulations with an efficient image-processing system are described to support the feasibility and superiority of these modifications.
A hierarchical self-organizing neural network which can recognize and reconstruct the traces of the previously learned binary patterns is presented. The recognition and reconstruction properties of the network are invariant with respect to distortion, noise, translation, scaling and partial rotation of the original training patterns. If two or more patterns are presented simultaneously, the network pays attention to each pattern selectively. The network can incorporate new training patterns for recognition without loosing its previously learned information. We demonstrate the usefulness of the network in image recognition, reconstruction and segmentation with simulation results.
A modified Hopfield network model for parallel and distributed image restoration is presented. The proposed network does not require the zero autoconnection assumption which is one of the major drawbacks of the original Hopfield network. A new number representation scheme is also given for implementation of the network. As a tool for the convergence analysis of parallel and distributed algorithms, the convergence of descent algorithms theorem is presented. According to this theorem, the proposed image restoration algorithm with sequential updates and decoupled parallel updates is shown to converge. The sufficient condition for convergence of n-simultaneous updates is also given. If this condition is satisfied, the algorithm with totally asynchronous updates is guaranteed to converge. When the image restoration problem does not satisfy the convergence condition, a greedy algorithm is used which guarantees convergence at the expense of image quality. The proposed algorithm with sequential updates performs identically to the algorithm using the original Hopfield network, without checking the reduction of the energy function at the update of each neuron.
Neural networks have been used to mimic cognitive processes which take place in animal brains. The learning capability inherent in neural networks makes them suitable candidates for adaptive tasks such as recall and recognition. The synaptic reinforcements create a proper condition for adaptation, which results in memorization, formation of perception, and higher order information processing activities. In this research a model of a goal seeking neural network is studied and the operation of the network with regard to recall and recognition is analyzed. In these analyses recall is defined as retrieval of stored information where little or no matching is involved. On the other hand recognition is recall with matching; therefore it involves memorizing a piece of information with complete presentation. This research takes the generalized view of reinforcement in which all the signals are potential reinforcers. The neuronal response is considered to be the source of the reinforcement. This local approach to adaptation leads to the goal seeking nature of the neurons as network components. In the proposed model all the synaptic strengths are reinforced in parallel while the reinforcement among the layers is done in a distributed fashion and pipeline mode from the last layer inward. A model of complex neuron with varying threshold is developed to account for inhibitory and excitatory behavior of real neuron. A goal seeking model of a neural network is presented. This network is utilized to perform recall and recognition tasks. The performance of the model with regard to the assigned tasks is presented.
Map annotations (such as lettering) normally degrade automatic map-to-image optical correlation system performance which when removed, improves both the SNR and accuracy of such systems when generating conjugate data point pairs between the two optical formats. This paper describes improvements to the map-to-image correlation that results when an annotation removal preprocessor filter is applied first to map data. Specifically, the paper describes the impact of implementing a neural network annotation filter on the performance of map-to-image optical correlation systems. This new filter is capable of automatically identifying and then removing annotations before performing the optical correlation. As shown, this removal process impacts the correlation SNR and phase-only filtering systems. Greatest improvement in system performance is achieved when the annotation filter is applied first to map data before implementing a binary, phase-only filtering process.