The design and implementation of a parallel digital signal processing systems on a chip containing 64 computational processors, 16 memory processors, and 16 I/O processors is described. The processors are interconnected by two levels of segmented buses. Each computational processor has a 16- bit data path and a control unit. The instruction set of the 16-bit processor supports computations on streams of data present in video, graphics, image processing, and digital communication applications. Two's complement arithmetic, saturation arithmetic, and packed instructions are supported. Higher data precision such as 32-bit and 64-bit can be achieved by cascading processors. The instruction memory of each computational processor has sixteen 40-bit words. Data streaming through the processor is manipulated by the instructions in the instruction memory. Multiple operations can be performed in a single cycle in a processor. A handshake protocol is used for synchronization between the sending and receiving processors. Six programmable registers are available in each computational processor for storing data. Each memory processor has a 256 X 16 storage unit for storing additional data. The memory processors can be statically configured as a delay line, FIFO, lookup table or random access memory. For each memory processor there are four FSMs supporting the four configurations. The I/O processors are provided for external communication. Multiple parallel processing chips, digital output from sensors, and SRAM chips can be interconnected using the I/O processors. The VLSI chips implementing the processes is organized as 16 clusters interconnected by a statically programmable hierarchical bus structure. The buses are segmented by programming the switches on the bus. Each cluster has six 16-bit data buses and four 2-bit control buses for supporting communication between four computational processors, one memory processor, and one I/O processor. In addition, adjacent processors can communicate using a bypass bus. The clusters are interconnected by sixteen 16-bit data buses and eight 2-bit control buses. Each cluster has 60 programmable switches to control the communication between the intracluster and intercluster buses. Each processor has 17 programmable switches to control the connections to the intracluster buses.
The theory and application of morphological associative memories and morphological neural networks in general are emerging areas of research in computer science. The concept of a morphological associative memory differs from a more conventional associative memory by the nonlinear functionality of the synaptic connection. By taking the maximum of sums instead of the sum of products, morphological network computation is inherently nonlinear. Hence, the morphological associative memory does not require any ad hoc methodology to interject a nonlinear state. In this paper, we introduce a very large scale integration analog circuit design that describes the nonlinear functionality of the synaptic connection. We specifically describe the fundamental circuit needed to implement a basic additive maximum associative memory, and describe noise conditions under which this memory will perform flawlessly. As a potential application, we propose the use of the analog circuit to real-time operation on or near a focal plane array sensor.
This paper describes a PC-cluster system for real-time parallel video image processing. The PC-cluster consists of seven PCs connected by a very high speed network. The key issue of this system is synchronization of distributed video data. Frame synchronization block is introduced to realize three kinds of synchronization: forward synchronization, barrier synchronization and backward synchronization. Forward synchronization is to notify of timing to start processing. Barrier synchronization is to wait for all data that are processed at the same time. Backward synchronization is to cancel processing and transferring useless data. Experimental results are also shown to confirm the performance of the PC-cluster.
Parallel processing of image analysis tasks is an essential method to speed up image processing and helps to exploit the full capacity of distributed systems. However, writing parallel code is a difficult and time-consuming process and often leads to an architecture-dependent program that has to be re-implemented when changing the hardware. Therefore it is highly desirable to do the parallelization automatically. For this we have developed a special kind of thread concept for image analysis tasks. Threads derivated from one subtask may share objects and run in the same context but may process different threads of execution and work on different data in parallel. In this paper we describe the basics of our thread concept and show how it can be used as basis of an automatic task parallelization to speed up image processing. We further illustrate the design and implementation of an agent-based system that uses image analysis threads for generating and processing parallel programs by taking into account the available hardware. The tests made with our system prototype show that the thread concept combined with the agent paradigm is suitable to speed up image processing by an automatic parallelization of image analysis tasks.
Many vision tasks are very complex and computationally intensive. Real time requirements further aggravate the situation. They usually involve both structured (low-level vision) and unstructured (high-level vision) computations. Parallel approaches offer hope in this context. Parallel approaches to vision tasks and scheduling schemes for their implementation receive special emphasis in this paper. Architectural issues are also addressed. The aim is to design algorithms which can be implemented on low cost heterogeneous networks running PVM. Issues connected with general purpose architectures also receive attention. The proposed ideas have been illustrated through a practical example (of eye location from an image sequence). Next generation multimedia environments are expected to routinely employ such high performance computing platforms.
In this article, we present a parallel image processing system based on the concept of reactive agents. This means that, in our system, each agent has a very simple behavior which allows it to take a decision (find out an edge, a region, ...) according to its position in the image and to the information enclosed in it. Our system lies in the oRis language, which allows to describe very finely and simply the agents' behaviors. In fact, oRis is an interpreted and dynamic multiagent language. First of all, oRis is an object language with the use of classes regrouping attributes and methods. The syntax is close to the C++ language and includes notions of multiple inheritance, oRis is also an agent language: every object with a method `main()' becomes an agent. This method is cyclically executed by the system scheduler and corresponds to the agent behavior. We also present an application made with oRis. This application allows to detect concentric striae located on different natural `objects' (age-rings of tree, fish otolith growth rings, striae of some minerals, ...). The stopping of the multiagent system is implemented through a technique issued from immunology: the apoptosis.
In this paper, we present the design and implementation of a parallel image processing software library (the Parallel Image Processing Toolkit). The Toolkit not only supplies a rich set of image processing routines, it is designed principally as an extensible framework containing generalized parallel computational kernels to support image processing. Users can easily add their own image processing routines without knowledge or explicit use of the underlying data distribution mechanisms or parallel computing model. Shared memory and multi-level memory hierarchies are exploited to achieve high performance on each node, thereby minimizing overall parallel execution time. Multiple load balancing schemes have been implemented within the parallel framework that transparently distribute the computational load evenly on a distributed memory computing environment. Inside the Toolkit, a message-passing model of parallelism is designed around the Message Passing Interface standard. Experimental results are presented to demonstrate the parallel speedup obtained with the Parallel Image Processing Toolkit in a typical workstation cluster with some common image processing tasks.
The AIM (Adaptive Image Manager) is a client/server based system providing a computer vision and image processing specific protocol that interfaces to potentially numerous varying computing platforms providing image processing and computer vision support. It provides a unified programming interface for the user despite the potential heterogeneity of the underlying hardware supporting the image processing. Computational platforms currently being studied include the Lockheed Martin PAL system, floating point gate arrays, and symmetric multiprocessors. The Open Distributed Processing Reference Model (ODP-RM) is an ISO standards effort (ISO/IEC 10746-1/2/3/4) to address the specification and implementation f distributed processing systems, providing an architecture for integrating distribution support, interworking, and portability. The ODP Reference Model is exploited in the design to support transparent distribution of image processing and computer vision processes to available computational devices. The AIM System's ODP Model is specified using Object-Z, a formal descriptive notation that is commonly employed for ODP specification. Use of Object-Z notation supports formal analysis of the system's properties, helping us verify that its design satisfies its goals. This paper presents an example image processing algorithm (Simple Image Statistic Thresholding) as a framework for understanding both the method of distribution and the benefits obtained from the use of our model description.
In this paper, we present a new global-search method for designing QMF (quadrature-mirror-filter) filter banks. We formulate the design problem as a nonlinear constrained optimization problem, using the reconstruction error as the objective, and the other performance metrics as constraints. This formulation allows us to search for designs that improve over the best existing designs. Due to the nonlinear nature of the performance metrics, the design problem is a nonlinear constrained optimization problem with many local minima. We propose to solve this design problem use global- search methods based on Lagrangian formulations. After transforming the original constrained optimization problem into an unconstrained form using Lagrange multipliers, we apply a new global-search method to find good solutions. The method consists of a coarse-level global-search phase, a fine-level global-search phase, and a local search phase, and is suitable for parallel computation due to the minimal dependency between various key components. In our experiments, we show that our method finds better designs than existing global-search methods, including simulated annealing and genetic algorithms.
We propose a perspective volume graphics rendering algorithm on SIMD mesh-connected computers and implement the algorithm on the Parallel Algebraic Logic computer. The algorithm is a parallel ray casting algorithm. It decomposes the 3D perspective projection into two transformations that can be implemented in the SIMD fashion to solve the data redistribution problem caused by non-regular data access patterns in the perspective projection.
A parallel robust relaxation algorithm is proposed to improve the detection and correction of illegal disparities encountered in the automatic stereo analysis (ASA) algorithm. Outliers and noisy matches from correlation-based ASA matching are improved by relaxation labeling and robust statistical methods at each stage of the multiresolution coarse-to-fine analysis. A parallel version of the relaxation labeling algorithm has been implemented for the MasPar supercomputer. The performance scales quite linearly with the number of processing elements and scales better than linear with increasing work load. The algorithm is highly scalable both as the number of processors are increased for solving a fixed size problem and also as the size of the problem increases.
This paper addresses efficient parallel compression and classification for sets of similar images that are normally generated from satellite imagery, medical imaging (CT and MR scans) or aerial surveillance. From our experiments it was observed that image similarities for each class of images can be more efficiently expressed in the domain of image compressing transforms. In particular, the paper shows that only one predictive compressing model can be constructed for the entire class of similar images of the same nature, and then used for nearly optimal compression of any image of the class. The extraction of the optimal class-compressing model still remains a computationally intensive process, which can be considerably improved on parallel computers. The paper demonstrates how a similar database compressing model can be extracted in parallel, and how this can be used for parallel similar database compression and classification of new images into appropriate similarity classes. The results of the parallel similar image analysis are demonstrated with MR and CT brain images obtained from the M.D. Anderson Cancer Center.
We present a parallel MPEG-2 video encoder on the Intel Paragon parallel computer. Given a video sequence or a set of sequences, the aim of the encoder is to achieve the maximum possible encoding rate. To achieve this aim, the parallel encoder works by a combined scheduling of processors, I/O nodes, and disks, enabling the system to work in a highly balanced fashion through matching of the encoding and I/O rates. An efficient data layout scheme for video frames is also proposed in order for I/O to sustain the desired data transfer rates. Using a small percentage of processors as the I/O nodes, the utilization of the system is also high. More importantly, our encoder is scalable and with an increase in the number of processors will result in a proportional increase in the encoding rate. Given any machine configuration (that is, the number of compute processors, I/O processors, and disks), our propose strategy can logically partition the system and match the I/O and encoding rates to reach the ideal encoding rate. The experimental results indicate about two-fold gain in performance compared to the previous studies. Our approach is useful for compressing a large video sequence or batches of sequences.
The ability to simplify an image whilst retaining such crucial information as shapes and geometric structures is of great importance for real-time image analysis applications. Here the technique of binary thresholding which reduces the image complexity has generally been regarded as one of the most valuable methods, primarily owing to its ease of design and analysis. This paper studies the state of developments in the field, and describes a radically different approach of adaptive thresholding. The latter employs the analytical technique of histogram normalization for facilitating an optimal `contrast level' of the image under consideration. A suitable criterion is also developed to determine the applicability of the adaptive processing procedure. In terms of performance and computational complexity, the proposed algorithm compares favorably to five established image thresholding methods selected for this study. Experimental results have shown that the new algorithm outperforms these methods in terms of a number of important errors measures, including a consistently low visual classification error performance. The simplicity of design of the algorithm also lends itself to efficient parallel implementations.
A computationally efficient algorithm for computing openings by 1D flat structuring elements is proposed. The algorithm utilizes the run-length encoded image and allows implementation of the opening of a gray-scale image by a sequence of arbitrarily sized flat structuring elements. The new algorithm compares favorably to existing methods for recursive implementation of a sequence of openings, and its computation time decreases with the size of the structuring element.
Proc. SPIE 3452, Implementation of the morphological shared-weight neural network (MSNN) for target recognition on the Parallel Algebraic Logic (PAL) computer, 0000 (21 September 1998); doi: 10.1117/12.323468
The morphological shared-weight neural network (MSNN) is an effective approach to automatic target recognition. Implementation of the network in parallel is critical for real-time target recognition systems. Although there is significant parallelism inherent in the MSNN, it is a challenge to implement it on an SIMD parallel computer consisting of a large array of simple processing elements. This paper discusses issues related to detection accuracy and throughput in implementing the MSNN on the Parallel Algebraic Logic computer.