We describe a general purpose environment for the development of parallel image processing/computer vision algorithms: PRIME (parallel image media processing environment). 'General purpose' here means that the environment is designed so as to be used on a variety of multi-processor systems ranging from tightly-coupled computers to loosely-coupled computers. The key point of the system is that it provides an architecture- independent programming environment for image processing and computer vision. We show the outline of PRIME, its implementation, and its preliminary performance evaluation.
A heterogeneous system for image processing was designed. The system consists of two layers with an array of DSPs in the low level layer and an array of PCs in the high level layer. The system resembles two stage multi-layers architecture. In the preprocessing phase, image data is delivered to the low level processors and operations are handled by using SPMD paradigm. The processed data is delivered to the high level processors where feature extraction and recognition tasks are performed. Apparently, the performance of this system is highly depended on communication mechanism used. The communication between two heterogenous processing layers is a critical design issue and is described in this paper. In our design, we have employed QuickRing. QuickRing provides a high transfer rate, 160 Mbytes/s, and supports multicast mode of data transmission that is suitable for broadcasting of data.
Parallelization of image analysis tasks forms a basic key for processing huge image data in real time. At this, suitable subtasks for parallel processing have to be extracted and mapped to components of a distributed system. Basically, this task should be done by the processing system and not by the user, as automatical parallelization allows a flexible resource management and reduces time for developing image analysis programs. This paper describes a multi-agent based system for planning and performing image analysis tasks within a distributed system. It illustrates a method for modeling image analysis tasks under the viewpoint of parallel processing and explains the special design requirements for parallelizing agents. Furthermore, we describe concepts for agent cooperation and for using the agent's ability of learning to allow long term improvement of its planning and scheduling strategies. The presented image analysis system allows an architecture-independent parallel processing of image analysis tasks with an optimized resource management.
In this paper, it is shown that, through the use of model- integrated program synthesis (MIPS), parallel real-time implementations of image processing data flows can be synthesized from high level graphical specifications. The complex details in inherent to parallel and real-time software development become transparent to the programmer, enabling the cost-effective exploitation of parallel hardware for building more flexible and powerful real-time imaging systems. The model integrated real-time image processing system (MIRTIS) is presented as an example. MIRTIS employs the multigraph architecture (MGA), a framework and set of tools for building MIPS systems, to generate parallel real-time image processing software which runs under the control of a parallel run-time kernel on a network of Texas Instruments TMS320C40 DSPs (C40s). The MIRTIS models contain graphical declarations of the image processing computations to be performed, the available hardware resources, and the timing constraints of the application. The MIRTIS model interpreter performs the parallelization, scaling, and mapping of the computations to the resources automatically or determines that the timing constraints cannot be met with the available resources. MIRTIS is a clear example of how parallel real-time image processing systems can be built which are (1) cost-effectively programmable, (2) flexible, (3) scalable, and (4) built from commercial off-the-shelf (COTS) components.
This paper describes two different parallel computing approaches for image processing problems on a Pentium based multiprocessor-system. These multiprocessor computers are often used as network servers. We demonstrate the utilization of one of these machines, equipped with four Intel Pentium processors, for a parallel image processing task. A parallel computation of motion vector-fields based on correlation techniques is discussed to show the possible acceleration. The computational results show that a high efficiency can be reached, even a linear speedup is possible under certain conditions. Besides the mentioned correlation technique there are various image processing problems that can easily be evaluated in parallel. Although massively parallel systems and special purpose systems are much faster, off-line image processing can be accelerated by using these broadly available low-cost machines.
This paper describes the use of an embedded multiprocessor to speed up the computer vision tasks of an autonomous mobile robot. A brief overview over the robot's vision system is given, identifying a signature segmentation algorithm as the most run-time consuming task. A parallelization strategy for the segmentation algorithm is presented based on the exploitation of the data-parallelism. The parallel algorithm is implemented on a multi-DSP system and experimentally evaluated. Speed-up and efficiency figures are discussed.
We have constructed a prototype image processing board containing 384 processors in 8 VLSI chips. The goal of the prototype is to show how fine grain parallelism present in image processing applications can be exploited by using lots of simple processors interconnected in clever ways. Each processor has a 16-bit data path, a simple instruction set containing 12 instructions, a simple control unit, and a scan chain for loading data and program. Each VLSI chip, called PADDI-2, contains 48 processors. The programing model used for the processors in MIMD. Each processor has 8 words in the instruction memory. There are internal registers and queues in a processor for storing data and partial results. Data is assumed to be entering the system as a stream and processed by the processors. Each VLSI chip is connected to an external memory (64 K by 16). A hardware synchronization mechanism is used for communication between processors, memory, and the external environment. If a sender and receiver is within the same chip, communication can be done in one cycle by the hierarchical interconnect bus structure. Programming the processors and the interconnections are done at compile time. The board is interfaced to a Sun SPARCstation using the SBus. Video input and output is supported by the board and field buffers are used for buffering. Software tools for checking the board, running test programs at the assembly language level, and libraries for application development have been produced. Image processing applications are currently under development. The board is available for experimentation over the Internet. Further details are available from the project web page (http://infopad.eecs.berkeley.edu/spartan).
A computer systems architecture for processing medical images and other data coming over the Web is proposed. The architecture comprises a Java engine for communicating images over the Internet, storing data in local memory, doing floating point calculations, and a coprocessor MIMD parallel DSP for doing fine-grained operations found in video, graphics, and image processing applications. The local memory is shared between the Java engine and the parallel DSP. Data coming from the Web is stored in the local memory. This approach avoids the frequent movement of image data between a host processor's memory and an image processor's memory, found in many image processing systems. A low-power and high performance parallel DSP architecture containing lots of processors interconnected by a segmented hierarchical network has been developed. The instruction set of the 16-bit processor supports video, graphics, and image processing calculations. Two's complement arithmetic, saturation arithmetic, and packed instructions are supported. Higher data precision such as 32-bit and 64-bit can be achieved by cascading processors. A VLSI chip implementation of the architecture containing 64 processors organized in 16 clusters and interconnected by a statically programmable hierarchical bus is in progress. The buses are segmentable by programming switches on the bus. The instruction memory of each processor has sixteen 40-bit words. Data streaming through the processor is manipulated by the instructions. Multiple operations can be performed in a single cycle in a processor. A low-power handshake protocol is used for synchronization between the sender and the receiver of data. Temporary storage for data and filter coefficients is provided in each chip. A 256 by 16 memory unit is included in each of the 16 clusters. The memory unit can be used as a delay line, FIFO, lookup table or random access memory. The architecture is scalable with technology. Portable multimedia terminals like U.C. Berkeley's InfoPad can be developed using the proposed parallel DSP architecture, color display, pen interface, and wireless network communication for use in clinics, hospitals, homes, offices, and factories.
Watershed transformation is a tool for image segmentation widely used in computer vision applications. However, the complexity of processing large images entails fast parallel algorithms. In this paper, an improved SPMD (single program multiple data) watershed algorithm based on image integration and sequential scannings is rendered. The task performed by the algorithm is an alternative to the classical simulated immersion for computing the watershed image. Although the technique converges slow on a single processor computer, due to the repeated raster and anti-raster scannings of the image, it performs much faster in parallel, when subimages of the global image are simultaneously processed. Additionally, a global connected components operation employed for the parallel labeling of the seeds for region growing increases the efficiency of the algorithm. Results of a message passing interface (MPI) implementation tested on a Cray T3D parallel computer demonstrate the resilience of the presented parallel design solution to increasing number of processors, as to larger image sizes.
The objective of thinning is to reduce the amount of information in image patterns to the minimum needed for recognition. Thinned image helps the extraction of important features such as end points, junction points, and connections from image patterns. The ultimate goal of parallel algorithms is to minimize the execution time while producing high quality thinned image. Though much research has been performed for parallel thinning algorithms, there has been no systematical approach for comparing the execution speed of parallel thinning algorithms. Several rough comparisons have been done in terms of iteration numbers. But, such comparisons may lead to wrong guides since the time required for iterations varies from one algorithm to the other algorithm. This paper proposes a formal method to analyze the performance of parallel thinning algorithms based on PRAM (Parallel random access machine) model. Six parallel algorithms, which shows relatively high performance, are selected, and analyzed based on the proposed analysis method. Experiments show that the proposed analysis method is sufficiently accurate to evaluate the performance of parallel thinning algorithms.
In this paper, we introduce an original fully parallel thinning algorithm. Our algorithm is based on the notion of P- simple points which insure to preserve the topology while removing set of simple points in parallel. We then propose a parallel implementation of this algorithm on MIMD architectures with the PVM computation model including dynamic load-balancing.
The problem of shrinking the connected components of binary images to small residues is addressed. A new parallel algorithm is presented with the lowest known worst case time complexity as measured by iteration counts, i.e. [(nc + nr)/2] iterations where the foreground of the image is contained within a rectangular region of size nc X nr pixels. The new algorithm uses a parallel application of operators with small support. It reduces all connected components in any 2D binary image to certain small 1, 2 or 3 point residues. It is potentially useful as a fundamental step in connected component labeling algorithms implemented as fine grain realizations in 2D mesh computing architectures.
The generalized matrix product includes in its formulation many common array manipulations. It also provides a framework for the expression of a number of important image processing algorithms. It is shown that the generalized matrix product may be implemented in its full generality on systolic array architectures. Two approaches are presented. One approach is to regard the generalized matrix product as a collection of products of small matrices and then consider arrangements of systolic configurations common to the smaller products. A second approach is to embed the two factors of the generalized matrix product in sparse matrices and multiply the sparse matrices using a conventional systolic array.
Two-dimensional (2D) Discrete Fourier Transform (DFT) frequently needs to be performed in the digital image processing. Although the computing time of 2D DFT can be dramatically reduced by using 2D Fast Fourier Transform (FFT), the processing speed of a very large array is yet intolerable. The development of parallel processing system promotes the application of 2D FFT. In this paper, we present the implementation of 2D FFT as a general procedure by row-column method and vector-radix method based on a general-purpose massively parallel processing system--DAWN1000 developed in China. Even though the 2D FFT has parallel characteristics in nature, the requirement of corner-turning and the existence of data communication make its implementation more complicated. We analyze the impact of the machine capacity and the computing complexity on the algorithm efficiency and evaluate the implementation in terms of the arithmetic operations as well as the data transfer. The comparison of the two methods shows the fact that each method has its own advantages and disadvantages. Combining their traits, we design a new implementation algorithm concerning its flexibility, the efficiency and the complexity of the communication. As an example, we fulfill the spaceborne SAR image processing by using the new approach. Keywords: Parallel Processing, MPP, Row-Column Algorithm, Vector-Radix Algorithm, Block Algorithm, TwoDimensional FFT, SAR Image Processing.
This paper gives an overview of developments at Silicon Graphics in the areas of scalable multiprocessing and visualization. These developments are grounded in a scalable, shared memory architecture that allows a high degree of modularity and enormous flexibility of configuration. The first implementations of this architecture include graphics supercomputers with multiple processors and multiple graphics subsystems, that enable parallelism at all levels, including parallelism applied to single graphics tasks. We first describe this architecture and then discuss the characteristics of its first generation implementations.
This paper presents a method for minimizing the total elapsed time spent by n tasks running on m differents processors working in parallel. The developed algorithm not only minimizes the total elapsed time but also reduces the idle time and waiting time of in-process tasks. This condition is very important in some applications of computer vision in which the time to finish the total process is particularly critical -- quality control in industrial inspection, real- time computer vision, guided robots. The scheduling algorithm is based on the use of two matrices, obtained from the precedence relationships between tasks, and the data obtained from the two matrices. The developed scheduling algorithm has been tested in one application of quality control using computer vision. The results obtained have been satisfactory in the application of different image processing algorithms.
Proc. SPIE 3166, Distribution of image processing applications on a heterogeneous workstation network: modeling, load balancing, and experimental results, 0000 (19 September 1997); doi: 10.1117/12.279615
This work analyzes the computation distribution in applications generated by a multilevel knowledge-based system for image processing called SVEX. This distribution has been carried out on a heterogeneous workstation network, trying to take advantage of the availability and frequent infra- utilization of this computational resource. The parallelization is based on message-passing tool parallel virtual machine (PVM). Firstly SVEX and its computational scheme are described, detailing the structure of the first level (the pixel processor). Then different distribution paradigms are studied, selecting for its implementation the parallelism based on the data. Considering this alterative, the research addresses two fundamental problems: analysis of basic load-balancing schemes and obtaining a model for predicting parallelization behavior as new machines are added to the computational network. The results produced in a series of experiments permit the comparison of load-balancing schemes and the validation of the proposed model. The experiments include the processing of both static images and sequences.
Contract-Linda is a novel programming architecture for heterogeneous parallel machines particularly suited to image processing. Previous research has concentrated on static and pre-determined scheduling of computation and on fine grain parallelism. Predetermined scheduling is satisfactory in cases where the computational process is fully deterministic. However with many image interpretation schemes the flow of control and the nature of the computational procedures can only be determined at run-time. In this paper we describe a programming paradigm for coarse grain and task level parallelism. Task management is based on the contract net protocol and utilizes the Linda Coordination Language to provide run-time scheduling. This paradigm accommodates a closely coupled network of heterogeneous processing modules which differ greatly in computational capability; modules based on neural networks, semantic networks, vector and scalar processors are accommodated. Contract-Linda allows specialized heterogeneous machines to be exploited using a straightforward generic programming model. It does this by providing an internal task management mechanism which ensures that the heterogenous processing elements are used by the tasks most suited to them and exploits dynamic parallelism within the problem as it is solved. By separating the task of describing the problem from that of describing how the work is carried out on the machine (and providing a solution for this problem) we allow applications to be quickly developed which can effectively utilize specialized machines without the need for specialized programming.
Edge detection is an important first step in many vision tasks where its improvements in speed and efficiency present a continuous challenge for developers of high-speed image recognizers. Classical techniques for accurate detection of edge features, as exemplified by Canny operator, demands such expensive operations as the iterative use of Gaussians and Laplacians, and their designs are largely sequential. Alternatively a variety of complex and edge-preserving filters have been developed to reduce the effects of noise without significantly distorting the edge loci. This paper describes a cascaded precursor approach for edge detection based on selective local contrast modifications which combine point- wise image operators and non-linear transformation. A principal advantage of the approach lies in its simplicity and uniformity of operations; the latter is a characteristic blueprint for efficient (parallel) low-level image processing algorithms. Further, unlike many enhancement algorithms, the characteristics of the proposed precursor can be studied analytically, thus allowing the independent adjustments of detector parameters for maximum performance in the specific environment.
SIMD parallel systems have been employed for image processing and computer vision applications since their inception. This paper describes a system in which parallel programs are implemented using a machine-independent, retargetable object library that provides SIMD execution on the Lockheed Martin PAL-I SIMD parallel processor. Programs' performance on this machine is improved through on-the-fly execution analysis and scheduling. We describe the relevant elements of the system structure, the general scheme for execution analysis, and the current cost model for scheduling.
Mathematical morphology is a powerful and widespread tool for image analysis and coding. But for large analysis neighborhoods, it becomes rapidly time consuming on general purpose machines. Many works have been done to decrease the overall complexity by splitting heavy computation tasks into several more efficient ones. Even if significant speedup factors have been achieved using dedicated architectures, the major drawback is that the number of operations required is still dependent on the considered neighborhood size. This paper deals with a systolic network for real time morphological processing since it is able to perform erosions, dilations, openings, and closings during a unique image scan. It implements a new method called 'casual recursive erosion' based on the decomposition of neighborhoods by both dilations and union sets. The resulting architecture presents a low complexity and offers a constant computation time. Reconfigurration of the application only consists in modifying a few array contents. System synthesis onto one FPGA yields very high processing rates.
Encodings of N-dimensional images on planar screens are studied from the standpoint of their suitability for reduction of N-dimensional image processing to 2-dimensional one. Such a reduction, if possible, would allow us to implement N- dimensional image processing optically via 2-dimensional one which is natural to optics. With this goal in mind a planar encoding commutative with respect to set and morphological operations is proposed. Also it is shown how to simulate in the plane N-dimensional geometric duality transform. Based on such a simulation, a constant-time optical algorithm for constructing the convex hull of N-dimensional images is proposed.
This article presents a new generation in parallel processing architecture for real-time image processing. The approach is implemented in a real time image processor chip, called the XiumTM-2, based on combining a fully associative array which provides the parallel engine with a serial RISC core on the same die. The architecture is fully programmable and can be programmed to implement a wide range of color image processing, computer vision and media processing functions in real time. The associative part of the chip is based on patented pending methodology of Associative Computing Ltd. (ACL), which condenses 2048 associative processors, each of 128 'intelligent' bits. Each bit can be a processing bit or a memory bit. At only 33 MHz and 0.6 micron manufacturing technology process, the chip has a computational power of 3 billion ALU operations per second and 66 billion string search operations per second. The fully programmable nature of the XiumTM-2 chip enables developers to use ACL tools to write their own proprietary algorithms combined with existing image processing and analysis functions from ACL's extended set of libraries.
A new algorithm to compute the motion vector has been developed based on the block-matching approach in which bands are used instead of blocks. The new band-matching algorithm (BDMA) offers better performance than both the fixed size and variable-size block-matching algorithms because it uses less vectors and processes parts of the image instead of the whole image. The BDMA extracts the likely moving objects, forms the image of differences between initial successive frames in an image sequence, divides each 'object' into bands with different sizes, forms a 4 by 4 block around each edge in the band, looks for the type of edges, calculates the different of the textures inside the band, and starts the matching process between frames. Bands are matched according to the number of edges and their type (the type is created and stored using some kind of code-book), and the texture values. Matching results produce a specific vector for the estimated motion. To enhance the performance two independent external memory are employed, they are accessed in parallel. One memory is indexed by the other, one to store the ongoing images, and the second to buffer the bands and results. There sizes and features are adapted to support the application for which the circuit is needed. The new algorithm has been evaluated using several standard video images.
Visual media processing is becoming increasingly important because of the wide variety of image and video based applications. Block rotation is an important operation in different image/video processing tasks such as graphics, fractal processing, pattern matching and image registration. Remote sensing, medical imaging, computer vision, computer graphics, and video coding are typical applications of digital image rotation. However, a hardware implementation of the block rotation algorithm has not been realized and software implementation is slow. Hence, they are not suitable for real- time execution. In this paper, we propose a novel method for block rotation, which is fast and suitable for hardware implementation. The algorithm employs area based interpolation. Experimental results have shown the performance enhancement compared to classical interpolation algorithms at a similar level of complexity.
In this paper, we present a software based H.263 video encoder using a cluster of workstations. H.263 is an international standard for video coding using low bitrate communication. Due to a high computational cost, a real-time software based codec (specially encoder) is impossible to implement using currently available single-processor systems. Parallel processing using multiple computers, however, is a natural way of implementing a software-based real-time codec. A cluster of workstations is analogous to a parallel machine consisting of several processors each with its own memory and is a cost-effective computing platform. The workstations connected via an ATM switch use message passing interface (MPI) library for communication. Our implementation is based on a data-parallel approach in which parallelism is exploited within each frame of the video on a macroblock basis. Since the encoder is implemented in software it provides flexibility to enhance performance through modifications and improvements in various components. The encoder can be used with various numbers of processors connected via any physical hardware topology, and its peformance scales accordingly. It provides options for user-defined parameters such as the number of processors, the size of motion search window, quantization parameter, bit- rate, etc. The experimental results indicate that our approach is suitable for real-time applications such as video telephony on conventional analog telephone lines. An encoding rate of 30 frames/sec. (with QCIF format) is achieved using 12 workstations.
We present a new parallel volume rendering algorithm based on the split-light model for rendering and the bulk synchronous parallel (BSP) model for parallelization. The BSP model provides a simple and architecture-independent approach to structure the parallel program. This parallel program has been tested on a shared memory SGI PowerChallenge machine, a distributed memory IBM SP2 machine and a network of UNIX workstations.
In this paper, we present the implementation of a volume graphics rendering algorithm using shift-restoration operations on parallel algebraic logic (PAL) image processor. The algorithm is a parallel ray casting algorithm. In order to eliminate shading artifacts caused by inaccurate estimation of surface normal vectors, we use gray level volume instead of binary volume, and apply a low pass filter to smooth the volume object surfaces. By transforming the volume to an intermediate coordinate system to which there is a simple mapping from the object coordinate system, we solve the data redistribution problem caused by nonregular data access patterns in volume rendering. It has been proved very effective in reducing the data communication cost of the rendering algorithm on the PAL hardware.
Chromosome image segmentation is an important step toward automatic karyotyping that involves visualization and interpretation of chromosomes. In this paper, we analyze the characteristics of chromosome images that can be effectively used for segmenting chromosomes and can be efficiently extracted on the Lockheed-Martin PAL parallel image processor. We design and implement a parallel algorithm that uses local features to split touching chromosomes.
Distributed computing provides a cost-effective solution for computation intensive problems. With the emerging of networking operating system for personal computer (PC), such as WindowsNT, it is now feasible to develop distributed computing on a network of PCs. In addition, the computing power delivered by a PC is kept increasing while the cost is decreasing. Implying that the performance/cost factor for a PC is high and the computing power delivered by the network is enormous. In this paper, we describe a software system which enables users to develop distributed computing program using the SPMD (single program multiple data) paradigm very quickly under the WindowsNT operating system. The programming model for the system is simple and a user can control the system through a graphical interface. The results show that our system provides a reasonable speedup in solving image processing problems.
The communication overhead in many multiprocessor computing platform is a critical factor over performance. In this paper we present communication performance of a large processing array built with TI 320C40 DSPs. Inter-processor communication is provided by message passing which is a common method used in multiprocessors systems. The system is developed for image processing therefore transmission of large data blocks and various forms of communication are required frequently. The processor used in this system has six built in communication links. They are 8-bit, bi-directional links with a speed of 20 Mbytes/sec. A processing array built with these processors employs MIMD paradigm and static interconnection. In this paper, the communication performance of such DSP network is investigated and performance results are presented. The communication functions include broadcasting, scattering, gathering and point to point transmission of messages.
In this article, we would like to detect boundaries of objects with the help of a multiagent system made up of reactive agents. The reactivity being very important, the agents' behavior is very simple (perception-action): they have the capacity, nevertheless, to adapt locally to what they consider their environment, that is to say the image. An agent can move and has its own position in its environment. The basic behavior for an agent consists of following the highest brightness gradient and inscribing its path, if estimating to be on an edge, in all the agents' shared memory. Its path thus corresponds to edges which are found in the image. Please note that, in order to be noise resistant, the path is actually stored in the shared memory only if it is long enough. The notion of shared memory is very important because it allows the interaction among agents and the coordination of their actions. The agents actually use already found edges for finding new ones or complete those already detected. We have tested this system on different gray scale images scenes, but as well on synthetic scenes allowing analysis of thus obtained results. The results are promising and especially fast. Our multiagent system has been tested on a single-processor computer, and it has been noted that the number of agents in a simulation neither affects the quality of the result nor CPU time necessary for segmentation of a given scene. We think that this approach is original in its use of agents and may be used to implement parallel image processing by assigning, for instance, an agent to each processor.
A new concept of working out computer means, with the architecture controlled by the parameters of the images (CPI) is being suggested. The methods and processor structures of image parameters determination are presented. The architectures of general purpose multiprocessor computer systems CPI with fixed and reconfigurable structures were elaborated and investigated. The analytical estimation of the image processing time depending on the image parameters, characteristics of the processor conveyor and the image memory modules are presented. The methods of the system organization, which optimize the number of processors, are proposed.
The use of multiprocessor systems is a well suited solution to handle the problem of implementing real-time image processing applications. We use a multiresolution image decomposition algorithm to show how the A3 methodology (algorithm architecture adequation), and the CAD software SynDEx which support it, may improve the implementation of such algorithms on a multi-DSP architecture. The application algorithm as well as the hardware are specified with graphs, then the implementation may be formalized in terms of graphs transformations. This methodology reduces significantly the development cycle of image processing applications, by simplifying test and debug process.
This paper presents the implementation of virtual topologies on top of PVM for developing parallel portable image processing on coarse grained machines. It allows the user to define a high level logical numbering to make easy the management and the interaction between tasks. A template convolution primitive has been implemented on 2D mesh topology.
The paper introduces a software architecture to support a user from the image processing community in the development of time-constrained image processing applications on parallel computers. The architecture is based on abstract data types with a well defined interface. The interface separates an application from the actual hardware used. On the application side of the interface the programmer is presented with a familiar (sequential) programming model. On the hardware side of the interface detailed knowledge of a parallel machine may be employed to arrive at efficient implementations of basic functionality. Knowledge of both suitable data distributions for images and performance characteristics of operations on those images allows for automated selection of an appropriate data distribution scheme throughout the application. Experiments show that with little effort reasonable levels of efficiency and scalability are achieved on a 32-node MIMD architecture.