One major difficulty in designing an architecture for the parallel implementation of Discrete Wavelet Transform (DWT) is that the DWT is not a block transform. As a result, frequent communication has to be set up between processors to exchange data so that correct boundary wavelet coefficients can be computed. The significant communication overhead thus hampers the improvement of the efficiency of parallel systems, specially for processor networks with large communication latencies. In this paper we propose a new technique, called Boundary Postprocessing, that allows the correct transform of boundary samples. The basic idea is to model the DWT as a Finite State Machine based on the lifting factorization of the wavelet filterbanks. Application of this technique leads to a new parallel DWT architecture. Split-and-Merge, which requires data to be communicated only once between neighboring processors for any arbitrary level of wavelet decompositions. Example designs and performance analysis for 1D and 2D DWT show that the proposed technique can greatly reduce the interprocessor communication overhead. As an example, in a two-processor case our proposed approach shows an average speedup of about 30% as compared to best currently available parallel computation.
Visual Computing and Communications is becoming increasingly important with the advent of broadband networks and compression standards. The International Standards Organization is currently finalizing the MPEG-4 standard, which emphasizes object based coding and content manipulation in video sequences. There are essentially two kinds of redundancies in a video sequence, namely spatial and temporal. The concept of video object planes (VOPs) has been introduced in MPEG-4, which allows for manipulation and coding of the various video objects. The temporal correlation in the VOPs is exploited by employing motion estimation/compensation process similar to the MPEG-1 and MPEG-2 standards. However, there are some enhancements to that of the MPEG-4 motion estimation procedure particularly in terms of the padding. Motion estimation process is applied to block sizes for both 8 X 8 and 16 X 16 pixels for the luminance component. In this paper, we propose design of flexible architectures for implementing scalable motion estimation and padding. The proposed architecture is modular and has a regular data flow and therefore can be implemented in VLSI.
A real-time-distributed image processing system requires data transfer, synchronization and error recovery. However, it is difficult for a programmer to describe these mechanisms. To solve this problem, we are developing a programming tool for real-time image processing on a distributed system. Using the programming tool, a programmer indicates only data flow between computers and image processing algorithms on each computer. In this paper, we outline specifications of the programming tool and show sample programs on the programming tool.
Various researchers have realized the value of implementing loop fusion to evaluate dense (pointwise) array expressions. Recently, the method of template metaprogramming in C++ has been used to significantly speed-up the evaluation of array expressions, allowing C++ programs to achieve performance comparable to or better than FORTRAN for numerical analysis applications. Unfortunately, the template metaprogramming technique suffers from several limitations in applicability, portability, and potential performance. We present a framework for evaluating dense array expressions in object-oriented programming languages. We demonstrate how this technique supports both common subexpression elimination and threaded implementation and compare its performance to object-library and hand-generated code.
We present a tutorial description of the CAP Computer-Aided Parallelization tool. CAP has been designed with the goal of letting the parallel application programmer have the complete control about how his application is parallelized, and at the same time freeing him from the burden of managing explicitly a large number of threads and associated synchronization and communication primitives. The CAP tool, a precompiler generating C++ source code, enables application programmers to specify at a high level of abstraction the set of threads present in the application, the processing operations offered by these threads, and the parallel constructs specifying the flow of data and parameters between operations. A configuration map specifies the mapping between CAP threads and operating system processes, possibly located on different computers. The generated program may run on various parallel configurations without recompilation. We discuss the issues of flow control and load balancing and show the solutions offered by CAP. We also show how CAP can be used to generate relatively complex parallel programs incorporating neighborhood dependent operations. Finally, we briefly describe a real 3D image processing application: the Visible Human Slice Server, its implementation according to the previously defined concepts and its performance.
In this paper we describe AIDPG, an interactive prototype system, which derives computer programs from their natural language descriptions. AIDPG shows how to analyze natural language, resolve ambiguities using knowledge, and generates programs. AIDPG consists of a natural language input model, a natural language analysis model, a program generation model (PGG-Model) and a human machine interface control model. The PGG model has three sub-models, program structure manage sub-model, a data structure and type manage sub- model, and program base manage sub-model. We used an arithmetic problem, which, described in Japanese, was passed to AIDPG and got run-possible C programs. Although AIDPG is basic currently we got a significant result.
By mapping computations directly onto hardware, reconfigurable machines promise a tremendous speed-up over traditional computers. However, executing floating-point operations directly in hardware is a waste of resources. Variable precision fixed-point arithmetic operations can save gates and reduce clock cycle times. This paper investigates the relation between precision and error for image compression/decompression. More precisely, this paper investigates the relationship between error and bit- precision for the Discrete Cosine Transform and JPEG.
An adaptive neighborhood contrast enhancement (ANCE) technique was developed to improve the perceptibility of features in digitized mammographic images for use in breast cancer screening. The computationally intensive algorithm was implemented on a cluster of 30 DEC Alpha processors using the message passing interface. The parallel implementation of the ANCE technique utilizes histogram- based image partitioning with each partition consisting of pixels of the same gray-level value regardless of their location in the image. The master processor allots one set of pixels to each slave processor. The slave returns the results to the master, and the master than sends a new set of pixels to the slave for processing. This procedure continues until there are no sets of pixels left. The subdivision of the original image based on gray-level values guarantees that slave processors do not process the same pixel, and is specifically well-suited to the characteristics of the ANCE algorithm. The parallelism value of the problem is approximately 16, i.e., the performance does not improve significantly when more than 16 processors are used. The result is a substantial improvement in processing time, leading to the enhancement of 4 K X 4 K pixel images in the range of 20 to 60 seconds.
In this paper, we present a new intelligent agent-based method to design filter banks that maximize compression quality. In this method, a multi-agent system containing cooperating intelligent agents with different roles is developed to search for filter banks that improve image compression quality. The multi-agent system consists of one generalization agent, and several problem formulation, optimization, and compression agents. The generalization agent performs problem decomposition and result collection. It distributes optimization tasks to optimization agents, and later collects results and selects one solution that works well on all training images as the final output. Problem formulation agents build optimization models that are used by the optimization agents. The optimization formulation includes both the overall performance of image compression and metrics of individual filters. The compression performance is provided by the image coding agent. Optimization agents apply various optimization methods to find the best filter bank for individual training images. Our method is modular and flexible, and is suitable for distributed processing. In experiments, we applied the proposed method to a set of benchmark images and designed filter banks that improve the compression performance of existing filter banks.
Volume holographic associative storage in a photorefractive crystal provides an inherent mechanism to develop a multichannel correlation system for real-time human face recognition with high parallelism. Wavelet transform is introduced to improve parallelism and discrimination of the system. Parameters of the system are optimized for maximum parallelism under limitation of hardware in this paper. Two factors mainly relative to parallelism of the system, dynamic scanning scope of the reference beam and angle interval of the 2D scanning setup, are analyzed. In our experiments, correlation outputs between an input human face and hundreds of face templates are obtained instantly in parallel. It can be recognized by simply identifying position of the correlation peak with highest intensity. Invariance of the system for human face recognition is also studied. A novel method to recognize an input human face of any rotation angle is proposed and testified by experiments.
Projection is a frequently used process in image processing and visualization. In volume graphics, projection is used to render the essential content of a 3D volume onto a 2D image plane. For Radon transform, projection is used to transform the image space into a parameter space. In this paper, we propose a matrix decomposition method called identity-plus- row decomposition for designing fast algorithms for projections. By applying this method, we solve the data redistributed problem due to the irregular data access patterns present in those applications on SIMD mesh- connected computers, developing fast algorithms for volume rendering and Radon transform on SIMD mesh-connected computers.
This paper presents the software architecture of a ubiquitous computing environment to support distributed image processing in general, and land mine detection and remediation in particular. The resource limitation of mobile clients and low bandwidth of the wireless networks is mitigated in our system. We use the distributed paradigm to shift the computation from the resource scarce mobile client to networked high performance computers. The system supports distributed processing of code and data resource components across a network. The system is implemented in Java using a three-tier client-proxy-server model. We also present a prototype of the software architecture.
Moments are one of the most well known feature descriptors which can be extracted from an image; their mathematical properties and versatility as feature extractors are well studied. This paper presents a design of moment generators, using established techniques in digital filters and Very Large Scale Integration processing combined under a component-based design framework. Analytically, the moment generator architecture is constructed by cascading single- pole stages of a relatively simple filter suitable for implementation on an ASIC platform, and which is capable of producing a linear combination of moments. Individual set of moments can be extracted, by using dematrixing techniques which could also be realized in the form of a preprogrammable logic table. A parallel implementation of the design is described using C*, a data-parallel extension of ANSI C. Preliminary evaluation of the design and implementation is also presented.
Edge detection and localization are important physical features of object images to be modeled and recognized by the human brain. To develop robust computer vision system methodologies, ones that have a range of applicability, we need early vision operators capable of matching a level of human perceptual performance. In this paper the physics of interaction between human retina cells and the incident light is developed. The suggested model which includes the receptor, intermediate, and ganglion cells, summarizes the knowledge obtained from electrophysiological and histological data published in the open literature during the last twenty years. Our analysis identifies at what scale neighboring edges start influencing the response of Laplacian of Gaussian operator. The use of human preattentive vision is the optimal choice for the electronic hardware implementation of the edge detector, because the concept of parallel processing is satisfied. The study of functional aspects of this model gives some first suggestions for the development of a rational theory of visual information processing. A computer simulation is used to test the performance of this approach.
The Karhunen-Loeve (K-L) transform is very useful in image representation and classification. However, the parallel implementation of it through some optical methods have rarely been investigated. In this paper, we constructed a photorefractive crystal based optical processor and implemented the time-consuming projection operations of the K-L transform in parallel so that the speed of the image processing can be greatly improved. In our approach, a set of eigenimages extracted from a large number of training images by K-L transform are stored in the crystal by using the two-wave mixing technique. When any new image inputs the processor, spatially separated beams with different light intensities are obtained in parallel. The intensity of each beam just represents the projection result between the input image and each eigenimage. The high speed can sufficiently demonstrate the advantage of the optical computing based parallel architecture for image processing.