A framework for autonomously controlling resource allocation at the application level in a distributed, heterogeneous multi-tasking SPMD environment called Operator-controlled, distributed resource allocation is motivated and described. The environment requiring such resource allocation is typified by the interactive television environment. In an interactive television environment, applications that are intended to execute within a virtual machine environment on a viewer's integrated receiver decoder (IRD), are encoded and broadcast with the audio-video television program. Different viewers make different choices to cause the application, which is identically downloaded to each IRD, to execute differently, hence comprising a multi-tasking SPMD environment. In current deployments, the IRDs are furnished by the operator who ensures that each IRD has sufficient resources to execute each and every program that is broadcast concurrently; hence, two viewers making exactly the same choices will execute the downloaded application identically. In the future, IRDs are expected to be available at retail stores, purchased by consumers that may choose more or less functionality so long as the IRD has the minimally acceptable functionality according to standards that are currently being developed by the middleware and consumer electronics vendors in conjunction with broadcasters and MSOs.
The accelerated development in Peer-to-Peer (P2P) and Grid computing has positioned them as promising next generation computing platforms. They enable the creation of Virtual Enterprises (VE) for sharing resources distributed across the world. However, resource management, application development and usage models in these environments is a complex undertaking. This is due to the geographic distribution of resources that are owned by different organizations or peers. The resource owners of each of these resources have different usage or access policies and cost models, and varying loads and availability. In order to address complex resource management issues, we have proposed a computational economy framework for resource allocation and for regulating supply and demand in Grid computing environments. The framework provides mechanisms for optimizing resource provider and consumer objective functions through trading and brokering services. In a real world market, there exist various economic models for setting the price for goods based on supply-and-demand and their value to the user. They include commodity market, posted price, tenders and auctions. In this paper, we discuss the use of these models for interaction between Grid components in deciding resource value and the necessary infrastructure to realize them. In addition to normal services offered by Grid computing systems, we need an infrastructure to support interaction protocols, allocation mechanisms, currency, secure banking, and enforcement services. Furthermore, we demonstrate the usage of some of these economic models in resource brokering through Nimrod/G deadline and cost-based scheduling for two different optimization strategies on the World Wide Grid (WWG) testbed that contains peer-to-peer resources located on five continents: Asia, Australia, Europe, North America, and South America.
In this paper we present an architecture for service and computational grids and discuss two aspects of coordination on a grid, one related to resource allocation and one driven by the process specification of an workflow.
SPANR (Schedule, Plan, Assess Networked Resources) is (i) a pre-run, off-line planning and (ii) a runtime, just-in-time scheduling mechanism. It is designed to support primarily commercial applications in that it optimizes throughput rather than individual jobs (unless they have highest priority). Thus it is a tool for a commercial production manager to maximize total work. First the SPANR Planner is presented showing the ability to do predictive 'what-if' planning. It can answer such questions as, (i) what is the overall effect of acquiring new hardware or (ii) what would be the effect of a different scheduler. The ability of the SPANR Planner to formulate in advance tree-trimming strategies is useful in several commercial applications, such as electronic design or pharmaceutical simulations. The SPANR Planner is demonstrated using a variety of benchmarks. The SPANR Runtime Scheduler (RS) is briefly presented. The SPANR RS can provide benefit for several commercial applications, such as airframe design and financial applications. Finally a design is shown whereby SPANR can provide scheduling advice to most resource management systems.
MPEG-2 video encoders are now available in a variety of forms using both hardware and software based approaches. The software-based approach potentially offers a better picture quality but is computationally quite intensive. MPEG-2 video encoding can be fast processed using parallelism. A number of approaches using parallel machines or networks of workstations have been reported. While these approaches promise good concepts they do not offer commercial solutions due to factors such as cost, size, etc. In this paper, we propose a new approach with the aim to build a cost-effective and a completely practical solution that is not only highly efficient but is also scalable from single-processor to multiple-processor PC. The highlights of the proposed work include an algorithm for enhancing the efficiency of motion estimation, speeding up the computation of motion estimation and DCT with Intel's SIMD (Single Instruction, Multiple Data) style MMX and SSE instruction sets within a single processor, and scheduling and allocation of a multithreading scheme on a multiple processor PC for managing I/O, synchronization, audio and video encoding, and multiplexing. The proposed multithreaded encoder exploits temporal parallelism in MPEG video sequences with small overhead. The encoder, providing a complete compression solution, achieves faster than the real-time and half of real-time encoding rates for CIF (352 x 288) and CCIR601 (720 x 576) video sequences, respectively, on multiple processor PC.
Application test suites used in the development of parallelizing compilers typically include single-file programs and algorithm kernels. The challenges posed by full-scale commercial applications are rarely addressed. It is often assumed that automatic parallelization is not feasible in the presence of large, realistic programs. In this paper, we reveal some of the hurdles that must be crossed in order to enable these compilers to apply parallelization techniques to large-scale codes. We use a benchmark suite that has been specifically designed to exhibit the computing needs found in industry. The benchmarks are provided by the High Performance Group of the Standard Performance Evaluation Corporation (SPEC). They consist of a seismic processing application and a quantum level molecular simulation. Both applications exist in a serial and a parallel variant. In our studies we compare the parallel variants with the automatically parallelized, serial codes. We use the Polaris parallelizing compiler, which takes Fortran codes and inserts OpenMP directives around loops determined to be dependence-free. We have found five challenges faced by an automatic parallelizing compiler when dealing with full applications: modularity, legacy optimizations, symbolic analysis, array reshaping, and issues arising from input/output operations.
There is evidence that some high performance applications such as Air Traffic Control and other real time dynamic IT and database tasks are NP-hard. This paper proposes that it is the nature of the basic computing model, not the tasks themselves that results in intractable situations, and that it is the communication component of many high performance IT applications that makes them intractable. Consequently, bigger, faster, higher performance hardware using this model cannot solve these tasks. Networks and clusters of computers simply move the communication bottleneck, ultimately aggravating rather than helping the situation. New models of computing are needed. 'It is no easy matter to root out old prejudices, or to overturn opinions which have acquired an establishment by time, custom and great authorities.' It is proposed that the associative model can significantly reduce if not eliminate this bottleneck for many commercial high-performance applications. Moreover, the associative model is well suited for the coming generation of high performance optical, biological and polymer computers.
Finite difference, time domain (FDTD) simulations are important to the design cycle for optical communications devices. High spatial resolution is essential, and the Courant condition limits the time step, making this problem require the level of high-performance system usually only available at a remote center. Model definition and result visualization can be done locally. Recent application of the alternating direction implicit (ADI) method to FDTD removes the Courant condition, promising larger time steps for meaningful turnaround in simulations. At each time step, tridiagonal equations are solved over single dimensions of a 3D problem, but all three dimensions are involved in each time step. Thus, for a distributed memory multiprocessor, no partition of the data prevents tridiagonals from crossing processors without remapping every time step. Likewise, for cache based or vector computers, there is a stride of NxN for tridiagonals at every time step for a NxNxN grid. There is plenty of parallelism, because NxN tridiagonals can be solved simultaneously. This makes the problem well suited to a machine like the Cray multithreaded architecture (MTA) that has a large, flat memory and uses parallelism to hide memory latency. A Cray MTA implementation of the ADI-FDTD code executes serial tridiagonal solvers in parallel on multiple threads and successfully hides memory latency, achieving just over one FLOP per clock cycle per processor for a 200x200x200 grid on an 8 processor system at the San Diego Supercomputer Center. The 8 processor speed is 2.06 Gflop and the efficiency is 98%. Comparing one MTA processor, with a 250 MHz clock to a 500 MHz Alpha processor, the MTA is three times as fast for a 50x50x50 grid problem size. A vectorized version of the code run on one Cray T90 processor is three times faster than one MTA processor for a 100x100x100 grid size.
The United States Nuclear Regulatory Commission has developed the thermal-hydraulic analysis code TRAC-M to consolidate the capabilities of its suite of reactor safety analysis codes. One of the requirements for the new consolidated code is that it supports parallel computations to extend code functionality and to improve execution speed. A flexible request driven Exterior Communication Interface (ECI) was developed at Penn State University for use with the consolidated code and has enabled distributed parallel computing. This paper reports the application of TRAC-M and the ECI at Purdue University to a series of practical nuclear reactor problems. The performance of the consolidated code is studied on a shared memory machine, DEC Alpha 8400, in which a Large Break Loss of Coolant Accident (LBLOCA) analysis is applied for the safety analysis of the new generation reactor, AP600. The problem demonstrates the importance of balancing the computational for practical applications. Other computational platforms are also examined, to include the implementation of Linux and Windows OS on multiprocessor PCs. In general, the parallel performance on UNIX and Linux platforms is found to be the most stable and efficient.
Fluid flow through porous materials is critical for understanding and predicting the behavior of systems as diverse in function and scale as hydrocarbon reservoirs, aquifers, filters, membrane separators and even catalytic converters. Recently, there have been calls to incorporate more physics in oil reservoir simulations, as well as to enhance computational capability through the use of High Performance Computing (HPC), in order to improve reservoir management. Accurate prediction of reservoir behavior depends on the physical properties of not only the fluid but also the underlying rock formation. Contemporary approaches to solving these flows involve simulation of only a single physical scale. We are currently developing HiMuST (Hierarchical Multiscale Simulator Technology), an integrated multiscale simulation system for flow through heterogeneous porous materials. HiMuST uses a hierarchy of simulation codes to address the issue of rock property characterization at the pore scale and can self-adjust according to available input data. At the microscopic scale, HiMuST employs the Lattice Boltzmann Method, based on magnetic resonance digitizations of actual rock samples. At the mesoscopic scale, a stochastic model represents a pore network as a randomly generated skeleton of cylindrical pipes, based on physical characteristics determined by the microscopic simulation. We present computational and computer science issues involved in the HPC implementation of the codes and in integrating them into a seamless simulation system. Issues such as portability, scalability, efficiency and extensibility of the final product are also discussed, as well as the numerical methods implemented at each scale. Example simulation results are presented.
This paper describe a new parallel algorithm for the multi-lattice Monte Carlo atomistic simulator for thin film deposition (ADEPT), implemented on parallel computer using the PVM (Parallel Virtual Machine) message passing library. This parallel algorithm is based on domain decomposition with overlapping and asynchronous communication. Multiple lattices are represented by a single reference lattice through one-to-one mappings, with resulting computational demands being comparable to those in the single-lattice Monte Carlo model. Asynchronous communication and domain overlapping techniques are used to reduce the waiting time and communication time among parallel processors. Results show that the algorithm is highly efficient with large number of processors. The algorithm was implemented on a parallel machine with 50 processors, and it is suitable for parallel Monte Carlo simulation of thin film growth with either a distributed memory parallel computer or a shared memory machine with message passing libraries. In this paper, the significant communication time in parallel MC simulation of thin film growth is effectively reduced by adopting domain decomposition with overlapping between sub-domains and asynchronous communication among processors. The overhead of communication does not increase evidently and speedup shows an ascending tendency when the number of processor increases. A near linear increase in computing speed was achieved with number of processors increases and there is no theoretical limit on the number of processors to be used. The techniques developed in this work are also suitable for the implementation of the Monte Carlo code on other parallel systems.
In many fields, such as data mining and e-commerce, performance issues are typically addressed by waiting for the next generation of processors and/or distributing the application in a parallel environment. An alternative has been to instrument the code so that observation can drive modifications to improve performance. Success is measured typically by the improvement in wall clock time of program execution. In the latest generation of commercial processors (IBM Power/PowerPC, Compaq Alpha, Intel Pentium III) programmable counters are included in the hardware to gather data that can be used for performance monitoring. These counters allow internal events in the processor to be observed without impacting the performance of the program that is being monitored. This paper explores the use of performance monitoring to characterize the machine learning based data mining program C4.5 running on an IBM Power II processor node in an IBM RS/6000 SP. Development and verification of the methodology to utilize the performance monitoring hardware is presented. The starting point of this work is an existing performance monitoring application that has been extended to allow monitoring of individual programs running on the single chip implementation of the Power II architecture. Examples of the data collected from the monitoring of C4.5 are presented and analyzed. With the experience gained from the work on a single node, the potential issues in extending this methodology into a parallel environment such as the IBM RS/6000 SP are explored.
Every software system changes over its lifetime. Some of the changes are predictable, and systems can be designed to be robust to such changes by considering those future changes beforehand. However, it is impossible to predict all the future changes and possible concerns. Anticipating the various concerns is hard due to the diversity in client requirements and the rapid advances in the enabling technologies. Because unanticipated changes often require many parts of the system to be modified or redesigned, they are very costly most of the time. Therefore, it is necessary to engineer adaptability into software systems in order to meet various future requirements. Adaptability is the cornerstone of a successful design. In this paper we present an application framework, the Aspect Moderator Framework (AMF), which can make a software system adaptable and robust to changes over its lifetime, thus reducing overall system cost. This paper shows how the Aspect Moderator Framework (AMF) can be used to build commercial applications that can evolve and adapt to new requirements easily.
In the development of an integrated, experimental search engine, Tocorime Apicu, the incorporation and emulation of the evolutionary aspects of the chosen biological model (honeybees) and the field of high-performance knowledge discovery in databases results in the coupling of diverse fields of research: evolutionary computations, biological modeling, machine learning, statistical methods, information retrieval systems, active networks, and data visualization. The use of computer systems provides inherent sources of self-similarity traffic that result from the interaction of file transmission, caching mechanisms, and user-related processes. These user-related processes are initiated by the user, application programs, or the operating system (OS) for the user's benefit. The effect of Web transmission patterns, coupled with these inherent sources of self-similarity associated with the above file system characteristics, provide an environment for studying network traffic. The goal of the study was client-based, but with no user interaction. New methodologies and approaches were needed as network packet traffic increased in the LAN, LAN+WAN, and WAN. Statistical tools and methods for analyzing datasets were used to organize data captured at the packet level for network traffic between individual source/destination pairs. Emulation of the evolutionary aspects of the biological model equips the experimental search engine with an adaptive system model which will eventually have the capability to evolve with an ever- changing World Wide Web environment. The results were generated using a LINUX OS.
Novel initiatives amongst the Internet community such as Internet2 and Qbone are based on the use of high bandwidth and powerful computers. However the experience amongst the majority of Internet users is light-years from these emerging technologies. We describe the construction of a distributed high performance search engine, utilising advanced threading techniques on a diskless Linux cluster. The resulting Virtual Reality scene is pass to the client machine for viewing. This search engine bridges the gap between the Internet of today, and the Internet of the future.
The Acxiom Cluster Testbed is a high-performance, low-cost cluster that is being constructed at the University of Arkansas for the purpose of designing, constructing, and evaluating cluster architectures for massive data processing. Our evaluation efforts include performance testing of cluster hardware and a variety of middleware configurations for the cluster. One of the most important middleware components for massive data processing is a high-performance cluster file system. In this paper we present system requirements for a parallel cluster-based file system, our experimental evaluation of the NFS file system and the PVFS parallel file system for Linux, and future research goals.
Phylogenies (that is, tree-of-life relationships) derived from gene order data may prove crucial in answering some fundamental open questions in biomolecular evolution. Real-world interest is strong in determining these relationships. For example, pharmaceutical companies may use phylogeny reconstruction in drug discovery for discovering synthetic pathways unique to organisms that they wish to target. Health organizations study the phylogenies of organisms such as HIV in order to understand their epidemiologies and to aid in predicting the behaviors of future outbreaks. And governments are interested in aiding the production of such foodstuffs as rice, wheat and potatoes via genetics through understanding of the phylogenetic distribution of genetic variation in wild populations. Yet few techniques are available for difficult phylogenetic reconstruction problems. Appropriate tools for analysis of such data may aid in resolving some of the phylogenetic problems that have been analyzed without much resolution for decades. With the rapid accumulation of whole genome sequences for a wide diversity of taxa, especially microbial taxa, phylogenetic reconstruction based on changes in gene order and gene content is showing promise, particularly for resolving deep (i.e., ancient) branch splits. However, reconstruction from gene-order data is even more computationally expensive than reconstruction from sequence data, particularly in groups with large numbers of genes and highly-rearranged genomes. We have developed a software suite, GRAPPA, that extends the breakpoint analysis (BPAnalysis) method of Sankoff and Blanchette while running much faster: in a recent analysis of chloroplast genome data for species of Campanulaceae on a 512-processor Linux supercluster with Myrinet, we achieved a one-million-fold speedup over BPAnalysis. GRAPPA can use either breakpoint or inversion distance (computed exactly) for its computation and runs on single-processor machines as well as parallel and high-performance computers.
The publicly-funded effort to sequence the complete nucleotide sequence of the human genome, the Human Genome Project (HGP), has currently produced more than 93% of the 3 billion nucleotides of the human genome into a preliminary `draft' format. In addition, several valuable sources of information have been developed as direct and indirect results of the HGP. These include the sequencing of model organisms (rat, mouse, fly, and others), gene discovery projects (ESTs and full-length), and new technologies such as expression analysis and resources (micro-arrays or gene chips). These resources are invaluable for the researchers identifying the functional genes of the genome that transcribe and translate into the transcriptome and proteome, both of which potentially contain orders of magnitude more complexity than the genome itself. Preliminary analyses of this data identified approximately 30,000 - 40,000 human `genes.' However, the bulk of the effort still remains -- to identify the functional and structural elements contained within the transcriptome and proteome, and to associate function in the transcriptome and proteome to genes. A fortuitous consequence of the HGP is the existence of hundreds of databases containing biological information that may contain relevant data pertaining to the identification of disease-causing genes. The task of mining these databases for information on candidate genes is a commercial application of enormous potential. We are developing a system to acquire and mine data from specific databases to aid our efforts to identify disease genes. A high speed cluster of Linux of workstations is used to analyze sequence and perform distributed sequence alignments as part of our data mining and processing. This system has been used to mine GeneMap99 sequences within specific genomic intervals to identify potential candidate disease genes associated with Bardet-Biedle Syndrome (BBS).
Pricing of derivatives is one of the central problems in Computational Finance. Since the theory of derivative pricing is highly mathematical, numerical techniques such as lattice approach, finite-difference and finite-element techniques among others have been resorted in the past. Recently Fast Fourier Transform (FFT) have been used for such applications as derivative pricing. In the current work, we develop a parallel algorithm for FFT and implement it to price options. Our main aim is to study the performance of this algorithm. For a data size of N and P processors, a blocked data distribution for the algorithm in general produces log(N) - log(P) iterations of local communications and log(P) iterations of remote communications. Therefore, the algorithm is divided into two parts: local and remote. In the local algorithm, the processors perform the computations on their locally partitioned data elements without any communications. In the case of remote algorithm, the processors perform the computation on the local data elements with remote communications. In this paper we focus on the remote communication and computation aspect of the algorithm. We discuss the performance of our algorithm and the results (in general terms) from FFT algorithm and binomial tree algorithm developed and implemented for the same/similar problem. We make some general observation on these two algorithms.
Financial planning problems are formulated as large scale, stochastic, multiperiod, tree structured optimization problems. An efficient technique for solving this kind of problems is the nested Benders decomposition method. In this paper we present a parallel, portable, asynchronous implementation of this technique. To achieve our portability goals we elected the programming language Java for our implementation and used a high level Java based framework, called OpusJava, for expressing the parallelism potential as well as synchronization constraints. Our implementation is embedded within a modular decision support tool for portfolio and asset liability management, the Aurora Financial Management System.