The OpenCL API allows for the abstract expression of parallel, heterogeneous computing, but hardware implementations
have substantial implementation differences. The abstractions provided by the OpenCL API are
often insufficiently high-level to conceal differences in hardware architecture. Additionally, implementations
often do not take advantage of potential performance gains from certain features due to hardware limitations
and other factors. These factors make it challenging to produce code that is portable in practice, resulting in
much OpenCL code being duplicated for each hardware platform being targeted. This duplication of effort
offsets the principal advantage of OpenCL: portability.
The use of certain coding practices can mitigate this problem, allowing a common code base to be adapted
to perform well across a wide range of hardware platforms. To this end, we explore some general practices
for producing performant code that are effective across platforms. Additionally, we explore some ways of
modularizing code to enable optional optimizations that take advantage of hardware-specific characteristics.
The minimum requirement for portability implies avoiding the use of OpenCL features that are optional,
not widely implemented, poorly implemented, or missing in major implementations. Exposing multiple levels of
parallelism allows hardware to take advantage of the types of parallelism it supports, from the task level down
to explicit vector operations. Static optimizations and branch elimination in device code help the platform
compiler to effectively optimize programs. Modularization of some code is important to allow operations to
be chosen for performance on target hardware. Optional subroutines exploiting explicit memory locality allow
for different memory hierarchies to be exploited for maximum performance. The C preprocessor and JIT
compilation using the OpenCL runtime can be used to enable some of these techniques, as well as to factor in
hardware-specific optimizations as necessary.
The OpenCL standard for general-purpose parallel programming allows a developer to target highly parallel computations towards graphics processing units (GPUs), CPUs, co-processing devices, and field programmable gate arrays (FPGAs). The computationally intense domains of linear algebra and image processing have shown significant speedups when implemented in the OpenCL environment. A major benefit of OpenCL is that a routine written for one device can be run across many different devices and architectures; however, a kernel optimized for one device may not exhibit high performance when executed on a different device. For this reason kernels must typically be hand-optimized for every target device family. Due to the large number of parameters that can affect performance, hand tuning for every possible device is impractical and often produces suboptimal results. For this work, we focused on optimizing the general matrix multiplication routine. General matrix multiplication is used as a building block for many linear algebra routines and often comprises a large portion of the run-time. Prior work has shown this routine to be a good candidate for high-performance implementation in OpenCL. We selected several candidate algorithms from the literature that are suitable for parameterization. We then developed parameterized kernels implementing these algorithms using only portable OpenCL features. Our implementation queries device information supplied by the OpenCL runtime and utilizes this as well as user input to generate a search space that satisfies device and algorithmic constraints. Preliminary results from our work confirm that optimizations are not portable from one device to the next, and show the benefits of automatic tuning. Using a standard set of tuning parameters seen in the literature for the NVIDIA Fermi architecture achieves a performance of 1.6 TFLOPS on an AMD 7970 device, while automatically tuning achieves a peak of 2.7 TFLOPS
Imaging scenarios commonly involve erratic, unpredictable camera behavior or subjects that are prone to movement, complicating multi-frame image processing techniques. To address these issues, we developed three techniques that can be applied to multi-frame image processing algorithms in order to mitigate the adverse effects observed when cameras are panning or subjects within the scene are moving. We provide a detailed overview of the techniques and discuss the applicability of each to various movement types. In addition to this, we evaluated algorithm efficacy with demonstrated benefits using field test video, which has been processed using our commercially available surveillance product. Our results show that algorithm efficacy is significantly improved in common scenarios, expanding our software’s operational scope. Our methods introduce little computational burden, enabling their use in real-time and low-power solutions, and are appropriate for long observation periods. Our test cases focus on imaging through turbulence, a common use case for multi-frame techniques. We present results of a field study designed to test the efficacy of these techniques under expanded use cases.
EM Photonics has been investigating the application of massively multicore processors to a key problem area:
Computational Fluid Dynamics (CFD). While the capabilities of CFD solvers have continually increased and improved
to support features such as moving bodies and adjoint-based mesh adaptation, the software architecture has often lagged
behind. This has led to poor scaling as core counts reach the tens of thousands. In the modern High Performance
Computing (HPC) world, clusters with hundreds of thousands of cores are becoming the standard. In addition,
accelerator devices such as NVIDIA GPUs and Intel Xeon Phi are being installed in many new systems. It is important
for CFD solvers to take advantage of the new hardware as the computations involved are well suited for the massively
multicore architecture. In our work, we demonstrate that new features in NVIDIA GPUs are able to empower existing
CFD solvers by example using AVUS, a CFD solver developed by the Air Force Research Labratory (AFRL) and the
Volcanic Ash Advisory Center (VAAC). The effort has resulted in increased performance and scalability without
sacrificing accuracy. There are many well-known codes in the CFD space that can benefit from this work, such as
FUN3D, OVERFLOW, and TetrUSS. Such codes are widely used in the commercial, government, and defense sectors.
The modern graphics processing unit (GPU) found in many standard personal computers is a highly parallel math
processor capable of over 1 TFLOPS of peak computational throughput at a cost similar to a high-end CPU with
excellent FLOPS-to-watt ratio. High-level sparse linear algebra operations are computationally intense, often requiring
large amounts of parallel operations and would seem a natural fit for the processing power of the GPU. Our work is on a
GPU accelerated implementation of sparse linear algebra routines. We present results from both direct and iterative
sparse system solvers.
The GPU execution model featured by NVIDIA GPUs based on CUDA demands very strong parallelism, requiring
between hundreds and thousands of simultaneous operations to achieve high performance. Some constructs from linear
algebra map extremely well to the GPU and others map poorly. CPUs, on the other hand, do well at smaller order
parallelism and perform acceptably during low-parallelism code segments. Our work addresses this via hybrid a
processing model, in which the CPU and GPU work simultaneously to produce results. In many cases, this is
accomplished by allowing each platform to do the work it performs most naturally. For example, the CPU is responsible
for graph theory portion of the direct solvers while the GPU simultaneously performs the low level linear algebra
Recently, GPU computing has taken the scientific computing landscape by storm, fueled by the attractive nature of the
massively parallel arithmetic hardware. When porting their code, researchers rely on a set of best practices that have
been developed over the few years that general purpose GPU computing has been employed. This paper challenges a
widely held belief that transfers to and from the GPU device must be minimized to achieve the best speedups over
existing codes by presenting a case study on CULA, our library for dense linear algebra computation on GPU. Among
the topics to be discussed include the relationship between computation and transfer time for both synchronous and
asynchronous transfers, as well as the impact that data allocations have on memory performance and overall solution time.
The modern graphics processing unit (GPU) found in many standard personal computers is a highly parallel math
processor capable of nearly 1 TFLOPS peak throughput at a cost similar to a high-end CPU and an excellent
FLOPS/watt ratio. High-level linear algebra operations are computationally intense, often requiring O(N<sup>3</sup>) operations
and would seem a natural fit for the processing power of the GPU. Our work is on CULA, a GPU accelerated
implementation of linear algebra routines. We present results from factorizations such as LU decomposition, singular
value decomposition and QR decomposition along with applications like system solution and least squares. The GPU
execution model featured by NVIDIA GPUs based on CUDA demands very strong parallelism, requiring between hundreds and thousands of simultaneous operations to achieve high performance. Some constructs from linear algebra map extremely well to the GPU and others map poorly. CPUs, on the other hand, do well at smaller order parallelism and perform acceptably during low-parallelism code segments. Our work addresses this via hybrid a processing model, in which the CPU and GPU work simultaneously to produce results. In many cases, this is accomplished by allowing each platform to do the work it performs most naturally.
The modern graphics processing unit (GPU) found in many off-the shelf personal computers is a very high
performance computing engine that often goes unutilized. The tremendous computing power coupled with
reasonable pricing has made the GPU a topic of interest in recent research. An application for such power would be
the solution to large systems of linear equations. Two popular solution domains are direct solution, via the LU
decomposition, and iterative solution, via a solver such as the Generalized Method of Residuals (GMRES). Our
research focuses on the acceleration of such processes, utilizing the latest in GPU technologies. We show
performance that exceeds that of a standard computer by an order of magnitude, thus significantly reducing the run
time of the numerous applications that depend on the solution of a set of linear equations.
Unmanned Aerial Vehicle (UAV) system integration with naval vessels is currently realized in limited form. The
operational envelopes of these vehicles are constricted due to the complexities involved with at-sea flight testing.
Furthermore, the unsteady nature of ship airwakes and the use of automated UAV control software necessitates that
these tests be extremely conservative in nature. Modeling and simulation are natural alternatives to flight testing;
however, a fully-coupled computational fluid dynamics (CFD) solution requires many thousands of CPU hours. We
therefore seek to decrease simulation time by accelerating the underlying computations using state-of-the-art,
commodity hardware. In this paper we present the progress of our proposed solution, harnessing the computational
power of high-end commodity graphics processing units (GPUs) to create an accelerated Euler equations solver on
unstructured hexahedral grids.
Our group has employed the use of modern graphics processor units (GPUs) for the acceleration of finite-difference
based computational electromagnetics (CEM) codes. In particular, we accelerated the well-known Finite-Difference
Time-Domain (FDTD) method, which is commonly used for the analysis of electromagnetic phenomena. This algorithm
uses difference-based approximations for Maxwell's Equations to simulate the propagation of electromagnetic fields
through space and materials. The method is very general and is applicable to a wide array of problems, but runtimes can
be very long so acceleration is highly desired. In this paper we present GPU-based accelerated solvers for the FDTD
method in both its 2D and 3D embodiments.
There is a growing need for miniature low-cost chemical sensors for use in monitoring environmental conditions. Applications range from environmental pollution monitoring, industrial process control and homeland security threat detection to biomedical diagnostics. Integrated opto-chemical sensors can provide the required functionality by monitoring chemistry induced changes in the refractive, absorptive, or luminescent properties of materials. Mach-Zehnder (MZ) interferometers, using the phase induced from a chemically reactive film, have shown success for such applications but typically are limited to one chemical analysis per sensor. In this paper we present a MZ-like sensor using the dispersion properties of a photonic crystal lattice. Properly engineered dispersion guiding enables the creation of multiple parallel MZ-like sensors monitoring different chemical reactions in a device much smaller than a typical MZ sensor. The phase shift induced in one arm of the photonic crystal structure by the chemical reaction of a special film induces a change in the sensor output. The use of a dispersion guiding photonic crystal structure enables the use of lower refractive index materials because the creation of a bandgap is not necessary. This in turn increases coupling efficiency into the device. Other advantages of this type of structure include the ability to guide both TE and TM modes as well as reduced sensitivity to fabrication tolerances. Two-dimensional FDTD analysis is used to optimize and model the effectiveness of the structure.
Designing nanoscale devices presents a number of unique challenges. As device features shrink, the computational demands of the simulations necessary to accurately model them increase significantly. This is a result of not only the increasing level of detail in the device design itself, but also the need to use more accurate models. The approximations that are generally made when dealing with larger devices break down as feature sizes decrease. This can be seen in the optics field when contrasting the complexity of physical optics models with those requiring a rigorous solution to Maxwell's equations. This added complexity leads to more demanding calculations, stressing computational resources and driving research to overcome these limitations. There are traditionally two means of improving simulation times as model complexity grows beyond available computational resources: modifying the underlying algorithms to maintain sufficient precision while reducing overall computations and increasing the power of the computational system. In this paper, we explore the latter. Recent advances in commodity hardware technologies, particularly field-programmable gate arrays (FPGAs) and graphics processing units (GPUs), have allowed the creation of desktop-style devices capable of outperforming PC clusters. We will describe the key hardware technologies required to build such a device and then discuss their application to the modeling and simulation of nanophotonic devices. We have found that FPGAs and GPUs can be used to significantly reduce simulation times and allow for the solution of much large problems.
The performance of modeling and simulation tools is inherently tied to the platform on which they are implemented. In
most cases, this platform is a microprocessor, either in a desktop PC, PC cluster, or supercomputer. Microprocessors are
used because of their familiarity to developers, not necessarily their applicability to the problems of interest. We have
developed the underlying techniques and technologies to produce supercomputer performance from a standard desktop
workstation for modeling and simulation applications. This is accomplished through the combined use of graphics
processing units (GPUs), field-programmable gate arrays (FPGAs), and standard microprocessors. Each of these
platforms has unique strengths and weaknesses but, when used in concert, can rival the computational power of a high-performance
computer (HPC). By adding a powerful GPU and our custom designed FPGA card to a commodity desktop
PC, we have created simulation tools capable of replacing massive computer clusters with a single workstation. We
present this work in its initial embodiment: simulators for electromagnetic wave propagation and interaction. We
discuss the trade-offs of each independent technology, GPUs, FPGAs, and microprocessors, and how we efficiently
partition algorithms to take advantage of the strengths of each while masking their weaknesses. We conclude by
discussing enhancing the computational performance of the underlying desktop supercomputer and extending it to other
This course teaches the basics of utilizing modern programmable graphics processing units (GPUs) for military applications. The modern GPU is a fully programmable parallel programming environment that performs computations an order of magnitude faster than the modern CPU. In this course, we will learn broadly about the architecture of the GPU, the appropriate situations where speedups may be obtained and gain an understanding of the tools and languages that are available for development. Programming is not a part of the curriculum.
We will also discuss the available GPU platforms, with an emphasis on rugged, deployable, and low-power offerings. Lastly, the bulk of the course will center on applications and case studies, with emphasis on applications we have produced, including: real-time image processing for the reduction of atmospheric turbulence, applied accelerated linear algebra, image enhancement via super resolution, computational fluid dynamics, and computational electromagnetics.