Long-range video surveillance is usually limited by the wavefront aberrations caused by atmospheric turbulence, rather than by the quality of the imaging optics or sensor. These aberrations can be mitigated optically by adaptive optics, or corrected post detection by digital video processing. Video processing is preferred if the quality of the enhancement is acceptable, because the hardware is less expensive and has lower size, weight and power (SWaP). Several competing video processing solutions may be employed: speckle imaging with bispectrum processing, lucky imaging, geometric correction and blind deconvolution. Speckle imaging was originally developed for astronomy. It has subsequently been adapted for the more challenging problem of low altitude, slant path, imaging, where the atmosphere is denser and more turbulent. This paper considers a bispectrum-based video processing solution, called ATCOM, which was originally implemented on an i7 CPU and accelerated using a GPU by EM Photonics Ltd. The design has since been adapted in a joint venture with RFEL Ltd to produce a low SWaP implementation based around Xilinx’s Zynq 7045 allprogrammable system-on-a-chip (SoC). This system is called ATACAMA. Bispectrum processing is computationally expensive and, for both ATCOM and ATACAMA, a sub-region of the image must be processed to achieve operation at standard video frame rates. This paper considers how the design may be optimized to increase the size of this region, while maintaining high performance. Finally, use of Xilinx’s next-generation UltraScale+ multiprocessor SoC (MPSoC), which has an embedded Mali-400 GPU as well as an ARM CPU, is explored to further improve functionality.
Video tracking of rocket launches inherently must be done from long range. Due to the high temperatures produced, cameras are often placed far from launch sites and their distance to the rocket increases as it is tracked through the flight. Consequently, the imagery collected is generally severely degraded by atmospheric turbulence. In this talk, we present our experience in enhancing commercial space flight videos. We will present the mission objectives, the unique challenges faced, and the solutions to overcome them.
Domain-specific languages are a useful tool for productivity allowing domain experts to program using familiar
concepts and vocabulary while benefiting from performance choices made by computing experts. Embedding the
domain specific language into an existing language allows easy interoperability with non-domain-specific code and
use of standard compilers and build systems. In C++, this is enabled through the template and preprocessor features.
C++ embedded domain specific languages (EDSLs) allow the user to write simple, safe, performant, domain specific
code that has access to all the low-level functionality that C and C++ offer as well as the diverse set of libraries
available in the C/C++ ecosystem.
In this paper, we will discuss several tools available for building EDSLs in C++ and show examples of projects
successfully leveraging EDSLs. Modern C++ has added many useful new features to the language which we have
leveraged to further extend the capability of EDSLs.
At EM Photonics, we have used EDSLs to allow developers to transparently benefit from using high performance
computing (HPC) hardware. We will show ways EDSLs combine with existing technologies and EM Photonics high
performance tools and libraries to produce clean, short, high performance code in ways that were not previously
Methods to reconstruct pictures from imagery degraded by atmospheric turbulence have been under development for
decades. The techniques were initially developed for observing astronomical phenomena from the Earth’s surface, but
have more recently been modified for ground and air surveillance scenarios. Such applications can impose significant
constraints on deployment options because they both increase the computational complexity of the algorithms
themselves and often dictate a requirement for low size, weight, and power (SWaP) form factors. Consequently,
embedded implementations must be developed that can perform the necessary computations on low-SWaP platforms.
Fortunately, there is an emerging class of embedded processors driven by the mobile and ubiquitous computing
industries. We have leveraged these processors to develop embedded versions of the core atmospheric correction engine
found in our ATCOM software. In this paper, we will present our experience adapting our algorithms for embedded
systems on a chip (SoCs), namely the NVIDIA Tegra that couples general-purpose ARM cores with their graphics
processing unit (GPU) technology and the Xilinx Zynq which pairs similar ARM cores with their field-programmable
gate array (FPGA) fabric.
Modern digital imaging systems are susceptible to degraded imagery because of atmospheric turbulence.
Notwithstanding significant improvements in resolution and speed, significant degradation of captured imagery still
hampers system designers and operators. Several techniques exist for mitigating the effects of the turbulence on
captured imagery, we will concentrate on the effects of Bi-Spectrum Speckle Averaging ,  approach to image
enhancement, on a data-set captured in-conjunction with meteorological data.
Atmospheric turbulence degrades imagery by imparting scintillation and warping effects that can reduce the ability to
identify key features of the subjects. While visually, a human can intuitively understand the improvement that turbulence
mitigation techniques can offer in increasing visual information, this enhancement is rarely quantified in a meaningful
way. In this paper, we discuss methods for measuring the potential improvement on system performance video
enhancement algorithms can provide. To accomplish this, we explore two metrics. We use resolution targets to
determine the difference between imagery degraded by turbulence and that improved by atmospheric correction
techniques. By comparing line scans between the data before and after processing, it is possible to quantify the
additional information extracted. Advanced processing of this data can provide information about the effective
modulation transfer function (MTF) of the system when atmospheric effects are considered and removed, using this data
we compute a second metric, the relative improvement in Strehl ratio.
When capturing imagery over long distances, atmospheric turbulence often degrades the data, especially when observation paths are close to the ground or in hot environments. These issues manifest as time-varying scintillation and warping effects that decrease the effective resolution of the sensor and reduce actionable intelligence. In recent years, several image processing approaches to turbulence mitigation have shown promise. Each of these algorithms has different computational requirements, usability demands, and degrees of independence from camera sensors. They also produce different degrees of enhancement when applied to turbulent imagery. Additionally, some of these algorithms are applicable to real-time operational scenarios while others may only be suitable for postprocessing workflows. EM Photonics has been developing image-processing-based turbulence mitigation technology since 2005. We will compare techniques from the literature with our commercially available, real-time, GPU-accelerated turbulence mitigation software. These comparisons will be made using real (not synthetic), experimentally obtained data for a variety of conditions, including varying optical hardware, imaging range, subjects, and turbulence conditions. Comparison metrics will include image quality, video latency, computational complexity, and potential for real-time operation. Additionally, we will present a technique for quantitatively comparing turbulence mitigation algorithms using real images of radial resolution targets.
Atmospheric turbulence degrades imagery by imparting scintillation and warping effects that blur the collected pictures and reduce the effective level of detail. While this reduction in image quality can occur in a wide range of scenarios, it is particularly noticeable when capturing over long distances, when close to the ground, or in hot and humid environments. For decades, researchers have attempted to correct these problems through device and signal processing solutions. While fully digital approaches have the advantage of not requiring specialized hardware, they have been difficult to realize in real-time scenarios due to a variety of practical considerations, including computational performance, the need to integrate with cameras, and the ability to handle complex scenes. We address these challenges and our experience overcoming them. We enumerate the considerations for developing an image processing approach to atmospheric turbulence correction and describe how we approached them to develop software capable of real-time enhancement of long-range imagery.
When capturing image data over long distances (0.5 km and above), images are often degraded by atmospheric turbulence, especially when imaging paths are close to the ground or in hot environments. These issues manifest as time-varying scintillation and warping effects that decrease the effective resolution of the sensor and reduce actionable intelligence. In recent years, several image processing approaches to turbulence mitigation have shown promise. Each of these algorithms have different computational requirements, usability demands, and degrees of independence from camera sensors. They also produce different degrees of enhancement when applied to turbulent imagery. Additionally, some of these algorithms are applicable to real-time operational scenarios while others may only be suitable for post-processing workflows. EM Photonics has been developing image-processing-based turbulence mitigation technology since 2005 as a part of our ATCOM  image processing suite. In this paper we will compare techniques from the literature with our commercially available real-time GPU accelerated turbulence mitigation software suite, as well as in-house research algorithms. These comparisons will be made using real, experimentally-obtained data for a variety of different conditions, including varying optical hardware, imaging range, subjects, and turbulence conditions. Comparison metrics will include image quality, video latency, computational complexity, and potential for real-time operation.
The use of commodity mobile processors in wearable computing and field-deployed applications has risen as these processors have become increasingly powerful and inexpensive. Battery technology, however, has not advanced as quickly, and as the processing power of these systems has increased, so has their power consumption. In order to maximize endurance without compromising performance, fine-grained control of power consumption by these devices is highly desirable. Various methodologies exist to affect system-level bias with respect to the prioritization of performance or efficiency, but these are fragmented and global in effect, and so do not offer the breadth and granularity of control desired. This paper introduces a method of giving application programmers more control over system power consumption using a directive-based approach similar to existing APIs such as OpenMP. On supported platforms the compiler, application runtime, and Linux kernel will work together to translate the power-saving intent expressed in compiler directives into instructions to control the hardware, reducing power consumption when possible while still providing high performance when required.
Graph analytics is a key component in identifying emerging trends and threats in many real-world applications. Largescale graph analytics frameworks provide a convenient and highly-scalable platform for developing algorithms to analyze large datasets. Although conceptually scalable, these techniques exhibit poor performance on modern computational hardware. Another model of graph computation has emerged that promises improved performance and scalability by using abstract linear algebra operations as the basis for graph analysis as laid out by the GraphBLAS standard. By using sparse linear algebra as the basis, existing highly efficient algorithms can be adapted to perform computations on the graph. This approach, however, is often less intuitive to graph analytics experts, who are accustomed to vertex-centric APIs such as Giraph, GraphX, and Tinkerpop. We are developing an implementation of the high-level operations supported by these APIs in terms of linear algebra operations. This implementation is be backed by many-core implementations of the fundamental GraphBLAS operations required, and offers the advantages of both the intuitive programming model of a vertex-centric API and the performance of a sparse linear algebra implementation. This technology can reduce the number of nodes required, as well as the run-time for a graph analysis problem, enabling customers to perform more complex analysis with less hardware at lower cost. All of this can be accomplished without the requirement for the customer to make any changes to their analytics code, thanks to the compatibility with existing graph APIs.
The transmission characteristics of millimeter waves (mmWs) make them suitable for many applications in defense and security, from airport preflight scanning to penetrating degraded visual environments such as brownout or heavy fog. While the cold sky provides sufficient illumination for these images to be taken passively in outdoor scenarios, this utility comes at a cost; the diffraction limit of the longer wavelengths involved leads to lower resolution imagery compared to the visible or IR regimes, and the low power levels inherent to passive imagery allow the data to be more easily degraded by noise. Recent techniques leveraging optical upconversion have shown significant promise, but are still subject to fundamental limits in resolution and signal-to-noise ratio. To address these issues we have applied techniques developed for visible and IR imagery to decrease noise and increase resolution in mmW imagery. We have developed these techniques into fieldable software, making use of GPU platforms for real-time operation of computationally complex image processing algorithms. We present data from a passive, 77 GHz, distributed aperture, video-rate imaging platform captured during field tests at full video rate. These videos demonstrate the increase in situational awareness that can be gained through applying computational techniques in real-time without needing changes in detection hardware.
Many ISR applications require constant monitoring of targets from long distance. When capturing over long distances, imagery is often degraded by atmospheric turbulence. This adds a time-variant blurring effect to captured data, and can result in a significant loss of information. To recover it, image processing techniques have been developed to enhance sequences of short exposure images or videos in order to remove frame-specific scintillation and warping. While some of these techniques have been shown to be quite effective, the associated computational complexity and required processing power limits the application of these techniques to post-event analysis. To meet the needs of real-time ISR applications, video enhancement must be done in real-time in order to provide actionable intelligence as the scene unfolds. In this paper, we will provide an overview of an algorithm capable of providing the enhancement desired and focus on its real-time implementation. We will discuss the role that GPUs play in enabling real-time performance. This technology can be used to add performance to ISR applications by improving the quality of long-range imagery as it is collected and effectively extending sensor range.
Several image processing techniques for turbulence mitigation have been shown to be effective under a wide range of long-range capture conditions; however, complex, dynamic scenes have often required manual interaction with the algorithm’s underlying parameters to achieve optimal results. While this level of interaction is sustainable in some workflows, in-field determination of ideal processing parameters greatly diminishes usefulness for many operators. Additionally, some use cases, such as those that rely on unmanned collection, lack human-in-the-loop usage. To address this shortcoming, we have extended a well-known turbulence mitigation algorithm based on bispectral averaging with a number of techniques to greatly reduce (and often eliminate) the need for operator interaction. Automations were made in the areas of turbulence strength estimation (Fried’s parameter), as well as the determination of optimal local averaging windows to balance turbulence mitigation and the preservation of dynamic scene content (non-turbulent motions). These modifications deliver a level of enhancement quality that approaches that of manual interaction, without the need for operator interaction. As a consequence, the range of operational scenarios where this technology is of benefit has been significantly expanded.
We examine the performance of a commercially available speckle imaging system in reconstructing static scenes from imagery corrupted by anisoplanatic distortions commonly observed when imaging over long horizontal paths near the ground. Performance is evaluated using the Mean Squared Error between system outputs and a diffraction-limited reference image. Input image frames are taken from a large library of simulated imagery of a static object observed over a 1 km horizontal path through volume turbulence in 3 turbulence conditions. 1000 image frames are available for each condition allowing for a statistically significant characterization of system performance over a range of turbulence conditions.
The OpenCL API allows for the abstract expression of parallel, heterogeneous computing, but hardware implementations
have substantial implementation differences. The abstractions provided by the OpenCL API are
often insufficiently high-level to conceal differences in hardware architecture. Additionally, implementations
often do not take advantage of potential performance gains from certain features due to hardware limitations
and other factors. These factors make it challenging to produce code that is portable in practice, resulting in
much OpenCL code being duplicated for each hardware platform being targeted. This duplication of effort
offsets the principal advantage of OpenCL: portability.
The use of certain coding practices can mitigate this problem, allowing a common code base to be adapted
to perform well across a wide range of hardware platforms. To this end, we explore some general practices
for producing performant code that are effective across platforms. Additionally, we explore some ways of
modularizing code to enable optional optimizations that take advantage of hardware-specific characteristics.
The minimum requirement for portability implies avoiding the use of OpenCL features that are optional,
not widely implemented, poorly implemented, or missing in major implementations. Exposing multiple levels of
parallelism allows hardware to take advantage of the types of parallelism it supports, from the task level down
to explicit vector operations. Static optimizations and branch elimination in device code help the platform
compiler to effectively optimize programs. Modularization of some code is important to allow operations to
be chosen for performance on target hardware. Optional subroutines exploiting explicit memory locality allow
for different memory hierarchies to be exploited for maximum performance. The C preprocessor and JIT
compilation using the OpenCL runtime can be used to enable some of these techniques, as well as to factor in
hardware-specific optimizations as necessary.
Imaging scenarios commonly involve erratic, unpredictable camera behavior or subjects that are prone to movement, complicating multi-frame image processing techniques. To address these issues, we developed three techniques that can be applied to multi-frame image processing algorithms in order to mitigate the adverse effects observed when cameras are panning or subjects within the scene are moving. We provide a detailed overview of the techniques and discuss the applicability of each to various movement types. In addition to this, we evaluated algorithm efficacy with demonstrated benefits using field test video, which has been processed using our commercially available surveillance product. Our results show that algorithm efficacy is significantly improved in common scenarios, expanding our software’s operational scope. Our methods introduce little computational burden, enabling their use in real-time and low-power solutions, and are appropriate for long observation periods. Our test cases focus on imaging through turbulence, a common use case for multi-frame techniques. We present results of a field study designed to test the efficacy of these techniques under expanded use cases.
Located at Edwards Air Force Base, Armstrong Flight Research Center (AFRC) is NASA’s premier site for aeronautical research and operates some of the most advanced aircraft in the world. As such, flight tests for advanced manned and unmanned aircraft are regularly performed there. All such tests are tracked through advanced electro-optic imaging systems to monitor the flight status in real-time and to archive the data for later analysis. This necessitates the collection of imagery from long-range camera systems of fast moving targets from a significant distance away. Such imagery is severely degraded due to the atmospheric turbulence between the camera and the object of interest. The result is imagery that becomes blurred and suffers a substantial reduction in contrast, causing significant detail in the video to be lost. In this paper, we discuss the image processing techniques located in the ATCOM software, which uses a multi-frame method to compensate for the distortions caused by the turbulence.
Long-range video surveillance performance is often severely diminished due to atmospheric turbulence. The larger
apertures typically used for video-rate operation at long-range are particularly susceptible to scintillation and blurring
effects that limit the overall diffraction efficiency and resolution. In this paper, we present research progress made toward
a digital signal processing technique which aims to mitigate the effects of turbulence in real-time. Our previous work in
this area focused on an embedded implementation for portable applications. Our more recent research has focused on
functional enhancements to the same algorithm using general-purpose hardware. We present some techniques that were
successfully employed to accelerate processing of high-definition color video streams and study performance under non-ideal
conditions involving moving objects and panning cameras. Finally, we compare the real-time performance of two
implementations using a CPU and a GPU.
EM Photonics has been investigating the application of massively multicore processors to a key problem area:
Computational Fluid Dynamics (CFD). While the capabilities of CFD solvers have continually increased and improved
to support features such as moving bodies and adjoint-based mesh adaptation, the software architecture has often lagged
behind. This has led to poor scaling as core counts reach the tens of thousands. In the modern High Performance
Computing (HPC) world, clusters with hundreds of thousands of cores are becoming the standard. In addition,
accelerator devices such as NVIDIA GPUs and Intel Xeon Phi are being installed in many new systems. It is important
for CFD solvers to take advantage of the new hardware as the computations involved are well suited for the massively
multicore architecture. In our work, we demonstrate that new features in NVIDIA GPUs are able to empower existing
CFD solvers by example using AVUS, a CFD solver developed by the Air Force Research Labratory (AFRL) and the
Volcanic Ash Advisory Center (VAAC). The effort has resulted in increased performance and scalability without
sacrificing accuracy. There are many well-known codes in the CFD space that can benefit from this work, such as
FUN3D, OVERFLOW, and TetrUSS. Such codes are widely used in the commercial, government, and defense sectors.
Multi-frame algorithms for the removal of atmospheric turbulence have proven effective under ideal conditions where
the scene remains static; however, movement of the camera across a scene often introduces undesirable effects that
degrade the quality of processed imagery to the point where it becomes unusable. This paper discusses the development
of two solutions to this problem, each with different computational costs and levels of efficacy. We discuss a solution to
this problem that uses robust registration methods to align a window of input images to each other and processes them to
obtain a single improved frame, repeating the sequence of realignment and processing each time a new frame arrives.
While this approach produces high quality results, the associated computational cost precludes real-time implementation,
even on accelerated platforms. An alternative solution involves measuring scene movement through lightweight
registration and quantification. Registration results are used to make a global determination of "safe" approaches to
processing in order to avoid degraded results. This particular method is computationally inexpensive at the cost of
efficacy. We discuss the performance of both of these modifications against the original, uncompensated algorithm in
terms of computational cost and quality of output imagery. Additionally, we will briefly discuss future goals which aim
to minimize additional computation while maximizing processing efficacy.
Recently, GPU computing has taken the scientific computing landscape by storm, fueled by the attractive nature of the
massively parallel arithmetic hardware. When porting their code, researchers rely on a set of best practices that have
been developed over the few years that general purpose GPU computing has been employed. This paper challenges a
widely held belief that transfers to and from the GPU device must be minimized to achieve the best speedups over
existing codes by presenting a case study on CULA, our library for dense linear algebra computation on GPU. Among
the topics to be discussed include the relationship between computation and transfer time for both synchronous and
asynchronous transfers, as well as the impact that data allocations have on memory performance and overall solution time.
The modern graphics processing unit (GPU) found in many standard personal computers is a highly parallel math
processor capable of nearly 1 TFLOPS peak throughput at a cost similar to a high-end CPU and an excellent
FLOPS/watt ratio. High-level linear algebra operations are computationally intense, often requiring O(N<sup>3</sup>) operations
and would seem a natural fit for the processing power of the GPU. Our work is on CULA, a GPU accelerated
implementation of linear algebra routines. We present results from factorizations such as LU decomposition, singular
value decomposition and QR decomposition along with applications like system solution and least squares. The GPU
execution model featured by NVIDIA GPUs based on CUDA demands very strong parallelism, requiring between hundreds and thousands of simultaneous operations to achieve high performance. Some constructs from linear algebra map extremely well to the GPU and others map poorly. CPUs, on the other hand, do well at smaller order parallelism and perform acceptably during low-parallelism code segments. Our work addresses this via hybrid a processing model, in which the CPU and GPU work simultaneously to produce results. In many cases, this is accomplished by allowing each platform to do the work it performs most naturally.
Modern image enhancement techniques have been shown to be effective in improving the quality of imagery. However,
the computational requirements of applying such algorithms to streams of video in real-time often cannot be satisfied by
standard microprocessor-based systems. While a scaled solution involving clusters of microprocessors may provide the
necessary arithmetic capacity, deployment is limited to data-center scenarios. What is needed is a way to perform these
techniques in real time on embedded platforms. A new paradigm of computing utilizing special-purpose commodity
hardware including Field-Programmable Gate Arrays (FPGAs) and Graphics Processing Units (GPU) has recently
emerged as an alternative to parallel computing using clusters of traditional CPUs. Recent research has shown that for
many applications, such as image processing techniques requiring intense computations and large memory spaces, these
hardware platforms significantly outperform microprocessors. Furthermore, while microprocessor technology has begun
to stagnate, GPUs and FPGAs have continued to improve exponentially. FPGAs, flexible and powerful, are best
targeted at embedded, low-power systems and specific applications. GPUs, cheap and readily available, are available to
most users through their standard desktop machines. Additionally, as fabrication scale continues to shrink, heat and
power consumption issues typically limiting GPU deployment to high-end desktop workstations are becoming less of a
factor. The ability to include these devices in embedded environments opens up entire new application domains. In this
paper, we investigate two state-of-the-art image processing techniques, super-resolution and the average-bispectrum
speckle method, and compare FPGA and GPU implementations in terms of performance, development effort, cost,
deployment options, and platform flexibility.
The modern graphics processing unit (GPU) found in many off-the shelf personal computers is a very high
performance computing engine that often goes unutilized. The tremendous computing power coupled with
reasonable pricing has made the GPU a topic of interest in recent research. An application for such power would be
the solution to large systems of linear equations. Two popular solution domains are direct solution, via the LU
decomposition, and iterative solution, via a solver such as the Generalized Method of Residuals (GMRES). Our
research focuses on the acceleration of such processes, utilizing the latest in GPU technologies. We show
performance that exceeds that of a standard computer by an order of magnitude, thus significantly reducing the run
time of the numerous applications that depend on the solution of a set of linear equations.
The acquisition of high-resolution imagery is necessary in a wide variety of fields, such as intelligence gathering,
surveillance, and other defense applications. The quality of footage typically determines the usefulness of the obtained
information, yet, the use of low-resolution imaging devices may be unavoidable under circumstances where highresolution
equipment is unavailable or impossible to deploy. In these scenarios, super resolution methods can be applied
to recover lost detail. These methods generally use computationally intense routines to process a series of low-resolution
input frames in order to generate a higher-resolution output. Because of the algorithms' computational intensity, realtime
operation for moderately-sized frames cannot be realized using general-purpose CPU technology. Modern graphics
processing units (GPUs) offer computational performance that far exceeds current CPU technology, allowing real-time
operation to be achieved. This paper presents the development of a GPU-accelerated super resolution implementation.
The algorithm presented here employs gradient-based registration, weighted nearest neighbor (WNN) interpolation
techniques, and Wiener filtering. This accelerated implementation performs at speeds 40 times that of a conventional a
CPU implementation, and achieves processing rates suitable for valuable real-time applications.