Graph analytics is a key component in identifying emerging trends and threats in many real-world applications. Largescale graph analytics frameworks provide a convenient and highly-scalable platform for developing algorithms to analyze large datasets. Although conceptually scalable, these techniques exhibit poor performance on modern computational hardware. Another model of graph computation has emerged that promises improved performance and scalability by using abstract linear algebra operations as the basis for graph analysis as laid out by the GraphBLAS standard. By using sparse linear algebra as the basis, existing highly efficient algorithms can be adapted to perform computations on the graph. This approach, however, is often less intuitive to graph analytics experts, who are accustomed to vertex-centric APIs such as Giraph, GraphX, and Tinkerpop. We are developing an implementation of the high-level operations supported by these APIs in terms of linear algebra operations. This implementation is be backed by many-core implementations of the fundamental GraphBLAS operations required, and offers the advantages of both the intuitive programming model of a vertex-centric API and the performance of a sparse linear algebra implementation. This technology can reduce the number of nodes required, as well as the run-time for a graph analysis problem, enabling customers to perform more complex analysis with less hardware at lower cost. All of this can be accomplished without the requirement for the customer to make any changes to their analytics code, thanks to the compatibility with existing graph APIs.
The use of commodity mobile processors in wearable computing and field-deployed applications has risen as these processors have become increasingly powerful and inexpensive. Battery technology, however, has not advanced as quickly, and as the processing power of these systems has increased, so has their power consumption. In order to maximize endurance without compromising performance, fine-grained control of power consumption by these devices is highly desirable. Various methodologies exist to affect system-level bias with respect to the prioritization of performance or efficiency, but these are fragmented and global in effect, and so do not offer the breadth and granularity of control desired. This paper introduces a method of giving application programmers more control over system power consumption using a directive-based approach similar to existing APIs such as OpenMP. On supported platforms the compiler, application runtime, and Linux kernel will work together to translate the power-saving intent expressed in compiler directives into instructions to control the hardware, reducing power consumption when possible while still providing high performance when required.
The OpenCL API provides an abstract mechanism for massively parallel programming on a very wide range of
hardware, including traditional CPUs, GPUs, accelerator devices, FPGAs, and more. However, these different hardware
architectures and platforms function quite differently. Therefore, coding OpenCL applications that are usefully portable
is challenging. Certain considerations are therefore required in developing an effectively portable OpenCL library to
enable parallel application development without requiring fully separate code paths for each target platform.
By making use of device detection and characterization provided by the OpenCL API, valuable information can be
obtained to make runtime decisions for optimization. In particular, the effects of memory affinity change depending on
the memory organization of the device architecture. Work partitioning and assignment depend on the device execution
model, in particular the types of parallel execution supported and available synchronization primitives.
These considerations, in turn, affect the selection and invocation of kernel code. For certain devices, platform-specific
libraries are available, while others can benefit from generated kernel code based on the specified device parameters. By
parameterizing an algorithm based on how these considerations affect performance, a combination of device parameters
can be used to produce an execution strategy that will provide improved performance for that device or collection of
Many ISR applications require constant monitoring of targets from long distance. When capturing over long distances, imagery is often degraded by atmospheric turbulence. This adds a time-variant blurring effect to captured data, and can result in a significant loss of information. To recover it, image processing techniques have been developed to enhance sequences of short exposure images or videos in order to remove frame-specific scintillation and warping. While some of these techniques have been shown to be quite effective, the associated computational complexity and required processing power limits the application of these techniques to post-event analysis. To meet the needs of real-time ISR applications, video enhancement must be done in real-time in order to provide actionable intelligence as the scene unfolds. In this paper, we will provide an overview of an algorithm capable of providing the enhancement desired and focus on its real-time implementation. We will discuss the role that GPUs play in enabling real-time performance. This technology can be used to add performance to ISR applications by improving the quality of long-range imagery as it is collected and effectively extending sensor range.
The OpenCL standard for general-purpose parallel programming allows a developer to target highly parallel computations towards graphics processing units (GPUs), CPUs, co-processing devices, and field programmable gate arrays (FPGAs). The computationally intense domains of linear algebra and image processing have shown significant speedups when implemented in the OpenCL environment. A major benefit of OpenCL is that a routine written for one device can be run across many different devices and architectures; however, a kernel optimized for one device may not exhibit high performance when executed on a different device. For this reason kernels must typically be hand-optimized for every target device family. Due to the large number of parameters that can affect performance, hand tuning for every possible device is impractical and often produces suboptimal results. For this work, we focused on optimizing the general matrix multiplication routine. General matrix multiplication is used as a building block for many linear algebra routines and often comprises a large portion of the run-time. Prior work has shown this routine to be a good candidate for high-performance implementation in OpenCL. We selected several candidate algorithms from the literature that are suitable for parameterization. We then developed parameterized kernels implementing these algorithms using only portable OpenCL features. Our implementation queries device information supplied by the OpenCL runtime and utilizes this as well as user input to generate a search space that satisfies device and algorithmic constraints. Preliminary results from our work confirm that optimizations are not portable from one device to the next, and show the benefits of automatic tuning. Using a standard set of tuning parameters seen in the literature for the NVIDIA Fermi architecture achieves a performance of 1.6 TFLOPS on an AMD 7970 device, while automatically tuning achieves a peak of 2.7 TFLOPS
The OpenCL API allows for the abstract expression of parallel, heterogeneous computing, but hardware implementations
have substantial implementation differences. The abstractions provided by the OpenCL API are
often insufficiently high-level to conceal differences in hardware architecture. Additionally, implementations
often do not take advantage of potential performance gains from certain features due to hardware limitations
and other factors. These factors make it challenging to produce code that is portable in practice, resulting in
much OpenCL code being duplicated for each hardware platform being targeted. This duplication of effort
offsets the principal advantage of OpenCL: portability.
The use of certain coding practices can mitigate this problem, allowing a common code base to be adapted
to perform well across a wide range of hardware platforms. To this end, we explore some general practices
for producing performant code that are effective across platforms. Additionally, we explore some ways of
modularizing code to enable optional optimizations that take advantage of hardware-specific characteristics.
The minimum requirement for portability implies avoiding the use of OpenCL features that are optional,
not widely implemented, poorly implemented, or missing in major implementations. Exposing multiple levels of
parallelism allows hardware to take advantage of the types of parallelism it supports, from the task level down
to explicit vector operations. Static optimizations and branch elimination in device code help the platform
compiler to effectively optimize programs. Modularization of some code is important to allow operations to
be chosen for performance on target hardware. Optional subroutines exploiting explicit memory locality allow
for different memory hierarchies to be exploited for maximum performance. The C preprocessor and JIT
compilation using the OpenCL runtime can be used to enable some of these techniques, as well as to factor in
hardware-specific optimizations as necessary.