The OpenCL API provides an abstract mechanism for massively parallel programming on a very wide range of
hardware, including traditional CPUs, GPUs, accelerator devices, FPGAs, and more. However, these different hardware
architectures and platforms function quite differently. Therefore, coding OpenCL applications that are usefully portable
is challenging. Certain considerations are therefore required in developing an effectively portable OpenCL library to
enable parallel application development without requiring fully separate code paths for each target platform.
By making use of device detection and characterization provided by the OpenCL API, valuable information can be
obtained to make runtime decisions for optimization. In particular, the effects of memory affinity change depending on
the memory organization of the device architecture. Work partitioning and assignment depend on the device execution
model, in particular the types of parallel execution supported and available synchronization primitives.
These considerations, in turn, affect the selection and invocation of kernel code. For certain devices, platform-specific
libraries are available, while others can benefit from generated kernel code based on the specified device parameters. By
parameterizing an algorithm based on how these considerations affect performance, a combination of device parameters
can be used to produce an execution strategy that will provide improved performance for that device or collection of