OpenCL Technology details

Hello, dear habrasociety.
In a previous article about OpenCL, an overview was made of this technology, the possibilities that it can offer the user and its status at the moment.
Now consider the technology more closely. We’ll try to understand how OpenCL represents a heterogeneous system, which provides opportunities for interacting with the device, and which offers an approach to creating programs.
OpenCL was conceived as a technology for creating applications that could run in a heterogeneous environment. Moreover, it is designed to provide comfortable operation with such devices that are now only in the plans and even with those that no one has yet invented. To coordinate the work of all these devices, the heterogeneous system always has one “main” device that interacts with all other means of the OpenCL API. Such a device is called a "host", it is defined outside of OpenCL.
Therefore, OpenCL proceeds from the most general premises that give an idea of a device with OpenCL support: since this device is supposed to be used for calculations, it has a certain “processor” in the general sense of the word. Something that can execute commands. Since OpenCL is designed for parallel computing, such a processor may have means of parallelism within itself (for example, several cores of one CPU, several SPE processors in Cell). Also an elementary way to increase the performance of parallel computing is to install several such processors on the device (for example, multiprocessor PC motherboards, etc.). And naturally, in a heterogeneous system there can be several such OpenCL devices (generally speaking, with different architectures).
In addition to computing resources, the device has a certain amount of memory. Moreover, there are no requirements for this memory, it can be either on the device or even be allocated on the host RAM (as, for example, this is done with the integrated video cards).
Actually everything. No further assumptions are made about the device.
Such a broad concept of the device allows you to not impose any restrictions on programs developed for OpenCL. This technology will allow you to develop applications that are highly optimized for the specific architecture of a specific device that supports OpenCL, and those that will demonstrate stable performance on all types of devices (assuming equivalent performance of these devices).
OpenCL provides the programmer with a low-level API through which he interacts with device resources. The OpenCL API can either be directly supported by the device, or work through an intermediate API (as in the case of NVidia: OpenCL runs on top of the CUDA Driver API supported by devices), it depends on the specific implementation not described by the standard.
Let's see how OpenCL provides such versatility, while maintaining a low-level nature.
Next, I will give a free translation of part of the OpenCL 1.0 specification with some comments and additions.
To describe the main ideas of OpenCL, we use a hierarchy of 4 models:
- Platform Model
- Memory model
- Execution Model
- Programming Model
Platform Model
The OpenCL platform consists of a host connected to devices that support OpenCL. Each OpenCL device consists of computing units (Compute Unit), which are further divided into one or more processing elements (Processing Elements, hereinafter PE).
An OpenCL application runs on the host in accordance with the native models of its platform. An OpenCL application sends commands from the host to devices to perform calculations on PE. PE within the computational unit execute one stream of commands as SIMD blocks (one instruction is executed by all at the same time, processing of the next instruction will not start until all PEs have completed the execution of the current instruction), or as SPMD blocks (each PE has its own instruction counter (program counter) )
That is, OpenCL processes certain commands coming from the host. Thus, the application is not rigidly connected with OpenCL, which means that you can always replace the OpenCL implementation without violating the program’s performance. Even if such a device is created that does not fit into the OpenCL device model, it will be possible to create an OpenCL implementation for it that translates host commands into a more convenient form for the device.
Execution Model
Running an OpenCL program consists of two parts: the host part of the program and kernels (kernels; with your permission, I will continue to use the English term, as more familiar to most of us) running on an OpenCL device. The host part of the program determines the context in which kernels are executed and controls their execution.
The main part of the OpenCL execution model describes the execution of kernels. When the kernel is queued for execution, the index space is defined (NDRange, the definition will be given below). A copy (instanse) of the kernel will be executed for each index from this space. A copy of the kernel running for a specific index is called a “Work-Item” (work unit) and is determined by a point in the index space, that is, each “unit” is provided with a global ID. Each Work-Item executes the same code, but the specific execution path (branching, etc.) and the data with which it works may be different.
Work-items are organized in groups (Work-Groups). Groups provide a larger partition in the index space. Each group is assigned a group ID with the same dimension that was used to address the individual elements. Each element is assigned a unique, within the group, local ID. Thus, Work-Items can be addressed both by global ID and by a combination of group and local ID.
Work-items in the group are executed competitively (in parallel) on the PE of one computing unit.
A unified device model is clearly visible here: several PE -> CU, several CU -> device, several devices -> heterogeneous system.
The index space in OpenCL 1.0 is called NDRange and can be 1-, 2- and 3-dimensional. NDRange - an array of integers of length N, indicating the dimension in each direction.
The choice of the NDRange dimension is determined by the convenience for a particular algorithm: in the case of working with three-dimensional models it is convenient to index by three-dimensional coordinates, in the case of working with images or two-dimensional grids it is more convenient when the dimension of the indices is 2. 4-dimensional objects in our world are very rare, therefore the dimension limited 3. Also, be that as it may, the main goal of OpenCL is the GPU. Nvidia GPUs now natively support index sizes up to 3, respectively, to implement a larger dimension, one would have to resort to tricks and complications of either the CUDA Driver API or the OpenCL implementation.
Execution context and command queues in the OpenCL execution model.
The host determines the execution context of the kernel. The context includes the following resources:
- Devices: A set of OpenCL devices that the host uses.
- Kernels: OpenCL functions that execute on devices.
- Program Objects: source codes and kernel executables.
- Memory Objects: A collection of objects in memory that are visible to both the host and the OpenCL device. Memory objects contain values that kernels can work with.
The context is created and managed using the functions of the OpenCL API. The host creates a data structure called a command-queue to control the execution of kernels on devices. The host sends commands to the queue, after which they are set by the scheduler for execution on devices in the desired context.
Commands can be of the following types:
- Kernel execution command: execute the kernel on a PEs device.
- Memory Commands: Move data to, from, or between memory objects.
- Synchronization Commands: Control the order in which commands are executed.
The command queue schedules commands for execution on the device. They execute asynchronously between the host and the device. Commands can be executed relative to each other in two ways:
- Execution in order: the commands are launched for execution in the order in which they are located in the queue and are completed in the same order. That is, the commands are executed sequentially.
- Inconsistent execution: commands are sent for execution in order, but do not wait for the previous command to complete before execution begins. In this case, the programmer must explicitly use synchronization commands.
You can associate multiple command queues with a single context. These queues are executed competing among themselves, and independently without any explicit means of synchronization between them.
Using the command queue allows for great versatility and flexibility when using OpenCL. Modern GPUs have their own scheduler, which decides what to execute and when and on which computing units. Using the queue does not hamper the work of the scheduler, which has its own queue of commands.
Execution model: kernel categories.
OpenCL kernel can have two categories:
- OpenCL kernel: written in OpenCL C and compiled by the OpenCL compiler. All OpenCL implementations must support OpenCl-kernel. Implementations may provide other mechanisms for creating OpenCL-kernel.
- Naitive kernel: access to them is through host pointers to a function. Native kernel is queued for execution, as well as OpenCL-kernel and uses the same memory objects as OpenCL-kernel. For example, such kernels can be functions defined in the application code or exported from the library. Note that the ability to execute native kernels is optional and their semantics are not defined by the standard. The OpenCL API includes functions for polling device capabilities for support for such kernels.
Memory model
A Work-Item executing kernel can use four different types of memory:
- Global memory. This memory provides read and write access to elements of all groups. Each Work-Item can write and read from any part of the memory object. Writing and reading global memory may be cached depending on the capabilities of the device.
- Constant memory. A region of global memory that remains constant during kernel execution. The host allocates and initializes memory objects located in constant memory.
- Local memory A region of memory local to the group. This memory area can be used to create variables shared by the entire group. It can be implemented as a separate memory on an OpenCL device. Alternatively, this memory may be marked as a region in global memory.
- Private (private) memory. A memory area owned by a Work-Item. Variables defined in the private memory of one Work-Item are not visible to others.
The specification defines 4 types of memory, but again does not impose any requirements on the implementation of memory in hardware. All 4 types of memory can be located in global memory, and type separation can be performed at the driver level, and vice versa, there may be a hard separation of memory types, dictated by the device architecture.
The existence of precisely these types of memory is quite logical: the processor core has its own cache, the processor has a common cache and the entire device has a certain amount of memory.
Software model. (Programming Model)
The OpenCL execution model supports two software models: Data Parallelism and Task Parallelism; hybrid models are also supported. The main model that defines the design of OpenCL is data parallelism.
A software model with data parallelism.
This model defines computation as a sequence of instructions applied to many elements of a memory object. The index space associated with the OpenCL runtime model defines Work-Items and how data is distributed between them. In the strict model of data parallelism, there is a strict one-to-one correspondence between the Work-Item and the element in the memory object with which the kernel can work in parallel. OpenCL implements a softer data concurrency model where strict one-to-one matching is not required.
OpenCL provides a hierarchical model of data concurrency. There are two ways to define hierarchical division. In an explicit model, the programmer determines the total number of elements that must be executed in parallel, and how these elements will be distributed into groups. In an implicit model, the programmer only determines the total number of elements that must be executed in parallel, and the division into work groups is performed automatically.
Software model with job parallelism.
In this model, each copy of the kernel is executed regardless of any index space. Logically, this is equivalent to running a kernel on a computing unit (CU) with a group consisting of one element. In this model, users express concurrency in the following ways:
- use vector data types implemented in the device;
- queue up a lot of tasks;
- queue native kernels using a software model orthogonal to OpenCL;
The existence of two programming models is also a tribute to universality. For modern GPUs and Cells, the first model is well suited. But not all algorithms can be effectively implemented within the framework of such a model, and there is also the likelihood of a device appearing whose architecture will be inconvenient for using the first model. In this case, the second model allows you to write applications specific to another architecture.
What the OpenCL platform consists of
The OpenCL platform allows applications to use the host and one or more OpenCL devices as one heterogeneous parallel computer system. The platform consists of the following components:
- OpenCL Platform Layer: Allows the host to discover OpenCL devices, query their properties, and create context.
- OpenCL Runtime: The runtime allows a host program to manage contexts after they have been created.
- OpenCL Compiler: The OpenCL compiler creates executable files containing the OpenCL – kernel. The OpenCL-C programming language is implemented by a compiler that supports a subset of the ISO C99 language standard with extensions for concurrency.
How does it all work?
In the next article, I will take a detailed look at the process of creating an OpenCL application using one of the applications distributed with the Nvidia Computing SDK as an example. I will give examples of application optimizations for OpenCL offered by Nvidia as recommendations.
Now, I will schematically describe what steps the process of creating such an application consists of:
- We create a context for the execution of our program on the device.
- We select the necessary device (you can immediately select the device with the largest number of Flops).
- We initialize the selected device with the context that we created.
- Create a command queue based on the device ID and context.
- We create the program on the basis of source codes and context,
or on the basis of binary files and context. - We assemble the program (build).
- Create a kernel.
- Create memory objects for input and output data.
- We queue the command for writing data from the memory area with data on the host to the device’s memory.
- We queue the execution command of the kernel we created.
- We queue the command to read data from the device.
- We are waiting for the completion of operations.
It is worth noting that the assembly of the program is carried out at runtime, almost JIT-compilation. The standard describes that this is done so that it is possible to assemble the program taking into account the selected context. It also allows each OpenCL implementation provider to optimize the compiler for their device. However, the program can also be created from binary codes. Or create it once at the first start, and then reuse it, this feature is also described in the standard. Nevertheless, the compiler is integrated into the OpenCL platform, for better or worse, but it is.
Conclusion
As a result, the OpenCL model turned out to be very universal, while it remains low-level, allowing you to optimize applications for a specific architecture. It also provides cross-platform when moving from one type of OpenCL-devices to another. The provider of the OpenCL implementation has the ability to optimize the interaction of its device with the OpenCL API in every possible way, seeking to increase the efficiency of resource allocation for the device. In addition, a well-written OpenCL application will remain effective over generations of devices.
References:
- http://www.khronos.org/registry/cl/specs/opencl-1.0.48.pdf - latest version (6.10.09) of OpenCL specifications in English