Intel® Graphics Technology. Part II: “unloading” calculations on a graph

    We continue our discussion about Intel® Graphics Technology, namely, what we have at our disposal in terms of code writing: offload and offload_attribute pragmas for offloading, target (gfx) and target (gfx_kernel) attributes , __GFX__ and __INTEL_OFFLOAD macros , intrinsics and a set of API functions for asynchronous offline. That’s all we need for happiness. I almost forgot: of course, we need a compiler from Intel and the magic option / Qoffload .

    But first things first. One of the main ideas is a relatively easy modification of the existing code executed on the CPU for its execution on graphics integrated into the processor.

    This is most easily shown using a simple example of summing two arrays:

    void vector_add(float *a, float *b, float *c){
    for(int i = 0; i < N; i++)
            c[i] = a[i] + b[i];

    With Intel® Cilk ™ Plus Technology, we can easily make it parallel by replacing the for loop with cilk_for :

    void vector_add(float *a, float *b, float *c){
    cilk_for(int i = 0; i < N; i++)
            c[i] = a[i] + b[i];

    Well, in the next step, we already ship the calculations to the graphics using the #pragma offload directive in synchronous mode:

    void vector_add(float *a, float *b, float *c){
    #pragma offload target(gfx) pin(a, b, c:length(N))
    cilk_for(int i = 0; i < N; i++)
            c[i] = a[i] + b[i];

    Or create a kernel for asynchronous execution using the __declspec (target (gfx_kernel)) specifier in front of the function:

    void vector_add(float *a, float *b, float *c){
    cilk_for(int i = 0; i < N; i++)
            c[i] = a[i] + b[i];

    By the way, everywhere there is a set of letters GFX , which should make us think that we are working with integrated graphics ( GFX - Graphics ), and not with the GPU , which is often understood as discrete graphics.

    As you already understood, the procedure has a number of features. Well, firstly, everything only works with cilk_for loops . It is clear that for a good work there should be a parallel version of our code, but so far it is the mechanism for working with loops from Cilk that is supported, that is, the same OpenMP goes by the checkout. It must be remembered that the graphics do not work very well with 64 bit “fleets” and integers - the features of the “hardware”, so there is no need to wait for high performance with such operations.

    There are two main modes for computing on a chart: synchronous and asynchronous. For the implementation of the former, compiler directives are used, and for the latter, a set of API functions, and for the implementation of the offload, it will be necessary to “put” the function (kernel) so declared into the queue for execution.

    Synchronous mode
    It is carried out using the directive #pragma offload target (gfx) before the cilk_for loop of interest to us .
    In a real application, in this loop there may well be a call to some function, therefore it must also be declared with __declspec (target (gfx)) .
    Synchronicity lies in the fact that the thread executing the code on the host (CPU) will wait for the end of calculations on the graph. At the same time, the compiler generates code for both the host and graphics, which allows for greater flexibility when working with different hardware. If the offline is not supported, then the execution of all code occurs on the CPU. In the first post, we already talked about how this is implemented.
    The directive can specify the following parameters:
    • if (condition) - the code will be executed only on the chart, if the condition is true
    • in | out | inout | pin (variable_list: length (length_variable_in_elements))
      in , out , or inout - specify which variables to copy between the CPU and the graphics
    • pin - set the variables common to the CPU and graphics. In this case, data copying does not occur, and the used memory cannot be swapped.
    • length is a necessary thing when working with pointers. You need to set the size of the data to be copied to / from the graphics memory, or to be shared with the CPU. Set as the number of elements of type pointer. For a pointer to an array, this is the number of corresponding elements in the array.

    An important note - the use of pin can significantly reduce the overhead of using offline. Instead of copying the data back and forth, we organize access to physical memory available to both the host (CPU) and integrated graphics. If the data size is small, then we will not see a large increase.
    Since the OS does not know that processor graphics use memory, the obvious decision was to make the memory pages used cannot be swapped, in order to avoid an unpleasant situation. Therefore, you need to be careful and not to "kick" a lot - otherwise we get a lot of pages for which you can not make a swap. Naturally, the performance of the system as a whole will not increase from this.

    In our example of summing two arrays, we just use the parameterpin (a, b, c: length (N)) :

    #pragma offload target(gfx) pin(a, b, c:length(N))

    That is, the arrays a and b are not copied to the graphics memory, but remain available in the shared memory, while the corresponding page does not swap until we finish the work.
    By the way, the / Qoffload- option is used to ignore pragmas . Well, this is if we suddenly get tired of the offload. By the way, nobody canceled ifdefs, and this technique is still very relevant:

    #ifdef __INTEL_OFFLOAD
      cout << "\nThis program is built with __INTEL_OFFLOAD.\n" << "The target(gfx) code will be executed on target if it is available\n";
      cout << "\nThis program is built without __INTEL_OFFLOAD\n"; << "The target(gfx) code will be executed on CPU only.\n";

    Asynchronous mode
    Let's now consider another offload mode, which is based on the use of API functions. Graphics have their own queue for execution, and all we need is to create kernels (gfx_kernel) and put them in this queue. The kernel can be created using the __declspec (target (gfx_kernel)) specifier in front of the function. At the same time, when the thread on the host sends the kernel to the queue for execution, it continues to run. However, it is possible to wait for the completion of execution on the chart using the _GFX_wait () function .

    In synchronous operation, each time we enter the region with an offline, we kick the memory (if we do not want to copy, of course), and when we exit the loop, we stop this process. This happens implicitly and does not require any construction. Therefore, if the offload is performed in some kind of cycle, then we will get very large overhead (overhead). In the asynchronous case, we can explicitly indicate when to start kicking memory and when to finish using the API functions.

    In addition, in asynchronous mode, code generation is not provided for both the host and graphics. Therefore, you will have to take care of implementing the code only for the host yourself.

    Here is the code for calculating the sum of arrays in asynchronous mode (the asynchronous version of the code for vec_add was presented above):

    	float *a = new float[TOTALSIZE];
    	float *b = new float[TOTALSIZE];
    	float *c = new float[TOTALSIZE];
    	float *d = new float[TOTALSIZE];
    	a[0:TOTALSIZE] = 1;
    	b[0:TOTALSIZE] = 1;
    	c[0:TOTALSIZE] = 0;
    	d[0:TOTALSIZE] = 0;
    	_GFX_share(a, sizeof(float)*TOTALSIZE);
    	_GFX_share(b, sizeof(float)*TOTALSIZE);
    	_GFX_share(c, sizeof(float)*TOTALSIZE);
    	_GFX_share(d, sizeof(float)*TOTALSIZE);
    	_GFX_enqueue("vec_add", c, a, b, TOTALSIZE);
    	_GFX_enqueue("vec_add", d, c, a, TOTALSIZE);

    So, we declare and initialize 4 arrays. Using the _GFX_share function, we explicitly say that this memory (the starting address and length in bytes are set by the function parameters) needs to be kicked, that is, we will use the shared memory for the CPU and graphics. After that, we queue the desired function vec_add , which is defined using __declspec (target (gfx_kernel)) . As always, it uses the cilk_for loop . The thread on the host puts the second miscalculation of the vec_add function with the new parameters into the queue without waiting for the first to execute. With _GFX_wait, we expect all cores to be executed in the queue. And in the end, we explicitly stop the memory pinning using _GFX_unshare .

    Do not forget that to use the API functions we need the header file gfx_rt.h . In addition, for use cilk_for need to connect Cilk / cilk.h .
    An interesting point is that by default the installed compiler could not find gfx_rt.h - I had to register the path to its daddy ( C: \ Program Files (x86) \ Intel \ Composer XE 2015 \ compiler \ include \ gfx in my case) with pens in the project settings.

    I also found one interesting option that I did not say about in a previous post when I talked about code generation by the compiler. So, if we know in advance on which hardware we will work, then we can indicate this to the compiler explicitly using the / Qgpu-arch option. So far, there are only two options: / Qgpu-arch: ivybridge or / Qgpu-arch: haswell . As a result, the linker will call the compiler to translate the code from the vISA architecture to the one we need, and we will save on JIT'ing.

    And finally, an important note about the offload on Windows 7 (and DirectX 9). It is critical that the display is active, otherwise it won’t work. In Windows 8, there is no such limitation.

    Well, remember that we are talking about integrated graphics in the processor. The described constructions do not work with discrete graphics - we use OpenCL for it.

    Also popular now: