OpenCL: How to make this thing work
A lot of people who tried the technology of using CUDA / OpenCL graphics accelerators "to taste" did not get very good results. Yes, tests go and simple examples show impressive acceleration, but when it comes to real algorithms, it’s very difficult to get a good result.
How to make this technology work?
In this article, I tried to summarize my six-month butting experience with OpenCL technology under Mandriva linux and MacOS X 10.6 on the tasks of complex search for string data for bioinformatics. OpenCL was chosen because for Mac it is a “native” technology (some of the poppies are equipped with AMD graphics cards and CUDA is not even theoretically available under them), but the recommendations are quite universal and are suitable for NVIDIA CUDA as well.
So, what is needed for the graphics accelerator to work?
Code concurrency
1. It is necessary to remember, and for those who did not know, to learn the techniques of code refactoring . If you have a non-standard algorithm, be prepared for the fact that it will have to be tormented for a long time before you can correctly parallelize it. During these torments, the results of the program should not change, therefore, without a good test example, you can not even get to work. The example should not work out for too long, because it will have to be run often, but not too fast - we must evaluate the speed of its work. Optimal - 20 seconds.
2. Mass parallelism on the accelerator implies that there will be data arrays at the input and output of the algorithm, and the dimension of these arrays cannot be less than the number of threads of the parallel algorithm. The accelerated algorithm usually looks like a for loop (or several nested such loops) each iteration of which does not depend on the result of the previous ones. Do you have such an algorithm?
3. Parallel programming in itself is not an easy thing, even without the use of graphics accelerators. Therefore, it is highly recommended to parallelize your algorithm first with something simpler, for example using OpenMP . There parallelism is included by one directive ... just do not forget that if some buffer variables are used in a loop, then in parallel execution they must be multiplied by the number of iterations or wound up inside the loop!
4. In order not to lose dozens of hours, you need to be 100% sure that at least the parallel part of the program algorithm is completely free from memory errors. This is achieved, for example, using valgrind . Also, this wonderful program is able to catch parallelization errors through OpenMP, so it is better to catch everything in advance, before it hits the accelerator - there are much fewer tools there.
Accelerator performance considerations
1. You need to understand that the accelerator works with its own memory , the volume of which is limited, and the transfer back and forth is quite expensive. The specific numbers depend on the specific system, but this question will have to be “kept in mind” all the time. Best of all, graphics accelerators work with algorithms that receive not so much (but not less than the number of threads!) Data, and they are mathematically processed for a long time. Enough, but not very long, in many systems a limit is set for the maximum thread execution time!
2. The memory is copied back and forth in inextricable areas . Therefore, for example, dimensional arrays usual for C N organized as arrays of pointers to pointers are completely unsuitable; one has to use linearly organized multidimensional data arrays.
3. In the "internal" code executed on the accelerator, there are no memory allocation / deallocation operations, input / output, recursion is impossible. I / O, temporary buffers are provided by code executable on the central processor.
4. The “ideal” algorithm for the accelerator is when the same code works for different data (SIMD) , when the code does not depend on them (an example is the addition of vectors). Any branching, any cycle with a variable depending on the given number of iterations gives a significant slowdown.
Some conclusions
Thus, most likely, in order to use the full power of GPGPU, you will have to rewrite the code, having in mind the above limitations, and it’s far from the fact that it will work out right away and generally work out. Not all tasks are well paralleled by data, alas. But the result is worth the candle, so the translation of molecular dynamics problems to GPGPU allowed NVIDIA specialists to obtain very interesting results that in the realities of our institute would allow counting not three months on a supercomputer in another city, but one day on a desktop machine. And for this it’s worth the work.
Materials used:
NVIDIA OpenCL
Series of articles on the Habr
OpenCL for Mac OS X