Geotyper July 21, 2018 at 12:01

Creating a bot to participate in the AI mini cup. GPU Experience

Below under the cut, I’ll talk about the author’s experience in using the GPU for calculations, including as part of creating a bot to participate in the AI mini cup. But rather, this is an essay on the subject of GPU.

-You have a magical name ... -Do you
know, Joel? .. The magic is leaving ...

In childhood, we talk about the age when chemistry is not yet underway in school or just starting to take place, the author was fascinated by the burning reaction, it so happened that his parents did not interfere with him and the Moscow wasteland near the house was occasionally lit up by flashes of various children's activity, a homemade rocket on black gunpowder, on sugar-nitrate caramel, etc. Two circumstances limited children's fantasies: the decomposition of nitroglycerin in a home laboratory with a hissing ceiling from acids and the drive to a police room in an attempt to get chemicals at one of the defense enterprises generously scattered around the Aviamotornaya metro area.

And then a physics school with yamaha msx computers appeared, a programmable MK calculator at home, and there was no time for chemistry. The interests of the child have shifted to computers. And what the author lacked from the first acquaintance with the computer was the burning reaction, his programs were smoldering, there wasn’t this sensation of natural power. You could see the process of optimizing calculations in games, but at that time the author did not know how to replace the calculation of sin () with the table of values of this function, there was no Internet ...

So, the author was able to get a feeling of joy from computing power, clean burning, I use GPU in calculations.

On a habr there are some good articles about calculations on GPU. There are also many examples on the Internet, so it was decided to just write on Saturday morning about personal feelings and it is possible to push others towards mass parallelism.

Let's start with simple forms. GPU computing supports several frameworks, but the most famous are NVIDIA CUDA and OpenCL. We will take CUDA and immediately we will have to narrow our set of programming languages to C ++. There are libraries for connecting to CUDA in other programming languages, for example, ALEA GPU in C #, but this is rather the topic of a separate review article.

As they could not make a mass car with a jet engine at one time, although some of its indicators are higher than that of an internal combustion engine, parallel calculations are not always possible to apply in real problems. The main application for parallel computing: you need a task containing some element of mass character, multiplicity. In our case of creating a bot, a neural network (a lot of neurons, neural connections) falls under the mass and a population of bots (calculating the dynamics of movement, collisions for each bot take some time, if the bots are from 300-1000 then the central processor surrenders and you will observe just slow smoldering of your program, such as long pauses between frames of visualization).

The best mass option is when each element of the calculations does not depend on the result of calculations on another element of the list, for example, the simple task of sorting an array is already overgrown with all kinds of tricks, since the position of the number in the array depends on other numbers and cannot be taken into the forehead on a parallel cycle . To simplify the wording: the first sign of successful mass character is that if you do not need to change the position of an element in the array, you can freely perform calculations on it, take the values of other elements for this, but do not move it from its place. Something like a fairy tale: don’t change the order of the elements, otherwise the GPU will turn into a pumpkin.

In modern programming languages, there are constructions that can be executed in parallel on several cores of a central processor or logical threads and they are widely used, but the author focuses the reader on mass parallelism, when the number of executing modules exceeds hundreds or thousands of units.

The first elements of parallel structures appeared: a parallel cycle . For most tasks, it will be enough. In a broad sense, this is the quintessence of
parallel computing.

An example of writing the main loop in CUDA (kernel):

    int  tid = blockIdx.x * blockDim.x + threadIdx.x;
    int  threadN = gridDim.x * blockDim.x;
    for (int pos = tid; pos < numElements; pos += threadN)
    {
        //  вычисления по параметру pos,  итерации цикла будут выполняться параллельно, другими словами цикл распадется на отдельные thread  каждый со своим параметром pos. 
Важное замечание: порядок выполнения отдельных  thread не зависит от вас, поэтому thread с номером pos=1146  может выполняться раньше чем thread c номером pos=956. Это надо помнить при работе с параллельными алгоритмами. Здесь как в зазеркалье много вещей непривычных для последовательно исполняемых программ.
     }

Much has been written in the documentation and reviews for CUDA , about GPU blocks, about Threads that are produced in these blocks, how to parallelize the task on them. But if you have an array of data and it clearly consists of mass elements, use the above loop form, since it is visually similar to a regular loop in form, which is pleasant, but unfortunately not in content.

I think the reader already understands that the class of tasks is rapidly narrowing in relation to mass parallel programming. If we are talking about creating games, 3d rendering engines , neural networks, video editing, and other similar tasks, then the clearing for independent reader actions is heavily worn out, there are large programs, small programs, frameworks, known and unknown libraries for these tasks. That is, the area remains just from the topic, to create your own small computing rocket, not SpaceX and Roscosmos, but something homely, but completely evil to the calculations.

Here is a picture of a completely flaming rocket depicted.

Speaking of tasks that a parallel cycle in your hands will not be able to solve. And the creators of CUDA in the person of NVIDIA developers have already thought about this.

There is a Thrust library in some places useful until "no options" done differently. By the way, did not find its full review on Habré.

To understand how it works, you first need to say three sentences about the principles of CUDA. If you need more words, you can read the link .

The principles of CUDA:

Computations take place on the GPU, the program of which is the kernel, and you have to write it in C. The kernel, in turn, communicates only with the GPU memory and you have to load the data into the video processor memory from the main program and upload it back to the program. Sophisticated algorithms on CUDA require mind flexibility.

So, the Thrust library removes the routine and takes on some of the "complex" tasks for CUDA, such as summing arrays or sorting them. You no longer need to write a separate kernel, load pointers into memory, and copy data from these pointers to the GPU memory. All the mystery will happen before your eyes in the main program and with a speed slightly inferior to CUDA. The Thrust library is written in CUDA, so this is a single berry field in terms of performance.

What you need to do in Thrust is to create an array (thrust :: vector) within its library, which is compatible with regular arrays (std :: vector). That is, of course, not everything is so simple, but the meaning of what the author said is similar to the truth. There are really two arrays, one on the GPU (device), the other in the main program (host).

An example will show the simplicity of the syntax (X, Y, Z arrays):

// initialize X to 0,1,2,3, ....
    thrust::sequence(X.begin(), X.end());
    // compute Y = -X
    thrust::transform(X.begin(), X.end(), Y.begin(), thrust::negate());
    // fill Z with twos
    thrust::fill(Z.begin(), Z.end(), 2);
    // compute Y = X mod 2
    thrust::transform(X.begin(), X.end(), Z.begin(), Y.begin(), thrust::modulus());
    // replace all the ones in Y with tens
    thrust::replace(Y.begin(), Y.end(), 1, 10);

You can see how harmless it looks against the background of creating the CUDA kernel, and the set of functions in Thrust is large . Starting from working with random variables, which in CUDA is done by a separate cuRAND library (preferably run by a separate kernel), to sorting, summing and writing your functions according to functionality close to the kernel functions.

The author has little experience using CUDA and C ++, two months. About this year about C #. This, of course, slightly contradicts the beginning of the article about his early acquaintance with computers, school physics and applied mathematics as an education. I will say so. But for what I am writing this article, it’s not that I mastered everything like that, but that C ++ turned out to be a comfortable language (I used to be a little afraid of it, against the background of articles in the Habrr type “lambda functions → overloading of internal operators, like redefine everything "), it is clear that the years of its development have led to quite friendly development environments (IDEs). The language itself in its latest version, it seems like it collects garbage from memory, I do not know how it was before. At least, programs written by the author on the simplest algorithmic constructions, Computing algorithms for bots drove for days and there were no memory leaks and other failures at high load. This also applies to CUDA, at first it seems complicated, but it is based on simple principles and of course it is difficult to initialize places on GPUs in places if there are a lot of them, but then you will have your own small rocket, with smoke from the video card.

Of the classes of objects for training with the GPU, the author recommends cellular automata . At one time there was an increase in popularity and fashion for them, but then neural networks seized the minds of developers.
Up to:

“every quantity in physics, including time and space, is finite and discrete.”
than not a cellular automaton.

But it’s beautiful when three simple formulas can create this:

If it will be interesting to read about cellular automata on CUDA, write in the comments, there will be typed material for a small article.
And this is the source of cellular automata (under the video there are links to the sources):

The idea to write an article after breakfast, in one breath seems to me to work out. Second coffee time. Have a nice weekend reader.

Tags:

Thrust CUDA Parallel Computing

Creating a bot to participate in the AI ​​mini cup. GPU Experience

Also popular now:

Creating a bot to participate in the AI mini cup. GPU Experience