V_oron September 21, 2012 at 00:00

CUDA: block synchronization

From the sandbox

When using parallel computing tools, it is very likely that a situation may arise when the algorithm contains two such sequential steps: i ) each j- th stream stores some intermediate calculation result in the j- th memory cell, and then, ii ) this stream should use the results of one or more "neighboring" threads. Obviously, it is necessary to organize a certain time barrier in the program code, which each thread overcomes after everyone has saved their intermediate results in the corresponding memory cells (step ( i )). Otherwise, some thread may go to step ( ii), while some other threads have not yet completed step ( i ). It is regrettable, but the creators of CUDA felt that such a special built-in mechanism to synchronize any number of threads on one GPU is not needed. So how can you deal with this scourge? Although Google, judging by the prompts, is familiar with this issue, it was not possible to find a ready-made satisfactory recipe for its task, and there are some pitfalls on the way to achieving the desired result for the beginner (which I am).

A few words about the CUDA architecture

To begin with, let me remind you, based on the official documentation [1,2] and slides [3,4] , materials of various third-party sites [5-11], the general picture that a programmer encounters when using CUDA. At the highest level of abstraction, he receives a parallel computing system architecture SIMT ( Single-Instruction, the Multiple-Thread A ) - a team to perform many parallel more or less independent streams ( threads ). The totality of all these threads running as part of a single task (see Figure 1) is called grid .

Рис. 1.

Parallelism grid`a performance is ensured primarily by the presence on the video card more identical scalar processors ( scalar processors ), which, in fact, performed streams (see. Figure 3). Physically (see Figure 2), scalar processors are parts of streaming multiprocessors ( SM ).

Рис. 2.

For example, in my Tesla there are 30 SMs, each of which has 8 scalar processors. However, it is possible to run grid`y of substantially 240 used in these nuclei of ng bigger number of streams ⁽¹⁾thanks to hardware mechanisms for sharing available resources (both the working time of these cores and the available memory). And some features of the implementation of just these mechanisms also determine the methodology for synchronizing threads when accessing shared memory.

One of these important features is a grouping of streams of 32 pieces in the warp `s, which are parts of larger structures - units ( blocks ). All threads of each block (for example, for my Tesla block can contain a maximum of 512 threads ⁽¹⁾ ) run strictly on one SM, therefore they have access only to its resources. However, more than one block can be launched on one SM (see Fig. 3), and resources will be divided equally between them.

Рис. 3.

In each SM, there is a control unit that distributes processor time resources. This is done so that at every moment of time all the kernels of one SM execute strictly one warp. And upon its completion, the next warp assigned to this SM is selected in a cunning optimal way. Thus, it turns out that the flows of one warp are synchronized due to the hardware feature of CUDA and are executed according to a method even closer to SIMD ( Single Instruction, Multiple Data ). But the flows of even one block from different warp's can be noticeably out of sync.

Another, no less important feature is the organization of memory in CUDA and access of threads to its various parts. The highest degree of general availability for threads has global memory (global memory ), physically implemented in the form of integrated circuits sealed on a graphics card - the same video memory, which is now calculated in gigabytes. The location outside the processor makes this type of memory the slowest compared to others provided for computing on a video card. A smaller “degree of accessibility” is shared memory : located in each SM block (see Figure 2), usually 16KB in size ⁽¹⁾, is available only to those threads that run on the cores of this SM (see Fig. 1, Fig. 3). Since more than one block can be allocated for parallel execution on one SM, the entire amount of shared memory available in SM is distributed equally between these blocks. It should be noted that shared memory is physically located somewhere very close to the SM cores, therefore it has a high access speed comparable to the speed of register ( registers) - the main form of memory. It is registers that can serve as operands of elementary machine instructions, and are the fastest memory. All cash registers of one SM are equally divided between all threads running on that SM. The group of registers allocated for use by any thread is available to him and only to him. As an illustration of the power of CUDA (or, conversely, the scale of the disaster): in the same Tesla, each SM provides for use 16384 pieces of 32-bit general registers ⁽¹⁾ .

From all of the above, we can conclude that the interaction between the flows of one block should be tried through their common fast shared memory, and between the flows of two different blocks - only using global memory. This is where the problem identified in the introduction arises: tracking the relevance of data in various streams that are publicly available for reading and writing memory areas. In other words, the problem of thread synchronization. As already noted, within one block, the flows of each warp are synchronized with each other. To synchronize block flows, regardless of warp membership, there are several barrier-type commands:

__syncthreads () is the surest way. This function will make each thread wait until (a) all other threads of this block reach this point and (b) all operations for accessing shared and global memory made by the threads of this block are completed and become visible to the threads of this block . It is not necessary to place this command inside the conditional if statement , but it is necessary to provide an unconditional call to this function by all threads of the block.
__threadfence_block () will make the thread that called it wait until all committed access operations to shared and global memory are completed and become visible to the threads of this block .
__threadfence () will force the thread that caused it to wait until all committed access operations to the shared memory become visible to the threads of this block , and operations with global memory to all threads on the "device" . By “device” is meant a graphics adapter.
__threadfence_system () is similar to __threadfence () , but it enables synchronization with threads on the CPU ("host") when using a very convenient page-locked memory. More details in [1,2] and some other sources listed in the list below.

The first team puts a single barrier for all threads of one block, the other three - for each thread its own independent barrier. To synchronize the flows of the entire grid, you will need to come up with something else. Before considering this “yet”, we specify the task so that we can give an example of a meaningful C code.

Task more

So, more specifically, consider the following example. Let two sections be allocated in the adapter’s global memory: 128 elements for X [] and P [] arrays . Let the array X [] is written from the host (by the central processor from the computer's RAM). Create a grid of two blocks of 64 threads in each - that is, a total of 128 threads (see Figure 4).

Рис. 4.

Now step ( i ) can be performed : each thread with number j will add all the elements of the array X [] to each other , writing the result to P [j]. Next, step ( ii ) should be performed : each j- th stream will begin the summation of all elements of the array P[], writing them to the corresponding X [j]. Of course, using CUDA for parallel execution of the same thing 128 times is pointless, but in real life each stream will have its own set of weighting factors with which summation occurs, and the transformations X -> P and vice versa, P -> X - will occur many times. In our example, we choose coefficients equal to unity - for clarity and simplicity, which does not violate generality.

We pass from theory to experiment. The algorithm is very transparent, and a person who has never dealt with multithreading can immediately suggest the following CUDA kernel code:

__global__ void Kernel(float *X, float *P)
{
    const int N = 128;                       // Число элементов и используемых потоков в константе.
    const int index = 
        threadIdx.x + blockIdx.x*blockDim.x; // Номер потока.
    float a;                                 // Аккумулятор в регистре. Каждому потоку свой.
    /* этап (i): */
    a = X[0];
    for(int j = 1; j < N; ++j)               // Собственно, цикл суммирования
        a += X[j];
    P[index] = a / N;                        // Отнормируем, чтобы не получалось больших чисел.
    /* конец этапа (i). */
    /* этап (ii): */
    a = P[0];
    for(int j = 1; j < N; ++j)               // Собственно, цикл суммирования
        a += P[j];
    X[index] = a / N;                        // Отнормируем, чтобы не получалось больших чисел.
    /* конец этапа (ii). */
}

Repeated execution of this kernel will show that the array P [] will be the same from time to time, but, here, X [] can sometimes differ. Moreover, if there is a difference, then it will not be in one element X [j], but in a group of 32 consecutive elements! In this case, the index of the first element in the erroneous block will also be a multiple of 32 - this is just a manifestation of synchronization in those very warp`s and some out of sync streams of different warp`ov. If an error occurred in some thread, then all the rest of his warp will have it. If you apply the synchronization mechanism proposed by the CUDA developers

__global__ void Kernel(float *X, float *P)
{
    ...
    /* конец этапа (i). */
    __syncthreads();
    /* этап (ii): */
    ...
}

then we will ensure that each stream of the block will give the same result. And if somewhere it will be wrong - then the whole block. Thus, it remains to somehow synchronize different blocks.

Solution methods

Unfortunately, I only know two methods:

The CUDA kernel terminates if and only if all threads terminate. Thus, one core can be split into two and called from the main program sequentially;
Come up with a flag system in global memory.

I did not like the first option very much, due to the fact that in my task it is necessary to call such kernels often (thousands of times), and there is reason to fear that there will be additional delays in starting the kernel itself. If only because at the beginning of each core you need to prepare some variables, process the arguments of the kernel function ... It would be more logical and faster to do this once in the "large" kernel, and then not interfere with the CPU, leaving the graphics adapter to boil in the juice from the data in its own memory.

As for the second option with a flag system, a similar mechanism is mentioned in the section “B.5 Memory Fence Functions” in [1]. However, a slightly different kernel algorithm is considered there. To implement block synchronization, we introduce two functions: the first will prepare the values of the counter of spent blocks, and the second will play the role of a barrier - delay each thread until all blocks are complete. For example, these functions and the kernel that uses them may look like this:

__device__ unsigned int count;                 // Флаг-счетчик отработавших блоков. Под него выделится
                                               //4 байта в глобальной памяти устройства.
/* Фунция начальной инициализации флага-счетчика: */
__device__ void InitSyncWholeDevice(const int index)
{
    if (index == 0)                            // Первый поток в grid`е (индекс 0) запишет нулевое
        count = 0;                             //начальное значение в счетчик блоков.
    if (threadIdx.x == 0)                      // Первый поток каждого block`а будет ждать, пока флаг-
        while (count != 0);                    //счетчик действительно станет нулем.
    // Заставляем остальные потоки каждого block`а ждать, пока первые не выйдут из цикла:
    __syncthreads();
    // Все, флаг-аккумулятор записан. Все потоки на device более-менее идут вровень.
}
/* Фунция синхронизации потоков на device: */
__device__ void SyncWholeDevice()
{
    // Переменная под значение счетчика до инкремента:
    unsigned int oldc;
    // Каждый поток пождет, пока записанное им в gmem и smem, станет видно всему grid`у:
    __threadfence();                           
    // Первые потоки каждого block`а атомарным образом инкрементируют (каждый по разу)
    //флаг-аккумулятор:
    if (threadIdx.x == 0)                      
    {
        // В oldc кладется значение count до "+1":
        oldc = atomicInc(&count, gridDim.x-1);
        // Пусть поток подождет, пока его инкремент "дойдет" до ячейки в gmem:
        __threadfence();                       
        // Если это последний блок (остальные уже инкрементировали count и ждут за счет цикла ниже),
        //то и незачем ему считывать count, так как предварительно убедились, что его инкремент
        //записан в gmem. Если мы в блоке, который еще не "отработал", то его первый поток будет
        //зациклен, пока все остальные блоки не "отчитаются" о завершении счета.
        if (oldc != (gridDim.x-1))             
            while (count != 0);
    }
    // Заставляем потоки в каждом блоке ждать, пока первые не выйдут из цикла:
    __syncthreads();                           
}
__global__ void Kernel_Synced(float *X, float *P)
{
    InitSyncWholeDevice(threadIdx.x + blockIdx.x*blockDim.x);
    ...
    /* конец этапа (i). */
    SyncWholeDevice();
    /* этап (ii): */
    ...
}

That's, like, that's all. The flag was wound up, the functions were written. It remains to compare the performance of the first and second methods. But, unfortunately, the SyncWholeDevice () function will increment the counter, but it will not provide a barrier delay. It would seem, why? After all, there is a while loop . Here we are sailing up to the pitfall mentioned in the abstract, which becomes visible: if you look at the nvcc ptx file generated by the compiler [12-14] , it turns out that he kindly throws an empty loop from his point of view. You can force the compiler not to optimize the loop this way in at least two ways.

An explicit insertion in the ptx assembler will certainly work. For example, such a function, the call of which should replace the cyclewhile :

	
__device__ void do_while_count_not_eq(int val)
{
	asm("{\n\t"
		"$my_while_label: \n\t"
		"   .reg .u32      r_count;            \n\t"
		"   .reg .pred     p;                  \n\t"
		"   ld.global.u32  r_count, [count];   \n\t"
		"   setp.ne.u32    p, r_count, %0;     \n\t"
		"@p bra            $my_while_label;    \n\t"
		"}\n\t"
		: : "r"(val));
}

Another syntactically more elegant way is to use the volatile specifier when declaring a counter flag. This will tell the compiler that the variable in global (or shared) memory can be modified by any thread at any time. Therefore, when accessing this variable, it is necessary to disable all optimization. In the code it will be necessary to change only two lines:

__device__ volatile unsigned int count;        // Флаг-счетчик отработавших блоков. Под него выделится
                                               //4 байта в глобальной памяти устройства.
...
        // В oldc кладется значение count до "+1":
        oldc = atomicInc((unsigned int*)&count, gridDim.x-1);
...

Assessment of solution methods

We now carry out a rough theoretical estimation of the performance of two block synchronization methods. Rumor has it that a kernel call takes ~ 10 microseconds - this is the price of synchronization by multiple core calls. In the case of synchronization by introducing a barrier from the loop, ~ 10 threads (depending on how many blocks) increment and read one loop in the global memory in the loop, where each input / output operation takes about 500 clock cycles. Let each block carry out such operations 3. Then, about 10 * 500 * 3 = 1.5 * 10 ^ 4 clock cycles will be spent on the synchronization operation. At a core frequency of 1.5 GHz, we get 1.0 * 10 ^ (- 5) sec = 10 μs. That is, the order of magnitude is the same.

But, of course, it is interesting to look at the results of at least some tests. In Fig. 5, the post reader can see a comparison of the time spent on performing 100 consecutive conversions.X -> P -> X , repeated 10 times for each grid configuration. Repeating 10 times is done to average the time required for 100 transformations ⁽²⁾ .

Рис. 5.

In the horizontal plane, the number of triggered blocks and the number of threads in each of them are plotted. The vertical axis represents the percentage time gain for the “one kernel launch ” method (let's call SKL - single kernel launch) relative to the “multiple kernel call” method ( MKL- multi kernel launch). It is clearly seen that the gain on the grid configurations under consideration, although very small, is almost always positive. However, the more blocks there are, the MKL method lags behind in performance less and less. For 32 blocks, he even slightly outperforms the SKL method. This is due to the fact that the more blocks, the more threads (having threadIdx.x == 0 ) read the variable countfrom slow global memory. And there is no mechanism “I read it once, gave meaning to all flows”. If we consider a change in relative performance depending on the number of threads in a block, with a constant number of blocks themselves, then you can also notice some regularity. But here, effects unknown to the author work, related to the synchronization of flows in the block, the management of warp in SM. Therefore, we refrain from further comments.

It is interesting to look at performance with the same number of working threads (1024), but their different division into blocks. Figure 6 shows the graphs of the renormalized time spent on 100 * 10 of the above transformations for two methods (MKL and SKL).

Рис. 6.

In fact, this is the diagonal “slice” in Fig. 5. It is clearly seen that at first, with larger blocks, the performance of both synchronization methods grows equally. CUDA developers warn of such an effect in the official documentation [2] , but the author, again, unfortunately, does not know the details of the mechanisms of this phenomenon. The reduction of the gap and even the loss of the SKL method with the smallest division into blocks is connected, as already mentioned, with an increase in the number of readings of the variable count .

It should be noted that the tests were carried out during the implementation of the SKL method by replacing the while loop with the ptx-assembler insert. Using the volatile specifiersometimes (depending on the grid configuration) it slows down the process, and sometimes it speeds up. The deceleration reaches 0.20%, and the acceleration - 0.15%. This behavior, most likely, is determined by the features of the implementation of the while loop by the compiler and on the ptx-assembler insert by a person, and allows us to consider both implementations of the SKL method as equally productive.

Conclusion

In this article, I tried to illuminate at a basic level the problem of thread synchronization, methods of block synchronization; After some tests, give the pictures a general description of the CUDA system. In addition, in the source code of the test program ^{(2), the} reader will be able to find another example of reliable use of the buffer in shared memory (threads are synchronized via __syncthreads () ). I hope someone will find this useful. To me personally, this information, collected in one place, would help to save several days of experimenting with the code, and “googling”, since I have a stupid tendency to not very attentive reading the documentation.

⁽¹⁾It is proposed to use the CUDA API function cudaGetDeviceProperties (...) [1-2, 15] to obtain technical information about the adapters available on the computer .
⁽²⁾The source code for the test program, uploaded to pastebin.com.

List of sources of information

[1] CUDA C Programming Guide
[2] CUDA C Best Practices Guide
[3] Advanced CUDA Webinar: Memory Optimizations
[4] S. Tariq, An Introduction to GPU Computing and CUDA Architecture
[5] Vanderbilt University, ACCRE, GPU Computing with CUDA
[6] OmSTU, Faculty of Radio Engineering, Department of Integrated Information Protection, retraining program "Programming for GPUs"
[7] Summer Supercomputer Academy, High-performance computing on clusters using graphics accelerators NVIDIA
[8] iXBT.com: NVIDIA CUDA - non-graphical computing on GPUs
[9] cgm.computergraphics.ru:Введение в технологию CUDA
[10] THG.ru: nVidia CUDA: вычисления на видеокарте или смерть CPU?
[11] steps3d.narod.ru: Основы CUDA, Программирование на CUDA (часть 2)
[12] The CUDA Compiler Driver (NVCC)
[13] Using Inline PTX Assembly in CUDA
[14] PTX: Parallel Thread Execution ISA Version 3.0
[15] CUDA API Reference Manual (PDF, HTML online)

Теги: