CUDA: Working with memory. Part II
The main theme of this part is the optimization of working with global memory when programming GPUs.
The GPU has a number of features, ignoring which can cost multiple performance losses when using global memory. But if you take into account all the subtleties, then you can get really effective CUDA-programs.
Getting down.
The volume of global memory is the largest of all types of memory, but at the same time, this memory is the slowest in terms of technical characteristics: read and write speeds.
In the previous part, I looked at an example of transposing a matrix. To increase performance, a shared memory buffer was used, which allowed to increase the performance by almost four times. But it was strange enough to see this increase with an extra mediator. The secret lies in the correct use of global memory.
There are two ways to optimize working with global memory: aligning the sizes of the types used and using combined queries.
Data type alignment allows you to compile a query into global memory into one GPU command, otherwise the compiler will generate additional code, which can significantly reduce performance. For optimal performance, the data type should be 4, 8, or 16 bytes.
If the type size does not match 4, 8 or 16 bytes, then it is better to use a type of a larger dimension or to align using the keyword __align __ (alignment size).
An example of optimization using built-in CUDA types.
The size of type int3 is 12 bytes, access to memory will not be optimal:
It is better to use the int4 type (16 bytes), even if you don’t need the fourth component:
When working with structures, you must use the __align__ keyword, which allows you to align the type to a given size.
An example of alignment of the size of the structure.
Before alignment, the size of the vector3 structure is 12 bytes:
The number 12 will be displayed on the console.
After alignment, the size of vector3 will be 16 bytes:
The number 16 will be displayed on the console.
A much greater performance gain can be obtained by combining a large number of requests into global memory into one (sometimes requests are called transactions). This is called coalescing global memory accesses in the nVidia documentation . But, before proceeding to a direct discussion of what is needed to combine requests into memory, you need to know a couple of additional things about the GPU.
The GPU uses the so-called warp to control the execution of threads. From a software point of view, warp represents a thread pool. It is within this warp that the parallel operation of the threads that were requested when the kernel was called occurs, it is in the warp that the threads can interact with each other. The size of the warp for all GPUs is 32, that is, only 32 threads are executed in parallel in the warp. At the same time, several warp can be launched on the GPU, this amount is determined by the size of the available register and shared memory. Another interesting feature is that half-warp is used to access memory, that is, at the beginning, the first 16 threads are accessed, and then the second half of 16 threads. Why access happens this way, I can’t say for sure, I can only assume that this is due to the primary tasks of the GPU - graphics processing.
Now consider the requirements needed to combine queries into global memory. Do not forget that memory access occurs through half-warp.
The conditions necessary for combining when accessing memory depend on the version of Compute Capability, I give them for versions 1.0 and 1.1, more details can be found in the documentation from nVidia.
A couple of notes on the conditions:
Fig. 1. Request for giving the union when accessing memory
in Fig. Figure 1 shows examples of global memory queries that provide a single transaction. On the left, all conditions are met: each thread from half-warp accesses a 32-bit word corresponding in order, the address of the beginning of memory is aligned with the size of the transaction block (16 threads * 4 bytes = 64 bytes). On the right is an example where some threads from a block do not access their corresponding words in memory at all.
Fig. 2. Queries that do not provide union when accessing memory
In fig. Figure 2 shows examples that do not give a union when accessing global memory. On the left, the condition for handling the threads to the corresponding words in memory is not satisfied. On the right, the condition for aligning the memory address with the block size has not been met. As a result: instead of one combined transaction, we get 16 separate ones, one for each half-warp thread.
A few words should be given to the issue of working with structures and how to achieve increased productivity. If there is a need to use an array of structures, it is better to create separate arrays of structure components, which will reduce the number of requests to global memory due to associations.
Consider an example.
Ineffective work with global memory:
It is more efficient to use separate arrays:
In the first case, using an array of vectors to access each field of the structure, a separate request to memory is required, in the second case, by combining 3 queries for each half-warp, it is enough. On average, this approach allows you to increase productivity by 2 times.
In conclusion of all the above, I want to give the most important advice when working with memory in CUDA:
NEVER ATTEMPT TO CHANGE THE VALUE OF ONE MEMORY CELL WITH MULTIPLE THREADS SIMULTANEOUSLY.
This is the most common mistake in multi-threaded programming. In fact, CUDA does not guarantee atomic access for each thread to a specific memory area, so the results may not turn out exactly as expected. Although atomic operations exist in CUDA, it is better to use the concept of immutable data and save the results of calculations in new objects, which will be transferred to the next stages of calculations.
The GPU has a number of features, ignoring which can cost multiple performance losses when using global memory. But if you take into account all the subtleties, then you can get really effective CUDA-programs.
Getting down.
What is wrong with global memory?
The volume of global memory is the largest of all types of memory, but at the same time, this memory is the slowest in terms of technical characteristics: read and write speeds.
In the previous part, I looked at an example of transposing a matrix. To increase performance, a shared memory buffer was used, which allowed to increase the performance by almost four times. But it was strange enough to see this increase with an extra mediator. The secret lies in the correct use of global memory.
There are two ways to optimize working with global memory: aligning the sizes of the types used and using combined queries.
Alignment of sizes of used types
Data type alignment allows you to compile a query into global memory into one GPU command, otherwise the compiler will generate additional code, which can significantly reduce performance. For optimal performance, the data type should be 4, 8, or 16 bytes.
If the type size does not match 4, 8 or 16 bytes, then it is better to use a type of a larger dimension or to align using the keyword __align __ (alignment size).
An example of optimization using built-in CUDA types.
The size of type int3 is 12 bytes, access to memory will not be optimal:
__device__ int3 data[512];
__global__ void initData()
{
int idx = threadIdx.x
data[idx] = make_int3(idx, idx, idx);
};
* This source code was highlighted with Source Code Highlighter.
It is better to use the int4 type (16 bytes), even if you don’t need the fourth component:
__device__ int4 data[512];
__global__ void initData()
{
int idx = threadIdx.x
data[idx] = make_int4(idx, idx, idx, 0);
};
* This source code was highlighted with Source Code Highlighter.
When working with structures, you must use the __align__ keyword, which allows you to align the type to a given size.
An example of alignment of the size of the structure.
Before alignment, the size of the vector3 structure is 12 bytes:
struct vector3
{
float x;
float y;
float z;
};
int main()
{
printf("%i\n", sizeof(vector3));
return 0;
};
* This source code was highlighted with Source Code Highlighter.
The number 12 will be displayed on the console.
After alignment, the size of vector3 will be 16 bytes:
struct __align__(16) vector3
{
float x;
float y;
float z;
};
int main()
{
printf("%i\n", sizeof(vector3));
return 0;
};
* This source code was highlighted with Source Code Highlighter.
The number 16 will be displayed on the console.
Using federated queries
A much greater performance gain can be obtained by combining a large number of requests into global memory into one (sometimes requests are called transactions). This is called coalescing global memory accesses in the nVidia documentation . But, before proceeding to a direct discussion of what is needed to combine requests into memory, you need to know a couple of additional things about the GPU.
The GPU uses the so-called warp to control the execution of threads. From a software point of view, warp represents a thread pool. It is within this warp that the parallel operation of the threads that were requested when the kernel was called occurs, it is in the warp that the threads can interact with each other. The size of the warp for all GPUs is 32, that is, only 32 threads are executed in parallel in the warp. At the same time, several warp can be launched on the GPU, this amount is determined by the size of the available register and shared memory. Another interesting feature is that half-warp is used to access memory, that is, at the beginning, the first 16 threads are accessed, and then the second half of 16 threads. Why access happens this way, I can’t say for sure, I can only assume that this is due to the primary tasks of the GPU - graphics processing.
Now consider the requirements needed to combine queries into global memory. Do not forget that memory access occurs through half-warp.
The conditions necessary for combining when accessing memory depend on the version of Compute Capability, I give them for versions 1.0 and 1.1, more details can be found in the documentation from nVidia.
- Threads must access either 32-bit words, resulting in one 64-byte block (transaction), or 64-bit words, while giving one 128-byte block (transaction)
- If a call to 128-bit words is used, then as a result two transactions will be executed, each of which will return 128 bytes of information
- Threads must access memory elements sequentially, each next thread must correspond to the next word in memory (some threads may not address the corresponding words at all)
- All 16 words must be within the accessed block of memory
A couple of notes on the conditions:
- Words mean any type of data, the main thing is compliance with the required dimensions.
- The dimension of words is given in bits, and the dimension of the resulting data blocks in bytes.
Fig. 1. Request for giving the union when accessing memory
in Fig. Figure 1 shows examples of global memory queries that provide a single transaction. On the left, all conditions are met: each thread from half-warp accesses a 32-bit word corresponding in order, the address of the beginning of memory is aligned with the size of the transaction block (16 threads * 4 bytes = 64 bytes). On the right is an example where some threads from a block do not access their corresponding words in memory at all.
Fig. 2. Queries that do not provide union when accessing memory
In fig. Figure 2 shows examples that do not give a union when accessing global memory. On the left, the condition for handling the threads to the corresponding words in memory is not satisfied. On the right, the condition for aligning the memory address with the block size has not been met. As a result: instead of one combined transaction, we get 16 separate ones, one for each half-warp thread.
Array structures or arrays of structures?
A few words should be given to the issue of working with structures and how to achieve increased productivity. If there is a need to use an array of structures, it is better to create separate arrays of structure components, which will reduce the number of requests to global memory due to associations.
Consider an example.
Ineffective work with global memory:
struct __align__(16) vec3
{
float x;
float y;
float z;
};
__device__ vec3 data[SIZE];
__global__ void initData()
{
int idx = blockDim.x * blockIdx.x + threadIdx.x;
data[idx].x = idx;
data[idx].y = idx * 2;
data[idx].z = idx * 3;
};
* This source code was highlighted with Source Code Highlighter.
It is more efficient to use separate arrays:
__device__ float x[SIZE];
__device__ float y[SIZE];
__device__ float z[SIZE];
__global__ void initArr()
{
int idx = blockDim.x * blockIdx.x + threadIdx.x;
x[idx] = idx;
y[idx] = idx * 2;
z[idx] = idx * 3;
};
* This source code was highlighted with Source Code Highlighter.
In the first case, using an array of vectors to access each field of the structure, a separate request to memory is required, in the second case, by combining 3 queries for each half-warp, it is enough. On average, this approach allows you to increase productivity by 2 times.
Conclusion
In conclusion of all the above, I want to give the most important advice when working with memory in CUDA:
NEVER ATTEMPT TO CHANGE THE VALUE OF ONE MEMORY CELL WITH MULTIPLE THREADS SIMULTANEOUSLY.
This is the most common mistake in multi-threaded programming. In fact, CUDA does not guarantee atomic access for each thread to a specific memory area, so the results may not turn out exactly as expected. Although atomic operations exist in CUDA, it is better to use the concept of immutable data and save the results of calculations in new objects, which will be transferred to the next stages of calculations.