Concurrent programming with CUDA. Part 1: Introduction

  • Tutorial

Another CUDA article - why?


On Habré there were already many good articles on CUDA - one , two and others. However, the search for the “CUDA scan ” combination returned only 2 articles that were in no way related to the scan algorithm on the GPU, which is one of the most basic algorithms. Therefore, inspired by the recently viewed course on Udacity - Intro to Parallel Programming , I decided to write a more complete series of articles about CUDA. I must say right away that the series will be based on this particular course, and if you have time, it will be much more useful to go through it.

Content


The following articles are currently planned:
Part 1: Introduction.
Part 2: GPU hardware and parallel communication patterns.
Part 3: Fundamental GPU algorithms: reduce, scan, and histogram.
Part 4: Fundamental GPU algorithms: compact, segmented scan, sorting. Practical application of some algorithms.
Part 5: Optimization of GPU programs.
Part 6: Examples of parallelization of sequential algorithms.
Part 7: Additional topics of parallel programming, dynamic parallelism.

Delay vs bandwidth



The first question that everyone should ask before using the GPU to solve their problems is what GPU is good for, when should you use it? To answer, you need to define 2 concepts:
Latency - the time taken to complete one instruction / operation.
Throughput - the number of instructions / operations performed per unit of time.
A simple example: we have a passenger car with a speed of 90 km / h and a capacity of 4 people, and a bus with a speed of 60 km / h and a capacity of 20 people. If for the operation we take the movement of 1 person per 1 kilometer, then the delay of the car - 3600/90 = 40s - in so many seconds 1 person will overcome the distance of 1 kilometer, the throughput of the car is 4/40 = 0.1 operations / second; bus delay - 3600/60 = 60s, bus throughput - 20/60 = 0.3 (3) operations / second.
So, the CPU is a car, the GPU is a bus: it has a large delay but also a large bandwidth. If for your task the delay of each specific operation is not so important as the number of these operations per second - it is worth considering the use of the GPU.

CUDA Basic Concepts and Terms


So, let's deal with CUDA terminology:

  • Device - GPU Performs the role of "subordinate" - does only what the CPU tells him.
  • Host (host) - CPU. Performs a controlling role - launches tasks on the device, allocates memory on the device, moves memory to / from the device. And yes, using CUDA assumes that both the device and the host have their own separate memory.
  • Kernel is a task launched by a host on a device.

When using CUDA, you simply write code in your favorite programming language (a list of supported languages, excluding C and C ++), after which the CUDA compiler will generate the code separately for the host and separately for the device. A small caveat: the code for the device should only be written in C with some 'CUDA extensions'.

The main stages of the CUDA program


  1. The host allocates the right amount of memory on the device.
  2. The host copies data from its memory to the device’s memory.
  3. The host starts the execution of certain cores on the device.
  4. The device runs the kernel.
  5. The host copies the results from the device memory to its memory.

Naturally, for the most efficient use of the GPU, it is necessary that the ratio of time spent on core work to time spent on memory allocation and data movement be as large as possible.

Kernels


Let us consider in more detail the process of writing code for the kernels and their launch. An important principle is that kernels are written as (practically) ordinary sequential programs - that is, you will not see the creation and start of threads in the code of the kernels themselves. Instead, to organize parallel computing, the GPU will launch a large number of copies of the same kernel in different threads - or rather, you yourself say how many threads to start. And yes, returning to the question of the efficiency of using the GPU - the more threads you start (provided that all of them will do useful work) - the better.
The code for the kernels differs from the usual sequential code in such moments:
  1. Inside the kernels you have the opportunity to find out the “identifier” or, more simply, the position of the thread that is currently executing - using this position we ensure that the same core will work with different data depending on the thread in which it is running. By the way, such an organization of parallel computing is called SIMD (Single Instruction Multiple Data) - when several processors perform the same operation simultaneously on different data.
  2. In some cases, it is necessary to use various synchronization methods in the kernel code.

How do we set the number of threads in which the kernel will be launched? Since the GPU is still a Graphics Processing Unit, this naturally affected the CUDA model, namely, the way the number of threads is set:
  • First, the dimensions of the so-called grid are set, in 3D coordinates: grid_x, grid_y, grid_z . As a result, the grid will consist of grid_x * grid_y * grid_z blocks.
  • Then the block sizes in 3D coordinates are set: block_x, block_y, block_z . As a result, the block will consist of block_x * block_y * block_z threads. Total, we have grid_x * grid_y * grid_z * block_x * block_y * block_z flows. An important note - the maximum number of threads in one block is limited and depends on the GPU model - typical values ​​are 512 (older models) and 1024 (newer models).
  • Inside the kernel, threadIdx and blockIdx variables with x, y, z fields are available - they contain the 3D coordinates of the stream in the block and block in the grid, respectively. BlockDim and gridDim variables with the same fields are also available - block and grid sizes, respectively.

As you can see, this method of triggering streams is really suitable for processing 2D and 3D images: for example, if you need to process each pixel in a 2D or 3D image in a certain way, then after choosing the block sizes (depending on the image size, processing method and GPU model), the grid dimensions are chosen so that the entire image is covered, possibly with a surplus - if the dimensions of the image are not completely divided by the dimensions of the block.

We are writing the first program in CUDA


Enough theory, time to write code. Instructions for installing and configuring CUDA for different operating systems - docs.nvidia.com/cuda/index.html . Also, for simplicity of work with image files we will use OpenCV , and for comparison of CPU and GPU performance - OpenMP .
The task is quite simple: convert a color image to shades of gray . To do this, the brightness of the pixel pix in the gray scale is calculated by the formula: the Y = 0.299 + 0.587 * pix.R pix.G * + 0.114 * pix.B .
First, write the skeleton of the program:
main.cpp
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include "openMP.hpp"
#include "CUDA_wrappers.hpp"
#include "common/image_helpers.hpp"
using namespace cv;
using namespace std;
int main( int argc, char** argv )
{
  using namespace std::chrono;
  if( argc != 2)
  {
    cout <<" Usage: convert_to_grayscale imagefile" << endl;
    return -1;
  }
  Mat image, imageGray;
  uchar4 *imageArray;
  unsigned char *imageGrayArray;
  prepareImagePointers(argv[1], image, &imageArray, imageGray, &imageGrayArray, CV_8UC1);
  int numRows = image.rows, numCols = image.cols;
  auto start = system_clock::now();
  RGBtoGrayscaleOpenMP(imageArray, imageGrayArray, numRows, numCols);
  auto duration = duration_cast(system_clock::now() - start);
  cout<<"OpenMP time (ms):" << duration.count() << endl;
  memset(imageGrayArray, 0, sizeof(unsigned char)*numRows*numCols);  
  RGBtoGrayscaleCUDA(imageArray, imageGrayArray, numRows, numCols);
  return 0;
}


Everything is pretty obvious here - we read the image file, prepare pointers to a color and grayscale image, run the option
with OpenMP and the option with CUDA, measure the time. The prepareImagePointers function has the following form:
prepareImagePointers
template 
void prepareImagePointers(const char * const inputImageFileName,
                          cv::Mat& inputImage, 
                          T1** inputImageArray, 
                          cv::Mat& outputImage,
                          T2** outputImageArray, 
                          const int outputImageType)
{
  using namespace std;
  using namespace cv;
  inputImage = imread(inputImageFileName, IMREAD_COLOR);
  if (inputImage.empty()) 
  {
    cerr << "Couldn't open input file." << endl;
    exit(1);
  }
  //allocate memory for the output
  outputImage.create(inputImage.rows, inputImage.cols, outputImageType);
  cvtColor(inputImage, inputImage, cv::COLOR_BGR2BGRA);
  *inputImageArray = (T1*)inputImage.ptr(0);
  *outputImageArray  = (T2*)outputImage.ptr(0); 
}


I went for a little trick: the fact is that we do very little work for each pixel of the image - that is, with the CUDA option, the above-mentioned problem arises of the ratio of the execution time of useful operations to the time of memory allocation and data copying, and as a result, the total time The CUDA version will be larger than the OpenMP version, but we want to show that CUDA is faster :) Therefore, for CUDA, only the time spent on performing the actual image conversion will be measured - excluding memory operations. In my defense, I’ll say that for a large class of tasks the useful time will still dominate, and CUDA will be faster even taking into account memory operations.
Next, we write the code for the OpenMP version:
openMP.hpp
#include 
#include 
#include 
void RGBtoGrayscaleOpenMP(uchar4 *imageArray, unsigned char *imageGrayArray, int numRows, int numCols)
{
    #pragma omp parallel for collapse(2)
    for (int i = 0; i < numRows; ++i)
    {
        for (int j = 0; j < numCols; ++j)
        {
            const uchar4 pixel = imageArray[i*numCols+j];
            imageGrayArray[i*numCols+j] = 0.299f*pixel.x + 0.587f*pixel.y+0.114f*pixel.z;
        }
    }
}


Everything is pretty straight forward - we just added the omp parallel for directive to the single-threaded code - this is the beauty and power of OpenMP. I tried to play around with the schedule parameter , but it turned out only worse than without it.
Finally, move on to CUDA. Here we will write in more detail. First you need to allocate memory for the input data, move them from the CPU to the GPU and allocate memory for the output data:
Hidden text
void RGBtoGrayscaleCUDA(const uchar4 * const h_imageRGBA, unsigned char* const h_imageGray, size_t numRows, size_t numCols)
{
  uchar4 *d_imageRGBA;
  unsigned char *d_imageGray;
  const size_t numPixels = numRows * numCols;
  cudaSetDevice(0);
  checkCudaErrors(cudaGetLastError());
  //allocate memory on the device for both input and output
  checkCudaErrors(cudaMalloc(&d_imageRGBA, sizeof(uchar4) * numPixels));
  checkCudaErrors(cudaMalloc(&d_imageGray, sizeof(unsigned char) * numPixels));
  //copy input array to the GPU
  checkCudaErrors(cudaMemcpy(d_imageRGBA, h_imageRGBA, sizeof(uchar4) * numPixels, cudaMemcpyHostToDevice));


It is worth paying attention to the standard for naming variables in CUDA - data on the CPU starts with h_ ( h ost), data and the GPU starts with d_ ( d evice). checkCudaErrors - a macro taken from the github repository of the Udacity of the course. It has the following form:
Hidden text
#include 
#define checkCudaErrors(val) check( (val), #val, __FILE__, __LINE__)
template
void check(T err, const char* const func, const char* const file, const int line) {
  if (err != cudaSuccess) {
    std::cerr << "CUDA error at: " << file << ":" << line << std::endl;
    std::cerr << cudaGetErrorString(err) << " " << func << std::endl;
    exit(1);
  }
}


cudaMalloc - an analogue of malloc for the GPU, cudaMemcpy - an analogue of memcpy , has an additional parameter in the form of an enum that indicates the type of copy: cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice.
Next, you need to set the dimensions of the grid and block and call the kernel, not forgetting to measure the time:
Hidden text
 dim3 blockSize;
  dim3 gridSize;
  int threadNum;
  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);
  threadNum = 1024;
  blockSize = dim3(threadNum, 1, 1);
  gridSize = dim3(numCols/threadNum+1, numRows, 1);
  cudaEventRecord(start);
  rgba_to_grayscale_simple<<>>(d_imageRGBA, d_imageGray, numRows, numCols);
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
  float milliseconds = 0;
  cudaEventElapsedTime(&milliseconds, start, stop);
  std::cout << "CUDA time simple (ms): " << milliseconds << std::endl;


Pay attention to the kernel call format - kernel_name <<>>. The kernel code itself is also not very complicated:
rgba_to_grayscale_simple
__global__
void rgba_to_grayscale_simple(const uchar4* const d_imageRGBA,
                              unsigned char* const d_imageGray,
                              int numRows, int numCols)
{
    int y = blockDim.y*blockIdx.y + threadIdx.y;
    int x = blockDim.x*blockIdx.x + threadIdx.x;
    if (x>=numCols || y>=numRows)
      return;
    const int offset = y*numCols+x;
    const uchar4 pixel = d_imageRGBA[offset];
    d_imageGray[offset] = 0.299f*pixel.x + 0.587f*pixel.y+0.114f*pixel.z;
}


Here we calculate the y and x coordinates of the processed pixel using the previously described variables threadIdx , blockIdx and blockDim , and we do the conversion. Pay attention to the check if (x> = numCols || y> = numRows) - since the size of the image will not necessarily be divided entirely by the size of the blocks, some blocks may “go beyond the limits” of the image - therefore this check is necessary. Also, the kernel function must be marked with the __global__ specifier .
The last step is to copy the result back from the GPU to the CPU and free up the allocated memory:
Hidden text
  checkCudaErrors(cudaMemcpy(h_imageGray, d_imageGray, sizeof(unsigned char) * numPixels, cudaMemcpyDeviceToHost));
  cudaFree(d_imageGray);
  cudaFree(d_imageRGBA);


By the way, CUDA allows you to use the C ++ compiler for host code - so you can easily write wrappers to automatically free memory.
So, we start, measure (the size of the input image is 10.109 × 4.542 ):
OpenMP time (ms):45
CUDA time simple (ms): 43.1941

The configuration of the machine on which the tests were carried out:
Hidden text
Processor: Intel® Core (TM) i7-3615QM CPU @ 2.30GHz.
GPU: NVIDIA GeForce GT 650M, 1024 MB, 900 MHz.
RAM: DD3, 2x4GB, 1600 MHz.
OS: OS X 10.9.5.
Compiler: g ++ (GCC) 4.9.2 20141029.
CUDA compiler: Cuda compilation tools, release 6.0, V6.0.1.
Supported version of OpenMP: OpenMP 4.0.

It turned out somehow not very impressive :) But the problem is the same - too little work is done on each pixel - we start thousands of threads, each of which works almost instantly. In the case of the CPU, this problem does not arise - OpenMP will launch a relatively small number of threads (8 in my case) and divide the work between them equally - this way the processors will be occupied by almost 100%, while with the GPU we, in fact, we do not use all its power. The solution is pretty obvious - handle a few pixels in the kernel. The new, optimized kernel will look like this:
rgba_to_grayscale_optimized
#define WARP_SIZE 32
__global__
void rgba_to_grayscale_optimized(const uchar4* const d_imageRGBA,
                                 unsigned char* const d_imageGray,
                                 int numRows, int numCols,
                                 int elemsPerThread)
{
    int y = blockDim.y*blockIdx.y + threadIdx.y;
    int x = blockDim.x*blockIdx.x + threadIdx.x;
    const int loop_start =  (x/WARP_SIZE * WARP_SIZE)*(elemsPerThread-1)+x;
    for (int i=loop_start, j=0; j


Здесь не все так просто как с предыдущим ядром. Если разобраться, теперь каждый поток будет обрабатывать elemsPerThread пикселов, причем не подряд, а с расстоянием в WARP_SIZE между ними. Что такое WARP_SIZE, почему оно равно 32, и зачем обрабатывать пиксели пободным образом, будет более детально рассказано в следующих частях, сейчас только скажу что этим мы добиваемся более эффективной работы с памятью. Каждый поток теперь обрабатывает elemsPerThread пикселов с расстоянием в WARP_SIZE между ними, поэтому x-координата первого пиксела для этого потока исходя из его позиции в блоке теперь рассчитывается по несколько более сложной формуле чем раньше.
Запускается это ядро следующим образом:
Скрытый текст
  threadNum=128;
  const int elemsPerThread = 16;
  blockSize = dim3(threadNum, 1, 1);
  gridSize = dim3(numCols / (threadNum*elemsPerThread) + 1, numRows, 1);
  cudaEventRecord(start);
  rgba_to_grayscale_optimized<<>>(d_imageRGBA, d_imageGray, numRows, numCols, elemsPerThread);
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
  milliseconds = 0;
  cudaEventElapsedTime(&milliseconds, start, stop);
  std::cout << "CUDA time optimized (ms): " << milliseconds << std::endl;


Количество блоков по x-координате теперь рассчитывается как numCols / (threadNum*elemsPerThread) + 1 вместо numCols / threadNum + 1. В остальном все осталось так же.
Запускаем:
OpenMP time (ms):44
CUDA time simple (ms): 53.1625
CUDA time optimized (ms): 15.9273

Получили прирост по скорости в 2.76 раза (опять же, не учитывая время на операции с памятью) — для такой простой проблемы это довольно неплохо. Да-да, эта задача слишком простая — с ней достаточно хорошо справляется и CPU. Как видно из второго теста, простая реализация на GPU может даже проигрывать по скорости реализации на CPU.
На сегодня все, в следующей части рассмотрим аппаратное обеспечение GPU и основные шаблоны параллельной коммуникации.
Весь исходный код доступен на bitbucket.

Also popular now: