CUDA and remote GPU

From the sandbox

CUDA is good for everyone, as long as you have an Nvidia video card at hand. But what to do when there is no Nvidia video card on your favorite laptop? Or do you need to develop in a virtual machine?

I will try to consider in this article a solution such as the rCUDA (Remote CUDA) framework, which will help when Nvidia has a video card, but is not installed on the machine that is supposed to run CUDA applications. For those who are interested, welcome under cat.

Tldr

rCUDA (Remote CUDA) — фреймворк, реализующий CUDA API, позволяющий использовать удалённую видеокарту. Находится в работоспособной бета-версии, доступен только под Linux. Основная цель rCUDA — полная совместимость с CUDA API, вам не нужно никак модифицировать свой код, достаточно задать специальные переменные среды.

What is rCUDA

rCUDA (Remote CUDA) is a framework that implements the CUDA API, which allows you to use a video card located on a remote machine for CUDA calculations without making any changes to your code. Developed at the Polytechnic University of Valencia ( rcuda-team ).

Restrictions

Currently, only GNU / Linux systems are supported, but the developers promise to support Windows in the future. The current version of rCUDA, 18.03beta, is compatible with CUDA 5-8, that is, CUDA 9 is not supported. The developers declared full compatibility with CUDA API, with the exception of graphics.

Possible usage scenarios

Running CUDA applications in a virtual machine when forwarding a video card is inconvenient or impossible, for example, when the video card is occupied by a host, or when there are more than one virtual machine.
Laptop without a discrete graphics card.
The desire to use multiple video cards (clustering). Theoretically, you can use all video cards available in a team, including in conjunction.

Brief instruction

Test configuration

Testing was conducted on the following configuration:

Server:
Ubuntu 16.04, GeForce GTX 660

Client:
Virtual machine with Ubuntu 16.04 on a laptop without a discrete video card.

Getting rCUDA

The most difficult stage. Unfortunately, at the moment the only way to get your own copy of this framework is to fill in the corresponding request form on the official website. However, the developers promise to respond within 1-2 days. In my case, I was sent a distribution on the same day.

CUDA installation

First you need to install the CUDA Toolkit on the server and client (even if the client does not have an nvidia video card). To do this, you can download it from the official site or use the repository. The main thing is to use a version not higher than 8. In this example, the .run installer from the official site is used .

chmod +x cuda_8.0.61_375.26_linux.run
./cuda_8.0.61_375.26_linux.run

Important! On the client, you should refuse to install the nvidia driver. By default, the CUDA Toolkit will be available at / usr / local / cuda /. Install CUDA Samples, they will be needed.

Install rCUDA

Let's unpack the archive from developers in our home directory on the server and on the client.

tar -xvf rCUDA*.tgz -C ~/
mv ~/rCUDA* ~/rCUDA

You need to do these actions on the server and on the client.

Running the rCUDA daemon on the server

export PATH=$PATH/usr/local/cuda/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/home/<XXX>/rCUDA/lib/cudnn
cd ~/rCUDA/bin
./rCUDAd

Replace <XXX> with your username. Use ./rCUDAd -iv if you want to see detailed output.

Client setup

Let's open a terminal on the client, in which we will later run CUDA code. On the client side, we need to "replace" the standard CUDA libraries with the rCUDA libraries, for which we add the appropriate paths to the LD_LIBRARY_PATH environment variable. We also need to specify the number of servers and their addresses (in my example it will be one).

export PATH=$PATH/usr/local/cuda/bin  
export LD_LIBRARY_PATH=/home/<XXX>/rCUDA/lib/:$LD_LIBRARY_PATHexport RCUDA_DEVICE_COUNT=1  # укажем количество видеокарт (серверов), их может быть несколько export RCUDA_DEVICE_0=<IP АДРЕС СЕРВЕРА>:0 # укажем адрес первого сервера

Build and Run

Let's try to build and run a few examples.

Example 1

Let's start with a simple one, with deviceQuery, an example that simply displays the parameters of a CUDA compatible device, that is, in our case, the remote GTX660.

cd <YYY>/NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery
make EXTRA_NVCCFLAGS=--cudart=shared

Important! Without EXTRA_NVCCFLAGS = - cudart = shared miracle will not work
Replace <YYY> with the path you specified for CUDA Samples when installing CUDA.

Run the compiled example:

./deviceQuery

If you did everything correctly, the result will be something like this:

Result

./deviceQuery Starting...
 CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 660"
  CUDA Driver Version / Runtime Version          9.0 / 8.0
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 1994 MBytes (2090991616 bytes)
  ( 5) Multiprocessors, (192) CUDA Cores/MP:     960 CUDA Cores
  GPU Max Clock rate:                            1072 MHz (1.07 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              192-bit
  L2 Cache Size:                                 393216 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 660
Result = PASS

Most importantly, we should see:

Device0 = GeForce GTX 660
Result = PASS

Fine! We managed to build and run a CUDA application on a machine without a discrete video card, using the video card installed on the remote server.

Important! If the output of the application begins with the lines of the form:

mlock error: Cannot allocate memory
rCUDA warning: 1007.461
mlock error: Cannot allocate memory

it means that you need to add the following lines on the server and on the client to the file "/etc/security/limits.conf":

*            hard   memlock           unlimited
*            soft    memlock           unlimited

Thus, you allow all users (*) unlimited (unlimited) memory blocking (memlock). It would be even better to replace * with the necessary user, and instead of unlimited, select less fatty rights.

Example 2

Now let's try something more interesting. Test the implementation of the scalar product of vectors using shared memory and synchronization ("CUDA technology in examples" Sanders J. Kendrot E. 5.3.1).

In this example, we will calculate the scalar product of two vectors with a dimension of 33 * 1024, comparing the answer with the result obtained on the CPU.

dotProd.cu

#include<stdio.h>#define imin(a,b) (a<b?a:b)constint N = 33 * 1024;
constint threadsPerBlock = 256;
constint blocksPerGrid = imin(32, (N+threadsPerBlock-1) / threadsPerBlock);
__global__ voiddot(float* a, float* b, float* c){
    __shared__ float cache[threadsPerBlock];
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    int cacheIndex = threadIdx.x;
    float temp = 0;
    while (tid < N){
        temp += a[tid] * b[tid];
        tid += blockDim.x * gridDim.x;
    }
    // set the cache values
    cache[cacheIndex] = temp;
    // synchronize threads in this block
    __syncthreads();
    // for reductions, threadsPerBlock must be a power of 2// because of the following codeint i = blockDim.x/2;
    while (i != 0){
        if (cacheIndex < i)
            cache[cacheIndex] += cache[cacheIndex + i];
        __syncthreads();
        i /= 2;
    }
    if (cacheIndex == 0)
        c[blockIdx.x] = cache[0];
}
intmain(void){
    float *a, *b, c, *partial_c;
    float *dev_a, *dev_b, *dev_partial_c;
    // allocate memory on the cpu side
    a = (float*)malloc(N*sizeof(float));
    b = (float*)malloc(N*sizeof(float));
    partial_c = (float*)malloc(blocksPerGrid*sizeof(float));
    // allocate the memory on the gpu
    cudaMalloc((void**)&dev_a, N*sizeof(float));
    cudaMalloc((void**)&dev_b, N*sizeof(float));
    cudaMalloc((void**)&dev_partial_c, blocksPerGrid*sizeof(float));
    // fill in the host memory with datafor(int i=0; i<N; i++) {
        a[i] = i;
        b[i] = i*2;
    }
    // copy the arrays 'a' and 'b' to the gpu
    cudaMemcpy(dev_a, a, N*sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_b, b, N*sizeof(float), cudaMemcpyHostToDevice);
    dot<<<blocksPerGrid, threadsPerBlock>>>(dev_a, dev_b, dev_partial_c);
    // copy the array 'c' back from the gpu to the cpu
    cudaMemcpy(partial_c,dev_partial_c, blocksPerGrid*sizeof(float), cudaMemcpyDeviceToHost);
    // finish up on the cpu side
    c = 0;
    for(int i=0; i<blocksPerGrid; i++) {
        c += partial_c[i];
    }
    #define sum_squares(x) (x*(x+1)*(2*x+1)/6)printf("GPU - %.6g \nCPU - %.6g\n", c, 2*sum_squares((float)(N-1)));
    // free memory on the gpu side
    cudaFree(dev_a);
    cudaFree(dev_b);
    cudaFree(dev_partial_c);
    // free memory on the cpu sidefree(a);
    free(b);
    free(partial_c);
}

Build and Run:

/usr/local/cuda/bin/nvcc  --cudart=shared dotProd.cu -o dotProd
./dotProd

This result tells us that everything is fine with us:

GPU - 2.57236e + 13
CPU - 2.57236e + 13

Example 3

We will launch another standard test CUDA- matrixMulCUBLAS (matrix multiplication).

cd < YYY>/NVIDIA_CUDA-8.0_Samples/0_Simple/matrixMulCUBLAS
make EXTRA_NVCCFLAGS=--cudart=shared
./matrixMulCUBLAS

Result

[Matrix Multiply CUBLAS] — Starting…
GPU Device 0: "GeForce GTX 660" with compute capability 3.0

MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
Computing result using CUBLAS...done.
Performance= 436.24 GFlop/s, Time= 0.451 msec, Size= 196608000 Ops
Computing result using host CPU...done.
Comparing CUBLAS Matrix Multiply with CPU results: PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Interesting to us:

Performance = 436.24 GFlop / s
Comparing CUBLAS Matrix Multiply with CPU results: PASS

Security

I did not find any mention of any authentication method in the rCUDA documentation. I think at the moment the simplest thing you can do is open access to the desired port (8308) only from a specific address.

With iptables, it will look like this:

iptables -A INPUT -m state --state NEW -p tcp -s <адрес клиента> --dport 8308 -j ACCEPT

For the rest, leave the issue of security beyond the scope of this post.

Sources and references

[1] http://www.rcuda.net/pub/rCUDA_guide.pdf
[2] http://www.rcuda.net/pub/rCUDA_QSG.pdf
[3] C. Reaño, F. Silla, G. Shainer and S. Schultz, “Local and Remote GPUs Perform Similar with EDR 100G InfiniBand”, in proceedings of the International Middleware Conference, Vancouver, BC, Canada, December 2015.
[4] C. Reaño and F. Silla, “A Performance Comparison of CUDA Remote GPU Virtualization Frameworks”, in proceedings of the International Conference on Cluster Computing, Chicago, IL, USA, September 2015.

Tags: