Writing in Rust + CUDA C

  • Tutorial

Hello!

In this guide I want to tell how to make friends CUDA C / C ++ and Rust. And as an example, we will write a small program in Rust to calculate the scalar product of vectors, the calculation of the scalar product will be done on the GPU using CUDA C.

Who cares about the cut!

CUDA C


The first step is to install the CUDA compiler - nvcc. What is CUDA and why I need to describe it, I can read about it here, for example . I can only say that with its help you can write code that will run on NVIDIA video cards (hereinafter referred to as the GPU) and use all their power for parallel computing and graphics processing. Once again, this tutorial is not about how to write code in CUDA, but about how to use its advantages from Rust code and write parallel computing on the GPU.

So install nvcc and the CUDA Toolkit. With this complexity, detailed instructions should not arise: on the off site .

RUST + CUDA C


During this tutorial, as mentioned earlier, we will write a program in Rust to find the scalar product of two vectors, the calculation process itself will occur on the GPU.

The scalar product of two vectors.
Suppose we have two vectors: $ a = [a_1, a_2, ... a_n] $ and $ b = [b_1, b_2, ..., b_n] $, the scalar product of these vectors:

$ a \ cdot b = \ sum_ {i = 1} ^ {n} {a_ib_i} $



Let's start creating our program. Further, I assume that nvcc is successfully installed, rustc and cargo also stand for compiling rust code.

First, create the project folder. In the project folder, create the Cargo.toml file, which contains instructions for the cargo collector. The file looks like this:

[package]
name = "rust_cuda" # название программы
version = "0.1.0" # версия программы
authors = ["MoonL1ght "] # информация об авторе
build = "build.rs" # скрипт для сборки rust
links = "cudart" # библиотека cuda, которая линкуется динамически
[dependencies]
libc = "0.2" # библиотека rust для работы С кодом 
rand = "0.5.5" # библиотека rust для работы с случайными величинами
[build-dependencies]
cc = "1.0" # rust пакет для сборки С кода

Also in the project root folder, create the build.rs file which will contain instructions for building the rust program and compiling the CUDA C code.

Add the src folder to the project root into which we will place the source code files. In the src folder, create four files: main.rs - the code of the main program, dot.cpp - C ++ binding (wrapper for CUDA C), dot_gpu.h, dot_gpu.cu - the file that contains the code executed on the GPU.

Total we have such a project structure:

rust-cuda/
    src/
        main.rs
        dot.cpp
        dot_gpu.h
        dot_gpu.cu
    Cargo.toml
    build.rs

In the build.rs file, the most important thing is to write this:

println!("cargo:rustc-link-search=native=/Developer/NVIDIA/CUDA-10.1/lib");
println!("cargo:rustc-link-search=/Developer/NVIDIA/CUDA-10.1/lib");
println!("cargo:rustc-env=LD_LIBRARY_PATH=/Developer/NVIDIA/CUDA-10.1/lib");
println!("cargo:rustc-link-lib=dylib=cudart");

where /Developer/NVIDIA/CUDA-10.1/lib is the path to the CUDA executable files, in a unix-like system, this path can be found, for example, with the command:

which nvcc

In addition, in the build.rs file, you need to specify the path to the dot.cpp and dot_gpu.cpp files:

.files(&["./src/dot.cpp", "./src/dot_gpu.cu"])

All build.rs code
extern crate cc;
fn main() {
  cc::Build::new()
    .cuda(true)
    .cpp(true)
    .flag("-cudart=shared")
    .files(&["./src/dot.cpp", "./src/dot_gpu.cu"])
    .compile("dot.a");
  println!("cargo:rustc-link-search=native=/Developer/NVIDIA/CUDA-10.1/lib");
  println!("cargo:rustc-link-search=/Developer/NVIDIA/CUDA-10.1/lib");
  println!("cargo:rustc-env=LD_LIBRARY_PATH=/Developer/NVIDIA/CUDA-10.1/lib");
  println!("cargo:rustc-link-lib=dylib=cudart");
}


Now you can start writing the main program code. In the main.rs file, you need to create an interface for C / C ++ functions for calling directly from Rust code. You can read more about this in the official documentation in the FFI section .

extern "C" {
  // интерфейс C функции для расчета скалярного произведения двух векторов
  fn dot(v1: *mut c_float, v2: *mut c_float, N: size_t) -> c_float;
}

To call it, you need to use the unsafe code block, as arguments we pass a mutable pointer to the Vec type:

unsafe {
  gpu_res = dot(v1.as_mut_ptr(), v2.as_mut_ptr(), VEC_SIZE);
}

Full main.rs file code
extern crate libc;
extern crate rand;
use libc::{c_float, size_t};
use rand::Rng;
const VEC_SIZE: usize = 10;
const MAX: f32 = 10.;
const MIN: f32 = 0.;
extern "C" {
  fn dot(v1: *mut c_float, v2: *mut c_float, N: size_t) -> c_float;
}
fn cpu_dot(v1: Vec, v2: Vec) -> f32 {
  let mut res: f32 = 0.;
  for i in 0..v1.len() {
    res += v1[i] * v2[i];
  }
  return res;
}
fn main() {
  let mut v1: Vec = Vec::new();
  let mut v2: Vec = Vec::new();
  let mut gpu_res: c_float;
  let mut cpu_res: f32 = 0.;
  let mut rng = rand::thread_rng();
  for _ in 0..VEC_SIZE {
    v1.push(rng.gen_range(MIN, MAX));
    v2.push(rng.gen_range(MIN, MAX));
  }
  println!("{:?}", v1);
  println!("{:?}", v2);
  println!("GPU computing started");
  unsafe {
    gpu_res = dot(v1.as_mut_ptr(), v2.as_mut_ptr(), VEC_SIZE);
  }
  println!("GPU computing finished");
  println!("GPU dot product result: {}", gpu_res);
  cpu_res = cpu_dot(v1, v2);
  println!("CPU dot product result: {}", cpu_res);
}


Now we start writing the binding in C ++, as well as the code for calculating the scalar product of vectors in CUDA C.

In the dot.cpp file, we write the binding function, we actually call this function from Rust code:

extern "C" {
  float dot(float *v1, float *v2, size_t N) {
    float *gpu_res;
    float res = 0.0;
    gpu_res = gpu_dot(v1, v2, N); // вычисление на GPU
    for (int i = 0; i < blocksPerGrid; i++) {
      res += gpu_res[i];
    }
    free(gpu_res);
    return res;
  }
}

Full dot.cpp file code
#include 
#include "dot_gpu.h"
using namespace std;
void display_vector(float *v, size_t N) {
  cout << "[";
  for (size_t i = 0; i < N; i++) {
    cout << v[i];
    if (i != N - 1) {
      cout << ", ";
    }
  }
  cout << "]" << endl;
}
extern "C" {
  float dot(float *v1, float *v2, size_t N) {
    cout << "Calling gpu dot product" << endl;
    cout << "Got two vectors from rust:" << endl;
    display_vector(v1, N);
    display_vector(v2, N);
    float *gpu_res;
    float res = 0.0;
    gpu_res = gpu_dot(v1, v2, N);
    for (int i = 0; i < blocksPerGrid; i++) {
      res += gpu_res[i];
    }
    free(gpu_res);
    return res;
  }
}


The following is the code from the dot_gpu.cu file in which the main calculation is performed, I will not explain the code itself in this tutorial, since it is not dedicated to CUDA programming.

dot_gpu.cu
#include "dot_gpu.h"
__global__ void dot__(float *v1, float *v2, float *res, int N) {
  __shared__ float cache [threadsPerBlock];
  int tid = threadIdx.x + blockIdx.x * blockDim.x;
  int cacheIndex = threadIdx.x;
  float temp = 0.0;
  while (tid < N) {
    temp += v1[tid] * v2[tid];
    tid += blockDim.x * gridDim.x;
  }
  cache[cacheIndex] = temp;
  __syncthreads();
  int i = blockDim.x / 2;
  while (i != 0) {
    if (cacheIndex < i) {
      cache[cacheIndex] += cache[cacheIndex + i];
    }
    __syncthreads();
    i /= 2;   
  }
  if (cacheIndex == 0) {
    res[blockIdx.x] = cache[0];
  }
}
float * gpu_dot (float *v1, float *v2, size_t N) {
	float *dev_v1, *dev_v2, *dev_res, *res;
	res = new float[blocksPerGrid];
	cudaMalloc((void**)&dev_v1, N * sizeof(float));
  cudaMalloc((void**)&dev_v2, N * sizeof(float));
	cudaMalloc((void**)&dev_res, blocksPerGrid * sizeof(float));
	cudaMemcpy(dev_v1, v1, N * sizeof(float), cudaMemcpyHostToDevice);
	cudaMemcpy(dev_v2, v2, N * sizeof(float), cudaMemcpyHostToDevice);
	dot__<<>>(dev_v1, dev_v2, dev_res, (int)N);
	cudaMemcpy(res, dev_res, blocksPerGrid * sizeof(float), cudaMemcpyDeviceToHost);
	cudaFree(dev_v1);
  cudaFree(dev_v2);
	cudaFree(dev_res);
	return res;
}


All our little program is written and ready to build. In order to assemble it in the console, call the command:

cargo build

For start:

cargo run

After building the program, the target folder will appear in the main directory of the project. The executable file of our program will be located in the folder: ./target/debug/

Moreover, if we just run our executable file, we will get the error: dyld library not loaded. That is, he cannot find the path to the cuda dynamic library. To solve this problem, you can register the environment variable LD_LIBRARY_PATH = path_to_CUDA_lib_directory / or run symbolic links in the rust toolchain folder for CUDA before starting the executable file in the console:

ln -s /Developer/NVIDIA/CUDA-10.1/lib/* /Users/Alexander/.rustup/toolchains/nightly-x86_64-apple-darwin/lib

where /Users/Alexander/.rustup/toolchains/nightly-x86_64-apple-darwin/lib is my path to the installed rust toolchain, it may differ slightly for you.

When starting the program through cargo run, such an error did not occur, because we registered the environment variable LD_LIBRARY_PATH in the build.rs file.

Eventually


We have the ability to run CUDA C code directly from Rust code. In order to verify this, we created a small program, it works with vectors and performs all calculations on the GPU. The full code can also be viewed on github .

Also popular now: