Overview of Pseudo Random Number Generators for CUDA
By the specifics of the work, one often has to deal with GPU simulations using pseudo-random number generators. As a result, experience was accumulated, which I decided to share with the community.
So, the types of generators, their advantages and disadvantages.
1. Linear Congruential Generator
The simplest and most inefficient generator with a period of only 2 ^ 32. It is absolutely not suitable for serious simulations, such as Monte Carlo, but it can be used for debugging, because it is extremely simple to implement. The generator requires storing a state for each generator (stream). To do this, you can use the unsigned int array and address it to threadId. For example, like this: 2. Combined Tausworthe Generator
This generator has a longer period than an LCG, but is almost as simple. It requires a total of 27 instructions to generate a stream with a period of 2 ^ 113. This period is achieved by combining several generators through exclusive or. Below is the code generating one thread. Combined generator using 4 independent streams. This generator requires 4 unsigned ints to store state, but its use is not fundamentally different from LCG. 3. Hybrid Generator
This generator is obtained by combining 2 previous generators. If the periods of the 2 generators are mutually simple, then the period of the resulting generator will be the product of their periods. This generator is obtained by combining 3 Tausworthe generator flows and one LCG stream. The resulting period will be 2 ^ 121. 4. XorShift This generator has a period of 2 ^ 128-1 and requires 4 unsigned ints to store the state. Thanks to maratyszcza for the tip. 5. Mersenne Twister
This generator is considered one of the best at the present time and has a very large period - 2 ^ 19937, however, to generate one element from the stream, a complex sequential algorithm is required. Also, the generator requires a lot of memory to store its state, which can greatly affect the performance of the GPU, because state has to be stored in global memory. However, in situations where accuracy is more important than speed, it is preferable to use this generator. As in previous generators, each thread has its own generator and state. Below is the code for this generator (a modified example from the NVIDIA SDK). An example of using this generator: 6. WarpStandard generator
This generator was developed specifically for the GPU and has a period of 2 ^ 1024. The generator uses shared memory, so all threads in the block must be executed in concert, which is not always possible. A very good description of this generator is provided at the link.
Conclusion
Each of the presented generators is suitable for different cases. You can use LCG for debugging, if you need a lot more accuracy and randomness, I would use MersenneTwister. If all threads on the GPU are executed in a consistent manner - there are no dynamic transitions inside the blocks - WarpStandard is very suitable. In all other cases, I use HybridGenerator.
So, the types of generators, their advantages and disadvantages.
1. Linear Congruential Generator
The simplest and most inefficient generator with a period of only 2 ^ 32. It is absolutely not suitable for serious simulations, such as Monte Carlo, but it can be used for debugging, because it is extremely simple to implement. The generator requires storing a state for each generator (stream). To do this, you can use the unsigned int array and address it to threadId. For example, like this: 2. Combined Tausworthe Generator
__device__ inline unsigned int lcg_rand(unsigned int& seed)
{
seed = seed * 1664525 + 1013904223UL;
return seed;
}
__global__ void GenerateRandNums(unsigned int* out, unsigned int* states)
{
const int tid = blockDim.x * blockIdx.x + threadIdx.x;
unsigned int s = states[tid];
out[tid] = lcg_rand(s) + lcg_rand(s);
states[tid] = s;
}
This generator has a longer period than an LCG, but is almost as simple. It requires a total of 27 instructions to generate a stream with a period of 2 ^ 113. This period is achieved by combining several generators through exclusive or. Below is the code generating one thread. Combined generator using 4 independent streams. This generator requires 4 unsigned ints to store state, but its use is not fundamentally different from LCG. 3. Hybrid Generator
//S1,S2,S3,M - константы
__device__ inline unsigned int taus_rand_step(unsigned int& state, int S1, int S2, int S3, int M)
{
unsigned int b = (((state << S1) ^ state) >> S2);
state = (((state & M) << S3) ^ b);
return state;
}
__device__ unsigned int taus_rand(uint4& state)
{
return
taus_rand_step(state.x, 6, 13, 18, 4294967294UL) ^
taus_rand_step(state.y, 2, 27, 2, 4294967288UL) ^
taus_rand_step(state.z, 13, 21, 7, 4294967280UL) ^
taus_rand_step(state.w, 3, 12, 13, 4294967268UL);
}
This generator is obtained by combining 2 previous generators. If the periods of the 2 generators are mutually simple, then the period of the resulting generator will be the product of their periods. This generator is obtained by combining 3 Tausworthe generator flows and one LCG stream. The resulting period will be 2 ^ 121. 4. XorShift This generator has a period of 2 ^ 128-1 and requires 4 unsigned ints to store the state. Thanks to maratyszcza for the tip. 5. Mersenne Twister
__device__ unsigned int hybrid_taus(uint4& state)
{
return
taus_rand_step(state.x, 13, 19, 12, 4294967294UL) ^
taus_rand_step(state.y, 2, 25, 4, 4294967288UL) ^
taus_rand_step(state.z, 3, 11, 17, 4294967280UL) ^
lcg_rand(state.w);
}
__device__ inline unsigned int rng_xor128(uint4& s)
{
unsigned int t;
t = s.x^(s.x<<11);
s.x = s.y;
s.y = s.z;
s.z = s.w;
s.w = (s.w^(s.w>>19))^(t^(t>>8));
return s.w;
}
This generator is considered one of the best at the present time and has a very large period - 2 ^ 19937, however, to generate one element from the stream, a complex sequential algorithm is required. Also, the generator requires a lot of memory to store its state, which can greatly affect the performance of the GPU, because state has to be stored in global memory. However, in situations where accuracy is more important than speed, it is preferable to use this generator. As in previous generators, each thread has its own generator and state. Below is the code for this generator (a modified example from the NVIDIA SDK). An example of using this generator: 6. WarpStandard generator
#define SEED 999
#define DCMT_SEED 4172
#define MT_RNG_PERIOD 607
typedef struct{
unsigned int matrix_a;
unsigned int mask_b;
unsigned int mask_c;
unsigned int seed;
} mt_struct_stripped;
#define MT_RNG_COUNT (4096*16)
#define MT_MM 9
#define MT_NN 19
#define MT_WMASK 0xFFFFFFFFU
#define MT_UMASK 0xFFFFFFFEU
#define MT_LMASK 0x1U
#define MT_SHIFT0 12
#define MT_SHIFTB 7
#define MT_SHIFTC 15
#define MT_SHIFT1 18
__device__ struct mt_state
{
int iState;
unsigned int mti1;
unsigned int mt[MT_NN];
};
__device__ mt_struct_stripped ds_MT[MT_RNG_COUNT];
static mt_struct_stripped h_MT[MT_RNG_COUNT];
__device__ mt_state ds_mtState[MT_RNG_COUNT];
void InitGPUTwisters(const char* fname, unsigned int seed)
{
FILE *fd = fopen(fname, "rb");
if(!fd)
printf("initMTGPU(): failed to open %s\n", fname);
if( !fread(h_MT, sizeof(h_MT), 1, fd) )
printf("initMTGPU(): failed to load %s\n", fname);
fclose(fd);
mt_struct_stripped *MT = (mt_struct_stripped *)malloc(MT_RNG_COUNT * sizeof(mt_struct_stripped));
for(int i = 0; i < MT_RNG_COUNT; i++)
{
MT[i] = h_MT[i];
seed = lcg_rand(seed);
MT[i].seed = seed;
}
cudaMemcpyToSymbol(ds_MT, MT, sizeof(h_MT));
free(MT);
}
__device__ unsigned int GetRandomIntegerFast(unsigned int tid, mt_state* mts, mt_struct_stripped* config)
{
int iState1, iStateM;
unsigned int mti, mtiM, x;
iState1 = mts->iState + 1;
iStateM = mts->iState + MT_MM;
if(iState1 >= MT_NN)
iState1 -= MT_NN;
if(iStateM >= MT_NN)
iStateM -= MT_NN;
mti = mts->mti1;
mts->mti1 = mts->mt[iState1];
mtiM = mts->mt[iStateM];
x = (mti & MT_UMASK) | (mts->mti1 & MT_LMASK);
x = mtiM ^ (x >> 1) ^ ((x & 1) ? config->matrix_a : 0);
mts->mt[mts->iState] = x;
mts->iState = iState1;
x ^= (x >> MT_SHIFT0);
x ^= (x << MT_SHIFTB) & config->mask_b;
x ^= (x << MT_SHIFTC) & config->mask_c;
x ^= (x >> MT_SHIFT1);
return x;
}
__device__ float GetRandomFloatFast(unsigned int tid, mt_state* mts, mt_struct_stripped* config)
{
return ((float)GetRandomIntegerFast(tid,mts,config) + 1.0f) / 4294967296.0f;
}
__device__ void InitTwisterFast(unsigned int tid, mt_state* mts, mt_struct_stripped* config)
{
mts->mt[0] = config->seed;
for(int i = 1; i < MT_NN; i++)
mts->mt[i] = (1812433253U * (mts->mt[i - 1] ^ (mts->mt[i - 1] >> 30)) + i) & MT_WMASK;
mts->iState = 0;
mts->mti1 = mts->mt[0];
}
__global__ void GenerateUniformNumbers(float* output)
{
const uint tid = gridDim.x*gridDim.y*threadIdx.x + blockIdx.y*gridDim.x + blockIdx.x;
mt_state mts = ds_mtState[tid];
mt_struct_stripped config = ds_MT[tid];
InitTwisterFast(tid, &mts, &config);
output[tid] = GetRandomFloatFast(tid, &mts, &config)
ds_mtState[tid] = mts;
}
This generator was developed specifically for the GPU and has a period of 2 ^ 1024. The generator uses shared memory, so all threads in the block must be executed in concert, which is not always possible. A very good description of this generator is provided at the link.
Conclusion
Each of the presented generators is suitable for different cases. You can use LCG for debugging, if you need a lot more accuracy and randomness, I would use MersenneTwister. If all threads on the GPU are executed in a consistent manner - there are no dynamic transitions inside the blocks - WarpStandard is very suitable. In all other cases, I use HybridGenerator.