# Effective number generation in a given interval

Original author: ME O'Neill
• Transfer The overwhelming majority of my posts on random number generation mainly dealt with the properties of various generation schemes. This may turn out to be unexpected, but the performance of the randomization algorithm may depend not on the chosen generation scheme, but on other factors. In this post (which I was inspired by an excellent article by Daniel Lemyr ), we will examine the main reasons for the decline in random number generation performance that often outweigh the performance of the PRN engine.

Imagine this situation:

As homework, Juan and Sasha implement the same randomized algorithm in C ++, which will run on the same university computer and with one data set. Their code is almost identical and differs only in the generation of random numbers. Juan is in a hurry to do his music lessons, so he simply chose the Mersenne whirlwind. Sasha, on the other hand, spent a few extra hours researching. Sasha conducted benchmarks of several of the fastest PRNGs, which he recently learned from social networks, and chose the fastest. At the meeting, Sasha was impatient to brag, and he asked Juan: “What GPRS did you use?”

“Personally, I just took the Mersenne’s vortex - it’s built into the language and seems to work well.

“Ha!” Sasha answered. "I used`jsf32`. It is much faster than the old and slow Mersenne whirlwind! My program runs in 3 minutes 15 seconds! ”

“Hmm, not bad, but mine can do it in less than a minute,” Juan says and shrugs. “Well then, I have to go to the concert. Will you come with me? ”

“ No, ”Sasha answers. "I ... uh ... need to look at my code again."

This awkward fictional situation is not particularly fictional; it is based on real results. If your randomized algorithm does not execute as fast as we would like, and the bottleneck seems to be random number generation, then, oddly enough, the problem may not be in the random number generator!

### Introduction: Random Numbers in Practice

Most modern high-quality random number generators create machine words filled with random bits, that is, they usually generate numbers in the interval [0..2 32 ) or [0..2 64 ). But in many cases of use, users need numbers in a certain interval - for example, to roll a die or choose a random playing card, numbers are needed in small constant intervals. However, many algorithms, from mixing and reservoir sampling to randomized binary search trees, require numbers taken from other intervals.

### Methods

We will look at many different methods. To simplify the discussion, instead of generating numbers in the interval [ i .. j ) or [ i .. j ], we will generate numbers in the interval [0 .. k ). Having such a scheme, we can, for example, generate numbers in the interval [ i .. j ) by setting k = j - i , generating a number in the interval [0 .. k ), and then adding i to it .

#### Built-in C ++

Many languages ​​have built-in tools to get a random number in a specified interval. For example, to remove a card from a deck with 52 cards in scripting languages ​​such as Perl and Python, we can write `int(rand(52))`and, respectively `random.randint(0,52)`. [Note CryptoPirate user : It seems to me an error here, in Python randint (a, b) generates numbers from a to b including b. And since there are 52 cards in the deck and the first one is “0”, then it should be random.randint (0,51) .] In C ++, we can use it in the same way `uniform_int_distribution`.

C ++ code to implement this approach is simple:

``````uint32_t bounded_rand(rng_t& rng, uint32_t range) {
std::uniform_int_distribution<uint32_t> dist(0, range-1);
return dist(rng);
}``````

Usually, one of the techniques described below is used in the built-in tools, but most users simply use these tools, not thinking about what is happening “under the hood,” believing that these tools are correctly designed and quite effective. In C ++, built-in tools are more complex because they should be able to work with fairly arbitrary generation engines - a generator that produces values ​​in the range from -3 to 17 can be quite valid and can be used `std::uniform_int_distribution`to create numbers in any interval, for example [0. .1000). That is, the C ++ built-in tools are too overcomplicated for most of the cases in which they are used.

#### The classic remainder of the division (skewed)

Let's move from an over-simplified approach to a too simplistic one.

When I studied programming, we generated numbers in the interval (for example, to select a card in a deck of 52 cards) using the remainder operator. To get a number in the interval [0..52), we wrote `rand() % 52`.

In C ++, this approach can be implemented as follows:

``````uint32_t bounded_rand(rng_t& rng, uint32_t range) {
return rng() % range;
}``````

Despite the simplicity of this approach, it demonstrates the reason that obtaining numbers in the desired interval is usually a slow task - it requires division (to calculate the remainder obtained by the operator `%`). The division is usually at least an order of magnitude slower than other arithmetic operations, so a single arithmetic operation takes longer than all the work performed by a fast PRNG.

But in addition to low speed, it is also skewed . To understand why it `rand() % 52`returns skewed numbers, suppose that it `rand()`creates numbers in the interval [0..2 32 ), and note that 52 does not divide 2 32 completely, it divides it 82 595 524 times with the remainder 48. That is, if we use`rand() % 52`, then we will have 82 595 525 ways to choose the first 48 cards from the deck and only 82 595 524 ways to choose the last four cards. In other words, there is a skew of 0.00000121% against these last four cards (perhaps these are kings!). When I was a student and I wrote homework about throwing dice or drawing cards, nobody usually cared about such tiny distortions, but with an increase in the interval, the distortion grows linearly. For a 32-bit PRNG, a limited interval of less than 2 24 has a skew of less than 0.5%, but above 2 31 the skew is 50% - some numbers will return twice as often as others.

In this article, we will mainly consider techniques that use strategies to eliminate a systematic error, but it is probably worth saying that for a 64-bit PRNG, the skew value in normal applications is likely to be negligible.

Another problem may be that some generators have weak low bits. For example, the GPRS families Xoroshiro + and Xoshiro + have low bits that do not pass statistical tests. When we execute `% 52`(because 52 is even), we pass the least significant bit directly to the output.

#### Multiply floating point numbers (skewed)

Another common technique is the use of a PRNG that generates floating point numbers in the interval [0..1) with the subsequent conversion of these numbers to the desired interval. This approach is used in Perl, it is recommended to use it `int(rand(10))`to generate an integer in the interval [0..10) by generating a floating-point number followed by rounding down.

In C ++, this approach is written like this:

``````static uint32_t bounded_rand(rng_t& rng, uint32_t range){
double zeroone = 0x1.0p-32 * rng();
return range * zeroone;
}``````

(Note that `0x1.0p-32`this is a binary floating-point constant for 2 -32 , which we use to convert a random integer in the interval [0..2 32 ) to double in the unit interval; instead, we can perform such a conversion with `ldexp(rng(), -32)`, but when I benchmarked this approach, it turned out to be much slower.)

This approach is as skewed as the classic remainder of division, but skew appears differently. For example, if we were to select numbers in the interval [0..52), then the numbers 0, 13, 26, and 39 would occur once less often than others.

This version, when generalizing to 64 bits, is even more unpleasant, because it requires a floating point type whose mantissa is at least 64 bits. On x86 machines with Linux and macOS, we can use `long double`to take advantage of the increased precision x86 floating-point numbers that have a 64-bit mantissa, but `long double`cannot be universally ported to all systems - it is `long double`equivalent on some systems `double`.

There is a good side - this approach is faster than residual solutions for PRNG with weak low bits.

#### Integer Multiplication (Skewed)

The multiplication method can be adapted to fixed rather than floating point arithmetic. In fact, we just constantly multiply by 2 32 ,

``````uint32_t bounded_rand(rng_t& rng, uint32_t range) {
uint32_t x = rng();
uint64_t m = uint64_t(x) * uint64_t(range);
return m >> 32;
}``````

It might seem that this version requires 64-bit arithmetic, on x86 processors a good compiler will compile this code into a 32-bit instruction `mult`(which gives us two 32-bit output values, one of which is the return value). This version can be expected to be fast, but it is skewed exactly like the method of multiplying floating point numbers.

#### Drop division (no skew)

We can modify the floating point multiplication scheme into a division based scheme. Instead of multiplication, `x * range / 2**32`we calculate `x / (2**32 / range)`. Since we are working with integer arithmetic, rounding in this version will be performed differently, and sometimes generate values ​​outside the desired interval. If we discard these values ​​(for example, get rid of them and generate new values), then as a result we get a technique without distortions.

For example, in the case of pulling out the card with a 32-bit PRNG we can generate a 32-bit number and divide it by 2 32 /52 = 82595524, to select a card. This technique works if a random value from a 32-bit PRNG is less than 52 × 82595524 = 2 32/ 32 - 48. If the random value from the PRNG is one of the last 48 values ​​of the upper part of the generator interval, then it must be discarded and another should be sought.

Our code for this version uses a trick with division of 2 32 by `range`without using 64-bit math. For direct calculation, `2**32 / range`we need to represent the number 2 32 , which is too large (per unit!) To represent as a 32-bit integer. Instead, we consider that for unsigned integers, the unary negation operation `range`computes a positive value of 2 32 - `range`; dividing this value by `range`, we get a response one less than `2**32 / range`.

Therefore, C ++ code for generating numbers using division and drop looks like this:

``````uint32_t bounded_rand(rng_t& rng, uint32_t range) {
// calculates divisor = 2**32 / rangeuint32_t divisor = ((-range) / range) + 1;
if (divisor == 0) // overflow, it's really 2**32return0;
for (;;) {
uint32_t val = rng() / divisor;
if (val < range)
return val;
}
}``````

Of course, this approach requires two slow operations based on division, which are usually slower than other arithmetic operations, so you should not expect it to be fast.

#### The remainder of the division (double) without distortions - OpenBSD technique

We can also take the drop approach to eliminate the skew in the classic division remainder method. In the example with playing cards, we again need to drop 48 values. In this version, instead of discarding the last 48 values, we (equivalently) discard the first 48 values.

Here is the implementation of this approach in C ++:

``````uint32_t bounded_rand(rng_t& rng, uint32_t range) {
// calculates 2**32 % rangeuint32_t t = (-range) % range;
for (;;) {
uint32_t r = rng();
if (r >= t)
return r % range;
}
}``````

This technique removes skew, but it requires two time-consuming dividing operations with the remainder of each output value (and you may need an internal generator to create several numbers). Therefore, it should be expected that the method will be approximately two times slower than the classical skew approach.

At `коде arc4random_uniform`the OpenBSD (which is also used in OS X and iOS) used this strategy.

#### Remainder of division (single) without skew - Java methodology

Java uses a different approach to generating a number in an interval that uses only one remainder division operation, with the exception of fairly rare cases of discarding the result. Code:

``````static uint32_t bounded_rand(rng_t& rng, uint32_t range){
uint32_t x, r;
do {
x = rng();
r = x % range;
} while (x - r > (-range));
return r;
}``````

To understand why this option works, you need to think a little. Unlike the previous version based on residuals, which eliminates bias by removing part of the lowest values ​​from the internal generation engine, this version filters out values ​​from the upper part of the engine interval.

#### Skew integer multiplication - Lemira method

In much the same way that we removed the bias from the remainder of division method, we can eliminate the bias from the integer multiplication technique. This technique was invented by Lemyr.

``````uint32_t bounded_rand(rng_t& rng, uint32_t range) {
uint32_t t = (-range) % range;
do {
uint32_t x = rng();
uint64_t m = uint64_t(x) * uint64_t(range);
uint32_t l = uint32_t(m);
} while (l < t);
return m >> 32;
}``````

#### Drop bit mask (no skew) - Apple technique

In our last approach, division and remainder operations are completely eliminated. Instead, it uses a simple masking operation to obtain a random number in the interval [0..2 k ), where k is the smallest value, such that 2 k is greater than the interval. If the value is too large for our interval, we discard it and try to get another. The code is shown below:

``````uint32_t bounded_rand(rng_t& rng, uint32_t range) {
uint32_t mask = ~uint32_t(0);
--range;
uint32_t x;
do {
x = rng() & mask;
} while (x > range);
return x;
}``````

This approach was adopted by Apple when (in the macOS Sierra release) it performed its own code revision`arc4random_uniform` .

### Benchmarking basic techniques

Now we have several approaches that can be evaluated. Unfortunately, when we are concerned about the costs of a single division operation, benchmarking becomes a non-trivial thing. No benchmark is able to take into account all the factors affecting the field of application, and there are no guarantees that the best option for your application will certainly be the best for mine.

We use three benchmarks and test the techniques with many different PRNGs.

#### Benchmark Large-Shuffle

Probably the most obvious benchmark is mixing. In this benchmark, we simulate performing large-scale mixing. To sort an array of size N, we must generate numbers in the intervals [0 .. N ), [0 .. ( N -1)), ..., [0..1). In this benchmark, we will assume that N is the maximum possible number (for `uint32_t`this, 2 32 -1). Code:

``````for (uint32_t i = 0xffffffff; i > 0; --i) {
uint32_t bval = bounded_rand(rng, i);
assert(bval < i);
sum += bval;
}``````

Note that we “use” each number, adding it to `sum`(so that it will not be thrown away by optimization), but do not perform any mixing to focus on the generation of numbers.

For testing 64-bit generation, we have a similar test, but it will be impractical to perform a test corresponding to mixing an array of 2 64 - 1 in size (because it will take many thousands of years to complete this larger benchmark). Instead, we cross the entire 64-bit interval, but generate the same number of output values ​​as in the 32-bit test. Code:

``````for (uint32_t i = 0xffffffff; i > 0; --i) {
uint64_t bound = (uint64_t(i)<<32) | i;
uint64_t bval = bounded_rand(rng, bound );
assert(bval < bound);
sum += bval;
}``````

##### Mersenne vortex results

The results shown below demonstrate the performance of this benchmark for each of the methods we examined when using the Mersenne vortex and testing on the 32-bit (using `std::mt19937`from `libstdc++`) considered in the article and a similar 64-bit code (using `std:mt19937_64`from `libstdc++`). The results are the geometric mean of 15 runs with different seed values, which is then normalized so that the classical division remainder method has a single run time. It may seem that we have clear answers about performance - it seems that you can build techniques for their perfection, and ask yourself what the developers were thinking `libstdc++`when they wrote such a terrible implementation for 32-bit numbers. But, as is often the case with benchmarking, the situation is more complicated than it seems from these results. Firstly, there is a risk that the results may be specific to the Mersenne vortex, so we will expand the many tested PRNGs. Secondly, there may be a subtle problem with the benchmark itself. Let's first deal with the first question.

##### Results of different PRNGs

32-bit PRNG we test using the `arc4_rand32`, `chacha8r`, `gjrand32`, `jsf32`, `mt19937`, `pcg32`, `pcg32_fast`, `sfc32`, `splitmix32`, `xoroshiro64+`, `xorshift*64/32`, `xoshiro128+`and `xoshiro128**`, while 64-bit PRNG - using the `gjrand64`, `jsf64`, `mcg128`, `mcg128_fast`, `mt19937_64`, `pcg64`, `pcg64_fast`, `sfc64`, `splitmix64`, `xoroshiro128+`, `xorshift*128/64`, `xoshiro256+`and `xoshiro256*`. These kits will give us some slow PRNs and a lot of very fast ones.

Here are the results: We can see the key differences from the results with the Mersenne vortex. Faster PRSPs shift the equilibrium towards the bounding code, and therefore the difference between the different approaches becomes more pronounced, especially in the case of 64-bit PRSPs. With a wider set of PRNGs, implementation no `libstc++`longer seems so terrible.

##### findings

In this benchmark by a significant margin, the approach based on multiplication with bias wins in speed. There are many situations in which the boundaries will be small relative to the size of the PRNG, and the performance is absolutely critical. In such situations, a slight bias is unlikely to have a noticeable effect, but the PRNG speed will have. One such example is Quicksort with a random anchor point. Of the skewed methods, the bitmask technique looks promising.

But before making serious conclusions, we need to point out the huge problem of this benchmark - most of the time is spent on very high boundaries, which most likely gives excessive importance to large intervals. Therefore, we need to go to the second benchmark.

#### Benchmark Small-Shuffle

This benchmark is similar to the previous one, but performs much less “array mixing” (multiple). Code:

``````for (uint32_t j = 0; j < 0xffff; ++j) {
for (uint32_t i = 0xffff; i > 0; --i) {
uint32_t bval = bounded_rand(rng, i);
assert(bval < i);
sum += bval;
}
}``````

##### Mersenne vortex results ##### Results of different PRNGs ##### findings

This benchmark avoids too much emphasis on large boundaries and more accurately reflects real-world use cases, but now completely discards large boundaries.

#### Benchmark for all intervals

This benchmark aims to avoid the disadvantages of the previous two; he performs testing at each size of the power of two so that each size is present, but his influence is not overestimated.

``````for (uint32_t bit = 1; bit != 0; bit <<= 1) {
for (uint32_t i = 0; i < 0x1000000; ++i) {
uint32_t bound = bit | (i & (bit - 1));
uint32_t bval = bounded_rand(rng, bound);
assert(bval < bound);
sum += bval;
}
}``````

##### Mersenne vortex results ##### Results of different PRNGs ##### findings

Many of our findings remain unchanged. The skew method is quick if we can put up with the error, and the bitmask scheme seems to be a good averaged choice.

We could end this if we did not want to go back, take a critical look at our code and make changes to it.

### Make improvements

Up to this point, all skew elimination methods required the use of an additional division remainder operation, which is why they are performed much more slowly than skew methods. It would be helpful if we could reduce this advantage.

#### Faster Threshold Based Drop

Some of our algorithms have code using a threshold value, for example:

``````uint32_t bounded_rand(rng_t& rng, uint32_t range) {
// calculates 2**32 % rangeuint32_t t = (-range) % range;
for (;;) {
uint32_t r = rng();
if (r >= t)
return r % range;
}
}``````

When `range`small compared to the PRNG output interval, most often the number will be much larger than the threshold. That is, if we can add a preliminary estimate of the threshold, which may be a little more, we will save on the expensive operation of taking the remainder of the division.

The following code handles this task:

``````uint32_t bounded_rand(rng_t& rng, uint32_t range) {
uint32_t r = rng();
if (r < range) {
uint32_t t = (-range) % range;
while (r < t)
r = rng();
}
return r % range;
}``````

This change can be applied to both “double mod without distortions” (see above) and to “integer multiplication without distortions”. The idea came up with Lemyr, who applied it to the second method (but not to the first).

##### Large-Shuffle Benchmark Results

This optimization leads to a significant improvement in the results of the 64-bit benchmark (in which the mod is even slower), but in fact it slightly worsens the performance in the 32-bit benchmark. Despite the improvements, the bitmask method still wins. ##### Small-Shuffle Benchmark Results

On the other hand, this change significantly speeds up the small-shuffle benchmark for both the method of integer multiplication and the double remainder of division method. In both cases, their performance shifts closer to the results of the options without distortions. The performance of the double remainder method (OpenBSD) is now almost equal to that of the single remainder method (Java). ##### Benchmark results for all intervals

We see a similar improvement in the benchmark for all intervals. It looks like we can announce a new universal winner: an optimized method for multiplying Lemire integers without skewing.

#### Division remainder optimization

Typically `a % b`, division is required for the calculation , but in situations where the `a < b`result is simple `a`and division is not required. And when `a/2 < b`, the result is simple `a - b`. Therefore, instead of computing

``a %= b;``

we can fulfill

``````if (a >= b) {
a -= b;
if (a >= b)
a %= b;
}``````

The cost of dividing is so significant that increasing the cost of this more complex code can justify itself by saving time due to the absence of division.

##### Large-Shuffle Benchmark Results

Adding this optimization greatly improves the large-shuffle benchmark results. This is again more noticeable in 64-bit code, where the operation of taking the remainder is more expensive. The double-remainder method (OpenBSD style) shows versions with optimizations for only one remainder operation and for both. In this benchmark, the bit mask is still the winner, but the border between it and Lemira’s approach has narrowed significantly.

##### Small-Shuffle Benchmark Results

Adding this optimization does not increase the performance of the small-shuffle benchmark, so the question remains only if it adds significant costs. In some cases, no; in others, the costs increase slightly. ##### Benchmark results for all intervals

In the benchmark for all intervals, the changes are also small. #### Bonus: PRSP comparison results

The main reason for using a lot of PRNGs for testing number schemes in intervals was to avoid inadvertent distortion of the results due to the operating characteristics of individual PRNC schemes. But we can use the same results of internal tests to compare the generation schemes themselves.

##### PRNG with 32-bit output

The graph below shows the performance of different 32-bit generation schemes, averaged for all methods and fifteen runs, normalized to the performance of the 32-bit Mersenne vortex: On the one hand, I am glad to see that it `pcg32_fast`is really fast - it was defeated only by a small version of Xoroshiro (which does not pass statistical tests). But this also shows why I rarely get upset because of the performance of modern high-performance general-purpose PRSPs - the difference between the different methods is very insignificant. In particular, the fastest four circuits differ in performance by less than 5%, and I believe that this is simply caused by “noise”.

##### PRNG with 64-bit output

The graph shows the performance of various 64-bit generation schemes averaged among all the techniques and fifteen runs normalized to the performance of the 32-bit Mersenne vortex. It may seem strange that normalization is performed using the 32-bit Mersenne vortex, but this allows us to see the additional costs of using 64-bit generation in cases where 32-bit generation is sufficient. These results confirm that it is `mcg128_fast`incredibly fast, but the last four techniques again differ by only about 5%, so it’s difficult to choose from the fastest methods. `pcg64`and `pcg64_fast`must be slower `mcg128_fast`, because their basic generators are 128-bit linear congruent generators (LCG, LCG) and 128-bit multiplicative congruent generators (MKG, MCG). Despite the fact that they are not the fastest techniques in this set, they are `pcg64`still 20% faster than the 64-bit Mersenne vortex.

But perhaps more importantly, these results also show that if you do not need 64-bit output, then a 64-bit PRNG is usually slower than 32-bit.

### findings

From our benchmarks, we can see that the transition from standardly used PRNGs (for example, the 32-bit Mersenne vortex) to faster PRNPs reduced the execution time of benchmarks by 45%. But the transition from the standard method of searching for a number in the interval to our fastest method allowed us to reduce the benchmark time by about 66%; in other words, up to one third of the original time.

The fastest method (without distortions) is the Lemira method (with my additional optimization). Here he is:

``````uint32_t bounded_rand(rng_t& rng, uint32_t range) {
uint32_t x = rng();
uint64_t m = uint64_t(x) * uint64_t(range);
uint32_t l = uint32_t(m);
if (l < range) {
uint32_t t = -range;
if (t >= range) {
t -= range;
if (t >= range)
t %= range;
}
while (l < t) {
x = rng();
m = uint64_t(x) * uint64_t(range);
l = uint32_t(m);
}
}
return m >> 32;
}``````

Using the Lemira method will improve the performance of most randomized algorithms more than moving from a fast generation engine to a faster one.

### Appendices: Testing Notes

The code of all tests is posted on GitHub . In total, I tested 23 methods for `bounded_rand`using 26 different PRNs (13 32-bit PRNs and 13 64-bit), in two compilers (GCC 8 and LLVM 6), which gave me 26 * 23 * 2 = 1196 executable files, each of which it was performed with the same 15 seed, which gives 1196 * 15 = 17,940 unique test runs, in each of which three benchmarks are combined. Basically, I ran tests on a 48-core machine with four 2.1 GHz Xeon E7-4830v3 processors. Performing a full set of tests took a little less than a month of processor time.

In the end, we return to the situation from the introduction of the article. Imagine that Sasha used `jsf32.STD-libc++`, and Juan -`mt19937.BIASED_FP_MULT_SCALE`. In benchmark 3, the latter takes 69.6% less time. That is, the time from this fictitious situation is based on data from reality.