A small review of SIMD in .NET / C #

We offer you a small overview of the capabilities of the vectorization of algorithms in the .NET Framework and .NETCORE. The purpose of the article is to introduce those techniques to those who did not know them at all and to show that .NET is not far behind the "real, compiled" languages for native
development.

I'm just starting to learn the techniques of vectorization, so if someone from the community points out the obvious jamb, or offers improved versions of the algorithms described below, I will be wildly happy.

A bit of history

In .NET, SIMD first appeared in 2015 with the release of the .NET Framework 4.6. Then the types Matrix3x2, Matrix4x4, Plane, Quaternion, Vector2, Vector3 and Vector4 were added, which allowed vectorized calculations. Later, the Vector <T> type was added, which provided more opportunities for vectorization of algorithms. But many programmers were still unhappy, because The types described above limited the flow of programmer's thoughts and did not allow using the full power of SIMD instructions of modern processors. Already in our time, in the .NET Core 3.0 Preview, the System.Runtime.Intrinsics namespace has appeared, which provides much more freedom in choosing instructions. To get the best results in speed, you need to use RyuJit and you need to either build under x64, or disable Prefer 32-bit and build under AnyCPU.

Sum the array elements

I decided to start with the classic problem, which is often written first when it comes to vectorization. This is the task of finding the sum of the elements of the array. Let's write four implementations of this task, we will summarize the elements of the Array array:

Most obvious

publicintNaive() {
    int result = 0;
    foreach (int i in Array) {
        result += i;
    }
    return result;
}

Using LINQ

publiclongLINQ() => Array.Aggregate<int, long>(0, (current, i) => current + i);

Using vectors from System.Numerics:

publicintVectors() {
    int vectorSize = Vector<int>.Count;
    var accVector = Vector<int>.Zero;
    int i;
    var array = Array;
    for (i = 0; i < array.Length - vectorSize; i += vectorSize) {
        var v = new Vector<int>(array, i);
        accVector = Vector.Add(accVector, v);
    }
    int result = Vector.Dot(accVector, Vector<int>.One);
    for (; i < array.Length; i++) {
        result += array[i];
    }
    return result;
}

Using code from the System.Runtime.Intrinsics space:

publicunsafeintIntrinsics() {
    int vectorSize = 256 / 8 / 4;
    var accVector = Vector256<int>.Zero;
    int i;
    var array = Array;
    fixed (int* ptr = array) {
        for (i = 0; i < array.Length - vectorSize; i += vectorSize) {
            var v = Avx2.LoadVector256(ptr + i);
            accVector = Avx2.Add(accVector, v);
        }
    }
    int result = 0;
    var temp = stackallocint[vectorSize];
    Avx2.Store(temp, accVector);
    for (int j = 0; j < vectorSize; j++) {
        result += temp[j];
    }   
    for (; i < array.Length; i++) {
        result += array[i];
    }
    return result;
}

I launched a benchmark on these 4 methods on my computer and got the following result:

Method	Itemscount	Median
Naive	ten	75.12 ns
LINQ	ten	1 186.85 ns
Vectors	ten	60.09 ns
Intrinsics	ten	255.40 ns

Naive	100	360.56 ns
LINQ	100	2 719.24 ns
Vectors	100	60.09 ns
Intrinsics	100	345.54 ns

Naive	1000	1,847.88 ns
LINQ	1000	12 033.78 ns
Vectors	1000	240.38 ns
Intrinsics	1000	630.98 ns

Naive	10,000	18 403.72 ns
LINQ	10,000	102 489.96 ns
Vectors	10,000	7 316.42 ns
Intrinsics	10,000	3 365.25 ns

Naive	100,000	176 630.67 ns
LINQ	100,000	975 998.24 ns
Vectors	100,000	78 828.03 ns
Intrinsics	100,000	41 269.41 ns

It can be seen that the solutions with Vectors and Intrinsics greatly benefit in speed compared with the obvious solution and with LINQ. Now we need to understand what is happening in these two methods.

Let us consider the Vectors method in more detail:

Vectors

publicintVectors() {
    int vectorSize = Vector<int>.Count;
    var accVector = Vector<int>.Zero;
    int i;
    var array = Array;
    for (i = 0; i < array.Length - vectorSize; i += vectorSize) {
        var v = new Vector<int>(array, i);
        accVector = Vector.Add(accVector, v);
    }
    int result = Vector.Dot(accVector, Vector<int>.One);
    for (; i < array.Length; i++) {
        result += array[i];
    }
    return result;
}

int vectorSize = Vector <int> .Count; - this is how many 4 byte numbers we can put in the vector. If hardware acceleration is used, this value indicates how many 4-byte numbers can be placed in one SIMD register. In fact, it shows how many elements of this type can be performed in parallel;
accVector is the vector in which the result of the function will accumulate;
var v = new Vector <int> (array, i); - the data is loaded into the new vector v, from the array, starting at index i. It will load exactly vectorSize data.
accVector = Vector.Add (accVector, v); - two vectors are added.
For example, Array stores 8 numbers: {0, 1, 2, 3, 4, 5, 6, 7} and vectorSize == 4, then:
In the first iteration of the cycle, accVector = {0, 0, 0, 0}, v = {0, 1, 2, 3}, after adding in accVector will be: {0, 0, 0, 0} + {0, 1, 2, 3} = {0, 1, 2, 3}.
In the second iteration, v = {4, 5, 6, 7} and after addition, accVector = {0, 1, 2, 3} + {4, 5, 6, 7} = {4, 6, 8, 10}.
It remains only to somehow get the sum of all the elements of the vector, for this you can apply scalar multiplication by a vector filled with units: int result = Vector.Dot (accVector, Vector <int> .One);
Then we get: {4, 6, 8, 10} {1, 1, 1, 1} = 4 1 + 6 1 + 8 1 + 10 * 1 = 28.
In the end, if required, numbers that do not fit in the last vector are added up.

If you look at the code for the Intrinsics method:

Intrinsics

publicunsafeintIntrinsics() {
    int vectorSize = 256 / 8 / 4;
    var accVector = Vector256<int>.Zero;
    int i;
    var array = Array;
    fixed (int* ptr = array) {
        for (i = 0; i < array.Length - vectorSize; i += vectorSize) {
            var v = Avx2.LoadVector256(ptr + i);
            accVector = Avx2.Add(accVector, v);
        }
    }
    int result = 0;
    var temp = stackallocint[vectorSize];
    Avx2.Store(temp, accVector);
    for (int j = 0; j < vectorSize; j++) {
        result += temp[j];
    }   
    for (; i < array.Length; i++) {
        result += array[i];
    }
    return result;
}

You can see that it is very similar to Vectors with some exceptions:

vectorSize is a constant. This happens because Avx2 instructions that operate on 256-bit registers are explicitly used in this method. In a real application, there must be a check to see if the current processor supports the Avx2 instructions and, if it does not, call other code. It looks like this:
```
if (Avx2.IsSupported) {
DoThingsForAvx2();
}
elseif (Avx.IsSupported) {
DoThingsForAvx();
}
...
elseif (Sse2.IsSupported) {
DoThingsForSse2();
}
...
```
var accVector = Vector256 <int> .Zero; accVector is declared as a 256 bit vector filled with zeros.
fixed (int * ptr = Array) - the pointer to the array is entered into ptr.
Further operations are the same as in Vectors: loading data into a vector and adding two vectors.
For summing up the elements of the vector, the following method was used:
- an array is created on the stack: var temp = stackalloc int [vectorSize];
- the vector is loaded into this array: Avx2.Store (temp, accVector);
- the loop summarizes the elements of the array.
then the elements of the array that do not fit in the last vector are reached.

Compare two arrays

It is necessary to compare two byte arrays. Actually this is the task because of which I began to study SIMD in .NET. Let us write again several methods for the benchmark, we will compare two arrays: ArrayA and ArrayB:

The most obvious solution:

publicboolNaive() {
    for (int i = 0; i < ArrayA.Length; i++) {
        if (ArrayA[i] != ArrayB[i]) returnfalse;
    }
    returntrue;
}

Solution through LINQ:

publicboolLINQ() => ArrayA.SequenceEqual(ArrayB);

Solution via MemCmp function:

[DllImport("msvcrt.dll", CallingConvention = CallingConvention.Cdecl)]
staticexternintmemcmp(byte[] b1, byte[] b2, long count);
publicboolMemCmp() => memcmp(ArrayA, ArrayB, ArrayA.Length) == 0;

Using vectors from System.Numerics:

publicboolVectors() {
    int vectorSize = Vector<byte>.Count;
    int i = 0;
    for (; i < ArrayA.Length - vectorSize; i += vectorSize) {
        var va = new Vector<byte>(ArrayA, i);
        var vb = new Vector<byte>(ArrayB, i);
        if (!Vector.EqualsAll(va, vb)) {
            returnfalse;
        }
    }
    for (; i < ArrayA.Length; i++) {
        if (ArrayA[i] != ArrayB[i])
            returnfalse;
    }
    returntrue;
}

Using Intrinsics:

publicunsafeboolIntrinsics() {
    int vectorSize = 256 / 8;
    int i = 0;
    constint equalsMask = unchecked((int) (0b1111_1111_1111_1111_1111_1111_1111_1111));
    fixed (byte* ptrA = ArrayA)
    fixed (byte* ptrB = ArrayB) {
        for (; i < ArrayA.Length - vectorSize; i += vectorSize) {
            var va = Avx2.LoadVector256(ptrA + i);
            var vb = Avx2.LoadVector256(ptrB + i);
            var areEqual = Avx2.CompareEqual(va, vb);
            if (Avx2.MoveMask(areEqual) != equalsMask) {
                returnfalse;
            }
        }
        for (; i < ArrayA.Length; i++) {
            if (ArrayA[i] != ArrayB[i])
                returnfalse;
        }
        returntrue;
    }
}

The result of the benchmark on my computer:

Method	Itemscount	Median
Naive	10,000	66 719.1 ns
LINQ	10,000	71 211.1 ns
Vectors	10,000	3,695.8 ns
Memcmp	10,000	600.9 ns
Intrinsics	10,000	1,607.5 ns

Naive	100,000	588 633.7 ns
LINQ	100,000	651 191.3 ns
Vectors	100,000	34 659.1 ns
Memcmp	100,000	5 513.6 ns
Intrinsics	100,000	12,078.9 ns

Naive	1,000,000	5,637,293.1 ns
LINQ	1,000,000	6,622,666.0 ns
Vectors	1,000,000	777 974.2 ns
Memcmp	1,000,000	361 704.5 ns
Intrinsics	1,000,000	434 252.7 ns

I think the whole code of these methods is understandable, with the exception of two lines in Intrinsics:

var areEqual = Avx2.CompareEqual(va, vb);
if (Avx2.MoveMask(areEqual) != equalsMask) {
    returnfalse;
}

In the first two vectors are compared for equality and the result is stored in the areEqual vector, in which the bits in the element at a particular position are set to 1 if the corresponding elements in va and vb are equal. It turns out that if the vectors from bytes va and vb are completely equal, then in areEquals all elements must be equal to 255 (11111111b). Since Avx2.CompareEqual is a wrapper over _mm256_cmpeq_epi8, then on the Intel website you can see the pseudo-code of this operation:
The MoveMask method from the vector makes a 32-bit number. The bit values are the high-order bits of each of the 32 single-byte elements of the vector. Pseudocode can be viewed here .

Thus, if some bytes in va and vb do not match, then the corresponding bytes in areEqual will be equal to 0, therefore the high bits of these bytes will also be equal to 0, and therefore the corresponding bits in the Avx2 response. MovoveMask will also be equal to 0 and comparison with equalsMask will not work.

Let us analyze a small example, assuming that the vector length is 8 bytes (so that writing was less):

Let va = {100, 10, 20, 30, 100, 40, 50, 100}, and vb = {100, 20, 10, 30, 100, 40, 80, 90};
Then are Equal will be equal to {255, 0, 0, 255, 255, 255, 0, 0};
The MoveMask method will return 10011100b, which will need to be compared with the mask 11111111b, since Since these masks are unequal, it turns out that both the vectors va and vb are unequal.

Counting the number of times an item is found in a collection.

Sometimes it is necessary to count how many times a specific element is found in a collection, for example, ints, this algorithm can also be accelerated. Let's write several methods for comparison, we will look for the Item element in the Array array.

The most obvious:

publicintNaive() {
    int result = 0;
    foreach (int i in Array) {
        if (i == Item) {
            result++;
        }
    }
    return result;
}

using LINQ:

publicintLINQ() => Array.Count(i => i == Item);

using vectors from System.Numerics.Vectors:

publicintVectors() {
    var mask = new Vector<int>(Item);
    int vectorSize = Vector<int>.Count;
    var accResult = new Vector<int>();
    int i;
    var array = Array;
    for (i = 0; i < array.Length - vectorSize; i += vectorSize) {
        var v = new Vector<int>(array, i);
        var areEqual = Vector.Equals(v, mask);
        accResult = Vector.Subtract(accResult, areEqual);
    }
    int result = 0;
    for (; i < array.Length; i++) {
        if (array[i] == Item) {
            result++;
        }
    }
    result += Vector.Dot(accResult, Vector<int>.One);
    return result;
}

Using Intrinsics:

publicunsafeintIntrinsics() {
    int vectorSize = 256 / 8 / 4;
    //var mask = Avx2.SetAllVector256(Item);//var mask = Avx2.SetVector256(Item, Item, Item, Item, Item, Item, Item, Item);var temp = stackallocint[vectorSize];
    for (int j = 0; j < vectorSize; j++) {
        temp[j] = Item;
    }
    var mask = Avx2.LoadVector256(temp);
    var accVector = Vector256<int>.Zero;
    int i;
    var array = Array;
    fixed (int* ptr = array) {
        for (i = 0; i < array.Length - vectorSize; i += vectorSize) {
            var v = Avx2.LoadVector256(ptr + i);
            var areEqual = Avx2.CompareEqual(v, mask);
            accVector = Avx2.Subtract(accVector, areEqual);
        }
    }
    int result = 0;
    Avx2.Store(temp, accVector);
    for(int j = 0; j < vectorSize; j++) {
        result += temp[j];
    }
    for(; i < array.Length; i++) {
        if (array[i] == Item) {
            result++;
        }
    }
    return result;
}

The result of the benchmark on my computer:

Method	Itemscount	Median
Naive	1000	2 824.41 ns
LINQ	1000	12 138.95 ns
Vectors	1000	961.50 ns
Intrinsics	1000	691.08 ns

Naive	10,000	27 072.25 ns
LINQ	10,000	113 967.87 ns
Vectors	10,000	7 571.82 ns
Intrinsics	10,000	4,296.71 ns

Naive	100,000	361 028.46 ns
LINQ	100,000	1,091,994.28 ns
Vectors	100,000	82 839.29 ns
Intrinsics	100,000	40 307.91 ns

Naive	1,000,000	1,634 175.46 ns
LINQ	1,000,000	6 194 257.38 ns
Vectors	1,000,000	583 901.29 ns
Intrinsics	1,000,000	413 520.38 ns

Methods Vectors and Intrinsics completely coincide in logic, the differences are only in the implementation of specific operations. The whole idea is:

creates a vector mask, in which the desired number is stored in each element;
A part of the array is loaded into the vector v and compared to the mask, then all bits are set in equal elements in equal elements, since areEqual is a vector from ints, then if you set all the bits of one element, we get -1 in this element ((int) (1111_1111_1111_1111_1111_1111_1111_1111b) == -1);
the areEqual vector is subtracted from accVector and then in accVector there will be the sum of how many times the item element is encountered in all vectors v for each position (minus gives minutes plus).

All the code from the article can be found on GitHub

Conclusion

I considered only a very small part of the possibilities that .NET provides for vectorization of computations. For a complete and current list of available intrinsik in .NETCORE under x86, you can refer to the source code . Conveniently, there in C # files in the summary of each intrinsic is its own name from the world of C, which simplifies both the understanding of the purpose of this intrinsic and the translation of already existing C ++ / C algorithms to .NET. Documentation on System.Numerics.Vector is available on msdn .

In my opinion, .NET has a big advantage over C ++, since JIT compilation occurs already on the client machine, the compiler can optimize the code for a specific client processor, providing maximum performance. At the same time, a programmer for writing fast code can remain within the framework of one language and technology.

Tags: