# A small review of SIMD in .NET / C #

We offer you a small overview of the capabilities of the vectorization of algorithms in the .NET Framework and .NETCORE. The purpose of the article is to introduce those techniques to those who did not know them at all and to show that .NET is not far behind the "real, compiled" languages for native

development.

I'm just starting to learn the techniques of vectorization, so if someone from the community points out the obvious jamb, or offers improved versions of the algorithms described below, I will be wildly happy.

## A bit of history

In .NET, SIMD first appeared in 2015 with the release of the .NET Framework 4.6. Then the types Matrix3x2, Matrix4x4, Plane, Quaternion, Vector2, Vector3 and Vector4 were added, which allowed vectorized calculations. Later, the Vector <T> type was added, which provided more opportunities for vectorization of algorithms. But many programmers were still unhappy, because The types described above limited the flow of programmer's thoughts and did not allow using the full power of SIMD instructions of modern processors. Already in our time, in the .NET Core 3.0 Preview, the System.Runtime.Intrinsics namespace has appeared, which provides much more freedom in choosing instructions. To get the best results in speed, you need to use RyuJit and you need to either build under x64, or disable Prefer 32-bit and build under AnyCPU.

## Sum the array elements

I decided to start with the classic problem, which is often written first when it comes to vectorization. This is the task of finding the sum of the elements of the array. Let's write four implementations of this task, we will summarize the elements of the Array array:

Most obvious

```
publicintNaive() {
int result = 0;
foreach (int i in Array) {
result += i;
}
return result;
}
```

Using LINQ

`publiclongLINQ() => Array.Aggregate<int, long>(0, (current, i) => current + i);`

Using vectors from System.Numerics:

```
publicintVectors() {
int vectorSize = Vector<int>.Count;
var accVector = Vector<int>.Zero;
int i;
var array = Array;
for (i = 0; i < array.Length - vectorSize; i += vectorSize) {
var v = new Vector<int>(array, i);
accVector = Vector.Add(accVector, v);
}
int result = Vector.Dot(accVector, Vector<int>.One);
for (; i < array.Length; i++) {
result += array[i];
}
return result;
}
```

Using code from the System.Runtime.Intrinsics space:

```
publicunsafeintIntrinsics() {
int vectorSize = 256 / 8 / 4;
var accVector = Vector256<int>.Zero;
int i;
var array = Array;
fixed (int* ptr = array) {
for (i = 0; i < array.Length - vectorSize; i += vectorSize) {
var v = Avx2.LoadVector256(ptr + i);
accVector = Avx2.Add(accVector, v);
}
}
int result = 0;
var temp = stackallocint[vectorSize];
Avx2.Store(temp, accVector);
for (int j = 0; j < vectorSize; j++) {
result += temp[j];
}
for (; i < array.Length; i++) {
result += array[i];
}
return result;
}
```

I launched a benchmark on these 4 methods on my computer and got the following result:

Method | Itemscount | Median |
---|---|---|

Naive | ten | 75.12 ns |

LINQ | ten | 1 186.85 ns |

Vectors | ten | 60.09 ns |

Intrinsics | ten | 255.40 ns |

Naive | 100 | 360.56 ns |

LINQ | 100 | 2 719.24 ns |

Vectors | 100 | 60.09 ns |

Intrinsics | 100 | 345.54 ns |

Naive | 1000 | 1,847.88 ns |

LINQ | 1000 | 12 033.78 ns |

Vectors | 1000 | 240.38 ns |

Intrinsics | 1000 | 630.98 ns |

Naive | 10,000 | 18 403.72 ns |

LINQ | 10,000 | 102 489.96 ns |

Vectors | 10,000 | 7 316.42 ns |

Intrinsics | 10,000 | 3 365.25 ns |

Naive | 100,000 | 176 630.67 ns |

LINQ | 100,000 | 975 998.24 ns |

Vectors | 100,000 | 78 828.03 ns |

Intrinsics | 100,000 | 41 269.41 ns |

It can be seen that the solutions with Vectors and Intrinsics greatly benefit in speed compared with the obvious solution and with LINQ. Now we need to understand what is happening in these two methods.

Let us consider the Vectors method in more detail:

**Vectors**

```
publicintVectors() {
int vectorSize = Vector<int>.Count;
var accVector = Vector<int>.Zero;
int i;
var array = Array;
for (i = 0; i < array.Length - vectorSize; i += vectorSize) {
var v = new Vector<int>(array, i);
accVector = Vector.Add(accVector, v);
}
int result = Vector.Dot(accVector, Vector<int>.One);
for (; i < array.Length; i++) {
result += array[i];
}
return result;
}
```

- int vectorSize = Vector <int> .Count; - this is how many 4 byte numbers we can put in the vector. If hardware acceleration is used, this value indicates how many 4-byte numbers can be placed in one SIMD register. In fact, it shows how many elements of this type can be performed in parallel;
- accVector is the vector in which the result of the function will accumulate;

var v = new Vector <int> (array, i); - the data is loaded into the new vector v, from the array, starting at index i. It will load exactly vectorSize data. - accVector = Vector.Add (accVector, v); - two vectors are added.

For example, Array stores 8 numbers: {0, 1, 2, 3, 4, 5, 6, 7} and vectorSize == 4, then:

In the first iteration of the cycle, accVector = {0, 0, 0, 0}, v = {0, 1, 2, 3}, after adding in accVector will be: {0, 0, 0, 0} + {0, 1, 2, 3} = {0, 1, 2, 3}.

In the second iteration, v = {4, 5, 6, 7} and after addition, accVector = {0, 1, 2, 3} + {4, 5, 6, 7} = {4, 6, 8, 10}. - It remains only to somehow get the sum of all the elements of the vector, for this you can apply scalar multiplication by a vector filled with units: int result = Vector.Dot (accVector, Vector <int> .One);

Then we get: {4, 6, 8, 10}*{1, 1, 1, 1} = 4*1 + 6*1 + 8*1 + 10 * 1 = 28. - In the end, if required, numbers that do not fit in the last vector are added up.

If you look at the code for the Intrinsics method:

**Intrinsics**

```
publicunsafeintIntrinsics() {
int vectorSize = 256 / 8 / 4;
var accVector = Vector256<int>.Zero;
int i;
var array = Array;
fixed (int* ptr = array) {
for (i = 0; i < array.Length - vectorSize; i += vectorSize) {
var v = Avx2.LoadVector256(ptr + i);
accVector = Avx2.Add(accVector, v);
}
}
int result = 0;
var temp = stackallocint[vectorSize];
Avx2.Store(temp, accVector);
for (int j = 0; j < vectorSize; j++) {
result += temp[j];
}
for (; i < array.Length; i++) {
result += array[i];
}
return result;
}
```

You can see that it is very similar to Vectors with some exceptions:

- vectorSize is a constant. This happens because Avx2 instructions that operate on 256-bit registers are explicitly used in this method. In a real application, there must be a check to see if the current processor supports the Avx2 instructions and, if it does not, call other code. It looks like this:
`if (Avx2.IsSupported) { DoThingsForAvx2(); } elseif (Avx.IsSupported) { DoThingsForAvx(); } ... elseif (Sse2.IsSupported) { DoThingsForSse2(); } ...`

- var accVector = Vector256 <int> .Zero; accVector is declared as a 256 bit vector filled with zeros.
- fixed (int * ptr = Array) - the pointer to the array is entered into ptr.
- Further operations are the same as in Vectors: loading data into a vector and adding two vectors.
- For summing up the elements of the vector, the following method was used:
- an array is created on the stack: var temp = stackalloc int [vectorSize];
- the vector is loaded into this array: Avx2.Store (temp, accVector);
- the loop summarizes the elements of the array.

- then the elements of the array that do not fit in the last vector are reached.

## Compare two arrays

It is necessary to compare two byte arrays. Actually this is the task because of which I began to study SIMD in .NET. Let us write again several methods for the benchmark, we will compare two arrays: ArrayA and ArrayB:

The most obvious solution:

```
publicboolNaive() {
for (int i = 0; i < ArrayA.Length; i++) {
if (ArrayA[i] != ArrayB[i]) returnfalse;
}
returntrue;
}
```

Solution through LINQ:

`publicboolLINQ() => ArrayA.SequenceEqual(ArrayB);`

Solution via MemCmp function:

```
[DllImport("msvcrt.dll", CallingConvention = CallingConvention.Cdecl)]
staticexternintmemcmp(byte[] b1, byte[] b2, long count);
publicboolMemCmp() => memcmp(ArrayA, ArrayB, ArrayA.Length) == 0;
```

Using vectors from System.Numerics:

```
publicboolVectors() {
int vectorSize = Vector<byte>.Count;
int i = 0;
for (; i < ArrayA.Length - vectorSize; i += vectorSize) {
var va = new Vector<byte>(ArrayA, i);
var vb = new Vector<byte>(ArrayB, i);
if (!Vector.EqualsAll(va, vb)) {
returnfalse;
}
}
for (; i < ArrayA.Length; i++) {
if (ArrayA[i] != ArrayB[i])
returnfalse;
}
returntrue;
}
```

Using Intrinsics:

```
publicunsafeboolIntrinsics() {
int vectorSize = 256 / 8;
int i = 0;
constint equalsMask = unchecked((int) (0b1111_1111_1111_1111_1111_1111_1111_1111));
fixed (byte* ptrA = ArrayA)
fixed (byte* ptrB = ArrayB) {
for (; i < ArrayA.Length - vectorSize; i += vectorSize) {
var va = Avx2.LoadVector256(ptrA + i);
var vb = Avx2.LoadVector256(ptrB + i);
var areEqual = Avx2.CompareEqual(va, vb);
if (Avx2.MoveMask(areEqual) != equalsMask) {
returnfalse;
}
}
for (; i < ArrayA.Length; i++) {
if (ArrayA[i] != ArrayB[i])
returnfalse;
}
returntrue;
}
}
```

The result of the benchmark on my computer:

Method | Itemscount | Median |
---|---|---|

Naive | 10,000 | 66 719.1 ns |

LINQ | 10,000 | 71 211.1 ns |

Vectors | 10,000 | 3,695.8 ns |

Memcmp | 10,000 | 600.9 ns |

Intrinsics | 10,000 | 1,607.5 ns |

Naive | 100,000 | 588 633.7 ns |

LINQ | 100,000 | 651 191.3 ns |

Vectors | 100,000 | 34 659.1 ns |

Memcmp | 100,000 | 5 513.6 ns |

Intrinsics | 100,000 | 12,078.9 ns |

Naive | 1,000,000 | 5,637,293.1 ns |

LINQ | 1,000,000 | 6,622,666.0 ns |

Vectors | 1,000,000 | 777 974.2 ns |

Memcmp | 1,000,000 | 361 704.5 ns |

Intrinsics | 1,000,000 | 434 252.7 ns |

I think the whole code of these methods is understandable, with the exception of two lines in Intrinsics:

```
var areEqual = Avx2.CompareEqual(va, vb);
if (Avx2.MoveMask(areEqual) != equalsMask) {
returnfalse;
}
```

In the first two vectors are compared for equality and the result is stored in the areEqual vector, in which the bits in the element at a particular position are set to 1 if the corresponding elements in va and vb are equal. It turns out that if the vectors from bytes va and vb are completely equal, then in areEquals all elements must be equal to 255 (11111111b). Since Avx2.CompareEqual is a wrapper over _mm256_cmpeq_epi8, then on the Intel website you can see the pseudo-code of this operation:

The MoveMask method from the vector makes a 32-bit number. The bit values are the high-order bits of each of the 32 single-byte elements of the vector. Pseudocode can be viewed here .

Thus, if some bytes in va and vb do not match, then the corresponding bytes in areEqual will be equal to 0, therefore the high bits of these bytes will also be equal to 0, and therefore the corresponding bits in the Avx2 response. MovoveMask will also be equal to 0 and comparison with equalsMask will not work.

Let us analyze a small example, assuming that the vector length is 8 bytes (so that writing was less):

- Let va = {100, 10, 20, 30, 100, 40, 50, 100}, and vb = {100, 20, 10, 30, 100, 40, 80, 90};
- Then are Equal will be equal to {255, 0, 0, 255, 255, 255, 0, 0};
- The MoveMask method will return 10011100b, which will need to be compared with the mask 11111111b, since Since these masks are unequal, it turns out that both the vectors va and vb are unequal.

## Counting the number of times an item is found in a collection.

Sometimes it is necessary to count how many times a specific element is found in a collection, for example, ints, this algorithm can also be accelerated. Let's write several methods for comparison, we will look for the Item element in the Array array.

The most obvious:

```
publicintNaive() {
int result = 0;
foreach (int i in Array) {
if (i == Item) {
result++;
}
}
return result;
}
```

using LINQ:

`publicintLINQ() => Array.Count(i => i == Item);`

using vectors from System.Numerics.Vectors:

```
publicintVectors() {
var mask = new Vector<int>(Item);
int vectorSize = Vector<int>.Count;
var accResult = new Vector<int>();
int i;
var array = Array;
for (i = 0; i < array.Length - vectorSize; i += vectorSize) {
var v = new Vector<int>(array, i);
var areEqual = Vector.Equals(v, mask);
accResult = Vector.Subtract(accResult, areEqual);
}
int result = 0;
for (; i < array.Length; i++) {
if (array[i] == Item) {
result++;
}
}
result += Vector.Dot(accResult, Vector<int>.One);
return result;
}
```

Using Intrinsics:

```
publicunsafeintIntrinsics() {
int vectorSize = 256 / 8 / 4;
//var mask = Avx2.SetAllVector256(Item);//var mask = Avx2.SetVector256(Item, Item, Item, Item, Item, Item, Item, Item);var temp = stackallocint[vectorSize];
for (int j = 0; j < vectorSize; j++) {
temp[j] = Item;
}
var mask = Avx2.LoadVector256(temp);
var accVector = Vector256<int>.Zero;
int i;
var array = Array;
fixed (int* ptr = array) {
for (i = 0; i < array.Length - vectorSize; i += vectorSize) {
var v = Avx2.LoadVector256(ptr + i);
var areEqual = Avx2.CompareEqual(v, mask);
accVector = Avx2.Subtract(accVector, areEqual);
}
}
int result = 0;
Avx2.Store(temp, accVector);
for(int j = 0; j < vectorSize; j++) {
result += temp[j];
}
for(; i < array.Length; i++) {
if (array[i] == Item) {
result++;
}
}
return result;
}
```

The result of the benchmark on my computer:

Method | Itemscount | Median |
---|---|---|

Naive | 1000 | 2 824.41 ns |

LINQ | 1000 | 12 138.95 ns |

Vectors | 1000 | 961.50 ns |

Intrinsics | 1000 | 691.08 ns |

Naive | 10,000 | 27 072.25 ns |

LINQ | 10,000 | 113 967.87 ns |

Vectors | 10,000 | 7 571.82 ns |

Intrinsics | 10,000 | 4,296.71 ns |

Naive | 100,000 | 361 028.46 ns |

LINQ | 100,000 | 1,091,994.28 ns |

Vectors | 100,000 | 82 839.29 ns |

Intrinsics | 100,000 | 40 307.91 ns |

Naive | 1,000,000 | 1,634 175.46 ns |

LINQ | 1,000,000 | 6 194 257.38 ns |

Vectors | 1,000,000 | 583 901.29 ns |

Intrinsics | 1,000,000 | 413 520.38 ns |

Methods Vectors and Intrinsics completely coincide in logic, the differences are only in the implementation of specific operations. The whole idea is:

- creates a vector mask, in which the desired number is stored in each element;
- A part of the array is loaded into the vector v and compared to the mask, then all bits are set in equal elements in equal elements, since areEqual is a vector from ints, then if you set all the bits of one element, we get -1 in this element ((int) (1111_1111_1111_1111_1111_1111_1111_1111b) == -1);
- the areEqual vector is subtracted from accVector and then in accVector there will be the sum of how many times the item element is encountered in all vectors v for each position (minus gives minutes plus).

All the code from the article can be found on GitHub

## Conclusion

I considered only a very small part of the possibilities that .NET provides for vectorization of computations. For a complete and current list of available intrinsik in .NETCORE under x86, you can refer to the source code . Conveniently, there in C # files in the summary of each intrinsic is its own name from the world of C, which simplifies both the understanding of the purpose of this intrinsic and the translation of already existing C ++ / C algorithms to .NET. Documentation on System.Numerics.Vector is available on msdn .

In my opinion, .NET has a big advantage over C ++, since JIT compilation occurs already on the client machine, the compiler can optimize the code for a specific client processor, providing maximum performance. At the same time, a programmer for writing fast code can remain within the framework of one language and technology.