Indexers in C # under the hood: indexing better than Dow Jones

    Good day. In this article, I propose to get acquainted with indexers in various types. Let's see the assembler language code for these indexers and the characteristics of each instruction in its speed. I will also offer some obvious conclusions. But what exactly to use in your particular situation is up to you whether to sacrifice convenience for speed or vice versa.



    Assembly language code is given for 64 bit systems. The following metrics were chosen as metrics for each instruction: the number of micro-operations merged, the total number of micro-operations, delay, throughput, and, of course, the number of instructions. I didn’t give any numbers as a whole for the indexer, because the situation may vary depending on how you work with the indexed type and affect cache differently.

    Below is a brief summary of terminology, without digging deeper, just conceptual concepts. My goal was to describe everything as simple as possible, for a common understanding.

    Micro operation (uop)- some operation of which each instruction consists. The concept of micro-operations is used for optimizations such as merging, caching, and reordering. So, for example, the MOV instruction consists of 1 micro-operation, while the XCHG instruction on two registers consists of 3 micro-operations (the approach is through a “temporary variable”, that is, an internal register, thanks leotsarev for the update), the XCHG instruction over the register and memory consists of 8 micro-operations.

    Merged micro-operations (fused uops) - as mentioned above, merging micro-operations is one of the optimizations. It consists in replacing two micro-operations with one more complex.

    Latency- the number of measures after which the data used in this instruction will become available for use by another instruction.

    Throughput (Reciprocal throughput) - the number of clock cycles required to execute one instruction, provided that a sequence of identical instructions is executed and they operate with independent data.

    Based on these indicators, you can evaluate the performance of a particular set of instructions. Please note that we can only "evaluate", the actual performance depends on many factors, such as hit or miss cache, data dependency, etc.

    These figures are for the Intel processor architecture Skylake-X. This corresponds to my Intel Core i7-6700 processor.

    It is also worth remembering that fastcall for 64 bit systems provides the transfer of not 2, but 4 parameters in registers (rcx, rdx, r8, r9).

    Indexers in numbers

    1. Array indexer

    We will consider the following methods as an example:

    public int IndexerArray(int index, int[] array)
        return array[index];

    Consider the assembler language code for this snippet.

    1. cmp     edx,dword ptr [r8+8]
    2. jae     00007ff9`07288c78
    3. movsxd  rax,edx
    4. mov     eax,dword ptr [r8+rax*4+10h]

    The first line checks if the index goes beyond the bounds of the array. The second line throws an exception if it exits. Next, we calculate the position of the element in the array. The first fields in the array are service information, so we need to skip them (additional 10h = 16 bytes).

    Information on instructions:
    No.Fused uopsTotal uopsLatencyReciprocal throughput

    2. Favorite List <>

    Ink Code:
    public int IndexerList(int index, List list)
        return list[index];

    Assembly language code:

    1. cmp     edx,dword ptr [r8+10h]
    2. jae     M00_L00 
    3. mov     rax,qword ptr [r8+8]
    4. cmp     edx,dword ptr [rax+8]
    5. jae     00007ff9`07268f56
    6. movsxd  rdx,edx
    7. mov     eax,dword ptr [rax+rdx*4+10h]
    M00_L00 call System.ThrowHelper.ThrowArgumentOutOfRange_IndexException()

    There are clearly more instructions here. It is clearly seen that the sheet indexer wraps the array indexer. An interesting point is that checking to go beyond the boundaries of the array is performed twice. So, the first instruction checks if the index goes beyond the borders of the sheet. If it does, then we jump (instruction 2) to a very obvious call, throwing an exception if it goes beyond the boundaries of the array. This border check uses the inner field of the sheet, which is the second in order (offset of 10h (16) bytes from the beginning of the type, 8 to the pointer to the method table and 8 to the link to the internal array - the first field). In the third line, we put in the rax register the address of the internal array - the first field (by analogy, an offset of 8 bytes is a pointer to the method table). This is followed by the already familiar section - the index reference for the array (lines 4 - 7).
    I tried to remove things that are not directly related to indexing, but here it is worth leaving ret so that it does not seem that at the end of each call to the sheet element there will be an exception: D

    By the way, in order not to torment you with speculation, please read the sheet implementation by reference . Type casting to unsigned ints is used to reduce the number of comparisons.

    As a result, we get 7 instructions for successfully accessing the index, which is 3 more than in the array.

    Information on instructions:
    No.Fused uopsTotal uopsLatencyReciprocal throughput

    New - Span <>


    public int IndexerSpan(int index, Span span)
       return span[index];

    And in assembly language:

    1. cmp     edx,dword ptr [r8+8]
    2. jae     00007ff9`07278f69
    3. mov     rax,qword ptr [r8]
    4. movsxd  rdx,edx
    5. mov     eax,dword ptr [rax+rdx*4]

    When the spans were announced, they promised us that they were made wisely, with the support of runtime. And they didn’t lie what to say. In fact, it differs from the classical array in only one instruction, an additional step of accessing the address. Judging by this code, the address of the memory location is hidden inside the span, where the elements are located, which we get in line 3. This can be an address to a specific place in an array, line, or a piece of memory on the stack.
    The link can be found with the implementation of an indexer Span sake of interest. You may notice that there are 2 different implementations, depending on the environment variable. PROJECTN is the code name of the first version of .NET Native for UWP. Therefore, we are more interested in the second version of the indexer. She is tagged [Intrinsic]. Moreover, if you look at the static Unsafe class used in the implementation of this indexer, you can find information that the implementations of most methods in this file are represented as Intrinsic .

    Method calls or references to fields marked with the [Intrinsic] attribute have support from the runtime.

    In CoreCLR , the bodies of such methods are replaced by EE (Execution engine) with unsafe code (unsafe). If you need more details, you can start digging with the getILIntrinsicImplementationForUnsafe method .

    You can start looking for information on how this works in CoreRT (which interests me a little).
    Internal.IL.Stubs.UnsafeIntrinsics .

    Given such support from raintime, in order to understand what exactly will happen behind the scenes, it makes sense to look at the assembly language instructions, which we did.

    Information on instructions:
    No.Fused uopsTotal uopsLatencyReciprocal throughput

    All indexers are highly dependent on data: instructions use the results of previous ones. There are no unusual results here, and there shouldn’t be. But now the overhead that appears in this or that case is clear. Some obvious findings. If the algorithm involves very frequent accesses by index, then it makes sense to think about replacing the sheet with an array. If the call is not very frequent, then it may be more convenient to use a sheet that provides a very convenient api and does not have such a large overhead (I remind you to monitor the extensions of the internal array).

    Now let's try to look at the different ways with which we can specify a two-dimensional array: an array of arrays (jagged array) and a multidimensional array (multidimensional array).

    4. Multidimensional array

    Sharp Code:

    public int IndexerDimensionalArray(int index1, int index2, int[,] demensionalArray)
        return demensionalArray[index1, index2];

    Assembly language:

    1. mov     eax,edx
    2. sub     eax,dword ptr [r9+18h]
    3. cmp     eax,dword ptr [r9+10h]
    4. jae     00007ff9`00098fe6
    5. mov     edx,r8d
    6. sub     edx,dword ptr [r9+1Ch]
    7. cmp     edx,dword ptr [r9+14h]
    8. jae     00007ff9`00098fe6
    9. mov     ecx,dword ptr [r9+14h]
    10. imul    rcx,rax
    11. mov     rax,rdx
    12. add     rax,rcx
    13. mov     eax,dword ptr [r9+rax*4+20h]

    Everything is understandable in principle - 2 checks on the boundaries of the array, then calculating the index and reversing. This array is stored in memory in one fragment.

    Information on instructions:
    No.Fused uopsTotal uopsLatencyReciprocal throughput

    5. Array of arrays (jagged array)

    public int IndexerJaggedArray(int index, int index2, int[][] jaggedArray)
        return jaggedArray[index][index2];


    1. cmp     edx,dword ptr [r9+8]
    2. jae     00007ff9`00098f95
    3. movsxd  rax,edx
    4. mov     rax,qword ptr [r9+rax*8+10h]
    5. cmp     r8d,dword ptr [rax+8]
    6. jae     00007ff9`00098f95
    7. movsxd  rdx,r8d
    8. mov     eax,dword ptr [rax+rdx*4+10h]

    And the most interesting - we have fewer instructions than with a specially made type for multidimensionality.

    Information on instructions:
    No.Fused uopsTotal uopsLatencyReciprocal throughput

    But about the last 2 examples, I advise you not to rush to conclusions. Due to the fact that a two-dimensional array is a single type, which is initialized 1 time, the memory for the entire array is allocated in one large fragment. This will provide a better cache, which can fundamentally change the situation. In an array of arrays, the memory for each array will be allocated separately, so it is likely that the arrays will be allocated in memory and entered in the most suitable places for them.

    However, perhaps for some, this behavior will be more acceptable. Perhaps in some situations it is known that the lifetime of this specimen will be short-lived. And in order not to fall into a bunch of large objects (which is a kind of second generation for the garbage collector), where there is a chance to stay for a long time, much more than we would like. Or after some time we want to work only with certain lines, and everything else can be cleared. Plus, it is planned to work with the type by referring to random inconsistent elements, when the cache cannot work normally.

    Also, when using an array of arrays, it is more likely not to provoke the garbage collector to compacting, but to do with a sweep. Reminder: when fragmenting memory, the total amount of free space may be sufficient for a new object, but there is no continuous free area of ​​the required amount. In this case, compacting is performed - moving objects with the goal of defragmentation. If we are able to pick up a continuous stretch of free memory for a new object, we can simply enter the object into this free space. This is called a sweep.

    I hope this information helps you to draw the right conclusions and substantiate your opinion in the discussion about what to use.

    Also popular now: