5 ways to compare two byte arrays. Benchmarking

From the sandbox

As a result of profiling my software, I concluded that it was necessary to optimize the buffer comparison function.
Because Since the CLR does not provide a standard way to compare two pieces of memory, the function was whipped up independently (if only it worked).
Googling the phrase "Best Way to Compare Byte Arrays in .Net", I was confused: in the vast majority of cases, people suggested using either LINQ or Enumerable.SequenceEqual (), which is almost the same thing. Even on StackOverflow, this was the most popular answer. Those. catastrophically popular misconception is:

"Compiler \ run-time environment will optimize your loop so you don't need to worry about performance." From here.

It was it that first made me think of writing this post.
I conducted a comparative test of five ways to compare the buffers available from C #, and based on the test results I gave recommendations in choosing a method.
In addition, I decompiled some functions, and analyzed the code generated by the JIT compiler for the x86 configuration, and also compared the machine code generated by the JIT compiler with the machine code of the CRT function for a similar purpose.

Background

After writing another utility and making it work, I launched the profiler and saw that comparing byte arrays took up a huge percentage of the time.
The CompareByteArrays () function was on a critical path, and there was something to be done.
I did not see an algorithmic solution to the performance problem: the algorithms were selected in advance, and I did not have any ideas on how to reduce the resulting complexity.
The search results on the World Wide Web were not very pleased: there were comparisons of the speed of the methods, but how were these numbers obtained, on what data and under what conditions no one bothered to write.
I had my own assumptions about who and under what conditions would be faster. I decided to check.
The results were interesting, so the thought of fasting here finally matured.

Что и чем я тестировал

Главное

The code was compiled and run on a Windows 7 x64 machine with all the latest updates for .Net 4.0 Client Profile, i.e. with Microsoft Visual Studion 2010 default settings, and only for x86 configuration, i.e. this is 32 bit code. Firstly, because most of the code I come across is written for 32-bit systems. Secondly, I believe that .Net is critical in this configuration. In particular, because a huge amount of existing components is written for 32-bit, and no one will rewrite them, but you need to interact with them, i.e. they will determine the configuration of the final application. Secondly, an extremely small number of applications really need 64 bits - the required bit depth is determined primarily by the memory requirements and the target platform. Modern 64-bit x86-compatible processors support 32-bit tasks in hardware [1] [2]. It is logical that Windows x64 uses this feature by executing 32-bit code directly [3]. Compiling an initially 32-bit application with 64 bits multiplies the memory needed for it and reduces its performance, because the compiler options do not change the processor TLB size, the first level cache too, and the pointer size is doubled, which in turn also has the necessary data alignment boundary changes [4].

Data sizes

In my original task, it was required to compare byte arrays of size 16 bytes and sizes from 1 to 4 * 1024 bytes. 16 bytes is the size of MD5, 4 Kb is the size of a standard memory page in Windows, so it was selected for the I / O buffer.
However, I tested the comparison on blocks from 1 byte to 1/2 megabytes, for the simple reason that hashes are not only 16 bytes in size, but also 4, and even 2 bytes (CRC16, CRC32), and memory pages not only 4 kilobytes, but also 2 MB [2] and even 4 MB. The results shown on 1/2 MB blocks were sufficient to draw a conclusion about working with large blocks, moreover, my time is not unlimited (it is better not to touch the computer during the testing process so as not to take time from the described process).

Based on the results of the first runs, I found it sufficient to compare the methods on 25 different sizes of test data. To do this, I built the testing cycle so that the first test set consisted of 1-byte arrays, and at each next step the size of the test arrays was multiplied by the same factor.

Parameters were set by constants in the code:

privateconstint CountArrays = 128; //Количество сравниваемых тестовых наборовprivateconstint StartArraySize = 1; //Стартовый размер тестового набораprivateconstint MaxArraySize = 512 * 1024; //Максимальный размер тестового набораprivateconstint StepsCount = 25; //Количество шаговprivateconstint MeasurementsCount = 10; //Количество измерений

The main testing cycle looked like this:

double sizeMultiplier = 1;
         DoInvestigationStep(sizeMultiplier);
         constint MaxMultiplier = MaxArraySize / StartArraySize;
         var stepMmltiplier = Math.Pow( MaxMultiplier, 1 / (double)StepsCount );
         for (var step = 1; step <= StepsCount; step++)
         {
            sizeMultiplier = Math.Pow(stepMmltiplier, (double) step);
            DoInvestigationStep(sizeMultiplier);
         }

The size of the array at each step was calculated like this:

var arraySize = (int) (StartArraySize * p_SizeMultiplier);

Tested methods

Using the System.Collections.IStructuralEquatable Interface

This is a relatively new method for C #, it appeared only in .NET 4, so it was of particular interest to me: it was curious how slow it is.

privatestaticboolByteArrayCompareWithIStructuralEquatable(byte[] p_BytesLeft, byte[] p_BytesRight)
      {
         IStructuralEquatable eqa1 = p_BytesLeft;
         return eqa1.Equals(p_BytesRight, StructuralComparisons.StructuralEqualityComparer);
      }

LINQ, or rather Enumerable.SequenceEqual ()

This is one of the simplest methods, the easiest to implement. It is also the most popular method.

privatestaticboolByteArrayCompareWithSequenceEqual(byte[] p_BytesLeft, byte[] p_BytesRight)
      {
         return p_BytesLeft.SequenceEqual(p_BytesRight);
      }

PInvoke, for the CRT memcmp () function

Theoretically, this should be the fastest way, but it turned out that this is not always the case: the overhead of interacting with the platform eats up quite a lot of time, and this method in some cases loses its attractiveness.
Here I bring it in the form where I found it on SO. It was in this form that I tested it.
On PInvoke.net it looks a bit different.

      [DllImport("msvcrt.dll", CallingConvention = CallingConvention.Cdecl)]
      privatestaticexternintmemcmp(byte[] p_BytesLeft, byte[] p_BytesRight, long p_Count);
      privatestaticboolByteArrayCompareWithPInvoke(byte[] p_BytesLeft, byte[] p_BytesRight)
      {
         // Validate buffers are the same length.// This also ensures that the count does not exceed the length of either buffer.  return p_BytesLeft.Length == p_BytesRight.Length
                    && memcmp(p_BytesLeft, p_BytesRight, p_BytesLeft.Length) == 0;
      }

Head-on. Those. direct comparison of bytes in a for (;;) loop

This is the most obvious way to compare two arrays, which is why it is of interest.
Initially, the method did not seem too fast, but it turned out that for many situations it is quite acceptable.

By the way, one of the variations of this comparison method was used in an article from the CLR Manufacturers themselves, moreover, in the very text in which I needed it.
Apparently, users have finished them with such questions.

privatestaticboolByteArrayCompareWithSimplest(byte[] p_BytesLeft, byte[] p_BytesRight)
      {
         if (p_BytesLeft.Length != p_BytesRight.Length)
            returnfalse;
         var length = p_BytesLeft.Length;
         for (int i = 0; i < length; i++)
         {
            if (p_BytesLeft[i] != p_BytesRight[i])
               returnfalse;
         }
         returntrue;
      }

Optimized unsafe method from Hafor Stefanson

The author of the method claims that this optimized method makes comparisons in blocks of 64 bits for the largest possible part of the array, assuming that the beginning of the array is aligned with the QWORD boundary. It will also work if there is no QWORD alignment, but at a lower speed.
However, in a 32-bit environment, comparisons in 64-bit blocks can only be achieved using assembler: general-purpose registers are 32-bit, and the compiler uses these to generate machine code.
So how this method actually works depends on how it is compiled. In my case, it’s definitely not 64 bits.

// Copyright (c) 2008-2013 Hafthor Stefansson// Distributed under the MIT/X11 software license// Ref: http://www.opensource.org/licenses/mit-license.php.privatestaticunsafeboolUnsafeCompare(byte[] a1, byte[] a2)
      {
         if (a1 == null || a2 == null || a1.Length != a2.Length)
            returnfalse;
         fixed (byte* p1 = a1, p2 = a2)
         {
            byte* x1 = p1, x2 = p2;
            int l = a1.Length;
            for (int i = 0; i < l / 8; i++, x1 += 8, x2 += 8)
               if (*((long*) x1) != *((long*) x2))
                  returnfalse;
            if ((l & 4) != 0)
            {
               if (*((int*) x1) != *((int*) x2)) returnfalse;
               x1 += 4;
               x2 += 4;
            }
            if ((l & 2) != 0)
            {
               if (*((short*) x1) != *((short*) x2)) returnfalse;
               x1 += 2;
               x2 += 2;
            }
            if ((l & 1) != 0)
               if (*((byte*) x1) != *((byte*) x2))
                  returnfalse;
            returntrue;
         }
      }

Testing methodology

Test set

In order to bring test conditions as close as possible to combat conditions and to exclude “jamming” of test data in the processor cache (working with memory as much as possible), it was decided to generate a lot of test arrays and compare them with each other, i.e. each array with each.

As a test set, a “jagged” array was selected (in the original documentation - “Jagged Array”). There were several alternatives, for example, List<byte[]>and LinkedList<byte[]>, but the alternatives did not allow access to the element, although the array in .NET does not shine here: the CLR in any case checks the index to go beyond the boundaries of the array, even if it is an array with zero base.

Test data was generated in such a way that all arrays were different, and the different element of the array was exactly in the middle.

Test data generation

privatestaticvoidPrepareTestData(int p_ArraySize)
      {
         for (int i = 0; i < CountArrays; i++)
         {
            var byteArray = newbyte[p_ArraySize];
            byteArray[p_ArraySize / 2] = (byte) (i & 0x000000ff);
            s_arrays[i] = byteArray;
         }
      }

Making comparisons

Since all arrays were supposed to be compared with all, the general view of the method, the execution time of which was measured, is two nested loops over all arrays of the set, and the test method is called inside them. The essence of what is happening can be expressed by the code:

for (int i = 0; i < CountArrays; i++)
            for (int j = 0; j < CountArrays; j++)
               Compare( s_arrays[i], s_arrays[j] );

However, there is one “but”: if you do not use the result of “calculations” in any way, then the CLR has every right to not perform the “calculation” itself. I came across this before when I was experimenting with C ++. With the “-O3” optimization turned on, it was very difficult to measure anything, because time after time it turned out to be zero. Therefore, it was decided to exclude such a scenario from the very beginning, so that the result of the comparison function was “accumulated” and returned, and analyzed at the highest level.

Comparison without ignoring the result

var result = true;
         for (int i = 0; i < CountArrays; i++)
            for (int j = 0; j < CountArrays; j++)
            {
               var tmp = Compare( s_arrays[i], s_arrays[j] );
               result = result && tmp;
            }
         return result;

Analysis of the comparison result

var stubLocalVar = p_СomparisonMethod();
         if (stubLocalVar)
            thrownew InvalidOperationException();

Since there are 5 comparison methods, and the above template is common, it seems logical to determine the general, "template" method, and pass the comparison method to it. In this way:

Incorrect function call method

privatestaticboolCompareArraysWithExternalMethod(Func<byte[], byte[], bool> p_Method)
      {
         var result = true;
         for (int i = 0; i < CountArrays; i++)
            for (int j = 0; j < CountArrays; j++)
            {
               var tmp = p_Method( s_arrays[i], s_arrays[j] );
               result = result && tmp;
            }
         return result;
      }

This method, of course, eliminates copying / pasting and reduces the likelihood of error, but at the same time creates excessive call indirection and eliminates inlining, which, in turn, leads to a loss in measurement accuracy, especially since there are many calls. That is why I had to create five almost identical methods:

privatestaticboolCompareArraysWithUnsafeMethod();
      privatestaticboolCompareArraysWithSimplestMethod();
      privatestaticboolCompareArraysWithSequenceEqualMethod();
      privatestaticboolCompareArraysWithPInvokeMethod();
      privatestaticboolCompareArraysWithIStructuralEquatableMethod();

However, a level higher, I could already apply the template method.

The code below shows that the MeasureComparisonTime () function, which measures the execution time of the comparison method, takes a delegate of type Func as a parameter. It is his runtime that is measured.

Method Measurement Functions

privatestaticlongGetMinimalMesuredValue(int p_MeasurementNumber, long p_PreviousValue,
                                                 long p_MeasuredValue)
      {
         var result = p_MeasurementNumber == 0
                            ? p_MeasuredValue
                            : Math.Min(p_PreviousValue, p_MeasuredValue);
         return result;
      }
      privatestaticlongMeasureComparisonTime(Func<bool> p_Method, long p_PreviousTime,
                                                int p_MeasurementNumber)
      {
         GC.Collect();
         GC.Collect();
         s_stopwatch.Start();
         var stubLocalVar = p_Method();
         s_stopwatch.Stop();
         if (stubLocalVar)
            thrownew InvalidOperationException();
         var result = GetMinimalMesuredValue(p_MeasurementNumber, p_PreviousTime, s_stopwatch.ElapsedTicks);
         s_stopwatch.Reset();
         return result;
      }

Measurement procedure

The time was measured using a standard stopwatch ( class System.Diagnostics.Stopwatch ), which is based on QueryPerformanceCounter () . Naturally, they were interested not in milliseconds, but in tics: in the case of tiny arrays, all measurements in milliseconds would be zero. There was also an idea to use your own “bicycle” based on RDTSC [6], but, firstly, the first few runs showed that accuracy was enough for the current technique, and secondly, a warning about the need to use ProcessThread.ProcessorAffinity, which appeared in the latest versions of the documentation, suggests that the work of this class is based on the aforementioned processor clock counter. All measurements were repeated 10 times, and the best time was taken. The number 10 was selected experimentally, as giving fairly accurate results with minimal time.

Function taking all measurements

privatestatic MeasurementsResults DoMeasurements()
      {
         MeasurementsResults result;
         result.SimplestTime = 0;
         result.SequenceEqualTime = 0;
         result.PInvokeTime = 0;
         result.IStructuralEquatableTime = 0;
         result.UnsafeTime = 0;
         for (int measurementNumber = 0; measurementNumber < MeasurementsCount; measurementNumber++)
         {
            Console.WriteLine("Стартует измерение номер {0}", measurementNumber);
            result.SimplestTime = MeasureComparisonTime(CompareArraysWithSimplestMethod,
                                                        result.SimplestTime,
                                                        measurementNumber);
            result.SequenceEqualTime = MeasureComparisonTime(CompareArraysWithSequenceEqualMethod,
                                                             result.SequenceEqualTime,
                                                             measurementNumber);
            result.PInvokeTime = MeasureComparisonTime(CompareArraysWithPInvokeMethod,
                                                       result.PInvokeTime,
                                                       measurementNumber);
            result.IStructuralEquatableTime = MeasureComparisonTime(CompareArraysWithIStructuralEquatableMethod,
                                                                    result.IStructuralEquatableTime,
                                                                    measurementNumber);
            result.UnsafeTime = MeasureComparisonTime(CompareArraysWithUnsafeMethod,
                                                      result.UnsafeTime,
                                                      measurementNumber);
         }
         return result;
      }

Measurement results and first conclusions

Before starting the experiment, I had some idea which method would turn out to be the winner and which would be the outsider, but the results nevertheless surprised me. My predictions were not entirely correct.

Best time (tics)	Array size (bytes)	unsafe	Head-on	Pinvoke	Sequencequal	IStructuralEquatable
276	one	1.7	one	6	10.5	55
276	one	1.7	one	6.3	10.1	55.7
325	2	1.3	one	5.3	eleven	95.2
374	four	1,1	one	4.8	11,4	121.3
413	eight	one	1.3	five	14.1	181.2
412	13	one	1.7	4.7	17.5	252.5
490	23	one	2,3	3,7	22.1	364.8
554	39	one	3.4	3,5	30.1	535.9
691	67	one	4.3	2.9	39.1	727.5
887	114	one	5,6	2,4	50.3	964.1
1212	194	one	4.6	2.1	61.5	1190.9
1875	328	one	4.8	1.8	65.8	1295.8
2897	556	one	five	1,6	71.5	1416.9
4565	942	one	5.3	1.4	76.3	1521.1
7711	1595	one	5.2	1.3	76,2	1521.2
12482	2702	one	5,4	1.3	79,4	1593.6
20692	4576	one	5.5	1,2	81.2	1626.2
34369	7750	one	5,6	1,2	82.8	1657.1
58048	13124	one	5,6	1,2	82.9	1661.9
97511	22226	one	5,6	1,2	83.6	1677.3
173805	37640	one	5,4	1,1	79.5	1585.7
295989	63743	one	5.3	1,1	79.3	1574.9
588797	107949	one	4.6	1,1	67.5	1340.9
1010768	182811	one	4.6	1,1	67	1340.4
1705668	309590	one	4.6	1,1	67.3	1334.1
2885452	524287	one	4.6	1,1	67.3	1329.1

The results surprised
me with the following: First: The CRT function simply must be well optimized, and I calculated that at a certain stage (the size of the test array) its speed would block the overhead of the platform call, but this did not happen. Later, I repeated testing with much larger arrays, but even at 1.5 megabytes the unsafe code turned out to be faster.

Second: I guessed that IStructuralEquatable would turn out to be slow, but the numbers simply struck me: I definitely did not expect this.

Diazassembling and analysis of individual functions

Such a serious difference between unsafe and Simplest on large arrays (I expected no more than two to three times the “advantage”) suggests that with the optimization of loops and accesses to array elements in .Net, not everything is as smooth as .Net adherents say. Accordingly, I did not deny myself the pleasure of looking at the results of JIT'er's work, i.e. directly generated machine code.

Analysis of the CompareArraysWithSimplestMethod () Function

To do this, Thread.Sleep (60 * 1000) was inserted at the beginning of the method; after launching the release build, I picked up the OllyDebug process. Such a tricky move was needed in order not to undermine the CLR optimization in any way - to see the picture as it is.

Screenshot of the debugger window with arrows

Screenshot of the debugger window with arrows

Going down the call stack from ntdll.NtDelayExecution () and SleepEx (), I found my function and, after careful study, was once again unpleasantly surprised. What I really did not like here:

Instead of a single RngChkFail check (for the entire loop at once), the CLR does this check for each index !!!
The cycle remained in the form in which I wrote it: the compiler did not “deploy” it.
Therefore, the comparison remained byte by byte, although the CLR was quite capable of optimizing it, for example, by making a comparison of 4 bytes (register size).
Instead of a single return with a single epilogue and clearing the stack, the CLR copied them three times, making the code “swollen”. In fact, machine code repeats C # code almost one to one. In this regard, the question arises: did he optimize the code at all ?! Does he know how to do this?
Probably thanks to the previous paragraph (swollen code), the function was not inlined.

UnsafeCompare () function analysis

The result of decompilation of this function kindled my curiosity and made me think about other functions. I decided to compare the CRT function and the unsafe option. First, I decided to look at the unsafe option. To do this, I did the same operations as for the first function, except that Thread.Sleep () was inserted not in the function itself, but before it was called. This could hardly have any effect on the essence of what was happening, but somewhat simplified decompilation: an unsafe function is clearly more complicated than the first.

Insertion Thread.Sleep ()

privatestaticboolCompareArraysWithUnsafeMethod()
      {
         var result = true;
         for (int i = 0; i < CountArrays; i++)
            for (int j = 0; j < CountArrays; j++)
            {
               Thread.Sleep( 60 * 1000 );
               var tmp = UnsafeCompare(s_arrays[i], s_arrays[j]);

The first half of this function is also of interest, i.e. until UnsafeCompare () is called. It, like the first one, demonstrates that any access to an element of the array causes the index to be checked for occurrence within the boundaries of the array. I emphasized this in the listing.

CPU Disasm CompareArraysWithUnsafeMethod

;Address   Hex dump          Command                                 Comments
001B0B88    55push    ebp                             ; Стандартный пролог
001B0B89    8BEC            mov     ebp, esp                        
001B0B8B    57push    edi                             
001B0B8C    56push    esi                             
001B0B8D    53push    ebx                             
001B0B8E    BF 01000000     mov     edi, 1                          ; result = true;
001B0B93    33DB            xor     ebx, ebx                        ; for (int i = 0;
001B0B95    33F6            xor     esi, esi                        ; for (int j = 0;
001B0B97    B9 60EA0000     mov     ecx, 0EA60                      ; Thread.Sleep( 60 * 1000 );
001B0B9C    E8 0CFBC161     call    clr.ThreadNative::Sleep         
001B0BA1    8B15 A8337A03   mov     edx, [s_arrays]                 ; EDX  <-- s_arrays
001B0BA7    3B5A 04         cmp     ebx, [edx+4]                    ; Compare(s_arrays.Length, первый индекс) (1) !!!
001B0BAA    7331           jae     short stub_CLR.JIT_RngChkFail   
001B0BAC    8B4C9A 0C       mov     ecx, [ebx*4+edx+0C]             ; ECX <-- s_arrays[i]
001B0BB03B72 04         cmp     esi, [edx+4]                    ; Compare(s_arrays.Length, второй индекс) (2) !!!
001B0BB3    7328           jae     short stub_CLR.JIT_RngChkFail   
001B0BB5    8B54B2 0C       mov     edx, [esi*4+edx+0C]             ; EDX <-- s_arrays[j]
001B0BB9    FF15 F0381400   call    near UnsafeCompare

The result of the decompilation of the function turned out to be quite large and did not fit on one screen, so I did not give a screenshot of the debugger window. Since a large integral listing is tedious and unproductive to read, here I give it in logically connected pieces, accompanying each piece with a brief comment.

UnsafeCompare (). ParametersChecking

;a1 ========> ECX
;a2 ========> EDX
;Address   Hex dump          Command                Comments
001B0BF8    55push    ebp            ; Стандартный пролог
001B0BF9    8BEC            mov     ebp, esp
001B0BFB    57push    edi            ; сохранение регистров в стеке нужно, вероятно, затем, чтобы
001B0BFC    56push    esi            ; использовать их в дальнейшем
001B0BFD    53push    ebx            ;
001B0BFE    83EC 0C         subesp, 0C; вершина стека выше сохранённых регистров на 3*sizeof(DWORD), т.е. будет 3 переменных
001B0C01    33C0xor     eax, eax       ;                      (!)
001B0C03    8945 F0         mov     [ebp-10], eax  ; var1 <-- 0           (!)
001B0C06    8945 EC         mov     [ebp-14], eax  ; var2 <-- 0           (!)
001B0C09    85C9            test    ecx, ecx       ; Compare(a1, null)
001B0C0B    74 0C           je      short return1 
001B0C0D    85D2            test    edx, edx       ; Compare(a2, null)
001B0C0F    74 08           je      short return1 
001B0C11    8B41 04         mov     eax, [ecx+4]   ; eax <-- a1.Length
001B0C14    3B42 04         cmp     eax, [edx+4]   ; Compare(eax, a2.Length)
001B0C17    74 0A           je      short argsIsValid 
return1:
001B0C19    33C0xor     eax, eax       ; result <-- 0001B0C1B    8D65 F4         lea     esp, [ebp-0C]  
001B0C1E    5B              pop     ebx            
001B0C1F    5E              pop     esi            
001B0C205F              pop     edi            
001B0C21    5D              pop     ebp            
001B0C22    C3              ret
argsIsValid:
; Часть кода опущена

According to the fastcall calling convention, which follows the .NET Framework [7] [8], the first and second parameters are in the ecx and edx registers, in order from left to right. In the above listing, a sequential check of the input parameters is clearly visible, it almost unambiguously corresponds to the code in C #

if (a1 == null || a2 == null || a1.Length != a2.Length)
            returnfalse;

However, the highlighted lines deserve special attention: their purpose is unclear, and they are confusing. To understand what it is and why it might be necessary, I had to put in another experiment, during which I wrote the simplest unsafe function in C #, which uses the fixed clause and pointer handling, and performed similar actions with it.

privatestaticunsafeintfunc1(byte[] param1)
      {
         fixed (byte* p = param1)
         {
            return *p;
         }
      }

The disasma of the simplest unsafe function


;param1 ========> ECX
Address   Hex dump          Command                                  Comments
$ ==>       55              push    ebp                              ; Стандартный пролог
$+18BEC            mov     ebp, esp
$+350              push    eax                              ;                                 (!)
$+433C0            xor     eax, eax                         ;                                 (!)
$+68945 FC         mov     [ebp-4], eax                     ; var1 <-- 0                      (!)
$+985C9            test    ecx, ecx                         ; Compare(param1, null)
$+B         7406           je      short param_is_null
$+D         83790400      cmp     dword ptr [ecx+4], 0             ; Compare(param1.Length, 0)
$+117507           jne     short not_zero_len
param_is_null:
$+1333D2            xor     edx, edx
$+158955 FC         mov     [ebp-4], edx                    ; var1 <-- 0
$+18        EB 0C           jmp     short ret_1
not_zero_len:
$+1A        83790400      cmp     dword ptr [ecx+4], 0
$+1E        7610           jbe     short call.JIT_RngChkFail
$+208D49 08         lea     ecx, [ecx+8]                    ; ecx <-- param1.BufferPointer
$+23894D FC         mov     [ebp-4], ecx                    ; var1 <-- ecx
ret_1:
$+268B45 FC         mov     eax, [ebp-4]                    ; eax <-- var1
$+29        0FB600          movzx   eax, byte ptr [eax]             ; eax <-- *eax
$+2C        8BE5            mov     esp, ebp                        ; Стандартный эпилог
$+2E        5D              pop     ebp
$+2F        C3              ret
call.JIT_RngChkFail:
$+30        E8 B3BDC861     call    clr.JIT_RngChkFail
$+35        CC              int3

From the above listing, it becomes clear that as a result of JIT compilation, the variable in the fixed clause is necessarily pushed onto the stack, regardless of the availability of general-purpose registers. This is clearly seen from the fact that the eax register was saved on the stack, was not used in the future and, therefore, was free and accessible for operations (up to offset 0x26 from the beginning of the function), but the offset variable stack variable was used to store temporary data [ ebp-4] (let's call it var1). The variable is immediately necessarily initialized to zero, despite the fact that then this value is not used, but simply overwritten. For example, at offset 0x15, zero is again entered in var1, despite the fact that by that moment zero was already stored there.

Thus, it becomes clear that the highlighted lines in the UnsafeCompare.CPU Disasm.ParametersChecking listing do not make much sense, this is just a side effect of compilation. Also, from the above listing, it becomes obvious that the array is checked in the fixed expression in three stages: first, the array is checked for null, then its length is checked for equality to zero (the jne command analyzes only ZF), and only then for equality to zero and negative values ( jbe checks both ZF and CF). From my point of view, it is very strange that the last two actions were not combined into one, and even more strange that the cmp command is executed twice, because the conditional jump does not reset the flags. Among other things, I am sincerely grateful to those of you who read this line, because it means that I tried not in vain,

The experiment greatly simplifies further code analysis.

Fixed clause of CompareArraysWithUnsafeMethod () function

argsIsValid:
001B0C23    8379 04 00      cmpdwordptr[ecx+4], 0       ; Compare(a1.Length, 0) это первая проверка длины первого массива
001B0C27    75 07           jneshorta1Len_NonZero        ; только неравенство
001B0C29    33C0xoreax, eax
001B0C2B    8945 F0mov[ebp-10], eax              ; var1 <-- 0
001B0C2EEB 10           jmpshorta1Len_Zeroa1Len_NonZero:
001B0C30    8379 04 00      cmpdwordptr[ecx+4], 0       ; Compare(a1.Length, 0) вторая (последняя) проверка длины первого массива
001B0C34    0F86B5000000jbecall.JIT_RngChkFail        ; если равно или меньше
001B0C3A    8D41 08         leaeax, [ecx+8]               ; eax <--a1.BufferPointer
001B0C3D    8945 F0mov[ebp-10], eax              ; var1 <--eaxa1Len_Zero:
001B0C40    837A 04 00      cmpdwordptr[edx+4], 0       ; Compare(a2.Length, 0) это первая проверка длины второго массива
001B0C44    75 07           jneshorta2Len_NonZero        ; только неравенство
001B0C46    33D2xoredx, edx
001B0C48    8955 ECmov[ebp-14], edx              ; var2 <-- 0
001B0C4BEB 10           jmpshortpart3a2Len_NonZero:
001B0C4D    837A 04 00      cmpdwordptr[edx+4], 0       ; Compare(a2.Length, 0) вторая (последняя) проверка длины второго массива
001B0C51    0F86 98000000   jbecall.JIT_RngChkFail        ; если равно или меньше
001B0C57    8D52 08         leaedx, [edx+8]               ; edx <--a2.BufferPointer
001B0C5A    8955 ECmov[ebp-14], edx              ; var2 <--edx

The compiler presented no surprises here, i.e. compilation algorithm gives easily predictable results. For example, variables initialized in a fixed clause are pushed on the stack anyway.
Initializing variables inside a fixed block:

part3:
; ECX              <======= a1
; EDX              <======= a2.BufferPointer
; var1             <======= a1.BufferPointer
; var2             <======= a2.BufferPointer
001B0C5D    8B45 F0         mov     eax, [ebp-10]              ; eax <-- var1001B0C60    8BF0            mov     esi, eax                   ; esi <-- eax001B0C62    8B45 EC         mov     eax, [ebp-14]              ; eax <-- var2001B0C65    8BF8            mov     edi, eax                   ; edi <-- eax001B0C67    8B41 04         mov     eax, [ecx+4]               ; eax <-- a1.Length001B0C6A    8945 E8         mov     [ebp-18], eax              ; var3 <-- eax

The initialization of variables inside the fixed block is interesting in that it demonstrates well the principle by which the JIT compiler generates code. It is clearly seen here that instead of directly sending values from stack variables to index registers, the values are first placed in the accumulator register. Maybe this is some kind of secret magical meaning, but it looks like (sending to eax) just like an extra action.

8 byte loop

001B0C6D    33C9            xor     ecx, ecx                   ; счётчик цикла <-- 0001B0C6F    8BD8            mov     ebx, eax                   ; ebx <-- a1.Length
001B0C71    85DB            test    ebx, ebx
001B0C73    7903           jns     short ebx_greaterZero
001B0C75    83C3 07add     ebx, 7                     ; если отрицательно, прибавляем 7
ebx_greaterZero:
001B0C78    C1FB 03         sar     ebx, 3                     ; ebx <-- a1.Length / 8001B0C7B    85DB            test    ebx, ebx
001B0C7D    7E 1D           jle     short fourBytesComparing
for8_body:
001B0C7F    8B06            mov     eax, [esi]
001B0C81    8B56 04         mov     edx, [esi+4]
001B0C84    3B57 04         cmp     edx, [edi+4]
001B0C87    7504           jne     short setResult_jumpReturn
001B0C89    3B07            cmp     eax, [edi]
001B0C8B    7404           je      short for8_increment
setResult_jumpReturn:
001B0C8D    33C0            xor     eax, eax                   ; result <-- 0001B0C8F    EB 56           jmp     short return2              ; gotoreturn
for8_increment:
001B0C91    41              inc     ecx
001B0C92    83C6 08add     esi, 8001B0C95    83C7 08add     edi, 8
for8_expression:
001B0C98    3BD9            cmp     ebx, ecx
001B0C9A  ^ 7F E3           jg      short for8_body

The structure of the cycle is quite trivial, for example, the cycle counter is traditionally located in ecx, the boundary value of the counter is located in ebx, which is also traditional. It is noteworthy here only that the first check of the loop condition is located immediately after the initialization and does not look like the main loop condition. It is also obvious that miracles do not happen, and the REX prefix is not available in either Protected Mode or Compatible Mode, i.e. comparison with QWORD blocks is not possible, therefore blocks with a DWORD size are compared. However, the code shows that before performing comparisons, the corresponding parts of the first array are loaded into the eax, edx registers, i.e. The JIT compiler attempted to produce machine code that was as close as possible to the source code.

It is striking that the compiler did not use the CMPSD string instruction here, namely, its “short” form, which compares DWORD blocks placed at esi and edi addresses, setting the appropriate flags. In this case, the size of the machine code would be several times smaller: the mov command here is 2 and 3 bytes in size, the cmp command is 2 and 3 bytes in size, and the size of each CMPSD command (in short form) would be only 1 byte, i.e. . for two teams only 2 bytes. This suggests that the JIT compiler does not want to optimize the code.

From the above listings, it is obvious that a couple of commands, the first of which is forwarding to eax, is a pattern that the compiler follows strictly.

An attempt to compare blocks of DWORD size is performed if such a volume remains:

fourBytesComparing:
001B0C9C    F745 E8 0400000 test    dword ptr [ebp-18], 00000004    ; ZF <-- (var3 & 4) == 0001B0CA3    7410           je      short length_lowerThenFour
001B0CA5    8B06            mov     eax, [esi]
001B0CA7    3B07            cmp     eax, [edi]
001B0CA9    7404           je      short dwords_equals              ; если блоки равны, переход к инкрименту регистров
001B0CAB    33C0            xor     eax, eax
001B0CAD    EB 38           jmp     short return2
dwords_equals:
001B0CAF    83C6 04add     esi, 4001B0CB2    83C7 04add     edi, 4

An attempt to compare blocks with the WORD size is performed if such a volume remains:

length_lowerThenFour:
001B0CB5    F745 E8 0200000 test    dword ptr [ebp-18], 00000002    ; ZF <-- (var3 & 2) == 0001B0CBC    7410           je      short length_lowerThenTwo
001B0CBE    0FBF06          movsx   eax, word ptr [esi]
001B0CC1    66:3B07         cmp     ax, [edi]
001B0CC4    7404           je      short words_equals              ; если блоки равны, переход к инкрименту регистров
001B0CC6    33C0            xor     eax, eax
001B0CC8    EB 1D           jmp     short return2
words_equals:
001B0CCA    46              inc     esi
001B0CCB    46              inc     esi
001B0CCC    47              inc     edi
001B0CCD    47              inc     edi

An attempt to compare blocks of BYTE size is performed if such a volume remains:

length_lowerThenTwo:
001B0CCE    F745 E8 0100000 test    dword ptr [ebp-18], 00000001   ; ZF <-- (var3 & 1) == 0001B0CD5    740B           je      short001B0CE2
001B0CD7    0FB606          movzx   eax, byte ptr [esi]
001B0CDA    3A07            cmp     al, [edi]
001B0CDC    7404           je      short001B0CE2                 ; если блоки равны, переход к инкрименту регистров
001B0CDE    33C0            xor     eax, eax
001B0CE0    EB 05           jmp     short return2
001B0CE2    B8 01000000     mov     eax, 1

Returning the result and throwing an exception:

return2:
001B0CE7    8D65 F4         lea     esp, [ebp-0C]
001B0CEA    5B              pop     ebx
001B0CEB    5E              pop     esi
001B0CEC    5F              pop     edi
001B0CED    5D              pop     ebp
001B0CEE    C3              ret
JIT_RngChkFail:
001B0CEF    E8 C4B1DB61     call    clr.JIT_RngChkFail
001B0CF4    CC              int3

CRT analysis of memcmp () function

This function is also interesting because it not only compares two buffers, but also finds out the relation between them, which means it is more complicated than those that were considered earlier.

After being hooked up by the debugger, I found out that C: \ Windows \ SysWOW64 \ msvcrt.dll version 7.0.7601.17744 was loaded into the process memory at the base address 0x76C20000.

Debugger Information Windows

This is an important clarification, since the code of different versions of libraries can be very different, because they can be compiled with different compiler options, and even more so, with different compilers.

The first look at the prepared function confused me a little: firstly, before the standard prologue, at the very beginning of the function, there was a strange instruction, and secondly, the dimensions of this function are amazing. The presence of “long” jumps was striking, in addition, the switch with 32 cases is intimidating.

Strange instruction:

76C37975   .  8BFF          mov     edi, edi                                   ; <--(!) Здесь начало функции INT msvcrt.memcmp(buf1,buf2,count)
76C37977   .  55            push    ebp
76C37978   .  8BEC          mov     ebp, esp

It copies the register into itself, without updating any flags, i.e. the result of its execution is zero, just like nop, but 2 bytes in size. Having scrolled the code in the debugger window, I found that many other functions start in the same way. Thanks to Raymond Chen’s blog, an explanation was quickly found. This is truly an analogue of nop.

It's a hot-patch point.
The MOV EDI, EDI instruction is a two-byte NOP, which is just enough space to patch in a jump instruction so that the function can be updated on the fly. The intention is that the MOV EDI, EDI instruction will be replaced with a two-byte JMP $ -5 instruction to redirect control to five bytes of patch space that comes immediately before the start of the function. Five bytes is enough for a full jump instruction, which can send control to the replacement function installed somewhere else in the address space.
blogs.msdn.com/b/oldnewthing/archive/2011/09/21/10214405.aspx

Long jumps and a very large switch

Address   Hex dump              Command                                               Comments
75A9797C   .  8B7D 10           mov     edi, [ebp+10]                                 ; edi <-- length75A9797F   .  8BC7              mov     eax, edi
75A97981   .  83E8 00subeax, 0                                        ;
75A97984   .  0F84 E7070100     je      msvcrt.zeroResult_GoReturn                     ; (length == 0)=> {result <-- 0, gotoreturn;} (!) Обратите внимание на размер инструкции
; часть кода опущена
; часть кода опущена
75A979BC   .  83FF 1F           cmp     edi, 1F                                       ; Switch (cases 1..1F, 32. exits)
75A979BF   .  775B             ja      short msvcrt.75A97A1C
75A979C1   .  FF24BD 1F8AA975   jmp     near [edi*4+msvcrt.75A98A1F]                  ; (!) Джамп на адрес из таблицы переходов (!) Обратите внимание на размер инструкции
; часть кода опущена
; часть кода опущена
75A98A1F   .  1C7AA975          dd      msvcrt.75A97A1C                               ; (00) Начало таблицы переходов, здесь jump на выход из функции
75A98A23   .  E88AA975          dd      msvcrt.75A98AE8                               ; (01)
75A98A27   .  CA8AA975          dd      msvcrt.75A98ACA                               ; (02)
75A98A2B   .  8C8BA975          dd      msvcrt.75A98B8C                               ; (03)
75A98A2F   .  0A7AA975          dd      msvcrt.75A97A0A                               ; (04)
75A98A33   .  088FA975          dd      msvcrt.75A98F08                               ; (05)
75A98A37   .  B88AA975          dd      msvcrt.75A98AB8                               ; (06)
75A98A3B   .  758BA975          dd      msvcrt.75A98B75                               ; (07)
75A98A3F   .  F479A975          dd      msvcrt.75A979F4                               ; (08)
75A98A43   .  238FA975          dd      msvcrt.75A98F23                               ; (09)
75A98A47   .  9F8AA975          dd      msvcrt.75A98A9F                               ; (0A)
75A98A4B   .  A18BA975          dd      msvcrt.75A98BA1                               ; (0B)
75A98A4F   .  DE79A975          dd      msvcrt.75A979DE                               ; (0C)
75A98A53   .  3A8FA975          dd      msvcrt.75A98F3A                               ; (0D)
75A98A57   .  FD8AA975          dd      msvcrt.75A98AFD                               ; (0E)
75A98A5B   .  ED8EA975          dd      msvcrt.75A98EED                               ; (0F)
75A98A5F   .  C879A975          dd      msvcrt.75A979C8                               ; (10)
75A98A63   .  518FA975          dd      msvcrt.75A98F51                               ; (11)
75A98A67   .  BA8EA975          dd      msvcrt.75A98EBA                               ; (12)
75A98A6B   .  6A98A975          dd      msvcrt.75A9986A                               ; (13)
75A98A6F   .  8990A975          dd      msvcrt.75A99089                               ; (14)
75A98A73   .  CD98A975          dd      msvcrt.75A998CD                               ; (15)
75A98A77   .  D58EA975          dd      msvcrt.75A98ED5                               ; (16)
75A98A7B   .  8598A975          dd      msvcrt.75A99885                               ; (17)
75A98A7F   .  1899A975          dd      msvcrt.75A99918                               ; (18)
75A98A83   .  E898A975          dd      msvcrt.75A998E8                               ; (19)
75A98A87   .  698FA975          dd      msvcrt.75A98F69                               ; (1A)
75A98A8B   .  9D98A975          dd      msvcrt.75A9989D                               ; (1B)
75A98A8F   .  3399A975          dd      msvcrt.75A99933                               ; (1C)
75A98A93   .  0099A975          dd      msvcrt.75A99900                               ; (1D)
75A98A97   .  848FA975          dd      msvcrt.75A98F84                               ; (1E)
75A98A9B   .  B598A975          dd      msvcrt.75A998B5                               ; (1F) Окончание таблицы переходов

Due to the great complexity of the analysis of the entire function, it was decided not to disassemble it completely, besides, only two things were of interest:

Overhead estimate for a platform call (PInvoke)
External evaluation of the main function code.

Platform overhead assessment

To estimate the overhead of calling the function, it was decided to trace the code from the managed code to the start of the function itself. To do this, a breakpoint was set at the beginning of the function, and after returning from Thread.Sleep (), tracing started. However, the result of the first trace attempt was unsuccessful: the trace log turned out to be too large (about 100 thousand lines), which probably indicated that DllMain () was executed, there was also the possibility that some CLR code was captured, perhaps JIT compiler code. What exactly was carried out there, I did not begin to find out: it was not of interest, because such a start code is executed only once and almost does not affect the overall picture. To exclude this code, I inserted another call to memcmp () before calling Thread.Sleep () and tracing again.

Trace Log Window

Part of the trace log:

main  00480AEA                    call0031C19C                        ESP=0016F368
; часть кода опущена
main  clr.628C3B5F                call    near [eax+14]                   ESP=0016F248        ; (1)
; часть кода опущена
main  00480B87                    mov     eax, [ebp-1C]                   EAX=00313880
main  00480B8A                    mov     eax, [eax+14]                   EAX=0031391C
main  00480B8D                    mov     ecx, [eax]                      ECX=75A97975        ; (2)
main  00480B8F                    push    dword ptr [ebp+0C]              ESP=0016F328
main  00480B92                    push    dword ptr [ebp+8]               ESP=0016F324
main  00480B95                    push    edi                             ESP=0016F320
main  00480B96                    push    dword ptr [ebp-10]              ESP=0016F31C
main  00480B99                    mov     dword ptr [ebp-2C], 0
main  00480BA0                    mov     [ebp-28], esp
main  00480BA3                    mov     dword ptr [ebp-24], 480BB0
main  00480BAA                    mov     byte ptr [ebx+8], 0
main  00480BAE                    call    ecx                             ESP=0016F318        ; (3)
main  msvcrt.memcmp               mov     edi, edi
--------  Logging stopped

It can be seen from the above log that, firstly, at least one indirect call occurs on the way to the function, and secondly, the CLR extracts the address of the final function from some data structure, i.e. the challenge has considerable indirectness and does not leave the transition prediction unit any hope of fulfilling its mission. This makes it senseless to take such functions outside the bounds of managed code if they do not process large chunks of data at one time and do not require a lot of computation time.

Evaluation of the main function code

In the best, least resource-intensive case, the function will detect the difference in the first byte and return the result. It is in this case that the function will spend the least time. In the worst case, the function compares the entire arrays and does not detect any differences. Obviously, in the second case, the data will be compared in blocks in a loop.

Since it is not possible to completely disassemble and analyze the entire function in a reasonable amount of time, it was decided to make two traces to evaluate the code being executed, and to evaluate both of these cases using the trace log. To analyze the best case, firstly, the test data generation function was modified so that all arrays differed starting from the first byte, and secondly, the function comparing all arrays was modified so that different arrays were compared in the first iteration of the loop .

Modified Array Comparison Function

privatestaticboolCompareArraysWithPInvokeMethod()
      {
         var result = true;
         for (int i = CountArrays - 1; i >= 0; i--) //Цикл в обратном направленииfor (int j = 0; j < CountArrays; j++)
            {
               var tmp = ByteArrayCompareWithPInvoke(s_arrays[i], s_arrays[j]);
               result = result && tmp;
            }
         return result;
      }

First trace (best case)

main  msvcrt.memcmp               mov     edi, edi
main  msvcrt.75A97977             push    ebp                             ESP=0041EC74
main  msvcrt.75A97978             mov     ebp, esp                        EBP=0041EC74
main  msvcrt.75A9797A             push    esi                             ESP=0041EC70
main  msvcrt.75A9797B             push    edi                             ESP=0041EC6C
main  msvcrt.75A9797C             mov     edi, [ebp+10]                   EDI=00080000                     ; edi <--countmainmsvcrt.75A9797Fmoveax, ediEAX=00080000                     ; eax <--edimainmsvcrt.75A97981subeax, 0                                                           ; if (eax == 0) {result <--0; return;}
mainmsvcrt.75A97984jemsvcrt.zeroResult_GoReturnmainmsvcrt.75A9798AdeceaxEAX=0007FFFFmainmsvcrt.75A9798Bjemsvcrt.75A98C10mainmsvcrt.75A97991deceaxEAX=0007FFFEmainmsvcrt.75A97992jemsvcrt.75A9E610mainmsvcrt.75A97998deceaxEAX=0007FFFDmainmsvcrt.75A97999jemsvcrt.75A9E5DFmainmsvcrt.75A9799FdeceaxEAX=0007FFFCmainmsvcrt.75A979A0jemsvcrt.75A98BD2mainmsvcrt.75A979A6movecx, [ebp+0C]                   ECX=034C53B8                    ; ecx <--buf1mainmsvcrt.75A979A9moveax, [ebp+8]                    EAX=05C41038                    ; eax <--buf2mainmsvcrt.75A979ACpushebxESP=0041EC68mainmsvcrt.75A979ADpush20ESP=0041EC64                    ; Очень необычный способ поместить число в регистр
mainmsvcrt.75A979AFpopedxEDX=00000020,ESP=0041EC68
;--------------------------------Начало цикла сравнения
mainmsvcrt.75A979B0cmpedi, edxmainmsvcrt.75A979B2jaemsvcrt.75A993A7mainmsvcrt.75A993A7movesi, [eax]                      ESI=4241403Fmainmsvcrt.75A993A9cmpesi, [ecx]
mainmsvcrt.75A993ABjnemsvcrt.75AA80E7                                                 ; Обнаружено отличие
mainmsvcrt.75AA80E7movzxesi, byteptr [eax]             ESI=0000003Fmainmsvcrt.75AA80EAmovzxebx, byteptr [ecx]             EBX=00000001mainmsvcrt.75AA80EDsubesi, ebxESI=0000003E                    ; Вычиление разницы между нулевыми байтами DWORD’а
mainmsvcrt.75AA80EFjnemsvcrt.75AA8178mainmsvcrt.75AA8178xorebx, ebxEBX=00000000                    ; Формирование результата в регистре ebxmainmsvcrt.75AA817Atestesi, esimainmsvcrt.75AA817CsetgblEBX=00000001mainmsvcrt.75AA817Fleaebx, [ebx+ebx-1]
mainmsvcrt.75AA8183movesi, ebxESI=00000001mainmsvcrt.75AA8185testesi, esimainmsvcrt.75AA8187jnemsvcrt.75A98AB1mainmsvcrt.75A98AB1moveax, esiEAX=00000001mainmsvcrt.75A98AB3jmpmsvcrt.75A97A1Emainmsvcrt.75A97A1EpopebxEBX=00852AE0,ESP=0041EC6Cmainmsvcrt.return1popediESP=0041EC70,EDI=034C53B8mainmsvcrt.75A97A20popesiESP=0041EC74,ESI=034C53B0mainmsvcrt.75A97A21popebpESP=0041EC78,EBP=0041ECC4mainmsvcrt.75A97A22retESP=0041EC7C

The first thing that catches your eye here is that the handling of special cases, i.e. cases where the parameter count lies within [0..4], made quite unusual. One can only guess whether this is the result of compiling the switch clause or whether a local variable was actually set up there, which was decremented at each step of the check. However, the debugging information claims that it was still a switch. From the point of view of optimization, this is quite a reasonable action, because loop initialization is not performed.

The second thing that immediately caught my eye was a very unusual way of entering the number 0x20 in the edx register (via the stack). This is very similar to some kind of compilation artifact, and clearly indicates that the function was not written in assembler. A person would not write this, because this is pointless: the stack is in memory, and accesses to it are always slower than to registers. I venture to suggest that this is an inline artifact.

After detecting the difference of double words in the buffers, the transition to the address 0x75AA8178 is performed, where the first bytes from the source buffers are loaded into the esi and ebx registers at the addresses where the difference was detected. Then follows the calculation of the difference between these bytes, and if no differences are found, then the next bytes are loaded, and so on for the entire double word. It is noteworthy that the result does not depend on the number of the byte in which the difference is detected. This can be seen from the fact that the code for generating the result for the last byte in DWORD is completely identical to the code for the same purpose for the first byte, given above in the tracing log of the first trace.

Sequential double word byte comparison

;Address   Hex dump          Command                                  Comments
75AA80E7   >  0FB630        movzx   esi, byte ptr [eax]
75AA80EA   .  0FB619        movzx   ebx, byte ptr [ecx]
75AA80ED   .  2BF3          subesi, ebx; Вычиление разницы между нулевыми байтами DWORD’а
75AA80EF   .- 0F85 83000000 jne     msvcrt.75AA8178                    ; Джамп на формирование результата в регистре ebx
75AA80F5   >  0FB67001     movzx   esi, byte ptr [eax+1]
75AA80F9   .  0FB659 01     movzx   ebx, byte ptr [ecx+1]
75AA80FD   .  2BF3          subesi, ebx
75AA80FF   .- 0F84 1EF9FEFFjemsvcrt.75A97A23; часть кода опущена
75A97A23   >  0FB67002     movzx   esi, byte ptr [eax+2]
75A97A27   .  0FB659 02     movzx   ebx, byte ptr [ecx+2]
75A97A2B   .  2BF3          subesi, ebx
75A97A2D   .- 74 15         jeshortmsvcrt.75A97A44; часть кода опущена
75A97A44   >  0FB67003     movzx   esi, byte ptr [eax+3]
75A97A48   .  0FB659 03     movzx   ebx, byte ptr [ecx+3]
75A97A4C   .  2BF3          subesi, ebx
75A97A4E   .- 0F84 5F190000jemsvcrt.75A993B3; Джамп куда-то внутрь цикла
75A97A54   .  33DB          xor     ebx, ebx                           ; Формирование результата в регистре ebx
75A97A56   .  85F6          test    esi, esi
75A97A58   .  0F9FC3        setg    bl
75A97A5B   .  8D5C1B FF     lea     ebx, [ebx+ebx-1]
75A97A5F   .  8BF3          mov     esi, ebx
75A97A61   .- E9 4D190000   jmp     msvcrt.75A993B3
; часть кода опущена
75A993B1   . |33F6          xor     esi, esi
75A993B3   > |85F6          test    esi, esi
75A993B5   .-|0F85 F6F6FFFF jne     msvcrt.75A98AB1

Code duplication is not good, but not scary, but sequential comparison of bytes is not the best way to compare double words, especially since the byte number does not affect the result. Thus, it is already visible that this code is somewhat strange, perhaps because the source code is even more strange.

First look at the log of the second trace: 8 double words are compared in one iteration of the cycle, and this is good: it is clear that the cycle is expanded, and on the other hand, after each comparison, exactly the same useless code goes: the register esi is entered 0 and the contents of the register esi are analyzed . I have no assumptions about why this was done, but there is an assumption about how this could happen: firstly, it is very similar to the result of some macro that formed the assembler code, and secondly, it seems that the Microsoft C compiler is not as good as I thought about it before.

Second trace (loop only)

;--------------------------------Начало цикла сравнения
main  msvcrt.75A979B0             cmp     edi, edx
main  msvcrt.75A979B2             jae     msvcrt.75A993A7
main  msvcrt.75A993A7             mov     esi, [eax]                      ESI=00000000
main  msvcrt.75A993A9             cmp     esi, [ecx]
main  msvcrt.75A993AB             jne     msvcrt.75AA80E7
main  msvcrt.75A993B1             xor     esi, esi
main  msvcrt.75A993B3             test    esi, esi
main  msvcrt.75A993B5             jne     msvcrt.75A98AB1
main  msvcrt.75A993BB             mov     esi, [eax+4]
main  msvcrt.75A993BE             cmp     esi, [ecx+4]
main  msvcrt.75A993C1             jne     msvcrt.75AA811F
main  msvcrt.75A993C7             xor     esi, esi
main  msvcrt.75A993C9             test    esi, esi
main  msvcrt.75A?993CB             jne     msvcrt.75A98AB1
main  msvcrt.75A993D1             mov     esi, [eax+8]
main  msvcrt.75A993D4             cmp     esi, [ecx+8]
main  msvcrt.75A993D7             jne     msvcrt.75A97A9A
main  msvcrt.75A993DD             xor     esi, esi
main  msvcrt.75A993DF             test    esi, esi
main  msvcrt.75A993E1             jne     msvcrt.75A98AB1
main  msvcrt.75A993E7             mov     esi, [eax+0C]
main  msvcrt.75A993EA             cmp     esi, [ecx+0C]
main  msvcrt.75A993ED             jne     msvcrt.75A97B1F
main  msvcrt.75A993F3             xor     esi, esi
main  msvcrt.75A993F5             test    esi, esi
main  msvcrt.75A993F7             jne     msvcrt.75A98AB1
main  msvcrt.75A993FD             mov     esi, [eax+10]
main  msvcrt.75A99400             cmp     esi, [ecx+10]
main  msvcrt.75A99403             jne     msvcrt.75A97BA4
main  msvcrt.75A99409             xor     esi, esi
main  msvcrt.75A9940B             test    esi, esi
main  msvcrt.75A9940D             jne     msvcrt.75A98AB1
main  msvcrt.75A99413             mov     esi, [eax+14]
main  msvcrt.75A99416             cmp     esi, [ecx+14]
main  msvcrt.75A99419             jne     msvcrt.75A97C29
main  msvcrt.75A9941F             xor     esi, esi
main  msvcrt.75A99421             test    esi, esi
main  msvcrt.75A99423             jne     msvcrt.75A98AB1
main  msvcrt.75A99429             mov     esi, [eax+18]
main  msvcrt.

    
    Also popular now: 
    
        
            Frontend Conf program for those who do not have a time flywheel / Conference Blog Oleg Bunin (Ontiko)
        
        
            7 tools that help us defeat procrastination / Digital Ecosystems Blog
        
        
            How and why we made our ATMs / Tinkoff.ru blog
        
        
            The "ability" of machine learning to predict the future of chaotic systems
        
        
            Can the Defense Ministry drown out Iridium, GlobalStar and OneWeb? / Bigd.host Blog
        
        
            How we defeated chaos in the central warehouse / CarPrice Blog
        
        
            “You can make your products on imported machines, this will not make it less domestic”
        
        
            An inside look at a modern domestic microelectronics plant and not only
        
        
            Chip aging accelerates
        
        
            New features of LCD serial interfaces