About imaginary and real optimizations 10 times, healing SSE, and all that

Published on May 18, 2013

About imaginary and real optimizations 10 times, healing SSE, and all that

    Based on one post yesterday about optimizing conditional transitions when calculating x = sign (a, b) * min (abs (a), abs (b)), supposedly 10 times. Short summary:

    • optimization is obvious, but the size is imaginary: not 10 times, but 2.5 times;
    • benchmarks must be done correctly: do not measure CPU stalls, RAM bandwidth, etc. instead of the function under study;
    • benchmarks must be done correctly: otherwise they can tremble wildly;
    • set priority only cool, but on short benchmarks in vain: + 0.5% speed, -15% jitter;
    • you need to measure the investigated function and only it, the only way you get the correct data;
    • it is necessary to warm percent, it is necessary to consider a minimum of N runs / seconds, only in this way you overcome trembling;
    • you need to use SSE, it turned out 8.6 times with it, and the code ... is read.

    In general, again a bunch of classical methodological errors in the benchmark. Who cares how to DO NOT make such mistakes, details, detailed debriefing, optimization several more times and, most importantly, the source code under the cut.

    I read the original post yesterday , I was very surprised at the acceleration of 10 times due to the elimination of transitions, even for synthetics. This is too much, transitions are expensive, but not so much. I tried to repeat the results, looked more carefully, and naturally: again kindergarten errors in the benchmark methodology! Well, it's time to disassemble them again, a good example.

    Error # 1 is that the given source code stupidly does not measure anything sane at all. The sum of the results llr () counts, and that's good. But the compiler sees that it is not used in any way, and therefore has every right to optimize. I just optimized. In addition, the originally published (now sorted out and fixed) optimization options considered the wrong result at all, and it was invisible. Oh…

    Moral # 1: boys, print the results, ALWAYS. You will immediately catch errors, and the compiler will not throw out the “unnecessary” loop. It also helps to declare the result as volatile, but you still need to print it.

    Error # 2 is that the author measures the rand () + llr () loop, then measures the rand () + llr2 () loop, then subtracts the runtime by hand and by eye. This is a bad idea for two reasons. rand is very slow, the benchmark turns out to be unreasonably long. ;) This time. In the experiment, the performance of the stuffing of two functions is measured, and this stuffing obviously behaves NOT as soon as the desired function. These are two.

    In the general case, the approach “let's measure the stuffing from the functions A + B” is unsuitable because the compiler can interfere with the calculations. It turns out that part of the “measured” function B hid in function A, and in fact we measure the unknown part from B. It is good when A + B is a combat pair that is used in this way. It’s bad when instead of A test syntax like rand ().

    In this case, such mixing does not occur, in call_rand () dysasma, without any attempts to inline it, but obviously there is some other misfortune. Which one, I don’t understand, but the working hypothesis about CPU stalls. The hypothesis is that by slightly delaying the start of llr () calculations, which starts with the test esi, esi instruction, where esi has just been returned from rand (), we can measurably speed up the original benchmark. At the same time, simply repeating the cycle 2 times, of course, does not give an effect, it is necessary to spread the calculations exactly:

    10.7 sec, rand ()
    13.3 sec, rand () + llr ()
    12.6 sec, rand () + llr () + 2x unroll

    // 2x unroll без ускорения, 13.3 sec
    int a = rand() - RAND_MAX / 2;
    int b = rand() - RAND_MAX / 2;
    x += LLR(a,b);
    a = rand() - RAND_MAX / 2;
    b = rand() - RAND_MAX / 2;
    x += LLR(a,b);
    // 2x unroll c ускорением, 12.6 sec
    int a = rand() - RAND_MAX / 2;
    int b = rand() - RAND_MAX / 2;
    int a2 = rand() - RAND_MAX / 2;
    int b2 = rand() - RAND_MAX / 2;
    x += LLR(a,b);
    x += LLR(a2,b2);
    


    In one of the experiments, by the way, acceleration up to 12.8 sec was generally given by the stupid insert asm {nop} before x + = LLR (a, b), but it was not possible to confidently repeat such a miracle. Some kind of random fluctuation. But in general, it is clearly evident that measuring the stuffing from heavy rand () and light llr () is an occupation, hmm, non-trivial. I assume that somewhere in the test / jxx instruction pair, stalls occur because the expression has just been counted. Maybe someone who has VTune on hand will be able to see more precisely, why.

    Moral # 2: boys, do not measure the minced meat from the investigated function and synthetics, measure only the function you need. How the compiler will mix that stuffing and what special effects will appear at the interface, not guess.

    We get rid of the braking rand (), make the pre-calculation of a sufficiently large block of random numbers per cycle, inside the cycle we leave only the calculation and summation llr (). In total, we measure in addition to the function an overhead for access to memory, but with linear reading it is minimal. SUDDENLY, instead of an imaginary acceleration of 10 times, we observe a real acceleration of a modest 2.5 times.

    Ok, this is already in line with expectations, but now it's not fast and good enough;)

    SSE comes to the rescue. I have a new Core2Duo E8500 on my desktop, but even it can SSE4. We expand the cycle 4 times, count 4 pairs at a time. Directly on the forehead we use the specifications for calculating sign, abs.

    1.073 sec, llr () baseline
    0.438 sec, llr2 () optimized, 2.5x
    0.125 sec, llr4 () sse + 4x unroll, 8.6x

    Interestingly, the code is pretty readable. The only thing is that you need to carefully comment on _mm_sign_epi32 (). First, he suddenly takes also the 2nd argument, and as if multiplies it by the sign of the 1st argument. What you need. The second, while _mm_sign (0) = 0, not 1, as you would like for some tasks. However, for our purposes there is no difference, maybe. if abs (a) or abs (b) is 0, then sign * min (abs) = 0, so there is no error.

    static inline __m128i LLR4(const int * pa, const int * pb)
    {
    	__m128i a = *(__m128i*)pa;
    	__m128i b = *(__m128i*)pb;
    	// sign(a,b)*min(abs(a),abs(b))
    	__m128i absa = _mm_abs_epi32(a);
    	__m128i absb = _mm_abs_epi32(b);
    	__m128i absm = _mm_min_epi32(absa, absb);
    	__m128i ab = _mm_mullo_epi32(a, b);
    	__m128i rr = _mm_sign_epi32(absm, ab);
    	return rr;
    }
    


    Marginal note: interestingly, summing the register components through _mm_hadd_epi32 instead of unloading the register into memory and then adding 4 normal ints does not produce results. I wish I had already seen the effect of hadd at least somewhere, for a long time I want.

    Another note: interestingly, the further development of the cycle also does not produce results. Apparently, it already depends on the speed of reading memory, there about 6.4 GB / sec comes out.

    Error # 3 , it means, is that SSE extensions have not been used for a long time, all the valiant optimizations for some reason are made for the platform, essentially i386.

    This is a controversial conclusion, I thought for a long time whether to write it or how. Is it a “mistake” or not? I personally think so. Because if you optimize, so big! The scholastics in kamenti will surely conquer, what is not. After all, this version is vectorized and works on 4 pairs of numbers at a time, and this is a decidedly incompatible change in the original function. Plus optimizations for i386 are also very valuable, suddenly the program will be launched not on i7, but on the monument of IT archeology. However, in any case, morality is certainly the same.

    Moral # 3: boys, now 2013, SSE4 is almost everywhere, SSE2 is everywhere, act accordingly, optimize (if possible) for 99% of users, and not 1% fallback.

    The resulting SSE version is now quite vigorous to watch how the time on different launches dances from 0.125 to 0.145 seconds. Almost 20% of the difference. Oh ...

    Error # 4 just appeared due to optimization;) Setting high priority of the thread and the process gives about 0.5% of productivity, but it doesn’t save from trembling measurements. First, an idle processor resets the frequency, but does not immediately gain it back. He succeeds in 10+ seconds, not in 0.1 seconds. Second, even if you run a short test 100 times, the time of each separate iteration will still dance. Both at the beginning and at the end of these 100 runs. You never know that there in a supposedly idle system with 99% idle it still works for itself and how it affects it.

    The processor must be “warmed up” and even after that the results of (fairly short) runs must be filtered. Just doing a little more iterations and averaging is not enough, the total time is still noticeably trembling. Throwing out the first “cold” iterations is not enough either, the result is trembling. You can consider the average and variance, throw outliers, where abs (value-mean)> = k * stddev, then recalculate the average, I tried, but it’s trembling!

    And least of all, such a simple method trembles: we make several runs (I do 10) and select from them the minimum time for one run. It turns out that such a minimum stands rooted to the spot, the differences in several launches by a maximum of 0.5%.

    Please note: these runs must be done in additionto the initial warm-up within 3-10 seconds, otherwise 10 * 0.1 = 1 second will not be enough again to gain the train level and despite the focus with a minimum, time will still tremble. Or it is necessary to do (strongly) more than 10 runs. In our example, the very first reference implementation at the same time works as such a warm-up. But if the line Bench (Test1) is commented out, the dances named after St. Vitus will begin again. Oops, throttling style!

    Moral # 4: boys, warm the percent for at least 3 seconds, and preferably 10 seconds, and in addition count at least N runs.

    With benchmarks, you can still make a lot of mistakes, and probably in the final code (by the way, here it is: gist.github.com/anonymous/5605201 ) something else is missing, but today we sorted out these. Hopefully not completely useless.

    Be vigilant, measure correctly, optimize without mercy and to the end;)