Quickly remove spaces from strings on ARM processors - alternative analysis

    Original article
    Posted by Martin Krastev


    A friend of mine drew my attention to an interesting article on habrahabr.ru - Russian translation of an article by Daniel Lemira Fast removal of spaces from strings on ARM processors . This article intrigued me for two reasons: firstly, someone actually spent time and efforts to find an optimal solution to a common problem on non-x86 architecture (hooray!), And secondly, the author gave a few results at the end of the article puzzled me: about a 6-fold advantage for Intel? The author made an unambiguous conclusion that ARM is very far in the ratio of “efficiency per clock” to “big iron” from Intel in this simple task.


    Challenge accepted!


    But let's start from the very beginning! The author started with some kind of baseline - sequential implementation, so I also decided to start from there and move up. Let's call this basis testee00for greater confusion:


    inline void testee00() {
        size_t i = 0, pos = 0;
        while (i < 16) {
            const char c = input[i++];
            output[pos] = c;
            pos += (c > 32 ? 1 : 0);
        }
    }

    I ran testee00on several amd64 processors and one arm64 processor, using different versions of the GCC and Clang compiler, always taking the best compilation result as a basis. Below are the results of the measure / symbol ratio, calculated perf -e cyclesdivided by the number of processed symbols (in our case, 5 10 ^ 7 16) and truncated to the 4th digit after the decimal point:


    CPUCompiler & flagsclocks / character
    Intel Xeon E5-2687W (SNB)g ++ - 4.8 -Ofast1.6363
    Intel Xeon E3-1270v2 (IVB)g ++ - 5.1 -Ofast1.6186
    Intel i7-5820K (HSW)g ++ - 4.8 -Ofast1.5223
    AMD Ryzen 7 1700 (Zen)g ++ - 5.4 -Ofast1.4113
    Marvell 8040 (Cortex-A72)g ++ - 5.4 -Ofast1.3805

    Table 1. testee00Desktop core performance


    Interesting, right? A small phone chip (3-Decoder OoO) really gives a better clock / symbol ratio than a larger desktop chip (at the end of this article you can see the actual statistics).


    So let's move on to SIMD. I do not pretend to be considered an experienced encoder on NEON, but sometimes I bother with ARM SIMD. I will not inline SIMD routines in the main body of this article so as not to scare the reader away; Instead, all test code and participating test procedures can be found at the end of this article.


    I took the liberty of changing the original SSSE3 pruning procedure for Daniel - in fact, I used my version for the test. Cause? I can't just take 2 ^ 16 * 2 ^ 4 = 1 MB of lookup table in my code - this would be a big cache eater for any scenarios where we do not just trim ascii threads, but calling the subroutine makes other work easier. The LSS-less SSSE3 version comes with the price of a few calculations, but it only works on registers, and, as you will see, the price for excluding the table is not prohibitive even with intensive cropping loads. Moreover, both the new version of SSSE3 and the version of NEON (ASIMD2) use the same algorithm, so the comparison is as direct as physically possible:


    CPUCompiler & flagsclocks / character
    Intel Xeon E5-2687W (SNB)g ++ - 4.8 -Ofast -mssse3.4230
    Intel Xeon E3-1270v2 (IVB)g ++ - 5.4 -Ofast -mssse3.3774
    Marvell 8040 (Cortex-A72)g ++ - 5.4 -Ofast -mcpu = cortex-a571.0503

    Table 2. testee01Desktop core performance


    Note: The microarchitecture tuning for A57 is transferred to the arm64 build, since the default scheduler from the compiler is clearly worse in this version as regards the NEON code, and A57 is a pretty “common” common denominator of ARMv8 when it comes to planning.


    As you can see, the efficiency per clock is 2x in favor of Sandy Bridge - the core, which with the same (or similar) fabnode will be 4 times larger in area A72. So everything is not so bad for telephone chips; )


    Bonus material: the same test on small arm64 and amd64 CP:


    CPUCompiler & flagsclocks / character, scalarclocks / character, vector
    AMD C60 (Bobcat)g ++ - 4.8 -Ofast -mssse33.57511.8215
    MediaTek MT8163 (Cortex-A53)clang ++ - 3.6 -march = armv8-a -mtune = cortex-a53 -Ofast2.65681.7100

    Table 3. Performance testee00on testee01at the entry-level cores




    Xeon E5-2687W @ 3.10GHz


    Scalar version


    $ g++-4.8 prune.cpp -Ofast
    $ perf stat -e task-clock,cycles,instructions -- ./a.out
    alabalanica1234
     Performance counter stats for './a.out':
            421.886991      task-clock (msec)         #    0.998 CPUs utilized
         1,309,087,898      cycles                    #    3.103 GHz
         4,603,132,268      instructions              #    3.52  insns per cycle
           0.422602570 seconds time elapsed
    $ echo "scale=4; 1309087898 / (5 * 10^7 * 16)" | bc
    1.6363

    SSSE3 version (batch of 16, misaligned write)


    $ g++-4.8 prune.cpp -Ofast -mssse3
    $ perf stat -e task-clock,cycles,instructions -- ./a.out
    alabalanica1234a
     Performance counter stats for './a.out':
            109.063426      task-clock (msec)         #    0.997 CPUs utilized
           338,414,215      cycles                    #    3.103 GHz
         1,052,118,398      instructions              #    3.11  insns per cycle
           0.109422808 seconds time elapsed
    $ echo "scale=4; 338414215 / (5 * 10^7 * 16)" | bc
    .4230



    Xeon E3-1270v2 @ 1.60GHz


    Scalar version


    $ g++-5 -Ofast prune.cpp
    $ perf stat -e task-clock,cycles,instructions -- ./a.out
    alabalanica1234
     Performance counter stats for './a.out':
            810.515709 task-clock (msec)         #    0.999 CPUs utilized
         1,294,903,960 cycles                    #    1.598 GHz
         4,601,118,631 instructions              #    3.55  insns per cycle
           0.811646618 seconds time elapsed
    $ echo "scale=4; 1294903960 / (5 * 10^7 * 16)" | bc
    1.6186

    SSSE3 version (batch of 16, misaligned write)


    $ g++-5 -Ofast prune.cpp -mssse3
    $ perf stat -e task-clock,cycles,instructions -- ./a.out
    alabalanica1234a
     Performance counter stats for './a.out':
            188.995814 task-clock (msec)         #    0.997 CPUs utilized
           301,931,101 cycles                    #    1.598 GHz
         1,050,607,539 instructions              #    3.48  insns per cycle
           0.189536527 seconds time elapsed
    $ echo "scale=4; 301931101 / (5 * 10^7 * 16)" | bc
    .3774



    Intel i7-5820K


    Scalar version


    $ g++-4.8 -Ofast prune.cpp
    $ perf stat -e task-clock,cycles,instructions -- ./a.out
    alabalanica1234
     Performance counter stats for './a.out':
            339.202545      task-clock (msec)         #    0.997 CPUs utilized
         1,204,872,493      cycles                    #    3.552 GHz
         4,602,943,398      instructions              #    3.82  insn per cycle
           0.340089829 seconds time elapsed
    $ echo "scale=4; 1204872493 / (5 * 10^7 * 16)" | bc
    1.5060



    AMD Ryzen 7 1700


    Scalar version


    $ perf stat -e task-clock,cycles,instructions -- ./a.out
    alabalanica1234
     Performance counter stats for './a.out':
            356,169901      task-clock:u (msec)       #    0,999 CPUs utilized
            1129098820      cycles:u                  #    3,170 GHz
            4602126161      instructions:u            #    4,08  insn per cycle
           0,356353748 seconds time elapsed
    $ echo "scale=4; 1129098820 / (5 * 10^7 * 16)" | bc
    1.4113



    Marvell ARMADA 8040 (Cortex-A72) @ 1.30GHz


    Scalar version


    $ g++-5 prune.cpp -Ofast
    $ perf stat -e task-clock,cycles,instructions -- ./a.out
    alabalanica1234
     Performance counter stats for './a.out':
            849.549040      task-clock (msec)         #    0.999 CPUs utilized
         1,104,405,671      cycles                    #    1.300 GHz
         3,251,212,918      instructions              #    2.94  insns per cycle
           0.850107930 seconds time elapsed
    $ echo "scale=4; 1104405671 / (5 * 10^7 * 16)" | bc
    1.3805

    ASIMD2 version (batch of 16, misaligned write)


    $ g++-5 prune.cpp -Ofast -mcpu=cortex-a57 -mtune=cortex-a57
    $ perf stat -e task-clock,cycles,instructions -- ./a.out
    alabalanica1234
     Performance counter stats for './a.out':
            646.394560      task-clock (msec)         #    0.999 CPUs utilized
           840,305,966      cycles                    #    1.300 GHz
           801,000,092      instructions              #    0.95  insns per cycle
           0.646946289 seconds time elapsed
    $ echo "scale=4; 840305966 / (5 * 10^7 * 16)" | bc
    1.0503

    ASIMD2 version (batch of 32, misaligned write)


    $ clang++-3.7 prune.cpp -Ofast -mcpu=cortex-a57 -mtune=cortex-a57
    $ perf stat -e task-clock,cycles,instructions -- ./a.out
    alabalanica1234
     Performance counter stats for './a.out':
           1140.643640      task-clock (msec)         #    0.999 CPUs utilized
         1,482,826,308      cycles                    #    1.300 GHz
         1,504,011,807      instructions              #    1.01  insns per cycle
           1.141241760 seconds time elapsed
    $ echo "scale=4; 1482826308 / (5 * 10^7 * 32)" | bc
    .9267



    AMD C60 (Bobcat) @ 1.333GHz


    Scalar version


    $ g++-4.8 prune.cpp -Ofast
    $ perf stat -e task-clock,cycles,instructions -- ./a.out
    alabalanica1234
     Performance counter stats for './a.out':
           2208.190651 task-clock (msec)         #    0.997 CPUs utilized
         2,860,081,604 cycles                    #    1.295 GHz
         4,602,968,860 instructions              #    1.61  insns per cycle
           2.214173331 seconds time elapsed
    $ echo "scale=4; 2860081604 / (5 * 10^7 * 16)" | bc
    3.5751

    SSSE3 version (batch of 16, misaligned write)


    $ clang++-3.5 prune.cpp -Ofast -mssse3
    $ perf stat -e task-clock,cycles,instructions -- ./a.out
    alabalanica1234a
     Performance counter stats for './a.out':
           1098.519499 task-clock (msec)         #    0.998 CPUs utilized
         1,457,266,396 cycles                    #    1.327 GHz
         1,053,073,591 instructions              #    0.72  insns per cycle
           1.101240320 seconds time elapsed
    $ echo "scale=4; 1457266396 / (5 * 10^7 * 16)" | bc
    1.8215



    MediaTek MT8163 (Cortex-A53) @ 1.50GHz (sans perf)


    Scalar version


    $ ../clang+llvm-3.6.2-aarch64-linux-gnu/bin/clang++ prune.cpp -march=armv8-a -mtune=cortex-a53 -Ofast
    $ time ./a.out
    alabalanica1234
    real    0m1.417s
    user    0m1.410s
    sys     0m0.000s
    $ echo "scale=4; 1.417 * 1.5 * 10^9 / (5 * 10^7 * 16)" | bc
    2.6568

    ASIMD2 version (batch of 16, misaligned write)


    $ ../clang+llvm-3.6.2-aarch64-linux-gnu/bin/clang++ prune.cpp -march=armv8-a -mtune=cortex-a53 -Ofast
    $ time ./a.out
    alabalanica1234
    real    0m0.912s
    user    0m0.900s
    sys     0m0.000s
    $ echo "scale=4; 0.912 * 1.5 * 10^9 / (5 * 10^7 * 16)" | bc
    1.7100

    Martin Krastev.


    Translation: Dmitry Alexandrov


    Also popular now: