advbg July 26, 2017 at 10:30

Quickly remove spaces from strings on ARM processors - alternative analysis

→ Original article
Posted by Martin Krastev

A friend of mine drew my attention to an interesting article on habrahabr.ru - Russian translation of an article by Daniel Lemira Fast removal of spaces from strings on ARM processors . This article intrigued me for two reasons: firstly, someone actually spent time and efforts to find an optimal solution to a common problem on non-x86 architecture (hooray!), And secondly, the author gave a few results at the end of the article puzzled me: about a 6-fold advantage for Intel? The author made an unambiguous conclusion that ARM is very far in the ratio of “efficiency per clock” to “big iron” from Intel in this simple task.

Challenge accepted!

But let's start from the very beginning! The author started with some kind of baseline - sequential implementation, so I also decided to start from there and move up. Let's call this basis testee00for greater confusion:

inline void testee00() {
    size_t i = 0, pos = 0;
    while (i < 16) {
        const char c = input[i++];
        output[pos] = c;
        pos += (c > 32 ? 1 : 0);
    }
}

I ran testee00on several amd64 processors and one arm64 processor, using different versions of the GCC and Clang compiler, always taking the best compilation result as a basis. Below are the results of the measure / symbol ratio, calculated perf -e cyclesdivided by the number of processed symbols (in our case, 5 10 ^ 7 16) and truncated to the 4th digit after the decimal point:

CPU	Compiler & flags	clocks / character
Intel Xeon E5-2687W (SNB)	g ++ - 4.8 -Ofast	1.6363
Intel Xeon E3-1270v2 (IVB)	g ++ - 5.1 -Ofast	1.6186
Intel i7-5820K (HSW)	g ++ - 4.8 -Ofast	1.5223
AMD Ryzen 7 1700 (Zen)	g ++ - 5.4 -Ofast	1.4113
Marvell 8040 (Cortex-A72)	g ++ - 5.4 -Ofast	1.3805

Table 1. testee00Desktop core performance

Interesting, right? A small phone chip (3-Decoder OoO) really gives a better clock / symbol ratio than a larger desktop chip (at the end of this article you can see the actual statistics).

So let's move on to SIMD. I do not pretend to be considered an experienced encoder on NEON, but sometimes I bother with ARM SIMD. I will not inline SIMD routines in the main body of this article so as not to scare the reader away; Instead, all test code and participating test procedures can be found at the end of this article.

I took the liberty of changing the original SSSE3 pruning procedure for Daniel - in fact, I used my version for the test. Cause? I can't just take 2 ^ 16 * 2 ^ 4 = 1 MB of lookup table in my code - this would be a big cache eater for any scenarios where we do not just trim ascii threads, but calling the subroutine makes other work easier. The LSS-less SSSE3 version comes with the price of a few calculations, but it only works on registers, and, as you will see, the price for excluding the table is not prohibitive even with intensive cropping loads. Moreover, both the new version of SSSE3 and the version of NEON (ASIMD2) use the same algorithm, so the comparison is as direct as physically possible:

CPU	Compiler & flags	clocks / character
Intel Xeon E5-2687W (SNB)	g ++ - 4.8 -Ofast -mssse3	.4230
Intel Xeon E3-1270v2 (IVB)	g ++ - 5.4 -Ofast -mssse3	.3774
Marvell 8040 (Cortex-A72)	g ++ - 5.4 -Ofast -mcpu = cortex-a57	1.0503

Table 2. testee01Desktop core performance

Note: The microarchitecture tuning for A57 is transferred to the arm64 build, since the default scheduler from the compiler is clearly worse in this version as regards the NEON code, and A57 is a pretty “common” common denominator of ARMv8 when it comes to planning.

As you can see, the efficiency per clock is 2x in favor of Sandy Bridge - the core, which with the same (or similar) fabnode will be 4 times larger in area A72. So everything is not so bad for telephone chips; )

Bonus material: the same test on small arm64 and amd64 CP:

CPU	Compiler & flags	clocks / character, scalar	clocks / character, vector
AMD C60 (Bobcat)	g ++ - 4.8 -Ofast -mssse3	3.5751	1.8215
MediaTek MT8163 (Cortex-A53)	clang ++ - 3.6 -march = armv8-a -mtune = cortex-a53 -Ofast	2.6568	1.7100

Table 3. Performance testee00on testee01at the entry-level cores

Xeon E5-2687W @ 3.10GHz

Scalar version

$ g++-4.8 prune.cpp -Ofast
$ perf stat -e task-clock,cycles,instructions -- ./a.out
alabalanica1234
 Performance counter stats for './a.out':
        421.886991      task-clock (msec)         #    0.998 CPUs utilized
     1,309,087,898      cycles                    #    3.103 GHz
     4,603,132,268      instructions              #    3.52  insns per cycle
       0.422602570 seconds time elapsed
$ echo "scale=4; 1309087898 / (5 * 10^7 * 16)" | bc
1.6363

SSSE3 version (batch of 16, misaligned write)

$ g++-4.8 prune.cpp -Ofast -mssse3
$ perf stat -e task-clock,cycles,instructions -- ./a.out
alabalanica1234a
 Performance counter stats for './a.out':
        109.063426      task-clock (msec)         #    0.997 CPUs utilized
       338,414,215      cycles                    #    3.103 GHz
     1,052,118,398      instructions              #    3.11  insns per cycle
       0.109422808 seconds time elapsed
$ echo "scale=4; 338414215 / (5 * 10^7 * 16)" | bc
.4230

Xeon E3-1270v2 @ 1.60GHz

Scalar version

$ g++-5 -Ofast prune.cpp
$ perf stat -e task-clock,cycles,instructions -- ./a.out
alabalanica1234
 Performance counter stats for './a.out':
        810.515709 task-clock (msec)         #    0.999 CPUs utilized
     1,294,903,960 cycles                    #    1.598 GHz
     4,601,118,631 instructions              #    3.55  insns per cycle
       0.811646618 seconds time elapsed
$ echo "scale=4; 1294903960 / (5 * 10^7 * 16)" | bc
1.6186

SSSE3 version (batch of 16, misaligned write)

$ g++-5 -Ofast prune.cpp -mssse3
$ perf stat -e task-clock,cycles,instructions -- ./a.out
alabalanica1234a
 Performance counter stats for './a.out':
        188.995814 task-clock (msec)         #    0.997 CPUs utilized
       301,931,101 cycles                    #    1.598 GHz
     1,050,607,539 instructions              #    3.48  insns per cycle
       0.189536527 seconds time elapsed
$ echo "scale=4; 301931101 / (5 * 10^7 * 16)" | bc
.3774

Intel i7-5820K

Scalar version

$ g++-4.8 -Ofast prune.cpp
$ perf stat -e task-clock,cycles,instructions -- ./a.out
alabalanica1234
 Performance counter stats for './a.out':
        339.202545      task-clock (msec)         #    0.997 CPUs utilized
     1,204,872,493      cycles                    #    3.552 GHz
     4,602,943,398      instructions              #    3.82  insn per cycle
       0.340089829 seconds time elapsed
$ echo "scale=4; 1204872493 / (5 * 10^7 * 16)" | bc
1.5060

AMD Ryzen 7 1700

Scalar version

$ perf stat -e task-clock,cycles,instructions -- ./a.out
alabalanica1234
 Performance counter stats for './a.out':
        356,169901      task-clock:u (msec)       #    0,999 CPUs utilized
        1129098820      cycles:u                  #    3,170 GHz
        4602126161      instructions:u            #    4,08  insn per cycle
       0,356353748 seconds time elapsed
$ echo "scale=4; 1129098820 / (5 * 10^7 * 16)" | bc
1.4113

Marvell ARMADA 8040 (Cortex-A72) @ 1.30GHz

Scalar version

$ g++-5 prune.cpp -Ofast
$ perf stat -e task-clock,cycles,instructions -- ./a.out
alabalanica1234
 Performance counter stats for './a.out':
        849.549040      task-clock (msec)         #    0.999 CPUs utilized
     1,104,405,671      cycles                    #    1.300 GHz
     3,251,212,918      instructions              #    2.94  insns per cycle
       0.850107930 seconds time elapsed
$ echo "scale=4; 1104405671 / (5 * 10^7 * 16)" | bc
1.3805

ASIMD2 version (batch of 16, misaligned write)

$ g++-5 prune.cpp -Ofast -mcpu=cortex-a57 -mtune=cortex-a57
$ perf stat -e task-clock,cycles,instructions -- ./a.out
alabalanica1234
 Performance counter stats for './a.out':
        646.394560      task-clock (msec)         #    0.999 CPUs utilized
       840,305,966      cycles                    #    1.300 GHz
       801,000,092      instructions              #    0.95  insns per cycle
       0.646946289 seconds time elapsed
$ echo "scale=4; 840305966 / (5 * 10^7 * 16)" | bc
1.0503

ASIMD2 version (batch of 32, misaligned write)

$ clang++-3.7 prune.cpp -Ofast -mcpu=cortex-a57 -mtune=cortex-a57
$ perf stat -e task-clock,cycles,instructions -- ./a.out
alabalanica1234
 Performance counter stats for './a.out':
       1140.643640      task-clock (msec)         #    0.999 CPUs utilized
     1,482,826,308      cycles                    #    1.300 GHz
     1,504,011,807      instructions              #    1.01  insns per cycle
       1.141241760 seconds time elapsed
$ echo "scale=4; 1482826308 / (5 * 10^7 * 32)" | bc
.9267

AMD C60 (Bobcat) @ 1.333GHz

Scalar version

$ g++-4.8 prune.cpp -Ofast
$ perf stat -e task-clock,cycles,instructions -- ./a.out
alabalanica1234
 Performance counter stats for './a.out':
       2208.190651 task-clock (msec)         #    0.997 CPUs utilized
     2,860,081,604 cycles                    #    1.295 GHz
     4,602,968,860 instructions              #    1.61  insns per cycle
       2.214173331 seconds time elapsed
$ echo "scale=4; 2860081604 / (5 * 10^7 * 16)" | bc
3.5751

SSSE3 version (batch of 16, misaligned write)

$ clang++-3.5 prune.cpp -Ofast -mssse3
$ perf stat -e task-clock,cycles,instructions -- ./a.out
alabalanica1234a
 Performance counter stats for './a.out':
       1098.519499 task-clock (msec)         #    0.998 CPUs utilized
     1,457,266,396 cycles                    #    1.327 GHz
     1,053,073,591 instructions              #    0.72  insns per cycle
       1.101240320 seconds time elapsed
$ echo "scale=4; 1457266396 / (5 * 10^7 * 16)" | bc
1.8215

MediaTek MT8163 (Cortex-A53) @ 1.50GHz (sans perf)

Scalar version

$ ../clang+llvm-3.6.2-aarch64-linux-gnu/bin/clang++ prune.cpp -march=armv8-a -mtune=cortex-a53 -Ofast
$ time ./a.out
alabalanica1234
real    0m1.417s
user    0m1.410s
sys     0m0.000s
$ echo "scale=4; 1.417 * 1.5 * 10^9 / (5 * 10^7 * 16)" | bc
2.6568

ASIMD2 version (batch of 16, misaligned write)

$ ../clang+llvm-3.6.2-aarch64-linux-gnu/bin/clang++ prune.cpp -march=armv8-a -mtune=cortex-a53 -Ofast
$ time ./a.out
alabalanica1234
real    0m0.912s
user    0m0.900s
sys     0m0.000s
$ echo "scale=4; 0.912 * 1.5 * 10^9 / (5 * 10^7 * 16)" | bc
1.7100

Martin Krastev.

Translation: Dmitry Alexandrov

Tags:

Quickly remove spaces from strings on ARM processors - alternative analysis

Also popular now: