Quickly remove spaces from strings on ARM processors
- Transfer
Suppose I gave you a relatively long line, and you want to remove all spaces from it. In ASCII, we can define spaces as a space character ('') and line endings ('\ r' and '\ n'). What interests me most is the algorithm and performance issues, so that we can simplify the task and remove all bytes with values less than or equal to 32.
In the previous article , where I asked about removing spaces for speed, the best answer was to use vectorization using 128-bit registers (SSE4). It turned out to be 5-10 times faster than head-on.
It is very convenient that all processors have 128-bit vector registers, as well as x64 processors. Are ARM processors as fast as x64 processors?
Let's first look at a quick scalar implementation:
It deletes all characters with values less than or equal to 32 and writes data back. It works very fast.
Is it possible to achieve even greater speed with vector instructions?
On x64 processors, the best strategy is to capture 16 bytes of data, quickly compare for empty characters, then extract the mask value (or bitset) created from 16 bits, one bit per character, where each bit corresponds to the value if a blank character is found or not. Such a bitset is quickly calculated on the x64 processor, since there is a special instruction (
So, we cannot process data on ARM like on x86 processors. What can be done?
As SS4 does this, we can quickly check that the byte values are less than or equal to 32, and thus determine the empty characters:
Now we can quickly check any of the 16 characters, it is empty using only two instructions:
This suggests a useful strategy. Instead of comparing characters one at a time, you can compare all 16 characters at once. If none of them is empty, then just copy 16 characters back to the input and move on. Otherwise, we are slipping into a slow scalar approach, but with the additional advantage that we do not need to repeat the comparison here:
The advantages of this approach will be most pronounced if you expect a lot of streams of 16 bytes without empty characters. In principle, this is true for many applications.
I wrote a benchmark in which I tried to estimate how much it takes to remove spaces, one at a time, based on input with a small number of empty characters scattered randomly. Source code is available , but you need an ARM processor to run. I ran it on a 64-bit ARM processor (made from A57 cores) . John Reger has a few more benchmarks on the same machine . It seems to me that the same cores work in the Nintendo Switch.
Technical specifications are not rich . However, the processor operates at a frequency of 1.7 GHz, which everyone can see if they start
In comparison, on the x64 processor, the scalar version uses something like 1.2 cycles per character, and ARM is about half as fast in cycles per character. This was to be expected, because A57 cores are unlikely to compete with x64 in cycle performance. However, when using SS4 on an x64 machine, I managed to achieve a performance of only 0.25 cycles per character, which is more than five times faster than on ARM NEON.
Such a big difference is due to the difference in the algorithms. On x64 processors, we use a combination
There are many nice things about ARM processors. Assembler code is much more elegant than similar code for x86 / x64 processors. Even ARM NEON instructions seem cleaner than SSE / AVX instructions. However, for many tasks, a complete lack of instructions
But maybe I underestimate ARM NEON ... can you complete the task more efficiently than I did?
Note. The article was edited: as one of the commentators noted, on 64-bit ARM processors it is possible to rearrange 16 bits with one instruction.
In the previous article , where I asked about removing spaces for speed, the best answer was to use vectorization using 128-bit registers (SSE4). It turned out to be 5-10 times faster than head-on.
It is very convenient that all processors have 128-bit vector registers, as well as x64 processors. Are ARM processors as fast as x64 processors?
Let's first look at a quick scalar implementation:
size_t i = 0, pos = 0;
while (i < howmany) {
char c = bytes[i++];
bytes[pos] = c;
pos += (c > 32 ? 1 : 0);
}
It deletes all characters with values less than or equal to 32 and writes data back. It works very fast.
Is it possible to achieve even greater speed with vector instructions?
On x64 processors, the best strategy is to capture 16 bytes of data, quickly compare for empty characters, then extract the mask value (or bitset) created from 16 bits, one bit per character, where each bit corresponds to the value if a blank character is found or not. Such a bitset is quickly calculated on the x64 processor, since there is a special instruction (
movemask
) there. On ARM processors, there is no such instruction. You can emulate movemask
with a few instructions. So, we cannot process data on ARM like on x86 processors. What can be done?
As SS4 does this, we can quickly check that the byte values are less than or equal to 32, and thus determine the empty characters:
static inline uint8x16_t is_white(uint8x16_t data) {
const uint8x16_t wchar = vdupq_n_u8(' ');
uint8x16_t isw = vcleq_u8(data, wchar);
return isw;
}
Now we can quickly check any of the 16 characters, it is empty using only two instructions:
static inline uint64_t is_not_zero(uint8x16_t v) {
uint64x2_t v64 = vreinterpretq_u64_u8(v);
uint32x2_t v32 = vqmovn_u64(v64);
uint64x1_t result = vreinterpret_u64_u32(v32);
return result[0];
}
This suggests a useful strategy. Instead of comparing characters one at a time, you can compare all 16 characters at once. If none of them is empty, then just copy 16 characters back to the input and move on. Otherwise, we are slipping into a slow scalar approach, but with the additional advantage that we do not need to repeat the comparison here:
uint8x16_t vecbytes = vld1q_u8((uint8_t *)bytes + i);
uint8x16_t w = is_white(vecbytes);
uint64_t haswhite = is_not_zero(w);
w0 = vaddq_u8(justone, w);
if(!haswhite) {
vst1q_u8((uint8_t *)bytes + pos,vecbytes);
pos += 16;
i += 16;
} else {
for (int k = 0; k < 16; k++) {
bytes[pos] = bytes[i++];
pos += w[k];
}
}
The advantages of this approach will be most pronounced if you expect a lot of streams of 16 bytes without empty characters. In principle, this is true for many applications.
I wrote a benchmark in which I tried to estimate how much it takes to remove spaces, one at a time, based on input with a small number of empty characters scattered randomly. Source code is available , but you need an ARM processor to run. I ran it on a 64-bit ARM processor (made from A57 cores) . John Reger has a few more benchmarks on the same machine . It seems to me that the same cores work in the Nintendo Switch.
scalar | 1.40 ns |
NEON | 1.04 ns |
Technical specifications are not rich . However, the processor operates at a frequency of 1.7 GHz, which everyone can see if they start
perf stat
. Here are how many cycles we need the character we need:scalar | ARM | Recent x64 |
---|---|---|
scalar | 2.4 cycles | 1.2 cycles |
vectorized (NEON and SS4) | 1.8 cycles | 0.25 cycles |
In comparison, on the x64 processor, the scalar version uses something like 1.2 cycles per character, and ARM is about half as fast in cycles per character. This was to be expected, because A57 cores are unlikely to compete with x64 in cycle performance. However, when using SS4 on an x64 machine, I managed to achieve a performance of only 0.25 cycles per character, which is more than five times faster than on ARM NEON.
Such a big difference is due to the difference in the algorithms. On x64 processors, we use a combination
movemask/pshufb
and branchless algorithm with a very small number of instructions. The version for ARM NEON is much weaker.There are many nice things about ARM processors. Assembler code is much more elegant than similar code for x86 / x64 processors. Even ARM NEON instructions seem cleaner than SSE / AVX instructions. However, for many tasks, a complete lack of instructions
movemask
can limit your work. But maybe I underestimate ARM NEON ... can you complete the task more efficiently than I did?
Note. The article was edited: as one of the commentators noted, on 64-bit ARM processors it is possible to rearrange 16 bits with one instruction.