How to speed up the unloading of LZ4 in ClickHouse

    When executing queries in ClickHouse, you can notice that in the profiler, at one of the first places, the LZ_decompress_fast function is often visible. Why it happens? This question became the reason for the whole study on choosing the best decompression algorithm. Here I publish the entire study, and the short version can be found in my report on HighLoad ++ Siberia.

    ClickHouse data is stored in compressed form. And during the execution of requests ClickHouse tries to do almost nothing - use a minimum of CPU resources. It happens that all the calculations that could take a while are already well optimized, and the request is well written by the user. Then it remains to perform the release.

    The question is - why can LZ4 unloading be a bottleneck? It would seem that LZ4 is a very lightweight algorithm.: The decompression speed, depending on the data, is usually 1 to 3 GB / s per processor core. This is significantly more than the speed of the disk subsystem. Moreover, we use all available kernels, and the expansion scales linearly across all physical kernels.

    But there are two points to keep in mind. Firstly, compressed data is read from the disk, and the compression rate is given in the amount of uncompressed data. If the compression ratio is large enough, then almost nothing needs to be read from the disks. But at the same time, a lot of compressed data is generated, and of course, this affects the CPU consumption: the amount of data compression work in the case of LZ4 is almost proportional to the volume of the compressed data itself.

    Secondly, reading data from disks may not be required at all if the data is in the cache. To do this, you can rely on page cache or use your own cache. In a column database, using the cache is more efficient due to the fact that not all columns fall into it, but only frequently used ones. This is why LZ4, in terms of CPU load, is often a bottleneck.

    Hence two more questions. If the data compression "slows down", then maybe they should not be compressed at all? But in practice, this assumption is meaningless. Recently in ClickHouse it was possible to configure only two data compression options - LZ4 and Zstandard. The default is LZ4. By switching to Zstandard, you can make compression stronger and slower. But it was impossible to completely disable compression until recently - LZ4 is considered as a reasonable minimum, which can always be used. That is why I really love the LZ4. :)

    But recently, a mysterious stranger appeared in the ClickHouse English chat room , who said that he has a very fast disk subsystem (NVMe SSD) and everything depends on compression - it would be nice to be able to turn it off. I replied that there is no such possibility, but it is easy to add. A few days later we received a pool request , in which the compression method is implementednone. I asked for the results - how much this helped, how speedy the requests. The person said that this new feature turned out to be useless in practice, since data without compression began to take up too much space.

    The second question that arises is: if there is a cache, why not store the already uncompressed data in it? This is allowed - in many cases it will be possible to get rid of the need for decompression. And in ClickHouse there is such a cache - a cache of expanded blocks . But it’s a pity to spend a lot of RAM on it because of its low efficiency. It justifies itself only on small, consecutive requests that use almost the same data.

    General consideration: data should be compressed, preferably always. Always burn them to a compressed disk. Transmit over the network also with compression. In my opinion, the default compression should be considered justified even when transferring to a 10-gigabit network without oversubscribing within the data center, and transferring data without compression between data centers is generally unacceptable.

    Why LZ4?

    Why is LZ4 used? Is it possible to choose something even easier? In principle, it is possible, and it is right and useful. But let's first look at what class of algorithms LZ4 belongs to.

    Firstly, it does not depend on the data type. For example, if you know in advance that you will have an array of integers, then you can use one of the many variants of the VarInt algorithm - it will be more efficient on the CPU. Secondly, LZ4 does not depend too much on the required assumptions on the data model. Suppose you have an ordered time series of sensor readings - an array with numbers of type float. Then you can calculate the deltas and then compress further, and this will be more efficient in terms of compression ratio.

    That is, LZ4 can be used without problems for any byte arrays - for any files. Of course, he has his own specialization (more on that below), and in some cases its use is meaningless. But if you call it a general-purpose algorithm, this will be a small mistake. And note that, thanks to the internal device, LZ4 automatically implements the RLE algorithm as a special case .

    Another question: is LZ4 the most optimal algorithm of this class for the combination of speed and compression force? Such algorithms are called pareto frontier - this means that there is no other algorithm that is strictly better in one indicator and no worse in others (and even on a wide variety of datasets). There are algorithms that are faster, but give a lower compression ratio, and there are those that compress more, but at the same time slower compress or decompress.

    In fact, the LZ4 is not a pareto frontier. There are options that are slightly better. For example, this is LZTURBO from a certain powturbo. There is no doubt in the reliability of the results thanks to the community on (the largest and approximately the only forum for data compression). But the developer does not distribute the source code or the binaries, but only gives them to a limited circle of people for testing or for a lot of money (like no one has paid so far). Also worth paying attention to Lizard (formerly LZ5) and Density . They can work a little better than LZ4 when choosing some compression level. Also pay attention to LZSSE - an extremely interesting thing. However, it’s better to look at it after reading this article.

    How does LZ4 work?

    Let's look at how LZ4 works in general. This is one of the implementations of the LZ77 algorithm: L and Z indicate the names of the authors (Lempel and Ziv), and 77 - in 1977, when the algorithm was published. It has many other implementations: QuickLZ, FastLZ, BriefLZ, LZF, LZO, as well as gzip and zip when using low compression levels.

    A data block compressed using LZ4 contains a sequence of records (commands, instructions) of two types:

    1. Literal: "take the next N bytes as is and copy them to the result."
    2. Match (match): “take N bytes that were already decompressed by the offset offset from the current position.”

    Example. Before compression:
    Hello world Hello

    After compression:
    literals 12 "Hello world " match 5 12

    If you take a compressed block and go through it with the cursor, executing these commands, we will get the original, uncompressed data as a result.

    We roughly looked at how the data is decompressed. The point is also clear: to perform compression, the algorithm encodes repeating byte sequences using matches.

    Clear and some properties. This algorithm is byte-oriented - it does not dissect individual bytes, but only copies them in its entirety. Here lies the difference, for example, from entropy coding. For example, zstd is a composition of LZ77 and entropy coding.

    Note that the size of the compressed block is not chosen too large so as not to spend a lot of RAM during unloading; so as not to slow down random access in a compressed file (which consists of many compressed blocks); and sometimes so that the block fits in some CPU cache. For example, you can choose 64 KB - so buffers for compressed and uncompressed data will fit in the L2 cache and half will remain.

    If we need to compress a larger file, we will simply concatenate the compressed blocks. At the same time, next to each compressed block it is convenient to place additional data - sizes, check-sum.

    The maximum offset for the match is limited, in LZ4 - 64 kilobytes. This value is called a sliding window. Indeed, this means that as the cursor moves forward, matches can be in a window of 64 kilobytes in size to the cursor, which moves with the cursor.

    Now let's look at how to compress data - in other words, how to find matching sequences in a file. Of course, you can use suffix trie (great if you've heard of it). There are options in which the longest matching sequence is guaranteed to be among the previous bytes in the compression process. This is called optimal parsing and gives almostbest compression ratio for fixed compressed block format. But there are more effective options - when we find some good enough match in the data, but not necessarily the longest. The most efficient way to find it is to use a hash table.

    To do this, we go through the source data block with the cursor and take a few bytes after the cursor. For example, 4 bytes. Hash them and put in the hash table the offset from the beginning of the block - where these 4 bytes met. The value 4 is called min-match - with the help of such a hash table we can find matches of at least 4 bytes.

    If we looked at the hash table, and there is already a record there, and if the offset does not exceed the sliding window, then we check how many more bytes match after these four bytes. Maybe there is much more that coincides. It is also possible that a collision has occurred in the hash table and nothing matches. This is normal - you can simply replace the value in the hash table with a new one. Collisions in the hash table will simply result in a lower compression ratio since there are fewer matches. By the way, this kind of hash table (of a fixed size and without collision resolution) is called a cache table, a cache table. This is also logical - in case of a collision, the cache table just forgets about the old record.
    The task for the attentive reader. Let the data be an array of numbers like UInt32 in little endian format, which is part of a sequence of natural numbers: 0, 1, 2 ... Explain why when using LZ4 this data is not compressed (the amount of compressed data is no less than the amount of uncompressed data).

    How to speed things up

    So, I want to speed up the unloading of LZ4. Let’s see what the unloading cycle is like. Here is the loop in pseudocode:

    while (...)
        read (input_pos, literal_length, match_length);
        copy (output_pos, input_pos, literal_length);
        output_pos + = literal_length;
        read (input_pos, match_offset);
        copy (output_pos, output_pos - match_offset,
        output_pos + = match_length;

    The LZ4 format is designed so that literals and matches alternate in a compressed file. And obviously, literal always comes first (because from the very beginning the match has nowhere to get from). Therefore, their lengths are encoded together.

    In fact, everything is a little more complicated. One byte is read from the file, and two nibble are taken from it, in which numbers from 0 to 15 are encoded. If the corresponding number is not equal to 15, then it is considered the length of the literal and match, respectively. And if it is 15, then the length is longer and it is encoded in the following bytes. Then the next byte is read, and its value is added to the length. Further, if it is equal to 255, then we continue - we read the next byte in the same way.

    Note that the maximum compression ratio for the LZ4 format does not reach 255. And the second (useless) observation: if your data is very redundant, then using LZ4 will double increase the compression ratio.

    When we read the length of the literal (and then also the length of the match and the offset of the match), to unclench it is enough to simply copy two pieces of memory.

    How to copy a piece of memory

    It would seem that you can use a function memcpythat is just designed to copy pieces of memory. But this is not optimal and still incorrect.

    Why is the use of the memcpy function suboptimal? Because she:

    1. usually located in the libc library (and the libc library usually links dynamically, and the memcpy call will go indirectly, via PLT),
    2. not inline with size argument unknown in compile time,
    3. makes a lot of effort to correctly process the “tails” of a memory fragment that are not multiple of the size of a machine word or register.

    The last point is the most important. Suppose we asked the memcpy function to copy exactly 5 bytes. It would be very good to copy 8 bytes at once, using two movq instructions for this. But then we will copy three extra bytes - that is, we will write abroad the transferred buffer. The function does not have the right to do this - indeed, because we will overwrite some data in our program, there will be a “trip” from memory. And if we wrote at an unaligned address, then these extra bytes can be located on an unallocated virtual memory page or on a page without write access. Then we get segfault (that's good).

    Hello world Hello wo...
    ^^^^^^^^ - src
                ^^^^^^^^ - dst


    But in our case, we can almost always write extra bytes. We can read extra bytes in the input buffer as long as the extra bytes are located in it entirely. Under the same conditions, we can write extra bytes to the output buffer - because at the next iteration we will overwrite them anyway.

    This optimization is already in the original LZ4 implementation:

    inline void copy8 (UInt8 * dst, const UInt8 * src)
        memcpy (dst, src, 8); /// Actually, memcpy is not called here.
    inline void wildCopy8 (UInt8 * dst, const UInt8 * src, UInt8 * dst_end)
            copy8 (dst, src);
            dst + = 8;
            src + = 8;
        } while (dst <dst_end);

    To take advantage of this optimization, you only need to verify that we are far enough from the border of the buffer. This should be free, because we already check that the buffer limits are exceeded. And the processing of the last few bytes - the "tail" of the data - can be done after the main loop.

    However, there are still some subtleties. There are two copies in the cycle - literal and match. But when using the LZ4_decompress_fast function (instead of LZ4_decompress_safe), the check is performed once - when we need to copy the literal. When copying a match, the check is not performed, but in the specification of the LZ4 format there are conditions that allow it to be avoided:

    The last 5 bytes are always literals
    The last match must start at least 12 bytes before end of block.
    Consequently, a block with less than 13 bytes cannot be compressed.

    Specially selected input data can cause a memory drive. If you use the LZ4_decompress_fast function, you need protection against bad data. Compressed data should be at least a check-sum. And if you need protection against an attacker, then use the LZ4_decompress_safe function. Other options: take a cryptographic hash function as a check sum, but it will almost certainly kill all the performance; either allocate more memory for buffers; either allocate memory for buffers with a separate call to mmap and create a guard page.

    When I see a code that copies data of 8 bytes, I immediately ask - why exactly 8 bytes? You can copy 16 bytes using SSE registers:

    inline void copy16 (UInt8 * dst, const UInt8 * src)
    #if __SSE2__
        _mm_storeu_si128 (reinterpret_cast <__ m128i *> (dst),
            _mm_loadu_si128 (reinterpret_cast(src)));
        memcpy (dst, src, 16);
    inline void wildCopy16 (UInt8 * dst, const UInt8 * src, UInt8 * dst_end)
            copy16 (dst, src);
            dst + = 16;
            src + = 16;
        } while (dst <dst_end);

    Copying 32 bytes for AVX and 64 bytes for AVX-512 works similarly. In addition, you can expand the cycle several times. If you've ever watched how it is implemented memcpy, then this is exactly the approach. (By the way, the compiler in this case will neither expand nor vectorize the loop: this will require the insertion of cumbersome checks.)

    Why is this not done in the original LZ4 implementation? Firstly, it is not obvious whether this is better or worse. The result depends on the sizes of the fragments that need to be copied. Suddenly they are all short and extra work will be useless? And secondly, it destroys those conditions in the LZ4 format that allow you to avoid unnecessary brunch in the inner loop.

    Nevertheless, we will keep this option in mind for now.

    Tricky copy

    Back to the question - is it always possible to copy data this way? Suppose we need to copy a match - that is, copy a piece of memory from the output buffer that is at some offset behind the cursor to the position of this cursor.

    Imagine a simple case - you need to copy 5 bytes at offset 12: But there is a more complicated case - when we need to copy a piece of memory that is longer than the offset. That is, it partially indicates data that has not yet been written to the output buffer. We copy 10 bytes at offset 3: In the compression process, we have all the data, and such a match may well be found. The function is not suitable for copying it: it does not support the case when the ranges of memory fragments intersect. By the way, the function

    Hello world ...........
    ^^^^^ - src
                ^^^^^ - dst

    Hello world Hello wo...
    ^^^^^ - src
                ^^^^^ - dst

    ^^^^^^^^^^ - src
       ^^^^^^^^^^ - dst

    ^^^^^^^^^^ - src
       ^^^^^^^^^^ - dst

    memcpymemmovealso does not fit, because the memory fragment from where to get the data is not yet fully initialized. You need to copy as if we were copying by byte.

    op [0] = match [0];
    op [1] = match [1];
    op [2] = match [2];
    op [3] = match [3];

    Here's how it works: That is, we need to create a repeating sequence. In the original LZ4 implementation, surprisingly incomprehensible code was written for this:

    ^ - src
       ^ - dst

     ^ - src
        ^ - dst

      ^ - src
         ^ - dst

       ^ - src
          ^ - dst

        ^ - src
           ^ - dst

    const unsigned dec32table [] = {0, 1, 2, 1, 4, 4, 4, 4};
    const int dec64table [] = {0, 0, 0, -1, 0, 1, 2, 3};
    const int dec64 = dec64table [offset];
    op [0] = match [0];
    op [1] = match [1];
    op [2] = match [2];
    op [3] = match [3];
    match + = dec32table [offset];
    memcpy (op + 4, match, 4);
    match - = dec64;

    We copy the first 4 bytes byte-by-bit, shift by some magic number, copy the next 4 bytes as a whole, shift the pointer to match to another magic number. The code author ( Jan Collet ), for some ridiculous reason, forgot to leave a comment on what this means. In addition, variable names are confusing. Both are called dec ... table, but we add one of them and subtract the other. In addition, another one is unsigned, and the other is int. However, it is worth paying tribute: just recently, the author improved this place in the code.

    Here's how it actually works. Copy the first 4 bytes byte: Now you can copy 4 bytes at once: You can continue as usual by copying 8 bytes at once:

    ^^^^ - src
       ^^^^ - dst

     ^^^^ - src
           ^^^^ - dst

      ^^^^^^^^ - src
               ^^^^^^^^ - dst

    As you know from experience, sometimes the best way to understand code is to rewrite it. Here's what happened:

    inline void copyOverlap8 (UInt8 * op, const UInt8 * & match, const size_t offset)
        /// 4% n.
        /// Or if 4% n is zero, we use n.
        /// It gives equivalent result, but is better CPU friendly for unknown reason.
        static constexpr int shift1 [] = {0, 1, 2, 1, 4, 4, 4, 4};
        /// 8% n - 4% n
        static constexpr int shift2 [] = {0, 0, 0, 1, 0, -1, -2, -3};
        op [0] = match [0];
        op [1] = match [1];
        op [2] = match [2];
        op [3] = match [3];
        match + = shift1 [offset];
        memcpy (op + 4, match, 4);
        match + = shift2 [offset];

    Performance, of course, has not changed in any way. However, I really wanted to try optimization, in which regular copying is 16 bytes at once.

    But this complicates the “special case” and leads to the fact that it is called more often (the condition is offset < 16met no less than offset < 8). Copying (beginning) of intersecting ranges for 16-byte copying looks like this:

    inline void copyOverlap16 (UInt8 * op, const UInt8 * & match, const size_t offset)
        /// 4% n.
        static constexpr int shift1 []
            = {0, 1, 2, 1, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4};
        /// 8% n - 4% n
        static constexpr int shift2 []
            = {0, 0, 0, 1, 0, -1, -2, -3, -4, 4, 4, 4, 4, 4, 4, 4};
        /// 16% n - 8% n
        static constexpr int shift3 []
            = {0, 0, 0, -1, 0, -2, 2, 1, 8, -1, -2, -3, -4, -5, -6, -7};
        op [0] = match [0];
        op [1] = match [1];
        op [2] = match [2];
        op [3] = match [3];
        match + = shift1 [offset];
        memcpy (op + 4, match, 4);
        match + = shift2 [offset];
        memcpy (op + 8, match, 8);
        match + = shift3 [offset];

    Is it possible to implement this particular function more optimally? I would like for such a complex code to find a magic SIMD instruction, because we just want to write 16 bytes, which entirely consist of several bytes of input data (from 1 to 15). They, in turn, just need to be repeated in the correct order.

    There is such an instruction - it is called pshufb(from the words packed shuffle bytes) and is included in SSSE3 (three letters S). It accepts two 16-byte registers. One register contains the source data. In another, there is a “selector”: a number from 0 to 15 is written in each byte, depending on which byte of the source register to take the result from. Or, if the value of the selector byte is greater than 127, fill the corresponding byte of the result with zero.

    Here is an example:

    xmm0: abc .............
    xmm1: 0120120120120120
    pshufb% xmm1,% xmm0
    xmm0: abcabcabcabcabca

    We filled every byte of the result with the byte of source data we selected - this is just what we need! Here's what the code looks like as a result:

    inline void copyOverlap16Shuffle (UInt8 * op, const UInt8 * & match, const size_t offset)
    #ifdef __SSSE3__
        static constexpr UInt8 __attribute __ ((__ aligned __ (16))) masks [] =
            0, 1, 2, 1, 4, 1, 4, 2, 8, 7, 6, 5, 4, 3, 2, 1, / * offset = 0, not used as mask, but for shift amount instead * /
            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, / * offset = 1 * /
            0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
            0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0,
            0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3,
            0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0,
            0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3,
            0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1,
            0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7,
            0, 1, 2, 3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5, 6,
            0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5,
            0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 1, 2, 3, 4,
            0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3,
            0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 0, 1, 2,
            0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 0, 1,
            0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 0,
        _mm_storeu_si128 (reinterpret_cast <__ m128i *> (op),
            _mm_shuffle_epi8 (
                _mm_loadu_si128 (reinterpret_cast(match)),
                _mm_load_si128 (reinterpret_cast(masks) + offset)));
        match + = masks [offset];
        copyOverlap16 (op, match, offset);

    Here _mm_shuffle_epi8is an intrinsic expanding into an instruction pshufb.

    Is it possible to do such an operation for more bytes at once using newer instructions? SSSE3 is a very old instruction set that has existed since 2006. AVX2 has an instruction that does this immediately for 32 bytes, but only independently for individual 16-byte fragments. It is no longer called packed shuffle bytes, but vector permute bytes - the words are different, but the meaning is the same. The AVX-512 VBMI has another instruction that works immediately for 64 bytes, but processors with its support have appeared recently. Similar instructions also exist in ARM NEON - they are called vtbl (vector table lookup), but they allow only 8 bytes to be written.

    In addition, there is a variant of the instructionpshufbwith 64-bit MMX registers to form 8 bytes. It is just right for replacing the source code. However, instead of it, I decided to still use the option that works with 16 bytes (for serious reasons).

    At the Highload ++ Siberia conference, after my report, a listener came up to me and said that for the case of 8 bytes, you can simply use multiplication by a specially selected constant (a shift will also be required) - I did not even think about this before!

    How to remove excess if

    Suppose I want to use a variant that copies 16 bytes each. How to avoid the need for additional verification of buffer overflow?

    I decided that I would not do this check at all. In the commentary on the function, it will be written that the developer should allocate a piece of memory for the specified number of bytes more than required to allow us to read and write unnecessary garbage there. The function interface will become inconvenient, but these are other problems.

    However, negative consequences may occur. Suppose the data that we need to expand was generated from blocks of 65,536 bytes. Then the user gives us a piece of memory of size 65 536 bytes for the decompressed data. But with the new function interface, the user will be required to allocate a piece of memory, for example, from 65,551 bytes. Then the allocator, quite possibly, will be forced to actually allocate 96 or even 128 kilobytes - depending on its implementation. If the allocator is very bad, it may suddenly stop caching a piece of memory in its heap and start using mmap each time to allocate memory (or free memory using madvice). Such a process will be terribly slow due to page faults. As a result, a small optimization attempt can lead to the fact that everything starts to slow down.

    Is there any acceleration?

    So, I made a version of the code in which three optimizations are applied:

    1. Copy 16 bytes instead of 8.
    2. Shuffle instructions are used for the case offset < 16.
    3. Removed one extra if.

    I began to test this code on a wide variety of data sets and got unexpected results.

    Example 1:
    Xeon E2650v2, Yandex.Browser data, AppVersion column.
    reference: 1.67 GB / sec.
    16 bytes, shuffle: 2.94 GB / sec (76% faster).

    Example 2:
    Xeon E2650v2, Yandex.Direct data, ShowsSumPosition column.
    reference: 2.30 GB / sec.
    16 bytes, shuffle: 1.91 GB / sec (20% slower).

    At first I saw that everything accelerated by tens of percent and managed to rejoice. Then on other files I saw that nothing accelerated. Still on some slowed, though not by much. I concluded that the results depend on the compression ratio. The more the file is compressed, the greater the advantage of switching to 16 bytes. This is natural: the greater the compression ratio, the greater the average length of the fragments to be copied.

    To better understand, using C ++ templates, I made code variants for four cases: we operate with 8- or 16-byte pieces; use or not use shuffle instruction.

    void NO_INLINE decompressImpl (
         const char * const source,
         char * const dest,
         size_t dest_size)

    On different files, completely different versions of the code won, but when testing on a working computer, the option with shuffle always won. On a working computer, testing is inconvenient, you have to do this:

    sudo echo 'performance' | tee / sys / devices / system / cpu / cpu * / cpufreq / scaling_governor
    kill -STOP $ (pidof firefox) $ (pidof chromium)

    Then I went to one of the old “development” servers (with Xeon E5645 processor), got even more data sets and got almost diametrically opposite results, which completely confused me. It turned out that the choice of the optimal algorithm is determined not only by the compression ratio, but also by the processor model. Depends on it when it is better to use a shuffle instruction, as well as the threshold, starting from which it is better to use 16-byte copies.

    By the way, when testing on servers it makes sense to do this:

    sudo kill -STOP $ (pidof python) $ (pidof perl) $ (pgrep -u skynet) $ (pidof cqudp-client)

    Otherwise, the results will be unstable. Also keep an eye out for thermal throttling and power capping.

    How to choose the best algorithm

    So, we have four variants of the algorithm, and we need to choose the best one depending on the conditions. It would be possible to create a representative dataset and hardware, then conduct a good load testing and choose the method that is better on average. But we do not have a representative dataset. For testing, I took a sample of the data from Metrica, Direct, Browser and flights to the USA. But this is not enough: ClickHouse is used by hundreds of companies around the world, and re-optimizing it on one data set, we may not notice a drop in performance with other data. And if the results depend on the processor model, you will have to explicitly enter the conditions into the code and test it on each model (or see the reference data on the timings of instructions - what do you think?). In any case, this is too time consuming.

    Then I decided to use another method, which is obvious to colleagues who knowingly studied at the ShAD. These are "many-armed bandits . " The bottom line is that a variant of the algorithm is chosen randomly, and then, based on statistics, we begin to more often choose those variants that have shown themselves well.

    We have a lot of data blocks that need to be decompressed. That is, you need independent calls to the data compression function. We can choose one of the four algorithms for each block and measure its operating time. Such an operation usually costs nothing compared to processing a data block - and in ClickHouse, a block of uncompressed data is at least 64 KB. (Read an article on measuring time.)

    To understand how the "multi-armed bandits" algorithm works, we will find out why it is so called. This is an analogy with slot machines in casinos, which have several pens that can be pulled to get some random amount of money. The player can pull the handles many times in any sequence he chooses. Each pen has a fixed corresponding probability distribution of the amount of money issued, but the player does not know him and can only evaluate it based on the experience of the game. Then he will be able to maximize his profit.

    One approach to maximizing winnings is to evaluate the probability distribution for each pen at each step, based on the statistics of the game in the previous steps. Then in the mind we “play” a random win for each pen, based on the obtained distributions. And then we pull that pen for which the outcome played out in the mind turned out to be better. This approach is called Thompson Sampling.

    We, in turn, must choose a decompression algorithm. The result is the runtime in picoseconds per byte: the less, the better. We will consider the operating time as a random variable and evaluate its distribution. To estimate the distribution of a random variable, it is necessary to use the methods of mathematical statistics. For such tasks, the Bayesian approach is often used, but it would be inefficient to enter complex formulas into C ++ code. You can use the parametric approach - to say that a random variable belongs to some family of random variables that depend on the parameters; and then evaluate these parameters.

    How to choose a family of random variables? For example, we could assume that the code execution time has a normal distribution. But this is absolutely untrue. Firstly, the runtime cannot be negative, and the normal distribution takes values ​​on the whole number line. Secondly, the runtime, I assume, has a large tail on the right.

    However, there are factors based on which to use the normal distribution estimate only for Thompson Sampling purposes is a good idea (even despite the fact that the distribution of the desired value is obviously not normal). The reason is that the necessary estimate of expectation and variance is very simple, and after a sufficient number of iterations, the normal distribution will become more or less narrow, not very different from the distributions that we would have obtained by other methods. If we are not very interested in the rate of convergence in the first steps, then such details can be neglected.

    On the one hand, this is a somewhat “ignorant” approach. It is known from experience that the average time to complete a request, load a site, and so on is “garbage,” which does not make sense to calculate. It would be better to calculate the median - robust statistics . But this is somewhat more complicated, and as I will show later, for practical purposes the described method justifies itself.

    First, I implemented the calculation of expectation and variance, and then I decided that it was too good, and I needed to simplify the code to get “worse”:

    /// For better convergence, we don't use proper estimate of stddev.
    /// We want to eventually separate between two algorithms even in case
    /// when there is no statistical significant difference between them.
    double sigma () const
        return mean () / sqrt (adjustedCount ());
    double sample (pcg64 & rng) const
        return std :: normal_distribution <> (mean (), sigma ()) (rng);

    I wrote everything so that the first few iterations are not taken into account - to exclude the effect of memory latencies.

    It turned out a test program that can itself select the best algorithm for the input data, and use the reference implementation of LZ4 or a fixed version of the algorithm as additional operating modes.

    Thus, there are six working options:
    - reference (baseline): the original LZ4 without our modifications;
    - variant 0: copy 8 bytes, do not use shuffle;
    - variant 1: copy 8 bytes each, use shuffle;
    - variant 2: copy 16 bytes each, do not use shuffle;
    - variant 3: copy 16 bytes each, use shuffle;
    - “gangster” option, which during the work itself chooses the best of the four optimized options listed.

    Testing on different CPUs

    If the result is highly dependent on the CPU model, it would be interesting to learn how. Maybe on some CPUs the difference is especially big?

    I prepared a set of datasets from different tables in ClickHouse with real data, a total of 256 different files of 100 MB of uncompressed data (the number 256 just coincided). And I looked at what kind of CPUs are on the servers on which I can run benchmarks. I found servers with the following CPUs:
    - Intel® Xeon® CPU E5-2650 v2 @ 2.60GHz
    - Intel® Xeon® CPU E5-2660 v4 @ 2.00GHz
    - Intel® Xeon® CPU E5-2660 0 @ 2.20GHz
    - Intel® Xeon ® CPU E5645 @ 2.40GHz
    - Intel Xeon E312xx (Sandy Bridge)
    - AMD Opteron (TM) Processor 6274
    - AMD Opteron (tm) Processor 6380
    - Intel® Xeon® CPU E5-2683 v4 @ 2.10GHz
    - Intel® Xeon® CPU E5530 @ 2.40GHz
    - Intel® Xeon® CPU E5440 @ 2.83GHz
    - Intel® Xeon® CPU E5-2667 v2 @ 3.30GHz The

    most interesting are the processors provided by R&D department:
    - AMD EPYC 7351 16-Core Processor is the new AMD server processor.
    - Cavium ThunderX2 - and this is not x86 at all, but AArch64. For them, my SIMD optimizations took a little redo. The server defines 224 logical and 56 physical cores.

    There are 13 servers in total, on each of which the test is run on 256 files in 6 variants (reference, 0, 1, 2, 3, adaptive), and the test is run 10 times, alternating the alternatives. It turns out 199,680 results, and they can be compared.

    For example, you can compare different CPUs with each other. But do not make hasty conclusions from these results: we only test the LZ4 decompression algorithm on one core (a very narrow case - we get a microbenchmark). For example, Cavium has the weakest core. But I tested ClickHouse on it, and it “tears” the Xeon E5-2650 v2 on heavy requests due to the superior number of cores, despite the lack of many optimizations that ClickHouse only makes for x86.

    ┌─cpu─────────────────────ref─┬─adapt─┬──max─┬─best─┬─adapt_boost─┬─max_boost─┬─ adapt_over_max─┐
    │ E5-2667 v2 @ 3.30GHz │ 2.81 │ 3.19 │ 3.15 │ 3 │ 1.14 │ 1.12 │ 1.01 │
    │ E5-2650 v2 @ 2.60GHz │ 2.5 │ 2.84 │ 2.81 │ 3 │ 1.14 │ 1.12 │ 1.01 │
    │ E5-2683 v4 @ 2.10GHz  │ 2.26 │  2.63 │ 2.59 │    3 │        1.16 │      1.15 │           1.02 │
    │ E5-2660 v4 @ 2.00GHz  │ 2.15 │  2.49 │ 2.46 │    3 │        1.16 │      1.14 │           1.01 │
    │ AMD EPYC 7351         │ 2.03 │  2.44 │ 2.35 │    3 │        1.20 │      1.16 │           1.04 │
    │ E5-2660 0 @ 2.20GHz   │ 2.13 │  2.39 │ 2.37 │    3 │        1.12 │      1.11 │           1.01 │
    │ E312xx (Sandy Bridge) │ 1.97 │  2.2  │ 2.18 │    3 │        1.12 │      1.11 │           1.01 │
    │ E5530 @ 2.40GHz       │ 1.65 │  1.93 │ 1.94 │    3 │        1.17 │      1.18 │           0.99 │
    │ E5645 @ 2.40GHz       │ 1.65 │  1.92 │ 1.94 │    3 │        1.16 │      1.18 │           0.99 │
    │ AMD Opteron 6380      │ 1.47 │  1.58 │ 1.56 │    1 │        1.07 │      1.06 │           1.01 │
    │ AMD Opteron 6274      │ 1.15 │  1.35 │ 1.35 │    1 │        1.17 │      1.17 │              1 │
    │ E5440 @ 2.83GHz       │ 1.35 │  1.33 │ 1.42 │    1 │        0.99 │      1.05 │           0.94 │
    │ Cavium ThunderX2      │ 0.84 │  0.87 │ 0.87 │    0 │        1.04 │      1.04 │              1 │

    ref, adapt, max is the speed in gigabytes per second (the reciprocal of the arithmetic mean time for all starts on all data sets). best - the number of the best algorithm among optimized options, from 0 to 3. adapt_boost - the relative advantage of the adaptive algorithm compared to baseline. max_boost - the relative advantage of the best of non-adaptive options compared to baseline. adapt_over_max - the relative advantage of the adaptive algorithm compared to the best non-adaptive.

    As you can see, on modern x86 processors we were able to accelerate the unloading by 12–20%. Even on ARM we got plus 4%, despite the fact that we paid less attention to optimizing for this architecture. It is also noticeable that, on average, the “gangster” algorithm outperforms the best-chosen best variant on all processors except very old Intel for various data sets.


    In practice, the work done has dubious utility. Yes, the LZ4 expansion itself accelerated by an average of 12–20%, and on individual data sets there is an even more than twofold increase in performance. But in general, this affects the query execution time significantly less. It is not so easy to find real queries where the advantage in speed will be more than a couple of percent.

    It should be borne in mind that on several Metrica clusters designed to perform “long” queries, we decided to use ZStandard level 1 instead of LZ4: it is more important to save IO and disk space on cold data.

    The greatest benefits from compression optimization are observed on highly compressed data - for example, on columns with repeating string values. But especially for this scenario, we have developed a separate solution that allows us to speed up such requests by orders of magnitude.

    Another useful consideration: optimizing the speed of the compression algorithm is often limited by the format of the compressed data. LZ4 uses a very good format, but Lizard, Density and LZSSE have other formats that can work faster. Perhaps instead of trying to speed up LZ4, it would be better to just connect LZSSE to ClickHouse.

    Implementing these optimizations in the LZ4 library itself is unlikely: to use them, you need to change or supplement the library interface. In fact, this is a very frequent case when improving algorithms: optimizations do not fit into the old abstractions, they require revision. At the same time, quite a few names have already been corrected in the original implementation. For example, regarding inc- and dec-tables, now everything is correct . In addition, a few weeks ago, the original implementation accelerated the decompression by the same 12-15% by copying 32 bytes instead of 16, as indicated above. We ourselves tried the option with 32 bytes - the results were not so cool, but overall there is acceleration too .

    If you look at the profile at the beginning of the article, you will notice that we could remove one extra copy from the page cache in userspace (either using mmap, or using O_DIRECT and userspace page cache - both of them are problematic), and also improve a bit calculation of check sums (now CityHash128 is used without CRC32-C, or you can take HighwayHash, FARSH or XXH3). Speeding up these two operations is useful for weakly compressed data, since they are performed on compressed data.

    In any case, the changes have already been added to the master, and the ideas obtained as a result of this study have found application in other problems. Here is a video from HighLoad ++ Siberia, and here is a presentation .

    Also popular now: