Entropy coding rANS or how to write your own archiver

    This article may be of interest to those who are engaged in data compression or want to write their own archiver. The article is written mainly on the materials of the blog , which is maintained by Fabian Giesen.





    Introduction


    The entropy coding method rANS ( r ange + ANS) is the sibling of the FSE algorithm, which I wrote about earlier . The abbreviation ANS means Asymmetric Numeral Systems , and the word range in the name hints at the similarity of this method with interval coding . The author of rANS is Yarek Duda .

    The rANS method allows you to achieve almost optimal compression at a very high speed. In this rANS is no worse than FSE, which is not surprising: both algorithms are based on a common theoretical base. However, the rANS algorithm is much simpler to implement than FSE.

    First, there will be a long “theoretical” part, and then we will try to write a simple archiver.

    Method Description


    The operation of the algorithm is determined by the following simple formulas:

    Coding:   C(s,x): x := (x / Fs) * M + Bs + (x % Fs)
    Decoding:   D(x): s = sym[x % M], x := Fs * (x / M) + (x % M) - Bs

    Let's analyze them in detail.

    The encoding function C (s, x) receives the character s to be encoded (let it be an integer) and the current state of the encoder x (also an integer).

    F s - symbol frequency s . The division by Fs above is integer.
    M is the sum of the frequencies of all symbols of the alphabet ( M = Σ F s ).
    In s , the beginning of the interval corresponding to the encoded character (in the figure below).
    x % Fs- the remainder of dividing x by F s .

    The principle of operation is the same as in arithmetic coding : we take the segment [ 0, M) and divide it into parts so that each character s corresponds to an interval equal in size to the frequency of the character F s . The occurrence of the value x% M in any interval denotes the encoding of the corresponding symbol.


    At the beginning of the encoding, initialize x with an arbitrary suitable value, and then calculate the function C (s, x) for all encoded characters in sequence.

    Each calculation of the function C (s, x) increases the value of x . When it gets too big, you should dump the data in output:

    while (x >= x_max) {
        writeToStream(x % b); // записываем данные
        x /= b; // уменьшаем x
    }

    This step is called renormalization . After it, you can continue coding.

    Above in the code, new constants appeared: x_max and b . In theory, the numbers M , b, and x_max are related by some relations, but in practice it is most effective to use the following values ​​if the xtype uint32 is used for the state :

    M = 2 ^ k , where k <= 16
    b = 2 ^ 16 (half the size of uint32 )

    The choice of M = 2 ^ k is due to the fact that the division by M, thus division with the remainder can be replaced with bitwise operations.

    The value of k is selected from the following considerations: the larger it is, the higher the accuracy of F s and the more efficient the compression. In this case, some overhead for storing the frequency table must be taken into account, so it is not always worth using the maximum values ​​of k .

    The value of x_max should be such that no overflow occurs. Based on the encoding function, we get that x < uint32_max * Fs / M or a slightly different way: x_max <= ( b * L ) *Fs / M , where L <= uint32_max / b . In real code, the condition takes the form x / b> = L / M * Fs to avoid overflow in the calculations.

    The value b = 2 ^ 16 (half the size of uint32) is chosen in such a way that if x exceeds x_max , then one division by b is enough to continue working. As a result, the cycle whilewill end after the first iteration, which means that it can be replaced with a simple one if.

    As a result, the encoding function takes the following form:

    typedef uint32_t RansState;
    constexpr uint32_t RANS_L = 1u << 16;
    constexpr uint32_t k = 10; // для примера
    constexpr uint32_t RANS_M = 1u << k; // M = 2^k
    // Кодируем символ s
    void RansEnc(RansState& x, uint32_t s, RansOutBuf& out)
    {
        assert(x >= RANS_L); // Этот инвариант должен выполняться всегда
        uint32 Fs = freq[s]; // Частота символа s
        uint32 Bs = range_start[s]; // Начало интервала s
        assert(Fs > 0 && Fs <= RANS_M);
        // renormalize
        if ((x >> 16) >= (RANS_L >> k) * Fs) { // x / b >=  L / M * Fs
            out.put( x & 0xffff );
            x >>= 16;
        }
        x = ((x / Fs) << k) + Bs + (x % Fs); // C(s,x)
        assert(x >= RANS_L); // Этот инвариант должен выполняться всегда
    }

    At the end of the encoding, you must save the value of x , since decoding will start from it. And yes, we will decode from the end to the beginning, that is, from the last encoded character to the first. (The article about the FSE this point is explained in detail.)

    I want to stay a little more on that, as the formula of coding work.

    x := (x / Fs) * M + Bs + (x % Fs)

    After calculation ( x / Fs) * Min the variable x there appear k free lower bits (remember that M = 2 ^ k ). Adding + Bs + (x % Fs)essentially writes to these bits a certain value from the interval of the character s , because Bs is the beginning of the interval, and (x% Fs) - a number within this interval (the size of the interval is Fs). Thus, when decoding, we can determine the encoded character from the interval that it falls into (x% M).

    Decoding

    Let's move on to the decoding function.

    D(x): s = sym[x % M], x := Fs * (x / M) + (x % M) - Bs

    As we have learned above, the desired symbol s defined by the remainder of the division of x % M . In what interval the value (x% M) fell, such a character was encoded.

    Earlier, we specifically defined M = 2 ^ k to simplify the decoding function. It ended up like this:

    uint32_t RansDecode(RansState& x, RansInBuf& in)
    {
        assert(x >= RANS_L); // Проверяем инвариант
        uint32_t x_mod = x & (RANS_M - 1); // = x % M
        // определяем интервал, в который попал x_mod, табличным методом
        assert(x_mod < dct.size());
        uint32_t s = dct[x_mod]; // декодированный символ
        uint32 Fs = freq[s]; // Частота символа s
        uint32 Bs = range_start[s]; // Начало интервала для s
        x = (x >> k) * Fs + x_mod - Bs;
        // renormalize
        if (x < RANS_L) {
            x = (x << 16) | in.read16(); // Считываем 16 бит
        }
        assert(x >= RANS_L); // Проверяем инвариант
        return s;
    }

    Decoding begins with the same x that was obtained at the end of the encoding. To do this, it must be saved along with the encoded data.

    At the end of the decoding, the state of decoder x should be exactly the same as the encoding. In general, at each step x must be exactly the same as at the corresponding coding step. This fact helps a lot when debugging.

    As you can see, decoding works faster than encoding, since there are no division operations.

    The most difficult moment in the decoding function is the method of determining in which interval the value fell (x% M).

    The easiest and fastest method is to use a sym [] table of size M. In this case, we get a table of the same size as in the FSE algorithm, with the difference that in rANS the table is not “mixed up”, the characters are in order, and such a table is much easier to build.

    The main drawback of this approach is the size of the sym table , which grows exponentially with increasing k .

    Alias ​​method


    An alias method was invented to more efficiently determine the hit in an interval . This method allows you to quickly determine the desired interval using small tables - by the number of characters in the alphabet.


    A long and lengthy explanation can be found here: Darts, Dice and Coins . I will describe the essence of the method in brief: we take a piece of the longest interval and attach it to the shortest interval so that the total size is exactly M / N (where N is the number of characters in the alphabet). It turns out that if you consistently do that, the result will always N pairs of size M / N .

    Naturally, M must be divisible by of N . And if you recall that we have M = 2 ^ k, then the size of the alphabet turns out also to be a power of two. This is not a problem, since you can always supplement the alphabet to the desired size with unused characters with a zero frequency.

    The fact that the character interval is divided into several parts complicates the coding procedure a little, but not much. If you recall the FSE, there the intervals were generally spread across the entire range, as if a crazy mixer had worked on them, and nothing worked =)

    Determining the desired interval is easy: divide x by N , and fall into one of the pairs. Next, we compare the remainder of the division of x% N with the boundary between the segments in a pair and fall either in one interval or in another.

    We try in practice


    We will use the code of the finished example .

    We take the data for compression from the file:
    size_t in_size;
    uint8_t* in_bytes = read_file("book1", &in_size);

    1. First you need to decide on the data structure .

    We use the simplest option: we will encode one byte using the alphabet [0 ... 255].

    2. The next step is to determine the frequency of the alphabet characters :

    (function stats.count_freqs)
    for (size_t i = 0; i < in_size; i++) {
        freqs[in_bytes[i]]++;
    }

    3. So, we have symbol frequencies, but now they need to be normalized , that is, proportionally reduced (or increased) so that in total we get M = 2 ^ k. This is not such a simple task as it might seem, so we use a ready-made function:

    stats.normalize_freqs(...);

    By the way, you need to determine the value of k . Since our alphabet consists of 256 characters, k less than 8 should not be taken. First, take the maximum - 16, and later try other values.

    4. Build alias table :

    stats.make_alias_table();

    5. We encode from the end , then to decode in the normal order.

    RansState rans; // Состояние rANS, которое ранее обозначалось x
    RansEncInit(&rans); // инициализируем начальным значением
    uint8_t* ptr = out_buf + out_max_size; // *end* of output buffer
    for (size_t i = in_size; i > 0; i--) { // NB: working in reverse!
        int s = in_bytes[i - 1];
        RansEncPutAlias(&rans, &ptr, &stats, s, prob_bits);
    }
    // Записываем конечное состояние. С него потом начнётся декодирование.
    RansEncFlush(&rans, &ptr);

    Further, the example by reference decodes compressed data using ready-made stats. In real life, for decoding, you need to save a table of frequencies (stats) along with compressed data. In the simplest case, you will have to spend N * k bits on it.

    As promised above, let's look at the compression results for different values ​​of k (in the code this prob_bits), taking into account the increase in size due to the recording of the frequency table:

    ( Original file size book1 : 768771)
    k = 16: 435059 + 512 = 435571 bytes
    k = 15 : 435078 + 480 = 435558 bytes
    k = 14: 435113 + 448 = 435561 bytes
    k = 13: 435239 + 416 = 435655 bytes
    k = 12: 435603 + 384 = 435987 bytes
    k = 11: 436530 + 352 = 436882 bytes
    k = 10: 440895 + 320 = 441215 bytes
    k = 9: 453418 + 288 = 453706 bytes
    k = 8: 473126 + 256 = 473382 bytes

    You can see that the higher k, the better compression. But at a certain point (at k = 16), the overhead of the frequency table begins to outweigh the benefits of increasing accuracy. If you compress a smaller file, this effect will appear on smaller k.

    You also need to say a few words about the trick called "interleaved rANS", which is additionally implemented in this example . The idea is that alternately using two independent state variables makes better use of processor parallelism. As a result, decoding is even faster.

    In conclusion, I want to note that the selected file compression method is too simple. It does not take into account the features of the data, which is why compression is far from optimal. If you look closely at the input, you can find that some letter combinations are more common than others, and some do not occur at all. Using this fact, compression can be significantly improved. But this is a topic for a separate article.

    Afterword


    Why write your own archiver when there are many time-tested utilities? The answer is quite simple: archivers tailored to a specific format compress much better.

    We at Playrix game development often come up against the need to reduce the size of the build. Games are constantly evolving, the amount of content is growing, and space is limited.

    Once again , looking longingly at the resources, we realized that some files can be compressed much better than zip does, given their structure. During the experiments, we managed to significantly reduce the size of our own animation format, there are also some shifts in the compression of graphics files.

    When developing compression algorithms, a universal entropy encoder, such as rANS or FSE, is an indispensable tool. It completely takes on the task of writing characters with the least number of bits, allowing the developer to focus on the main details of the algorithm. And also it works very fast both in encoding and in decoding.

    I hope this article helps you get to know rANS and start using it in your projects.

    References


    Here you can see working examples of rANS implementation (with different optimization options):

    Fabian Giesen: github.com/rygorous/ryg_rans Fabian's

    blog can read interesting details and details (in English):

    rANS notes
    rANS with static probability distributions
    rANS in practice

    Also popular now: