FSE encoding

    Finite State Entropy (FSE) is an entropy coding algorithm that is somewhat similar to both the Huffman algorithm and arithmetic coding . At the same time, he took the best from both of them: it works as fast as Huffman’s one, and with a compression ratio like arithmetic coding.

    FSE belongs to the ANS family of codecs ( Asymmetric Numeral Systems ) invented by Yarek Duda . Based on his research, Jan Colle developed an optimized version of the algorithm, later called FSE.

    It’s not easy to figure out Jan Kolle’s notes , so I will explain the explanation in a slightly different order, more convenient for understanding, in my opinion.



    Comparison with existing algorithms


    Recall how the Huffman algorithm works .

    Suppose we have a message consisting of characters in the following proportions:
    A : 50%; B : 25%; C : 12.5%; D : 12.5%

    With such a good distribution, you can encode characters as follows:

    For A, you need 1 bit, for B, 2 bits; C and D - 3 bits each.

    If the probability distribution is not so suitable, for example:
    A: 40%; B: 30%; C: 15%; D: 15%, then according to the Huffman algorithm we get exactly the same character codes, but the compression will not be so good.

    The compression ratio of a message is described by the following formula:p i b i , where- with what frequency the character occurs in the message- the number of bits to encode this character. It is known that for entropy codecs the optimal compression quality is described by the Shannon formula:pibi

    - p i l o g 2 ( p i ) . If we compare this with the previous formula, it turns out thatshould be equalbi- l o g 2 ( p i ) . More often than not, the value of this logarithm isnon-integer.

    Non-integer number of bits - what to do about it?


    In the Huffman algorithm, the number of bits is rounded to the nearest integer, which negatively affects the compression ratio.

    The arithmetic coding algorithm uses a different approach, which allows not rounding and thereby approaching the “ideal” compression. But it works relatively slowly.

    The FSE algorithm is a cross between the previous two approaches. It uses a variable number of bits to encode a character (sometimes more, sometimes less) so that on average it turns out- l o g 2 ( p i ) bits per character.

    How FSE works


    Consider an example. Let the characters occur in the following proportion in the encoded message:
    A : 5; B : 5; C : 3; D : 3
    We call these numbers normalized frequencies and denote them .qi

    N = q i  is the sum of (normalized) symbol frequencies.

    In our example, N = 16. For the algorithm to work quickly, it is necessary that N be a power of two. If the sum of the symbol frequencies is not equal to the power of two (almost always), then the frequencies need to benormalized-they must beproportionally reduced (or increased) so that the power of two is obtained in total. More on this will be said later.

    Take a table of N columns and write the message characters there so that each character occurs exactly as many times as its frequency. The order is not important yet.



    Each character A has its own range of cells (Sub-rangeson the picture): the size of each range should be a power of two (for speed), in total they should cover all N cells (so far, it does not matter to us how these ranges are located).

    Similarly, for each of the remaining characters, we define our own set of ranges.
    For the symbol D there will be 3 ranges:



    The code table is ready. Use it to encode the word "BCDA".

    1. The first character is " B ". Choose any cell in the table with this symbol. The cell number is our current state .


      Current state = 5 .

    2. Take the next character - " C ". We look at the code table for C and check in which of the intervals our current state fell.


      In our case, state 5 fell into the second interval.

      We write the offset from the beginning of this interval. In our case, it is  1 . To record the offset, you need 2 bits - the size of the range in which our state fell.

      The current range corresponds to the symbol C in cell 11, which means a new state = 11 .

    3. Repeat step 2 for each subsequent character: select the desired table, determine the range, write the offset, get a new state.

      For each character, only the offset in the interval is recorded. For this, a different number of bits is used - depending on whether we fall into the "large" or "small" interval. On average, the number of bits per character will tend to a value- l o g 2 ( q i / N ) . Proof of this fact requires acomplex theory, so just believe it.

    4. When the last character is encoded, we will have the final state. It needs to be saved, decoding will start from it.

    Decoding is performed in the reverse order - from the last encoded character to the first.



    1. We have the final state (4), and it uniquely identifies the last encoded character (A).

    2. Each state corresponds to an interval in the code table. We read the required number of bits according to the size of the interval (2 bits) and get a new state (15), which determines the next character.

    3. Repeat step 2 until we decode all the characters.

      Decoding ends either when the encoded data ends, or when a special stop symbol is encountered (requires an additional character in the alphabet).

    The distribution of characters in the table


    The theory tells us that coding will be better if the same characters are distributed evenly across the table.



    This is necessary so that the state does not “stagnate” at large intervals, losing an extra bit.

    It might seem that it would be nice to fall into small intervals all the time (on the left side of the table). But in this case, other characters will often fall into large intervals - in the end it will turn out only worse.

    Optimization


    Finally, let's move from theory to code. To begin with, we’ll figure out how to calculate during the encoding what interval the current state falls into and how many bits will need to be written.

    For coding, we need to split a table of size N (equal to a certain power of two) into intervals. The sizes of these intervals will also be powers of two: 2 ^ maxbit and 2 ^ (maxbit-1) - let's call them "large" and "small". Let small intervals be on the left and large intervals on the right, as in the figure below. For example, if N = 16 and we need 6 intervals, then the partition will be: 2 + 2 + 2 + 2 + 4 + 4 - four "small" intervals of size 2 1 and two "large" sizes of 2 2 .qi




    Simply calculating the size of a large interval is the smallest power of two such that
    2^maxbit >= N / q.
    Accordingly, where is the number of the most significant nonzero bit. Back to getting the intervals. First, divide the table into “large” intervals: (N / 2 ^ maxbit) pieces, then divide several of them in half. Dividing the interval increases their total number by 1, so “divide” will require q - (N / 2 ^ maxbit) large intervals. Accordingly, the number of small intervals will be (q - (N / 2 ^ maxbit)) * 2. Now we find under what condition the state state falls into a large interval. To determine the boundary between small and large intervals, we multiply the number of small intervals by their size: It turns out that the statemaxbit = log2(N) - highbit(q)highbit

    q



    (q - (N / 2^maxbit))*2 * 2^(maxbit-1) = (q * 2^maxbit) - N;

    statefalls into a large interval, provided
    N + state >= q * 2^maxbit,otherwise - in a small interval .

    The size of the interval in which we fell determines how many lower bits from the current state should be recorded during encoding. If the condition above is satisfied, then nbBitsOut =  maxbit, otherwise ( maxbit - 1). The remaining status bits will determine the slot number.

    Instead of the comparison operation, we apply a hack with a shift of the difference by a sufficiently large number: The absence of a condition in this expression allows the processor to more effectively use internal parallelism. (The variable count corresponds to the q symbol in the previous formulas.) As a result, the encoding function takes the following form:
    nbBitsOut = ((maxbit << 16) +  N + state - (count << maxbit)) >> 16;




    // Рассчитываем количество бит для записи
    nbBitsOut = ((maxBitsOut << 16) + N + state - (count << maxBitsOut)) >> 16;
    // Записываем nbBitsOut младших бит от state в bitStream:
    bitStream.WriteBits(state, nbBitsOut);
    interval_number = (N + state) >> nbBitsOut;
    // новое состояние:
    state = nextStateTable[deltaFindState[symbol] + interval_number];
    

    All that can be calculated in advance, we put out in the table:

    symbolTT[symbol].deltaNbBits = (maxBitsOut << 16) - (count << maxBitsOut);

    Note that stateall the time it is used in the form of N +  state. If you nextStateTablealso store N +  next_state, then the function will take the following form:

    nbBitsOut = (N_plus_state + symbolTT[symbol].deltaNbBits) >> 16;
    bitStream.WriteBits(N_plus_state, nbBitsOut);
    // новое состояние:
    N_plus_state = nextStateTable[symbolTT[symbol].deltaFindState + (N_plus_state >> nbBitsOut)];
    

    As you can see, the calculations are very simple and fast.

    Decoding will look accordingly:

    // Записываем декодированный символ
     outputSymbol(decodeTable[state].symbol);
     // Берём из таблицы, сколько бит надо прочитать
     nbBits = decodeTable[state].nbBits;
     // Прочитанные биты определяют смещение в интервале
     offset = readBits(bitStream, nbBits);
     // Вычисляем следующее состояние: начало диапазона + смещение
     state = decodeTable[state].subrange_pos + offset;

    Even easier!

    The real code can be found here . There you can find how tables are filledsymbolTT и decodeTable.

    About normalizing symbol frequencies


    Above, we proceeded from the assumption that the sum of symbol frequencies is equal to the power of two. In the general case, of course, this is not so. Then it is necessary to proportionally reduce these numbers so that in total they give a power of two. It should be borne in mind that the sum of the symbol frequencies is equal to the size of the code table. The larger the table, the more accurate the symbol frequencies will be presented and the better the compression. On the other hand, the table itself also takes up a lot of space. You can use the Shannon formula to estimate the size of the compressed data.qi

    - q i l o g 2 ( q i / N ) ,overlyboldly assuming that compression tends to the ideal, and in this way select the optimal tables.

    In the process of normalizing frequencies, many small problems can arise. For example, if some characters have a very, very small probability, then the smallest representable frequency 1 / N may turn out to be an inaccurate approximation, but less in any way. One approach in this case is to immediately assign such symbolsqi= 1, put them at the end of the code table (this is more optimal), and then deal with the rest of the characters. Rounding values ​​to the nearest integer may also not be the best option. This and many other nuances are well written by Jan Colle ( 1 , 2 , 3 , 4 ).

    There are both fast heuristic frequency normalization algorithms, and slow, but more accurate ones. The choice is yours.

    Data Model Selection


    Let's look at a situation where we need to encode numerical values ​​from a potentially large range, for example, from -65536 to +65535. Moreover, the probability distribution of these numbers is very uneven: with a large peak in the center and very small probabilities at the edges, as in the picture. If you encode each value with a separate character, the table will be of unimaginable size.


    In such a situation, a whole range of values ​​can be indicated with one character. With FSE, we will encode only the character itself, and the offset in the range will be written “as is”, according to the size of the range.

    This technique allows you to make code tables more or less compact.

    This example demonstrates that the choice of alphabet is highly dependent on your data. You need to use a little imagination to make the coding the most optimal.

    Mixed coding


    If different parts of the data have different frequency distributions, then it is better to use different tables to encode them.

    At the same time, heterogeneous data can be encoded using different tables in turn! State from one table is used to encode the next character with another table. The main thing is that you can repeat the same order when decoding. Tables should also be the same size.

    For example, let your data be a sequence of the form " team - number - team - number ... ". You always know that a number will come after the command, and this determines the order in which the tables are used.

    If you encode data from end to beginning, then when decoding (from beginning to end) it will be clear what will be followed and, accordingly, which table to use for decoding the next element.

    About table entry


    As mentioned above, the code table must be stored with the data in order to decode it. In this case, it is desirable that the table does not take up much space.

    Let's see which fields are in the dTable table:

    unsigned char symbol; // если алфавит не более 256 символов
    unsigned short subrange_pos; // начало соответствующего интервала
    unsigned char nbBits; // сколько бит прочитать
    

    Total: 4 bytes per character. A lot.

    But if you think about it, instead of the dTable table, you can only store normalized symbol frequencies. For example, if the size of the code table is 2 8  = 256, then 8 bits per character will be more than enough.
    Knowing the frequencies, you can build exactly the same dTable as with encoding.

    At the same time, I advise you to make sure that the algorithm does not use float (with it there are differences on different processors, it is better to use a fixed point), apply only stable_sort and the like to ensure reliable reproducibility of the result.

    By the way, these tables can also be compressed somehow. ;)

    Epilogue


    We at Playrix use FSE to encode vector animations. Deltas between frames consist of many small values ​​with a distribution peak near zero. The transition from Huffman to FSE allowed us to reduce the size of animations by about one and a half times. Compressed data is stored in memory, unpacking is done "on the fly" with simultaneous reproduction. FSE allows you to do this very efficiently.

    References


    Now that you’ve figured out how FSE works, I recommend reading the details on Jan Kolle’s blog for more understanding:

    1. Finite State Entropy - A new breed of entropy coder
    2. FSE decoding: how it works
    3. Huffman, a comparison with FSE
    4. A comparison of Arithmetic Encoding with FSE
    5. FSE: Defining optimal subranges
    6. FSE: distributing symbol values
    7. FSE decoding: wrap up
    8. FSE encoding: how it works
    9. FSE tricks - Memory efficient subrange maps
    10. Better normalization, for better compression
    11. Perfect normalization
    12. Ultra-fast normalization
    13. Taking advantage of unequalities to provide better compression
    14. Counting bytes fast - little trick from FSE

    GitHub code (by Jan Colle): https://github.com/Cyan4973/FiniteStateEntropy .

    Also popular now: