Data Compression Methods

    My supervisor and I are preparing a small monograph on image processing. I decided to submit a chapter on image compression algorithms to the court of the habrasociety. Since it is difficult to fit an entire chapter within one post, I decided to break it into three posts:
    1. Data compression methods;
    2. Lossless image compression;
    3. Compression of images with losses.
    Below you can find the first post in the series.

    Currently, there are a large number of lossless compression algorithms that can be conditionally divided into two large groups:
    1. Stream and dictionary algorithms. Algorithms of the RLE (run-length encoding), LZ *, and other families belong to this group. A feature of all the algorithms in this group is that when encoding, information is not used about the frequencies of the characters in the message, but information about the sequences encountered earlier.
    2. Algorithms for statistical (entropy) compression. This group of algorithms compresses information using the non-uniformity of frequencies with which various characters occur in the message. The algorithms of this group include arithmetic and prefix encoding algorithms (using Shannon-Fanno, Huffman, secant trees).
    Algorithms for converting information can be distinguished into a separate group. Algorithms of this group do not directly compress information, but their application greatly simplifies further compression using stream, dictionary, and entropy algorithms.

    Thread and Dictionary Algorithms


    Series Length Coding


    Run-Length Encoding (RLE) is one of the simplest and most common data compression algorithms. In this algorithm, a sequence of repeating characters is replaced by a character and the number of repeats.
    For example, the string “AAAAA”, which requires 5 bytes for storage (provided that a byte is allocated for storing one character), can be replaced with “5A”, consisting of two bytes. Obviously, this algorithm is all the more effective the longer the series of repetitions.

    The main disadvantage of this algorithm is its extremely low efficiency on sequences of non-repeating characters. For example, if you consider the sequence "ABABAB" (6 bytes), then after applying the RLE algorithm, it will turn into "1A1B1A1B1A1B" (12 bytes). To solve the problem of non-repeating characters, there are various methods.

    The simplest method is the following modification: a byte encoding the number of repetitions should store information not only about the number of repetitions, but also about their availability. If the first bit is 1, then the next 7 bits indicate the number of repetitions of the corresponding character, and if the first bit is 0, then the next 7 bits indicate the number of characters to be taken without repeating. If you code "ABABAB" using this modification, then we get "-6ABABAB" (7 bytes). Obviously, the proposed technique can significantly increase the efficiency of the RLE algorithm on non-repeating sequences of characters. The implementation of the proposed approach is shown in Listing 1:
    1. type
    2.   TRLEEncodedString = array of byte;
    3. function RLEEncode (InMsg: ShortString): TRLEEncodedString;
    4. var
    5.   MatchFl: boolean;
    6.   MatchCount: shortint;
    7.   EncodedString: TRLEEncodedString;
    8.   N, i: byte;
    9. begin
    10.   N: = 0;
    11.   SetLength (EncodedString, 2 * length (InMsg));
    12.   while length (InMsg)> = 1 do
    13.   begin
    14.     MatchFl: = (length (InMsg)> 1) and (InMsg [1] = InMsg [2]);
    15.     MatchCount: = 1;
    16.     while (MatchCount <= 126) and (MatchCount <length (InMsg)) and ((InMsg [MatchCount] = InMsg [MatchCount + 1]) = MatchFl) do
    17.       MatchCount: = MatchCount + 1;
    18.     if MatchFl then
    19.     begin
    20.       N: = N + 2;
    21.       EncodedString [N - 2]: = MatchCount + 128;
    22.       EncodedString [N - 1]: = ord (InMsg [1]);
    23.     end
    24.     else
    25.     begin
    26.       if MatchCount <> length (InMsg) then
    27.         MatchCount: = MatchCount - 1;
    28.       N: = N + 1 + MatchCount;
    29.       EncodedString [N - 1 - MatchCount]: = -MatchCount + 128;
    30.       for i: = 1 to MatchCount do
    31.         EncodedString [N - 1 - MatchCount + i]: = ord (InMsg [i]);
    32.     end;
    33.     delete (InMsg, 1, MatchCount);
    34.   end;
    35.   SetLength (EncodedString, N);
    36.   RLEEncode: = EncodedString;
    37. end;
    38.  

    Decoding a compressed message is very simple and comes down to a single pass through the compressed message, see Listing 2:
    1. type
    2.   TRLEEncodedString = array of byte;
    3. function RLEDecode (InMsg: TRLEEncodedString): ShortString;
    4. var
    5.   RepeatCount: shortint;
    6.   i, j: word;
    7.   OutMsg: ShortString;
    8. begin
    9.   OutMsg: = '';
    10.   i: = 0;
    11.   while i <length (InMsg) do
    12.   begin
    13.     RepeatCount: = InMsg [i] - 128;
    14.     i: = i + 1;
    15.     if RepeatCount <0 then
    16.     begin
    17.       RepeatCount: = abs (RepeatCount);
    18.       for j: = i to i + RepeatCount - 1 do
    19.         OutMsg: = OutMsg + chr (InMsg [j]);
    20.       i: = i + RepeatCount;
    21.     end
    22.     else
    23.     begin
    24.       for j: = 1 to RepeatCount do
    25.         OutMsg: = OutMsg + chr (InMsg [i]);
    26.       i: = i + 1;
    27.     end;
    28.   end;
    29.   RLEDecode: = OutMsg;
    30. end;
    31.  

    The second method of increasing the efficiency of the RLE algorithm is the use of information conversion algorithms that do not directly compress the data, but bring them to a form that is more convenient for compression. As an example of such an algorithm, we consider a BWT permutation, named after the inventors of the Burrows-Wheeler transform. This permutation does not change the characters themselves, but only changes their order in the string, while repeating substrings after applying the permutation are assembled into dense groups, which are much better compressed using the RLE algorithm. Direct BWT conversion is reduced to the sequence of the following steps:
    1. Adding to the original line a special character for the end of the line that is not found anywhere else;
    2. Getting all the cyclic permutations of the source string;
    3. Sorting the received lines in lexicographical order;
    4. Return of the last column of the resulting matrix.
    The implementation of this algorithm is shown in Listing 3.
    1. const
    2.   EOMsg = '|';
    3. function BWTEncode (InMsg: ShortString): ShortString;
    4. var
    5.   OutMsg: ShortString;
    6.   ShiftTable: array of ShortString;
    7.   LastChar: ANSIChar;
    8.   N, i: word;
    9. begin
    10.   InMsg: = InMsg + EOMsg;
    11.   N: = length (InMsg);
    12.   SetLength (ShiftTable, N + 1);
    13.   ShiftTable [1]: = InMsg;
    14.   for i: = 2 to N do
    15.   begin
    16.     LastChar: = InMsg [N];
    17.     InMsg: = LastChar + copy (InMsg, 1, N - 1);
    18.     ShiftTable [i]: = InMsg;
    19.   end;
    20.   Sort (ShiftTable);
    21.   OutMsg: = '';
    22.   for i: = 1 to N do
    23.     OutMsg: = OutMsg + ShiftTable [i] [N];
    24.   BWTEncode: = OutMsg;
    25. end;
    26.  

    The easiest way to explain this transformation is with a specific example. Take the string "Pineapple" and agree that the symbol for the end of the string will be the symbol "|". All cyclic permutations of this line and the result of their lexicographic sorting are given in Table. 1.



    I.e. the result of the direct conversion will be the string "| NNAAAS. It is easy to see that this line is much better than the original one, it is compressed by the RLE algorithm, because there are long subsequences of repeating letters in it.
    Other effects can also be achieved using other transformations, but the advantage of the BWT transform is that it is reversible, although the inverse transform is more complicated than the direct one. In order to restore the original string, you must perform the following steps:
    Create an empty matrix of size n * n, where n is the number of characters in the encoded message;
    Fill the rightmost empty column with a coded message;
    Sort table rows in lexicographical order;
    Repeat steps 2-3 until there are empty columns;
    Return the line that ends with the end of line character.

    At first glance, implementing the inverse transformation is straightforward, and one of the implementation options is shown in Listing 4.
    1. const
    2.   EOMsg = '|';
    3. function BWTDecode (InMsg: ShortString): ShortString;
    4. var
    5.   OutMsg: ShortString;
    6.   ShiftTable: array of ShortString;
    7.   N, i, j: word;
    8. begin
    9.   OutMsg: = '';
    10.   N: = length (InMsg);
    11.   SetLength (ShiftTable, N + 1);
    12.   for i: = 0 to N do
    13.     ShiftTable [i]: = '';
    14.   for i: = 1 to N do
    15.   begin
    16.     for j: = 1 to N do
    17.       ShiftTable [j]: = InMsg [j] + ShiftTable [j];
    18.     Sort (ShiftTable);
    19.   end;
    20.   for i: = 1 to N do
    21.     if ShiftTable [i] [N] = EOMsg then
    22.       OutMsg: = ShiftTable [i];
    23.   delete (OutMsg, N, 1);
    24.   BWTDecode: = OutMsg;
    25. end;
    26.  

    But in practice, efficiency depends on the sorting algorithm selected. Trivial quadratic complexity algorithms will obviously have an extremely negative impact on performance, so it is recommended that you use efficient algorithms.



    After sorting the table obtained in the seventh step, you must select from the table a row ending with the symbol "|". It is easy to see that this is the only line. T.O. We examined a BWT transformation using a concrete example.

    To summarize, we can say that the main plus of the RLE group of algorithms is its simplicity and speed (including decoding speed), and the main disadvantage is inefficiency on non-repeating character sets. Using special permutations increases the efficiency of the algorithm, but also greatly increases the operating time (especially decoding).

    Vocabulary Compression (LZ Algorithms)


    The group of dictionary algorithms, in contrast to the algorithms of the RLE group, encodes not the number of repetitions of characters, but the sequences of characters encountered earlier. During the operation of the considered algorithms, a table is dynamically created with a list of already encountered sequences and the codes corresponding to them. This table is often called a dictionary, and the corresponding group of algorithms is called dictionary.

    The simplest version of the dictionary algorithm is described below:
    Initialize the dictionary with all the characters found in the input line;
    Find in the dictionary the longest sequence (S) that matches the beginning of the encoded message;
    Give the code of the found sequence and remove it from the beginning of the encoded message;
    If the end of the message is not reached, read the next character © and add Sc to the dictionary, go to step 2. Otherwise, exit.

    For example, the just-initialized dictionary for the phrase "KUKUSHAKAKUKUSHONKUKILAAPUSHU" is shown in Table. 3:



    During compression, the dictionary will be supplemented by the sequences found in the message. The process of updating the dictionary is given in Table. 4.



    When describing the algorithm, a description of the situation when the dictionary is completely filled was intentionally omitted. Depending on the variant of the algorithm, different behavior is possible: full or partial clearing of the dictionary, termination of the dictionary or expansion of the dictionary with a corresponding increase in the bit depth of the code. Each of these approaches has certain disadvantages. For example, the termination of replenishment of the dictionary can lead to a situation where sequences are stored in the dictionary that occur at the beginning of the compressible line, but do not occur in the future. At the same time, clearing the dictionary can delete frequent sequences. Most of the used implementations, when filling out the dictionary, begin to track the degree of compression, and when it decreases below a certain level, the dictionary is rebuilt. Next, we will consider the simplest implementation,

    To begin with, we define the dictionary as a record that stores not only the encountered substrings, but also the number of substrings stored in the dictionary:
    1. type
    2.   TDictionary = record
    3.     WordCount: byte;
    4.     Words: array of string;
    5.   end;
    6.  

    The subsequences encountered earlier are stored in the Words array, and their code is the numbers of subsequences in this array.
    We also define the search functions in the dictionary and additions to the dictionary:
    1. const
    2.   MAX_DICT_LENGTH = 256;
    3.  
    4. function FindInDict (D: TDictionary; str: ShortString): integer;
    5. var
    6.   r: integer;
    7.   i: integer;
    8.   fl: boolean;
    9. begin
    10.   r: = -1;
    11.   if D.WordCount> 0 then
    12.   begin
    13.     i: = D.WordCount;
    14.     fl: = false;
    15.     while (not fl) and (i> = 0) do
    16.     begin
    17.       i: = i - 1;
    18.       fl: = D.Words [i] = str;
    19.     end;
    20.   end;
    21.   if fl then
    22.     r: = i;
    23.   FindInDict: = r;
    24. end;
    25.  
    26. procedure AddToDict (var D: TDictionary; str: ShortString);
    27. begin
    28.   if D.WordCount <MAX_DICT_LENGTH then
    29.   begin
    30.     D.WordCount: = D.WordCount + 1;
    31.     SetLength (D.Words, D.WordCount);
    32.     D.Words [D.WordCount - 1]: = str;
    33.   end;
    34. end;
    35.  

    Using these functions, the encoding process according to the described algorithm can be implemented as follows:
    1. function LZWEncode (InMsg: ShortString): TEncodedString;
    2. var
    3.   OutMsg: TEncodedString;
    4.   tmpstr: ShortString;
    5.   D: TDictionary;
    6.   i, N: byte;
    7. begin
    8.   SetLength (OutMsg, length (InMsg));
    9.   N: = 0;
    10.   InitDict (D);
    11.   while length (InMsg)> 0 do
    12.   begin
    13.     tmpstr: = InMsg [1];
    14.     while (FindInDict (D, tmpstr)> = 0) and (length (InMsg)> length (tmpstr)) do
    15.       tmpstr: = tmpstr + InMsg [length (tmpstr) + 1];
    16.     if FindInDict (D, tmpstr) <0 then
    17.       delete (tmpstr, length (tmpstr), 1);
    18.     OutMsg [N]: = FindInDict (D, tmpstr);
    19.     N: = N + 1;
    20.     delete (InMsg, 1, length (tmpstr));
    21.     if length (InMsg)> 0 then
    22.       AddToDict (D, tmpstr + InMsg [1]);
    23.   end;
    24.   SetLength (OutMsg, N);
    25.   LZWEncode: = OutMsg;
    26. end;
    27.  

    The coding result will be the numbers of words in the dictionary.
    The decoding process is reduced to direct decryption of codes, while there is no need to transfer the created dictionary, it is enough that when decoding the dictionary is initialized in the same way as when encoding. Then the dictionary will be completely restored directly in the decoding process by concatenating the previous subsequence and the current character.

    The only problem is possible in the following situation: when it is necessary to decode a subsequence that is not yet in the dictionary. It is easy to verify that this is possible only when it is necessary to extract the substring that should be added at the current step. And this means that the substring satisfies the cSc pattern, i.e. starts and ends with the same symbol. At the same time, cS is a substring added in the previous step. The situation considered is the only one when it is necessary to decode a string that has not yet been added. Given the above, we can offer the following option for decoding a compressed string:
    1. function LZWDecode (InMsg: TEncodedString): ShortString;
    2. var
    3.   D: TDictionary;
    4.   OutMsg, tmpstr: ShortString;
    5.   i: byte;
    6. begin
    7.   OutMsg: = '';
    8.   tmpstr: = '';
    9.   InitDict (D);
    10.   for i: = 0 to length (InMsg) - 1 do
    11.   begin
    12.     if InMsg [i]> = D.WordCount then
    13.       tmpstr: = D.Words [InMsg [i - 1]] + D.Words [InMsg [i - 1]] [1]
    14.     else
    15.       tmpstr: = D.Words [InMsg [i]];
    16.     OutMsg: = OutMsg + tmpstr;
    17.     if i> 0 then
    18.       AddToDict (D, D.Words [InMsg [i - 1]] + tmpstr [1]);
    19.   end;
    20.   LZWDecode: = OutMsg;
    21. end;
    22.  

    The advantages of dictionary algorithms include their greater compression efficiency compared to RLE. Nevertheless, it must be understood that the actual use of these algorithms is fraught with some implementation difficulties.

    Entropy coding


    Shannon Fano Tree Coding


    The Shannon-Fano algorithm is one of the first compression algorithms developed. The algorithm is based on the idea of ​​representing more frequent characters using shorter codes. Moreover, the codes obtained using the Shannon-Fano algorithm have the property of prefixity: i.e. no code is the beginning of any other code. The prefix property ensures that the coding is one-to-one. The algorithm for constructing Shannon-Fano codes is presented below:
    1. Break the alphabet into two parts, the total probabilities of characters in which are as close as possible to each other.
    2. Add 0 to the prefix code of the first part of the characters; add 1. to the prefix code of the second part of the characters.
    3. For each part (in which there are at least two characters), perform steps 1-3 recursively.
    Despite its comparative simplicity, the Shannon-Fano algorithm is not without drawbacks, the most significant of which is the non-optimal coding. Although the partition at each step is optimal, the algorithm does not guarantee an optimal result as a whole. Consider, for example, the following line: "AAAABVGDEZH." The corresponding Shannon-Fano tree and codes obtained on its basis are presented in Fig. 1:



    Without encoding, the message will occupy 40 bits (provided that each character is encoded with 4 bits), and using the Shannon-Fano algorithm 4 * 2 + 2 + 4 + 4 + 3 + 3 + 3 = 27 bits. The message size decreased by 32.5%, but it will be shown below that this result can be significantly improved.

    Huffman Tree Coding


    The Huffman coding algorithm, developed a few years after the Shannon-Fano algorithm, also has the property of prefixity, and, in addition, the proven minimal redundancy, this is due to its extremely wide distribution. To obtain Huffman codes, the following algorithm is used:
    1. All characters of the alphabet are represented as free nodes, while the weight of the node is proportional to the frequency of the symbol in the message;
    2. From the set of free nodes, two nodes with a minimum weight are selected and a new (parent) node is created with a weight equal to the sum of the weights of the selected nodes;
    3. Selected nodes are removed from the free list, and the parent node created on their basis is added to this list;
    4. Steps 2-3 are repeated until there are more than one node in the free list;
    5. Based on the constructed tree, each symbol of the alphabet is assigned a prefix code;
    6. The message is encoded with the received codes.

    Consider the same example as in the case of the Shannon-Fano algorithm. The Huffman tree and codes received for the message “AAAABVGDEZH” are presented in Fig. 2:



    It is easy to calculate that the volume of the encoded message is 26 bits, which is less than in the Shannon-Fano algorithm. Separately, it is worth noting that, due to the popularity of the Huffman algorithm, at the moment there are many options for Huffman coding, including adaptive coding, which does not require the transmission of symbol frequencies.
    Among the shortcomings of the Huffman algorithm, a significant part is the problems associated with the complexity of the implementation. The use of symbols of real variables for storing frequencies of frequencies is associated with a loss of accuracy, which is why integer variables are often used in practice, but the weight of the parent nodes is constantly growing, sooner or later overflow occurs. Thus, despite the simplicity of the algorithm, its correct implementation can still cause some difficulties, especially for large alphabets.

    Encoding with Secant Function Trees


    Coding with secant functions - an algorithm developed by the authors that allows you to receive prefix codes. The algorithm is based on the idea of ​​constructing a tree, each node of which contains a secant function. In order to describe the algorithm in more detail, it is necessary to introduce several definitions.
    A word is an ordered sequence of m bits (the number m is called the word length).
    Secant literal is a pair of the type discharge-value of the discharge. For example, the literal (4,1) means that 4 bits of the word must be equal to 1. If the condition of the literal is satisfied, then the literal is considered true, otherwise false.
    A k-bit secant is the set of k literals. If all literals are true, then the secant function itself is true, otherwise it is false.

    The tree is built so that each node divides the alphabet into as close parts as possible. In Fig. Figure 3 shows an example of a secant tree:



    In general, the tree of secant functions does not guarantee optimal coding, but it provides an extremely high speed due to the simplicity of the operation in nodes.

    Arithmetic coding


    Arithmetic coding is one of the most effective ways to compress information. Unlike the Huffman algorithm, arithmetic coding allows you to encode messages with entropy less than 1 bit per character. Because most arithmetic coding algorithms are protected by patents, only basic ideas will be described below.
    Suppose that in the alphabet used there are N characters a_1, ..., a_N, with frequencies p_1, ..., p_N, respectively. Then the arithmetic coding algorithm will look as follows:
    As a working half-interval, take [0; 1);
    Divide the working half-interval into N non-overlapping half-intervals. Moreover, the length of the ith half-interval is proportional to p_i.
    If the end of the message is not reached, select the i-th half-interval as a new working interval and go to step 2. Otherwise, return any number from the working half-interval. The record of this number in binary code will be an encoded message.
    In Fig. Figure 4 shows the process of encoding the “ABAAV” message.



    When decoding, it is necessary to perform a similar sequence of actions, only at each step it is necessary to additionally determine which character was encoded.

    The obvious advantage of arithmetic coding is its efficiency, and the main (except for patent restrictions) minus is the extremely high complexity of the encoding and decoding processes.

    Also popular now: