Images: formats and compression (1/3)

    Schematic image of PCX, GIF and PNG

    What is more traffic spent on loading a site? Most often these are pictures, and their total “weight” is often several times more than that of markup, scripts and styles. In common image files, raster data is stored in compressed form, and this is significantly better than uncompressed BMP. And if you want even better? Indeed, in fairly large projects, every byte counts (for example, in TradingView, which is really shy there).

    There are many utilities for compressing graphics, from highly specialized to omnipotent combines. The habr already has a wonderful overview of such programs, and the question of how to squeeze the picture is considered in more than detail.

    But howsuch programs work, what can be improved and how to make your own? I invite you to a sightseeing tour of image formats and raster data compression algorithms.

    Middle Ages


    The eighties of the last century became the time of the development of raster graphics. Graphics, as such, has been used before, but now it has become much more accessible, and not least, the gaming industry has influenced it. The Atari 2600 made it possible to draw something more refined than a white rectangle. And Commodore 64 had video memory with real pixels, and it was more convenient to work with it than “ switch the color no later than the 31415th beat ”.

    It’s difficult to draw Mona Lisa on the screen by manually setting pixels using a tricky algorithm. Yes, and it is not necessary, because to pull out a piece of data from the video memory and save it to a file, and then insert it from the file. Such a core dump in BASIC was made by the team BSAVE, and the format itself was named after it BSAVE'd. Editing images became much more convenient, and unpretentious graphic editors flourished in lush color, mostly indistinguishable from each other.

    But some of the editors outgrew childhood and became a very convenient and useful tool. So in 1984, PCPaint appeared, in which you could draw with the mouse. In addition to the obvious user-friendly features, PCPaint provided another advantage. The fact is that the BSAVE dump did not include data on the image size, color depth and the palette, and if there were few video modes (and the color image was imputedly shown in black and white), then for the palette we had to use a separate PAL file. The PCPaint editor's PIC format contained both a palette and a BSAVE dump. This is a small step for the programmer and a giant leap for the whole of mankind. Something like "but let's come up with the MKV format, in which subtitles can be stored inside, and so that you do not need to put them in the right folder".

    But another problem remained unresolved: on the PC, there are BSAVEon Apple BSAVE, but they generate files of different formats. This is not surprising, the internal representation of the picture in memory was different. There were transcoders, but it became clear that for a long time this vendor-dependent mess would not last. In 1984, Truevision introduced the TARGA format, better known as TGA. And in the next 1985, the PC Paintbrush saw the light. Although PC Paintbrush sold worse, its PCX format, portable and fairly simple, lasted longer than PIC.

    In both TGA and PCX, image size and palette data were stored explicitly, without strong binding to hardware. This became possible, because the pixel data no longer depended on a particular platform and was simply a scanline: from left to right, from top to bottom.

    But there was another important feature of these two formats, the raster data in them was stored in a compressed form. The applied RLE algorithm was not the top of efficiency, but it was already quite good.

    RLE


    RLE (Run length encoding) is a fairly simple algorithm, but it shows well how data compression works. Lossless data compression means getting rid of redundancy in them. To do this, take a dataset, find chains of repeating values ​​in them and replace them with something more compact.

    Typically, RLE is translated as “coding of series lengths,” and such duplicate values ​​refer to as “series”. And although I prefer the translation “run”, I can’t do anything, it’s already an established term.

    Most likely, you are already using it. Let's look at the line " AAAAAAAAAAAABBBAAAAAAAAA". If we have to dictate it over the phone, it will sound like “twelve capital letters A, three capital letters B, nine capital letters A”. If you write this down, you get " ", and so that there are no discrepancies, then " ". Much more compact.12A3B9A9A3A3B9A

    Now let's take another line, " 0KXQsNCx0YDQsNGF0LDQsdGA", and try to compress it like this. It will turn out " ...", stop-stop-stop! The string is twice as long as the original, but this is never a compression. We modify the algorithm and add letters to the numbers: the letter A means that the next character is written "as is"; if B, then two; if C, then three, and so on. It turns out " ". So, in the best case, we get a compression of 350%, and in the worst case, we lose only 4%. Of course, in real conditions, bytes are usually encoded, not letters of the Latin alphabet, and sequence lengths are encoded with values ​​from 0 to 255. Plus, usually meaningless values ​​of series lengths are ignored: in our example, 1 and A do the same thing, and 0 in general it makes no sense. But these are details, the essence remains the same.101K1XX0KXQsNCx0YDQsNGF0LDQsdGA



    Entropy


    No matter how one wants to avoid the theory, this thing is too important to ignore. Have you noticed that not all sequences can be compressed? Lines AAAAAAAAAAAABBBAAAAAAAAAand 0KXQsNCx0YDQsNGF0LDQsdGAone length, 24 characters, but in the second these characters are more chaotic, and it is more difficult to compress.

    A measure of such randomness is informational entropy; it is defined as the amount of information in a message.
    Little entropyMoreEven more

    The mere presence of informational entropy suggests that data can only be compressed to a certain length. No further, magic is not involved in the process. And according to Claude Shannon , the movie “Click” is not compressed to a kilobyte.

    To illustrate this, we turn to the fact that at first glance it is not related to IT.
    Fully Solved SudokuSudokuUnsolvable sudoku

    Sudoku on the left contains 81 digits and has already been decided. The one in the middle contains less information, 26 digits, but having solved it, you can restore all the original ones 81.

    But everything is very bad with the sudoku on the right - too much data has been deleted from it, and it is no longer being solved, that is, the original we will not receive the data set.

    Pcx


    But back to PCX and do something that no one has been doing for ten years, create a PCX file, and manually. You need to know your tool, so let's look at the specifications and find out what this format is. Here are the main points:

    • An image can have dimensions up to 65536 × 65536. A palette up to 256 colors, or a 24-bit color without a palette. The header is always 128 bytes in size, contains the image size, the number of colors and a palette of up to 16 colors, one byte per channel. Naturally, the 256-color palette does not fit into such a header, and if it does, it is appended to the end of the file and has a size of 769 bytes (1 byte of signature and 256 × 3 bytes of data). There is also a field for the CGA palette, but it is no longer supported.
    • The raw data stream consists of scanlines - horizontal lines in one letter. Scanline Within each may have up to four bit planes: R, G, B, A. All planes are written sequentially RRRRGGGGBBBBAAAA. The planes consist of a strictly even number of bytes, and, if necessary, placeholders are added to the raw data so that this condition is satisfied. Such placeholder data is ignored during decoding.
    • Raw raster data is compressed line by line using RLE. “Line by line” means that the RLE series cannot extend beyond the bit plane.
    • In RLE, byte values ​​from 192 to 255 encode the number of character repetitions from 0 to 63 times, respectively. The remaining values ​​are literal, if they are not found at the place where the number of repetitions is expected, it is believed that they are repeated once.


    PCX in practice


    Now, let’s take an example to analyze this technical data shmat. Take for example this picture (17 × 8, enlarged 8 times for convenience):

    Old school gradient

    Let's decide on the palette. There are three different colors in the image, so we have palettes of 4, 16 and 256 colors, as well as Truecolor. In a 4-color palette, we will have 4 values ​​in one byte (divide 8 byte bits into 2 bits of the number in the palette); in 16-color - 2 pixels per byte; in 256-color - a pixel per byte (plus 769 bytes of an additional palette); Truecolor has three bytes per pixel. The choice is obvious, 4 colors.

    We arrange the colors, for example, as follows:
    # 000000 0 (00)# 808080 1 (01)# 404040 2 (10)(not used)


    Now we write out the bit values ​​of the first line, starting with the most significant bits. Listing in a quaternary system, delimiters - byte boundaries. It turns out 4.25 bytes. RLE with fractional bytes does not work, we finish up to five. The documentation says that the scanline should have an even number of bytes. There is nothing to do, we are finishing it yet. We do the same with the rest of the lines, we get: Now let's see what values ​​can be compressed. We only note sequences of identical bytes longer than 2, because encoding a series of 2 bytes will take 2 bytes, and there is no point in that. Please note that although the zero bytes at the end of the penultimate line are asked to cling to the last line, this will not work, the border of the scanline does not cross the edges.

    0000 0000 0000 0000 0



    0000 0000 0000 0000 0000



    0000 0000 0000 0000 0000 0000



    0000 0000 0000 0000 0000 0000
    0111 1111 1111 1111 0000 0000
    0121 2121 2121 2121 0000 0000
    0212 1212 1212 1212 0000 0000
    0222 2222 2222 2222 0000 0000
    0202 0202 0202 0202 0000 0000
    0020 2020 2020 2020 0000 0000
    0000 0000 0000 0000 0000 0000



    0000 0000 0000 0000 0000 0000
    0111 1111 1111 1111 0000 0000
    0121 2121 2121 2121 0000 0000
    0212 1212 1212 1212 0000 0000
    0222 2222 2222 2222 0000 0000
    0202 0202 0202 0202 0000 0000
    0020 2020 2020 2020 0000 0000
    0000 0000 0000 0000 0000 0000


    Now the knight's move. The specification does not say anywhere what exactly needs to be done to scanline to an even number of bytes. Anyway, the decoder will throw out these values. Therefore, we can save as much as two bytes by doing this: Encode the resulting RLE. Perhaps it will be more convenient to switch to the more familiar hexadecimal form: We encode the first line. 6 bytes . Repeat 6 times the corresponding value 192 + 6 = 198 or . After it we write what value we are going to repeat 6 times, it turns out . The first scanline is ready. The second scanline begins with . This value can be encoded as (“once ”). But because of the RLE feature in PCX, we can write it just like that, and “once” will be implied.

    0000 0000 0000 0000 0000 0000
    0111 1111 1111 1111 0000 0000
    0121 2121 2121 2121 0000 0000
    0212 1212 1212 1212 0000 0000
    0222 2222 2222 2222 0000 0000
    0202 0202 0202 0202 0202 0202
    0020 2020 2020 2020 0000 0000
    0000 0000 0000 0000 0000 0000



    00 00 00 00 00 00
    15 55 55 55 00 00
    19 99 99 99 00 00
    26 66 66 66 00 00
    2A AA AA AA 00 00
    22 22 22 22 22 22
    08 88 88 88 00 00
    00 00 00 00 00 00

    0x00C60xC6 0x00

    0x150xC1 0x150x150x15

    Next, three identical bytes become . And at the end of two rows of zero bytes that can be written in two ways: on the forehead, or a cunningly . And so and so two bytes. It is unlikely that girls can be impressed with a tricky reception, but there are no other reasons to do things more intricately than required , so we take it simply . By the way, about catchy things. Insert something like , that is, a series of length 0 into the encoded data , and they will not affect the image itself. Wikipedia writes that it can be used for steganography, but as for me, such a stitch is sewn with white thread, and it’s not good at anything other than bloating the file size. Continuing in a similar way, we get:0xC3 0x55

    0x00 0x000xC2 0x000x00 0x00

    0xC0 0x73 0xC0 0x65 0xC0 0x63 0xC0 0x72 0xC0 0x65 0xC0 0x74



    C6 00
    15 C3 55 00 00
    19 C3 99 00 00
    26 C3 66 00 00
    2A C3 AA 00 00
    C6 22
    08 C3 88 00 00
    C6 00

    It remains to append a 128-byte header on top and you're done! And to admire the result, you can run this script . It is on node.js, but I am sure it will not be difficult to rewrite to your favorite language.

    PCX Practice 2: The Palette


    Now let's take another image and transfer it to PCX again, trying to compress as much as possible. This time the picture is smaller, 7 × 5.

    Highly Artistic UFO Image

    This is a UFO. I know that is not very similar.

    First things first, let's choose the color depth: 2 bits. These are four colors, just as much as we need. Define a palette, for example, like this:

    #FFFFFF 0 (00)# 84A7CA 1 (01)# 666666 2 (10)# 404040 3 (11)


    Now the raster data queue. The second time we will not make the tedious process of writing out the digits; we will immediately record the raw data and what came out of them in hexadecimal.

    Raw dataCompressed data
    0F C0
    3A B0
    FF FF
    D9 9C
    3F F0
    0F C1 C0
    3A B0
    C2 FF
    C1 D9 9C
    3F C1 F0


    Oops! The compressed data is larger than the original. This is because the values ​​from 0xC0to 0xFF are markers of the number of repetitions, and they can not be written just like that. Instead, 0xC0I had to substitute , instead  - and so on. In case we were lucky - there we did not lose precious bytes. But in general, a dull picture comes out: now RLE, instead of helping, only gets in the way. In the direction of hopelessness, let's see what can be done with this. Markers are bytes with the two most significant bits equal to one, 11 XXXXXXC1 C00xF00xC1 0xF0

    0xFF 0xFF

    . Data is written sequentially, starting with the most significant bit. However, with a two-bit color depth, pixels at a 4 × n offset affect whether a byte is a marker or not. Namely, the color pixels at number 3 (in our case, dark gray).

    Highlighted columns at offset 0 and 4

    Here they are, the culprits of the proliferation of file sizes. In the third line, dark gray pixels also fall into the selected columns, but they don’t do the weather for us: the line itself, when encoded, will give two identical bytes.

    We determine the order of colors in the palette ourselves, so we choose the color number 3 that is the least found at an offset of 4 × n. Blue is the best candidate; he has never been found in such places at all. Redefine the palette and make a second attempt.

    #FFFFFF 0 (00)# 666666 1 (01)# 404040 2 (10)# 84A7CA 3 (11)

    Raw dataCompressed data
    0A 80
    25 60
    AA AA
    B7 78
    2A A0
    0A 80
    25 60
    AA AA
    B7 78
    2A A0


    The compressed data turned out to be identical to the original. RLE this image was too tough, but at least there was no overhead. One way or another, the raster data is ready, and this is the maximum that you can squeeze.

    Well and demo: UFO generator

    Total


    Once again, briefly on techniques to help reduce PCX weight:

    • The choice of the minimum possible color depth (palette size);
    • Optimization of garbage data;
    • Removing meaningless data (series length 0);
    • Optimization palette.


    To be continued


    Well, that’s probably all. I won’t talk about TGA, although it differs from PCX, there are much more similarities than differences. But there were no other directly noteworthy graphic formats of that time.

    Except, of course, CompuServe's GIF format. We will delve into it the next time.

    Also popular now: