subzey May 21, 2013 at 02:32

Images: formats and compression (1/3)

What is more traffic spent on loading a site? Most often these are pictures, and their total “weight” is often several times more than that of markup, scripts and styles. In common image files, raster data is stored in compressed form, and this is significantly better than uncompressed BMP. And if you want even better? Indeed, in fairly large projects, every byte counts (for example, in TradingView, which is really shy there).

There are many utilities for compressing graphics, from highly specialized to omnipotent combines. The habr already has a wonderful overview of such programs, and the question of how to squeeze the picture is considered in more than detail.

But howsuch programs work, what can be improved and how to make your own? I invite you to a sightseeing tour of image formats and raster data compression algorithms.

Middle Ages

The eighties of the last century became the time of the development of raster graphics. Graphics, as such, has been used before, but now it has become much more accessible, and not least, the gaming industry has influenced it. The Atari 2600 made it possible to draw something more refined than a white rectangle. And Commodore 64 had video memory with real pixels, and it was more convenient to work with it than “ switch the color no later than the 31415th beat ”.

It’s difficult to draw Mona Lisa on the screen by manually setting pixels using a tricky algorithm. Yes, and it is not necessary, because to pull out a piece of data from the video memory and save it to a file, and then insert it from the file. Such a core dump in BASIC was made by the team BSAVE, and the format itself was named after it BSAVE'd. Editing images became much more convenient, and unpretentious graphic editors flourished in lush color, mostly indistinguishable from each other.

But some of the editors outgrew childhood and became a very convenient and useful tool. So in 1984, PCPaint appeared, in which you could draw with the mouse. In addition to the obvious user-friendly features, PCPaint provided another advantage. The fact is that the BSAVE dump did not include data on the image size, color depth and the palette, and if there were few video modes (and the color image was imputedly shown in black and white), then for the palette we had to use a separate PAL file. The PCPaint editor's PIC format contained both a palette and a BSAVE dump. This is a small step for the programmer and a giant leap for the whole of mankind. Something like "but let's come up with the MKV format, in which subtitles can be stored inside, and so that you do not need to put them in the right folder".

But another problem remained unresolved: on the PC, there are BSAVEon Apple BSAVE, but they generate files of different formats. This is not surprising, the internal representation of the picture in memory was different. There were transcoders, but it became clear that for a long time this vendor-dependent mess would not last. In 1984, Truevision introduced the TARGA format, better known as TGA. And in the next 1985, the PC Paintbrush saw the light. Although PC Paintbrush sold worse, its PCX format, portable and fairly simple, lasted longer than PIC.

In both TGA and PCX, image size and palette data were stored explicitly, without strong binding to hardware. This became possible, because the pixel data no longer depended on a particular platform and was simply a scanline: from left to right, from top to bottom.

But there was another important feature of these two formats, the raster data in them was stored in a compressed form. The applied RLE algorithm was not the top of efficiency, but it was already quite good.

RLE

RLE (Run length encoding) is a fairly simple algorithm, but it shows well how data compression works. Lossless data compression means getting rid of redundancy in them. To do this, take a dataset, find chains of repeating values in them and replace them with something more compact.

Typically, RLE is translated as “coding of series lengths,” and such duplicate values refer to as “series”. And although I prefer the translation “run”, I can’t do anything, it’s already an established term.

Most likely, you are already using it. Let's look at the line " AAAAAAAAAAAABBBAAAAAAAAA". If we have to dictate it over the phone, it will sound like “twelve capital letters A, three capital letters B, nine capital letters A”. If you write this down, you get " ", and so that there are no discrepancies, then " ". Much more compact.12A3B9A9A3A3B9A

Now let's take another line, " 0KXQsNCx0YDQsNGF0LDQsdGA", and try to compress it like this. It will turn out " ...", stop-stop-stop! The string is twice as long as the original, but this is never a compression. We modify the algorithm and add letters to the numbers: the letter A means that the next character is written "as is"; if B, then two; if C, then three, and so on. It turns out " ". So, in the best case, we get a compression of 350%, and in the worst case, we lose only 4%. Of course, in real conditions, bytes are usually encoded, not letters of the Latin alphabet, and sequence lengths are encoded with values from 0 to 255. Plus, usually meaningless values of series lengths are ignored: in our example, 1 and A do the same thing, and 0 in general it makes no sense. But these are details, the essence remains the same.101K1XX0KXQsNCx0YDQsNGF0LDQsdGA

Entropy

No matter how one wants to avoid the theory, this thing is too important to ignore. Have you noticed that not all sequences can be compressed? Lines AAAAAAAAAAAABBBAAAAAAAAAand 0KXQsNCx0YDQsNGF0LDQsdGAone length, 24 characters, but in the second these characters are more chaotic, and it is more difficult to compress.

A measure of such randomness is informational entropy; it is defined as the amount of information in a message.


Little entropy	More	Even more

The mere presence of informational entropy suggests that data can only be compressed to a certain length. No further, magic is not involved in the process. And according to Claude Shannon , the movie “Click” is not compressed to a kilobyte.

To illustrate this, we turn to the fact that at first glance it is not related to IT.

Sudoku on the left contains 81 digits and has already been decided. The one in the middle contains less information, 26 digits, but having solved it, you can restore all the original ones 81.

But everything is very bad with the sudoku on the right - too much data has been deleted from it, and it is no longer being solved, that is, the original we will not receive the data set.

Pcx

But back to PCX and do something that no one has been doing for ten years, create a PCX file, and manually. You need to know your tool, so let's look at the specifications and find out what this format is. Here are the main points:

An image can have dimensions up to 65536 × 65536. A palette up to 256 colors, or a 24-bit color without a palette. The header is always 128 bytes in size, contains the image size, the number of colors and a palette of up to 16 colors, one byte per channel. Naturally, the 256-color palette does not fit into such a header, and if it does, it is appended to the end of the file and has a size of 769 bytes (1 byte of signature and 256 × 3 bytes of data). There is also a field for the CGA palette, but it is no longer supported.
The raw data stream consists of scanlines - horizontal lines in one letter. Scanline Within each may have up to four bit planes: R, G, B, A. All planes are written sequentially RRRRGGGGBBBBAAAA. The planes consist of a strictly even number of bytes, and, if necessary, placeholders are added to the raw data so that this condition is satisfied. Such placeholder data is ignored during decoding.
Raw raster data is compressed line by line using RLE. “Line by line” means that the RLE series cannot extend beyond the bit plane.
In RLE, byte values from 192 to 255 encode the number of character repetitions from 0 to 63 times, respectively. The remaining values are literal, if they are not found at the place where the number of repetitions is expected, it is believed that they are repeated once.

PCX in practice

Now, let’s take an example to analyze this technical data shmat. Take for example this picture (17 × 8, enlarged 8 times for convenience):

Old school gradient

Let's decide on the palette. There are three different colors in the image, so we have palettes of 4, 16 and 256 colors, as well as Truecolor. In a 4-color palette, we will have 4 values in one byte (divide 8 byte bits into 2 bits of the number in the palette); in 16-color - 2 pixels per byte; in 256-color - a pixel per byte (plus 769 bytes of an additional palette); Truecolor has three bytes per pixel. The choice is obvious, 4 colors.

We arrange the colors, for example, as follows:

Now we write out the bit values of the first line, starting with the most significant bits. Listing in a quaternary system, delimiters - byte boundaries. It turns out 4.25 bytes. RLE with fractional bytes does not work, we finish up to five. The documentation says that the scanline should have an even number of bytes. There is nothing to do, we are finishing it yet. We do the same with the rest of the lines, we get: Now let's see what values can be compressed. We only note sequences of identical bytes longer than 2, because encoding a series of 2 bytes will take 2 bytes, and there is no point in that. Please note that although the zero bytes at the end of the penultimate line are asked to cling to the last line, this will not work, the border of the scanline does not cross the edges.

0000 0000 0000 0000 0

0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000
0111 1111 1111 1111 0000 0000
0121 2121 2121 2121 0000 0000
0212 1212 1212 1212 0000 0000
0222 2222 2222 2222 0000 0000
0202 0202 0202 0202 0000 0000
0020 2020 2020 2020 0000 0000
0000 0000 0000 0000 0000 0000

0000 0000 0000 0000 0000 0000
0111 1111 1111 1111 0000 0000
0121 2121 2121 2121 0000 0000
0212 1212 1212 1212 0000 0000
0222 2222 2222 2222 0000 0000
0202 0202 0202 0202 0000 0000
0020 2020 2020 2020 0000 0000
0000 0000 0000 0000 0000 0000

Now the knight's move. The specification does not say anywhere what exactly needs to be done to scanline to an even number of bytes. Anyway, the decoder will throw out these values. Therefore, we can save as much as two bytes by doing this: Encode the resulting RLE. Perhaps it will be more convenient to switch to the more familiar hexadecimal form: We encode the first line. 6 bytes . Repeat 6 times the corresponding value 192 + 6 = 198 or . After it we write what value we are going to repeat 6 times, it turns out . The first scanline is ready. The second scanline begins with . This value can be encoded as (“once ”). But because of the RLE feature in PCX, we can write it just like that, and “once” will be implied.

0000 0000 0000 0000 0000 0000
0111 1111 1111 1111 0000 0000
0121 2121 2121 2121 0000 0000
0212 1212 1212 1212 0000 0000
0222 2222 2222 2222 0000 0000
0202 0202 0202 0202 0202 0202
0020 2020 2020 2020 0000 0000
0000 0000 0000 0000 0000 0000

00 00 00 00 00 00
15 55 55 55 00 00
19 99 99 99 00 00
26 66 66 66 00 00
2A AA AA AA 00 00
22 22 22 22 22 22
08 88 88 88 00 00
00 00 00 00 00 00

0x00C60xC6 0x00

0x150xC1 0x150x150x15

Next, three identical bytes become . And at the end of two rows of zero bytes that can be written in two ways: on the forehead, or a cunningly . And so and so two bytes. It is unlikely that girls can be impressed with a tricky reception, but there are no other reasons to do things more intricately than required , so we take it simply . By the way, about catchy things. Insert something like , that is, a series of length 0 into the encoded data , and they will not affect the image itself. Wikipedia writes that it can be used for steganography, but as for me, such a stitch is sewn with white thread, and it’s not good at anything other than bloating the file size. Continuing in a similar way, we get:0xC3 0x55

0x00 0x000xC2 0x000x00 0x00

0xC0 0x73 0xC0 0x65 0xC0 0x63 0xC0 0x72 0xC0 0x65 0xC0 0x74

C6 00
15 C3 55 00 00
19 C3 99 00 00
26 C3 66 00 00
2A C3 AA 00 00
C6 22
08 C3 88 00 00
C6 00

It remains to append a 128-byte header on top and you're done! And to admire the result, you can run this script . It is on node.js, but I am sure it will not be difficult to rewrite to your favorite language.

PCX Practice 2: The Palette

Now let's take another image and transfer it to PCX again, trying to compress as much as possible. This time the picture is smaller, 7 × 5.

Highly Artistic UFO Image

This is a UFO. I know that is not very similar.

First things first, let's choose the color depth: 2 bits. These are four colors, just as much as we need. Define a palette, for example, like this:

Now the raster data queue. The second time we will not make the tedious process of writing out the digits; we will immediately record the raw data and what came out of them in hexadecimal.

Raw data	Compressed data
`0F C0` `3A B0` `FF FF` `D9 9C` `3F F0`	`0F` `C1` `C0` `3A B0` `C2` `FF` `C1` `D9 9C` `3F` `C1` `F0`

Oops! The compressed data is larger than the original. This is because the values from 0xC0to 0xFF are markers of the number of repetitions, and they can not be written just like that. Instead, 0xC0I had to substitute , instead - and so on. In case we were lucky - there we did not lose precious bytes. But in general, a dull picture comes out: now RLE, instead of helping, only gets in the way. In the direction of hopelessness, let's see what can be done with this. Markers are bytes with the two most significant bits equal to one, 11 XXXXXXC1 C00xF00xC1 0xF0

0xFF 0xFF

. Data is written sequentially, starting with the most significant bit. However, with a two-bit color depth, pixels at a 4 × n offset affect whether a byte is a marker or not. Namely, the color pixels at number 3 (in our case, dark gray).

Highlighted columns at offset 0 and 4

Here they are, the culprits of the proliferation of file sizes. In the third line, dark gray pixels also fall into the selected columns, but they don’t do the weather for us: the line itself, when encoded, will give two identical bytes.

We determine the order of colors in the palette ourselves, so we choose the color number 3 that is the least found at an offset of 4 × n. Blue is the best candidate; he has never been found in such places at all. Redefine the palette and make a second attempt.

Raw data	Compressed data
`0A 80` `25 60` `AA AA` `B7 78` `2A A0`	`0A 80` `25 60` `AA AA` `B7 78` `2A A0`

The compressed data turned out to be identical to the original. RLE this image was too tough, but at least there was no overhead. One way or another, the raster data is ready, and this is the maximum that you can squeeze.

Well and demo: UFO generator

Total

Once again, briefly on techniques to help reduce PCX weight:

The choice of the minimum possible color depth (palette size);
Optimization of garbage data;
Removing meaningless data (series length 0);
Optimization palette.

To be continued

Well, that’s probably all. I won’t talk about TGA, although it differs from PCX, there are much more similarities than differences. But there were no other directly noteworthy graphic formats of that time.

Except, of course, CompuServe's GIF format. We will delve into it the next time.

Tags: