
Images: formats and compression (1/3)

What is more traffic spent on loading a site? Most often these are pictures, and their total “weight” is often several times more than that of markup, scripts and styles. In common image files, raster data is stored in compressed form, and this is significantly better than uncompressed BMP. And if you want even better? Indeed, in fairly large projects, every byte counts (for example, in TradingView, which is really shy there).
There are many utilities for compressing graphics, from highly specialized to omnipotent combines. The habr already has a wonderful overview of such programs, and the question of how to squeeze the picture is considered in more than detail.
But howsuch programs work, what can be improved and how to make your own? I invite you to a sightseeing tour of image formats and raster data compression algorithms.
Middle Ages
The eighties of the last century became the time of the development of raster graphics. Graphics, as such, has been used before, but now it has become much more accessible, and not least, the gaming industry has influenced it. The Atari 2600 made it possible to draw something more refined than a white rectangle. And Commodore 64 had video memory with real pixels, and it was more convenient to work with it than “ switch the color no later than the 31415th beat ”.
It’s difficult to draw Mona Lisa on the screen by manually setting pixels using a tricky algorithm. Yes, and it is not necessary, because to pull out a piece of data from the video memory and save it to a file, and then insert it from the file. Such a core dump in BASIC was made by the team
BSAVE
, and the format itself was named after it BSAVE'd. Editing images became much more convenient, and unpretentious graphic editors flourished in lush color, mostly indistinguishable from each other.But some of the editors outgrew childhood and became a very convenient and useful tool. So in 1984, PCPaint appeared, in which you could draw with the mouse. In addition to the obvious user-friendly features, PCPaint provided another advantage. The fact is that the BSAVE dump did not include data on the image size, color depth and the palette, and if there were few video modes (and the color image was imputedly shown in black and white), then for the palette we had to use a separate PAL file. The PCPaint editor's PIC format contained both a palette and a BSAVE dump. This is a small step for the programmer and a giant leap for the whole of mankind. Something like "but let's come up with the MKV format, in which subtitles can be stored inside, and so that you do not need to put them in the right folder".
But another problem remained unresolved: on the PC, there are
BSAVE
on Apple BSAVE
, but they generate files of different formats. This is not surprising, the internal representation of the picture in memory was different. There were transcoders, but it became clear that for a long time this vendor-dependent mess would not last. In 1984, Truevision introduced the TARGA format, better known as TGA. And in the next 1985, the PC Paintbrush saw the light. Although PC Paintbrush sold worse, its PCX format, portable and fairly simple, lasted longer than PIC.In both TGA and PCX, image size and palette data were stored explicitly, without strong binding to hardware. This became possible, because the pixel data no longer depended on a particular platform and was simply a scanline: from left to right, from top to bottom.
But there was another important feature of these two formats, the raster data in them was stored in a compressed form. The applied RLE algorithm was not the top of efficiency, but it was already quite good.
RLE
RLE (Run length encoding) is a fairly simple algorithm, but it shows well how data compression works. Lossless data compression means getting rid of redundancy in them. To do this, take a dataset, find chains of repeating values in them and replace them with something more compact.
Typically, RLE is translated as “coding of series lengths,” and such duplicate values refer to as “series”. And although I prefer the translation “run”, I can’t do anything, it’s already an established term.
Most likely, you are already using it. Let's look at the line "
AAAAAAAAAAAABBBAAAAAAAAA
". If we have to dictate it over the phone, it will sound like “twelve capital letters A, three capital letters B, nine capital letters A”. If you write this down, you get " ", and so that there are no discrepancies, then " ". Much more compact.12
A
3
B
9
A
9
A
3
A
3
B
9
A
Now let's take another line, "
0KXQsNCx0YDQsNGF0LDQsdGA
", and try to compress it like this. It will turn out " ...", stop-stop-stop! The string is twice as long as the original, but this is never a compression. We modify the algorithm and add letters to the numbers: the letter A means that the next character is written "as is"; if B, then two; if C, then three, and so on. It turns out " ". So, in the best case, we get a compression of 350%, and in the worst case, we lose only 4%.
Of course, in real conditions, bytes are usually encoded, not letters of the Latin alphabet, and sequence lengths are encoded with values from 0 to 255. Plus, usually meaningless values of series lengths are ignored: in our example, 1 and A do the same thing, and 0 in general it makes no sense. But these are details, the essence remains the same.1
0
1
K
1
X
X
0KXQsNCx0YDQsNGF0LDQsdGA
Entropy
No matter how one wants to avoid the theory, this thing is too important to ignore. Have you noticed that not all sequences can be compressed? Lines
AAAAAAAAAAAABBBAAAAAAAAA
and 0KXQsNCx0YDQsNGF0LDQsdGA
one length, 24 characters, but in the second these characters are more chaotic, and it is more difficult to compress. A measure of such randomness is informational entropy; it is defined as the amount of information in a message.
![]() | ![]() | ![]() |
---|---|---|
Little entropy | More | Even more |
The mere presence of informational entropy suggests that data can only be compressed to a certain length. No further, magic is not involved in the process. And according to Claude Shannon , the movie “Click” is not compressed to a kilobyte.
To illustrate this, we turn to the fact that at first glance it is not related to IT.
![]() | ![]() | ![]() |
---|
Sudoku on the left contains 81 digits and has already been decided. The one in the middle contains less information, 26 digits, but having solved it, you can restore all the original ones 81.
But everything is very bad with the sudoku on the right - too much data has been deleted from it, and it is no longer being solved, that is, the original we will not receive the data set.
Pcx
But back to PCX and do something that no one has been doing for ten years, create a PCX file, and manually. You need to know your tool, so let's look at the specifications and find out what this format is. Here are the main points:
- An image can have dimensions up to 65536 × 65536. A palette up to 256 colors, or a 24-bit color without a palette. The header is always 128 bytes in size, contains the image size, the number of colors and a palette of up to 16 colors, one byte per channel. Naturally, the 256-color palette does not fit into such a header, and if it does, it is appended to the end of the file and has a size of 769 bytes (1 byte of signature and 256 × 3 bytes of data). There is also a field for the CGA palette, but it is no longer supported.
- The raw data stream consists of scanlines - horizontal lines in one letter. Scanline Within each may have up to four bit planes: R, G, B, A. All planes are written sequentially
RRRRGGGGBBBBAAAA
. The planes consist of a strictly even number of bytes, and, if necessary, placeholders are added to the raw data so that this condition is satisfied. Such placeholder data is ignored during decoding. - Raw raster data is compressed line by line using RLE. “Line by line” means that the RLE series cannot extend beyond the bit plane.
- In RLE, byte values from 192 to 255 encode the number of character repetitions from 0 to 63 times, respectively. The remaining values are literal, if they are not found at the place where the number of repetitions is expected, it is believed that they are repeated once.
PCX in practice
Now, let’s take an example to analyze this technical data shmat. Take for example this picture (17 × 8, enlarged 8 times for convenience):

Let's decide on the palette. There are three different colors in the image, so we have palettes of 4, 16 and 256 colors, as well as Truecolor. In a 4-color palette, we will have 4 values in one byte (divide 8 byte bits into 2 bits of the number in the palette); in 16-color - 2 pixels per byte; in 256-color - a pixel per byte (plus 769 bytes of an additional palette); Truecolor has three bytes per pixel. The choice is obvious, 4 colors.
We arrange the colors, for example, as follows:
![]() | ![]() | ![]() | ![]() |
---|
Now we write out the bit values of the first line, starting with the most significant bits. Listing in a quaternary system, delimiters - byte boundaries. It turns out 4.25 bytes. RLE with fractional bytes does not work, we finish up to five. The documentation says that the scanline should have an even number of bytes. There is nothing to do, we are finishing it yet. We do the same with the rest of the lines, we get: Now let's see what values can be compressed. We only note sequences of identical bytes longer than 2, because encoding a series of 2 bytes will take 2 bytes, and there is no point in that. Please note that although the zero bytes at the end of the penultimate line are asked to cling to the last line, this will not work, the border of the scanline does not cross the edges.
0000 0000 0000 0000 0
0000 0000 0000 0000 0
000
0000 0000 0000 0000 0
000 0000
0000 0000 0000 0000 0
000 0000
0111 1111 1111 1111 0
000 0000
0121 2121 2121 2121 0
000 0000
0212 1212 1212 1212 0
000 0000
0222 2222 2222 2222 0
000 0000
0202 0202 0202 0202 0
000 0000
0020 2020 2020 2020 0
000 0000
0000 0000 0000 0000 0
000 0000
0000 0000 0000 0000 0
000 0000
0111
1111 1111 1111
0
000 0000
0121
2121 2121 2121
0
000 0000
0212
1212 1212 1212
0
000 0000
0222
2222 2222 2222
0
000 0000
0202 0202 0202 0202
0
000 0000
0020
2020 2020 2020
0
000 0000
0000 0000 0000 0000 0
000 0000
Now the knight's move. The specification does not say anywhere what exactly needs to be done to scanline to an even number of bytes. Anyway, the decoder will throw out these values. Therefore, we can save as much as two bytes by doing this: Encode the resulting RLE. Perhaps it will be more convenient to switch to the more familiar hexadecimal form: We encode the first line. 6 bytes . Repeat 6 times the corresponding value 192 + 6 = 198 or . After it we write what value we are going to repeat 6 times, it turns out . The first scanline is ready. The second scanline begins with . This value can be encoded as (“once ”). But because of the RLE feature in PCX, we can write it just like that, and “once” will be implied.
0000 0000 0000 0000 0
000 0000
0111
1111 1111 1111
0
000 0000
0121
2121 2121 2121
0
000 0000
0212
1212 1212 1212
0
000 0000
0222
2222 2222 2222
0
000 0000
0202 0202 0202 0202 0
202 0202
0020
2020 2020 2020
0
000 0000
0000 0000 0000 0000 0
000 0000
00 00 00 00 00 00
15
55 55 55
00 00
19
99 99 99
00 00
26
66 66 66
00 00
2A
AA AA AA
00 00
22 22 22 22 22 22
08
88 88 88
00 00
00 00 00 00 00 00
0x00
C6
0xC6
0x00
0x15
0xC1
0x15
0x15
0x15
Next, three identical bytes become . And at the end of two rows of zero bytes that can be written in two ways: on the forehead, or a cunningly . And so and so two bytes. It is unlikely that girls can be impressed with a tricky reception, but there are no other reasons to do things more intricately than required , so we take it simply . By the way, about catchy things. Insert something like , that is, a series of length 0 into the encoded data , and they will not affect the image itself. Wikipedia writes that it can be used for steganography, but as for me, such a stitch is sewn with white thread, and it’s not good at anything other than bloating the file size. Continuing in a similar way, we get:
0xC3
0x55
0x00 0x00
0xC2
0x00
0x00 0x00
0xC0
0x73
0xC0
0x65
0xC0
0x63
0xC0
0x72
0xC0
0x65
0xC0
0x74
C6
00
15
C3
55 00 00
19
C3
99 00 00
26
C3
66 00 00
2A
C3
AA 00 00
C6
22
08
C3
88 00 00
C6
00
It remains to append a 128-byte header on top and you're done! And to admire the result, you can run this script . It is on node.js, but I am sure it will not be difficult to rewrite to your favorite language.
PCX Practice 2: The Palette
Now let's take another image and transfer it to PCX again, trying to compress as much as possible. This time the picture is smaller, 7 × 5.

This is a UFO. I know that is not very similar.
First things first, let's choose the color depth: 2 bits. These are four colors, just as much as we need. Define a palette, for example, like this:
![]() | ![]() | ![]() | ![]() |
---|
Now the raster data queue. The second time we will not make the tedious process of writing out the digits; we will immediately record the raw data and what came out of them in hexadecimal.
Raw data | Compressed data |
0F C0 3A B0 FF FF D9 9C 3F F0 | 0F C1 C0 3A B0 C2 FF C1 D9 9C 3F C1 F0 |
Oops! The compressed data is larger than the original. This is because the values from
0xC0
to 0xFF
are markers of the number of repetitions, and they can not be written just like that. Instead, 0xC0
I had to substitute , instead - and so on.
In case we were lucky - there we did not lose precious bytes. But in general, a dull picture comes out: now RLE, instead of helping, only gets in the way.
In the direction of hopelessness, let's see what can be done with this. Markers are bytes with the two most significant bits equal to one, 11 XXXXXXC1
C0
0xF0
0xC1
0xF0
0xFF 0xFF
. Data is written sequentially, starting with the most significant bit. However, with a two-bit color depth, pixels at a 4 × n offset affect whether a byte is a marker or not. Namely, the color pixels at number 3 (in our case, dark gray).

Here they are, the culprits of the proliferation of file sizes. In the third line, dark gray pixels also fall into the selected columns, but they don’t do the weather for us: the line itself, when encoded, will give two identical bytes.
We determine the order of colors in the palette ourselves, so we choose the color number 3 that is the least found at an offset of 4 × n. Blue is the best candidate; he has never been found in such places at all. Redefine the palette and make a second attempt.
![]() | ![]() | ![]() | ![]() |
---|
Raw data | Compressed data |
0A 80 25 60 AA AA B7 78 2A A0 | 0A 80 25 60 AA AA B7 78 2A A0 |
The compressed data turned out to be identical to the original. RLE this image was too tough, but at least there was no overhead. One way or another, the raster data is ready, and this is the maximum that you can squeeze.
Well and demo: UFO generator
Total
Once again, briefly on techniques to help reduce PCX weight:
- The choice of the minimum possible color depth (palette size);
- Optimization of garbage data;
- Removing meaningless data (series length 0);
- Optimization palette.
To be continued
Well, that’s probably all. I won’t talk about TGA, although it differs from PCX, there are much more similarities than differences. But there were no other directly noteworthy graphic formats of that time.
Except, of course, CompuServe's GIF format. We will delve into it the next time.