How JPEG format works

Original author: Omar Shehata
  • Transfer

JPEG images are found everywhere in our digital life, but behind this cover of awareness lies algorithms that eliminate details that are not perceived by the human eye. The result is the highest visual quality with the smallest file size - but how exactly does it all work? Let's see what exactly our eyes do not see!

It’s easy to take, for granted, the ability to send a photo to a friend, and not worry about which device, browser, or operating system it uses — however, this has not always been the case. By the early 1980s, computers knew how to store and display digital images, but there were many competing ideas about the best way to do this. It was impossible to simply send an image from one computer to another and hope that everything worked.

To solve this problem, in 1986 a committee of experts from around the world was assembled under the name " Joint Group of Experts in Photography"(Joint Photographic Experts Group, JPEG), founded in the framework of the joint work of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC) - two international standardization organizations, headquartered in Geneva (Switzerland).

A group of people under JPEG created the standard for compressing JPEG digital images in 1992. Anyone using the Internet has probably come across JPEG encoded images, the most common way to encode, send, and store images. From email to social networks, JPEG is used billions of times a day - almost every time we watch an image online or send it in. Without JPEG, the web would be less vibrant, slower, and probably there would be less pictures of cats!

This article is about how to decode a JPEG image. In other words, about what is required to convert compressed data stored on a computer into an image that appears on the screen. It is worth knowing about this, not only because it is important for understanding the technology that we use daily, but also because when disclosing compression levels, we better understand perception and vision, as well as the details to which our eyes are most sensitive.

In addition, playing with images in this way is very interesting.

Looking inside a jpeg

On a computer, everything is stored as a sequence of binary numbers. Typically, these bits, zeros and ones, are grouped in eight, making up bytes. When you open a JPEG image on a computer, something (browser, OS, something else) should decode the bytes, restoring the original image in the form of a list of colors that can be displayed.

If you download this cute cat photo and open it in a text editor, you will see a bunch of incoherent characters.

Here I use Notepad ++ to examine the contents of a file, as regular text editors such as Notepad from Windows will corrupt the binary after saving and it will no longer satisfy the JPEG format.

When you open an image in a text editor, you confuse the computer, just like you confuse your brain when you rub your eyes and begin to see colored spots!

These spots that you see are known as phosphenes., and are not the result of exposure to a light stimulus or hallucinations generated by the mind. They occur because your brain believes that any electrical signals in the optic nerves transmit information about the light. The brain needs to make such assumptions, because it is impossible to know whether the signal is sound, vision, or anything else. All nerves in the body transmit exactly the same electrical impulses. By pressing on the eyes, you send signals that are not visual, but activate the receptors of the eye that your brain interprets - in this case, incorrectly - as something visual. You are literally able to see the pressure!

It is funny to think about how computers look like a brain, but it is also a useful analogy that illustrates how strongly the significance of data - transmitted through the body by nerves, or stored on a computer - depends on their interpretation. All binary data consists of zeros and ones, basic components that can transmit information of any kind. Your computer often guesses how to interpret them using hints, such as file extensions. And now we are forcing him to interpret them as text, as this is what the text editor expects.

To understand how to decode JPEG, we need to see the original signals themselves - binary data. This can be done using the hex editor, or directly on the web page of the original article! There is an image, next to which in the text box are all its bytes (except the header), presented in decimal form. You can change them, and the script will recode and produce a new image on the fly.

You can learn a lot just by playing with this editor. For example, can you tell in what order the pixels are stored?

In this example, it is strange that changing some numbers does not affect the image at all, but, for example, if you replace the number 17 with 0 in the first line, the photo will completely go bad!

Other changes, for example, replacing 7 on line 1988 with the number 254 changes the color, but only of subsequent pixels.

Perhaps the strangest thing is that some numbers change not only the color, but also the shape of the image. Change 70 on line 12 to 2 and look at the top row of the image to see what I mean.

And no matter what JPEG image you use, you will always find these cryptic chess sequences when editing bytes.

When playing with the editor, it’s hard to understand how the photo is recreated from these bytes, since JPEG compression consists of three different technologies that are applied sequentially across the levels. We will study each of them separately to reveal the mysterious behavior that we observe.

Three levels of JPEG compression:

  1. Color downsampling .
  2. Discrete cosine transform and discretization .
  3. Encoding Series Lengths , Delta and Huffman

So that you can imagine the scale of compression, please note that the image above represents 79,819 numbers, that is, about 79 Kb. If we stored it without compression, for each pixel three numbers would be required - for the red, green and blue component. That would be 917,700 numbers, or approx. 917 Kb. As a result of JPEG compression, the resulting file has decreased by more than 10 times!

In fact, this image can be compressed much more strongly. Below are two images side by side - the photo on the right was shrunk to 16 Kb, that is 57 times smaller than the uncompressed version!

If you look closely, you will see that these images are not identical. Both of them are pictures with JPEG compression, but the right one is much smaller in volume. It also looks a little worse (look at the squares of the background colors). Therefore, JPEG is also called lossy compression; during compression, the image changes and loses some details.

1. Color downsampling

Here is an image using only the first compression level.

(Interactive version - in the original article). Removing one number destroys all colors. However, if you delete exactly six numbers, this practically does not affect the image.

Now the numbers are a little easier to decipher. This is an almost simple list of colors in which each byte changes exactly one pixel, but at the same time it is already half the size of an uncompressed image (which would occupy about 300 Kb in such a reduced size). Guess why?

You can see that these numbers do not denote the standard red, green, and blue components, because if we replace all the numbers with zeros, we get a green image (and not white).

This is because these bytes indicate Y (brightness),

Cb (relative blueness),

and Cr (relative redness) pictures.

Why not use RGB? Indeed, this is how most modern screens work. Your monitor can show any color, including red, green and blue with different intensities for each pixel. White is obtained by turning on all three at full brightness, and black - by turning them off.

It is also very similar to the work of the human eye. The color receptors in our eyes are called cones.", and are divided into three types, each of which is more sensitive to either red, or green, or blue [S-type cones are sensitive in violet-blue (S from the English Short - short-wave spectrum), M-type - in green-yellow (M from the English Medium - medium-wave), and L-type - in the yellow-red (L from the English. Long - long-wave) parts of the spectrum. The presence of these three types of cones (and rods sensitive in the emerald green spectrum) gives a person the color vision. / approx. Trans.]. The sticks , a different type of photoreceptors in our eyes can detect t lko changes in brightness, but they are much more sensitive. In our eyes, there are about 120 million rods and around six million cones.

Therefore, our eyes notice changes in brightness much better than changes in color. If you separate the color from the brightness, you can remove a little color, and no one will notice anything. Color downsampling is the process of representing the color components of an image at a lower resolution than the brightness components. In the example above, each pixel has exactly one Y component, and each individual group of four pixels has exactly one Cb component and one Cr. Therefore, the image contains four times less color information than the original.

YCbCr color space is used not only in JPEG. It was originally invented in 1938 for television shows. Not everyone has a color TV, so the separation of color and brightness allowed everyone to receive the same signal, and TVs without color just used only the brightness component.

Therefore, deleting one number from the editor completely destroys all colors. Components are stored as YYYY Cb Cr (in fact, not necessarily in that order - the storage order is specified in the file header). Removing the first number will lead to the fact that the first Cb value will be perceived as Y, Cr as Cb, and as a whole a domino effect will turn out, switching all the colors of the picture.

The JPEG specification does not oblige you to use YCbCr. But in most files it is used because it gives better quality images after downsampling compared to RGB. But you don’t have to take my word for it. See for yourself in the table below how the downsampling of each individual component will look like in RGB and in YCbCr.

(Interactive version - in the original article).

Removing blue is not as noticeable as red or green. This is because of the six million cones in your eyes, about 64% are sensitive to red, 32% to green and 2% to blue.

Sub-sampling of component Y (bottom left) is best seen. Noticeable is even a slight change.

Converting an image from RGB to YCbCr does not reduce file size, but makes it easier to find less noticeable details that can be removed. Lossy compression occurs in the second stage. It is based on the idea of ​​representing data in a more compressible form.

2. Discrete cosine transform and discretization

This compression level for the most part determines the essence of JPEG. After converting colors to YCbCr, the components are compressed separately, so we can concentrate on the Y component only later. And this is how the bytes of the Y component look after applying this level.

(Interactive version - in the original article). In the interactive version, clicking on a pixel scrolls the editor to the line that indicates it. Try removing numbers from the end or adding a few zeros to a specific number.

At first glance, it looks like a very bad compression. There are 100,000 pixels in the image, and 102,400 numbers are required to indicate their brightness (Y-components) - this is worse than not compressing anything at all!

However, note that most of these numbers are zero. Moreover, all these zeros at the end of lines can be deleted without changing the image. There are about 26,000 numbers left, and this is almost 4 times less!

At this level is the secret of chess patterns. Unlike other effects that we saw, the appearance of these patterns is not a glitch. They are the building blocks of the whole image. Each line of the editor contains exactly 64 numbers, discrete cosine transform (DCT) coefficients corresponding to intensities of 64 unique patterns.

These patterns are formed on the basis of the cosine graph. Here's what some of them look like:

8 out of 64 coefficients

Below is an image showing all 64 patterns.

(Interactive version - in the originalarticles).

These patterns are of particular importance since they form the basis of 8x8 images. If you are unfamiliar with linear algebra, this means that any 8x8 image can be obtained from these 64 patterns. DCT is the process of dividing images into 8x8 blocks and converting each block into a combination of these 64 coefficients.

The fact that any image can be composed of 64 specific patterns seems magic. However, this is the same as saying that any place on Earth can be described by two numbers - latitude and longitude [indicating hemispheres / approx. transl.]. We often consider the Earth's surface two-dimensional, so we need only two numbers. An 8x8 image has 64 dimensions, so we need 64 numbers.

It is not yet clear how this helps us in the sense of compression. If we need 64 numbers to represent an 8x8 image, why would this be better than just storing 64 luminance components? We do this for the same reason that we turned three RGB numbers into three YCbCr numbers: this allows us to remove invisible details.

It’s hard to see exactly which details are removed at this stage, because JPEG applies DCT to 8x8 blocks. However, no one forbids us to apply it to the whole picture. Here's what DCT looks like in component Y when applied to the whole picture:

From the end, you can delete more than 60,000 numbers with virtually no noticeable changes in the picture.

However, note that if we reset the first five numbers, the difference will be obvious.

The numbers at the beginning indicate low frequency changes in the image, and our eyes pick them up best. The numbers near the end indicate changes in high frequencies that are more difficult to notice. To “see what is not visible with the eye,” we can isolate these high-frequency parts by zeroing out the first 5000 numbers.

We see all areas of the image in which the largest change occurs from pixel to pixel. The cat's eyes, his mustache, terry blanket and shadows in the lower left corner stand out. You can go further by clearing the first 10,000 numbers:




These high-frequency parts are jpeg and deletes during the compression phase. Converting colors to DCT coefficients is lossless. Losses are generated at the discretization step, where high frequency values ​​or close to zero are removed. When you lower the quality of saving JPEG, the program increases the threshold for the number of deleted values, which gives a reduction in file size, but makes the picture more pixelated. Therefore, the image in the first section, which was 57 times smaller, looked like this. Each 8x8 block seemed to have a much smaller number of DCT coefficients compared to a better version.

You can make a cool effect like gradual streaming of images. You can display a blurry picture, which becomes more detailed as more and more coefficients are downloaded.

Here, just for fun, what happens when you use only 24,000 numbers:

Or just 5,000:

Very blurry, but as if recognizable!

3. Coding of series lengths, delta and Huffman

So far, all stages of compression have gone with losses. The last stage, on the contrary, goes without loss. It does not delete information, but significantly reduces the file size.

How can you compress something without discarding information? Imagine how we would describe a simple black rectangle of 700 x 437.

JPEG uses 5000 numbers for this, but a much better result can be achieved. Can you imagine a coding scheme that would describe such an image with as few bytes as possible?

The minimal scheme that I could come up with uses four: three to indicate color, and the fourth - how many pixels have that color. The idea of ​​representing repeated values ​​in such a compressed way is called series length coding. It has no losses, since we can restore the encoded data in its original form.

The size of a JPEG file with a black rectangle is much larger than 4 bytes - remember that at the DCT level, compression is applied to blocks of 8x8 pixels. Therefore, at least we need one DCT coefficient for every 64 pixels. We need one because instead of storing a single DCT coefficient followed by 63 zeros, encoding the lengths of the series allows us to store one number and indicate that “all the others are zeros”.

Delta coding is a technique in which each byte contains a difference from some value, rather than an absolute value. Therefore, editing certain bytes changes the color of all other pixels. For example, instead of storing

12 13 14 14 14 13 13 14

We could start at 12, and then simply indicate how much we need to add or subtract to get the next number. And this sequence in delta coding takes the form:

12 1 1 0 0 -1 0 1 The

converted data is not obtained less than the original, but it is already easier to compress it. Using delta coding before encoding series lengths can help a lot, while remaining lossless compression.

Delta coding is one of the few techniques used outside of 8x8 blocks. Of the 64 DCT coefficients, one is simply a constant wave function (solid color). It represents the average brightness of each block for the luminance components, or the average blueness for the Cb components, and so on. The first value of each DCT block is called the DC value, and each DC value undergoes delta coding with respect to the previous ones. Therefore, a change in the brightness of the first block will affect all blocks.

The last mystery remains: how does singular change completely spoil the whole picture? So far, compression levels did not have such properties. The answer lies in the JPEG header. The first 500 bytes contain metadata about the image - width, height, etc., and so far we have not worked with them.

Without a header, it's almost impossible (well, or very difficult) to decode JPEG. It will look as if I am trying to describe the picture to you, and begin to invent words in order to convey my impression. The description will probably be very concise, because I can invent words with exactly the meaning that I want to convey, but for all the others they will not make sense.

It sounds stupid, but that's exactly what happens. Each JPEG image is compressed with codes specific to it. A dictionary of codes is stored in the header. This technique is called the Huffman code, and the dictionary is called the Huffman table. In the header, the table is marked with two bytes - 255 and then 196. Each color component can have its own table.

Changes to the tables will radically affect any image. A good example is to change 1 on 12 on the 15th line.

This is because the tables indicate how to read individual bits. So far, we have only worked with binary numbers in decimal form. But this hides from us the fact that if you want to store the number 1 in a byte, then it will look like 00000001, since each byte should have exactly eight bits, even if only one is needed from them.

This is potentially a big waste of space if you have a lot of small numbers. The Huffman code is a technique that allows us to relax this requirement, according to which each number should occupy eight bits. This means that if you see two bytes:

234 115

That, depending on the Huffman table, these can be three numbers. To extract them, you first need to break them into separate bits:

11101010 01110011

Then we turn to the table to understand how to group them. For example, it can be the first six bits, (111010), or 58 in the decimal system, followed by five bits (10011), or 19, and finally the last four bits (0011), or 3.

Therefore, it is very difficult to understand bytes at this stage of compression. Bytes do not represent what it seems. I will not go into the details of working with the table in this article, but there are enough materials on this issue on the network .

One of the interesting tricks you can do if you know this is to separate the header from the JPEG and store it separately. In fact, it turns out that only you can read the file. Facebook is doing this to further reduce file size.

What else you can do is modify the Huffman table quite a bit. For others, it will look like a damaged picture. And only you will know the magic version of its correction.

To summarize: so what is needed for JPEG decoding? It is necessary:

  1. Extract the Huffman table (s) from the header and decode the bits.
  2. Extract discrete cosine transform coefficients for each color and brightness component for each 8x8 block by performing inverse transforms of coding of series lengths and deltas.
  3. Combine cosines based on coefficients to get pixel values ​​for each 8x8 block.
  4. Scale color components if downsampling has been performed (this information is in the header).
  5. Convert the resulting YCbCr values ​​for each pixel to RGB.
  6. Display image on screen!

Serious work for easy viewing pictures with a cat! However, what I like about it is that you can see how JPEG technology is human-centric. It is based on the characteristics of our perception, which allow us to achieve much better compression than conventional technologies. And now, understanding how JPEG works, you can imagine how these technologies can be transferred to other areas. For example, delta coding in video can give a serious reduction in file size, since there are often whole areas that do not change from frame to frame (for example, background).

The code used in the article is open, and contains instructions for replacing images with your own.

Also popular now: