A3a December 1, 2016 at 11:16

Magic H.264

Transfer

H.264 is a video compression standard. And it’s ubiquitous, it is used to compress video on the Internet, on Blu-ray, phones, surveillance cameras, drones, everywhere. Everyone is using H.264 now.

It should be noted the manufacturability of H.264. It appeared as a result of more than 30 years of work with one single goal: reducing the necessary channel bandwidth for transmitting high-quality video.

From a technical point of view, this is very interesting. The article will superficially describe the details of the operation of some compression mechanisms, I will try not to get bored with the details. In addition, it is worth noting that most of the technologies described below are valid for video compression in general, and not just for H.264.

Why bother compressing anything?

An uncompressed video is a sequence of two-dimensional arrays containing information about the pixels of each frame. Thus, it is a three-dimensional (2 spatial dimensions and 1 temporal) byte array. Each pixel is encoded in three bytes - one for each of the three primary colors (red, green and blue).

1080p @ 60 Hz = 1920x1080x60x3 => ~ 370 Mb / s data.

It would be almost impossible to use. A 50GB Blu-ray disc could hold just about 2 minutes. video. Copying will also not be easy. Even SSDs will have problems writing from memory to disk.

So yes, compression is necessary.

Why H.264?

I will definitely answer this question. But first, I'll show you something. Take a look at the Apple homepage:

I saved the image and give an example 2 file:

PNG screenshot of the main page 1015 Kb
5 seconds H.264 video at 60 fps on the same page 175 Kb

Um ... what? The file sizes seem to be mixed up.

No, the dimensions are fine. H.264 video with 300 frames weighs 175 Kb. One single frame from video to PNG - 1015 Kb.

It seems that we store 300 times more data in the video, but we get a file weighing 5 times less. It turns out that H.264 is 1,500 times more efficient than PNG.

How is this possible, what is the reception?

And there are a lot of tricks! H.264 uses all the tricks you can guess about (and there aren’t a lot of them). Let's go through the main ones.

Get rid of excess weight.

Imagine that you are preparing a car for racing and you need to speed it up. What will you do first? You will lose weight. Let's say a car weighs one ton. You start throwing away unnecessary details ... Rear seat? Pff ... throw it away. Subwoofer We can do without music. Air conditioning? Not needed. Transmission? In the muso ... wait, it will come in handy.

Thus, you will get rid of everything except the necessary.

This method of discarding unnecessary sections is called lossy data compression. H.264 encodes with loss, discarding less significant parts while preserving important ones.

PNG encodes without loss. This means that all information is saved, pixel by pixel, and therefore the original image can be recreated from a file encoded in PNG.

Important parts? How can an algorithm determine their importance in a frame?

There are several obvious ways to crop an image. Perhaps the upper right quarter of the picture is useless, then you can remove this corner and we will fit in ¾ of the original weight. Now the car weighs 750 kg. Or you can cut an edge of a certain width around the entire perimeter, important information is always in the middle. Yes, it is possible, but H.264 does not do all of this.

What does H.264 actually do?

H.264, like all lossy compression algorithms, reduces granularity. Below, a comparison of images before and after getting rid of the details.

See how the holes in the speaker grill of a MacBook Pro disappeared in a compressed image? If you do not zoom in, you may not notice it. The image on the right weighs only 7% of the original and this despite the fact that there was no compression in the traditional sense. Imagine a car weighing only 70 kg!

7% wow! How is it possible to get rid of detail in such a way?

First, some math.

Informational entropy

We come to the most interesting! If you visited the theory of computer science, then perhaps remember the concept of information entropy. Information entropy is the number of units for representing some data. Note that this is not the size of the data itself. This is the minimum number of units that must be used to represent all data items.

For example, if you take one coin toss in the form of data, then the entropy will be 1 unit. If the coin toss is 2, then 2 units are needed.

Suppose the coin is very strange - it was flipped 10 times and an eagle fell out each time. How would you tell anyone about this? It is unlikely that it would be something like LLCOOOOOOO, you would say “10 throws, all eagles” - boom! You just squeezed the information! Easy. I saved you from hours of tiring lectures. This, of course, is a huge simplification, but you converted the data into a kind of short presentation with the same information content. That is, they reduced redundancy. The informational entropy of data has not been affected - you just transformed the view. This method is called entropy encoding, which is suitable for encoding any kind of data.

Frequency space

Now that we’ve figured out informational entropy, let's move on to transforming the data itself. It is possible to present data in fundamental systems. For example, if you use the binary code, there will be 0 and 1. If you use the hexadecimal system, then the alphabet will consist of 16 characters. There is a one-to-one relationship between the above systems, so it is easy to convert one into another. So far, is everything clear? Move on.

And imagine that you can imagine data that changes in space or time in a completely different coordinate system. For example, the brightness of the image, and instead of the coordinate system with x and y, we take the frequency system. Thus, the freqX and freqY frequencies will be on the axes, this representation is called the Frequency domain representation. And there isthe theorem that any data can be represented without loss in such a system with sufficiently high freqX and freqY.

Ok, but what are freqX and freqY?

freqX and freqY are just another basis in the coordinate system. Just as you can switch from binary to hexadecimal, you can switch from XY to freqX and freqY. The transition from one system to another is shown below. The MacBook Pro Fine Grid contains high-frequency information and is located in the high-frequency area. Thus, small parts have a high frequency, and smooth changes such as color and brightness are low. All that is between remains between. In this view, the low-frequency details are closer to the center of the image, and the high-frequency ones are in the corners.

So far, everything is clear, but why is this needed?

Because now, you can take the image presented in the frequency intervals and crop the corners, in other words, apply a mask, thereby lowering the detail. And if you convert the image back to familiar, you will notice that it remains similar to the original, but with less detail. As a result of such manipulation, we will save space. By selecting the desired mask, you can control the detail of the image.

Below is a familiar laptop, but now with circular masks applied to it. The percentage indicates informational entropy relative to the original image. If you do not zoom in, the difference is not noticeable even at 2%! - the car now weighs 20 kg! This is how you need to lose weight. This lossy compression process is called Quantization .

This is impressive, what other tricks exist?

Color processing

The human eye does not distinguish closely between similar shades of color. You can easily recognize the smallest differences in brightness, but not color. Therefore, there must be a way to get rid of excess color information and save even more space.

On TVs, RGB colors are converted to YCbCr, where Y is the luminance component (essentially the brightness of the black and white image), and Cb and Cr are the color components. RGB and YCbCr are equivalent in terms of information entropy.

Why then complicate it? Isn't RGB enough?

In the days of black and white TVs, there was only the Y component. And with the advent of color TVs, engineers had the task of transmitting a color RGB image along with black and white. Therefore, instead of two channels for transmission, it was decided to encode the color into Cb and Cr components and transmit them together with Y, and color TVs themselves will convert the color and brightness components to their usual RGB.

But here's the trick: the luminance component is encoded in full resolution, and the color components are only a quarter. And this can be neglected, because the eye / brain does not distinguish between shades. Thus, you can reduce the image size in half and with minimal differences. 2 times! The machine will weigh 10 kg!

This technology for coding images with a decrease in color resolution is calledcolor sub-sampling . It has been used everywhere for a long time and applies not only to H.264.

These are the most significant technologies in reducing size with lossy compression. We managed to get rid of most of the detail and reduce the color information by 2 times.

Can I have more?

Yes. Cropping a picture is just the first step. Until that moment, we were taking apart a single frame. It's time to take a look at the compression in time, where we have to work with a group of personnel.

Motion compensation

H.264 is a standard that compensates for movement.

Motion compensation? What is it?

Imagine you are watching a tennis match. The camera is fixed and shoots from a certain angle and the only thing that moves is a ball. How would you encode this? You would do as usual, right? A three-dimensional array of pixels, two coordinates in space and one frame at a time, right?

But why? Most of the image is the same. Field, grid, spectators do not change, the only thing that moves is a ball. What if you define a single background image and one image of a ball moving along it. Wouldn't that save a lot of space? You see what I'm getting at, right? Motion compensation?

And that is exactly what H.264 does. H.264 splits the image into macroblocks, usually 16x16, which are used to calculate motion. One frame remains static, usually called an I-frame [Intra frame], and contains everything. Subsequent frames can be either P-frames [predicted] or B-frames [bi-directionally predicted]. In P-frames, the motion vector is encoded for each macroblock based on previous frames, so the decoder should use the previous frames, taking the last of the I-frames of the video and gradually adding changes to subsequent frames until it reaches the current one.

Things are even more interesting with B-frames, in which the calculation is performed in both directions, based on frames going before and after them. Now you understand why the video at the beginning of the article weighs so little, these are just 3 I-frames in which macroblocks rush around.

With this technology, only differences in motion vectors are encoded, thereby providing a high compression ratio of any video with movements.

We examined static and temporary compression. With the help of quantization, we reduced the data size many times, then with the help of color subsampling we reduced the result by half, and now we only managed to store the 3 frames out of 300 that were originally in the video under consideration by motion compensation.

It looks impressive. Now what?

Now we will draw a line using traditional lossless entropy coding. Why not?

Entropy coding

After the lossy compression steps, I frames contain redundant data. In the motion vectors of each of the macroblocks in the P-frames and B-frames there is a lot of the same information, since they often move identically, as can be seen in the initial video.

Such redundancy can be eliminated by entropy coding. And you can not worry about the data itself, since this is a standard lossless compression technology, which means everything can be restored.

Now it’s all! H.264 is based on the aforementioned technologies. This is the basis of the standard.

Excellent! But I am taken apart by curiosity to find out how much our car weighs now.

The original video was shot in a non-standard resolution of 1232x1154. If you count, you get:

5 seconds. @ 60 fps = 1232x1154x60x3x5 => 1.2 GB
Compressed video => 175 Kb

If you correlate the result with the agreed weight of the machine in one ton, you get a weight equal to 0.14 kg. 140 grams!

Yes, it's magic!

Of course, in a very simplified form, I presented the result of decades of research in this area. If you want to know more, then the page on Wikipedia is quite informative.

Tags: