How Yandex used computer vision to improve the quality of video broadcasts. DeepHD technology

    When people search a picture or video on the Internet, they often add the phrase “in good quality” to a request. Quality usually refers to resolution — users want the image to be large and look good on a modern computer, smartphone, or TV screen. But what if the source in good quality simply does not exist?

    Today we will tell Habr readers how we can improve video resolution in real time using neural networks. You will also learn how the theoretical approach to solving this problem differs from the practical one. If you are not interested in technical details, then you can safely browse the post - in the end you are waiting for examples of our work.

    On the Internet a lot of video content in low quality and resolution. These may be films made decades ago, or broadcast TV channels, which for various reasons are not conducted in the best quality. When users stretch the video to full screen, the image becomes muddy and fuzzy. The ideal solution for old films would be to find the original of the film, scan it on modern equipment and restore it manually, but this is not always possible. With broadcasts it is still more difficult - they need to be processed live. In this regard, the most acceptable way for us to work is to increase the resolution and clean up artifacts using computer vision technologies.

    In the industry, the task of increasing pictures and videos without losing quality is called the term super-resolution. Many articles have already been written on this topic, but the realities of “combat” use have proved much more complicated and interesting. Briefly about the main problems that we had to solve in our own DeepHD technology:

    • You need to be able to restore parts that were not on the original video due to its low resolution and quality, to “finish” them.
    • Solutions from the field of super-resolution restore the details, but they make clear and detailed not only the objects in the video, but also compression artifacts, which cause the hostility of the audience.
    • There is a problem with the collection of the training sample - a large number of pairs are required, in which the same video is present in both low resolution and quality and high. In reality, there is usually no quality pair for bad content.
    • The solution should work in real time.

    Technology selection

    In recent years, the use of neural networks has led to significant success in solving virtually all computer vision tasks, and the task of super-resolution is no exception. The most promising solutions seemed to us based on GAN (Generative Adversarial Networks, generative rival networks). They allow you to get photorealistic images of high definition, complementing them with missing details, such as drawing hair and eyelashes on images of people.

    In the simplest case, the neural network consists of two parts. The first part - the generator - takes the image as input and returns it enlarged twice. The second part - the discriminator - receives the image generated and “real” as input, and tries to distinguish it from each other.

    Training set training

    For training, we collected a few dozen videos in UltraHD-quality. First, we reduced them to 1080p resolution, thereby obtaining reference examples. Then we reduced these videos by another half, squeezing in parallel with a different bit rate to get something like a real video in low quality. We split the resulting rollers into frames and in this form used for training the neural network.


    Of course, we wanted to get an end-to-end solution: to train the neural network to generate high-resolution and high-quality video from the original at once. However, the GANs turned out to be very capricious and constantly tried to refine compression artifacts, rather than eliminate them. Therefore it was necessary to break the process into several stages. The first is the suppression of video compression artifacts, also known as deblocking.

    An example of how one of the deblocking methods works:

    At this stage, we minimized the standard deviation between the generated frame and the source frame. Thus, although we increased the image resolution, we did not receive a real increase in resolution due to regression to the average: the neural network, not knowing in which particular pixels this or that border in the image passes, was forced to average several options, getting a blurred result. The main thing that we have achieved at this stage is the elimination of video compression artifacts, so the next stage of the generative network only needed to increase clarity and add the missing small details and textures. After hundreds of experiments, we selected an optimal architecture for performance and quality, vaguely reminiscent of the DRCN architecture :

    The basic idea of ​​such an architecture is the desire to obtain the deepest possible architecture, while not having problems with convergence during its training. On the one hand, each next convolutional layer extracts more and more complex features of the input image, which allows determining that the object is at a given point of the image and reconstructing complex and badly damaged parts. On the other hand, the distance in the graph of the neural network from any of its layers to the output remains small, thereby improving the convergence of the neural network and making it possible to use a large number of layers.

    Generative network training

    We took the SRGAN architecture as the basis for a neural network to increase resolution . Before you train a competitive network, you need to train a generator — train it in the same way as during the deblocking stage. Otherwise, at the beginning of the training, the generator will only return noise, the discriminator will immediately begin to “win” - it will be easy to learn to distinguish noise from real personnel, and no training will succeed.

    Next, we teach GAN, but there are some nuances here too. It is important for us that the generator created not only photo-realistic frames, but also kept the information available on them. To do this, we add the content loss function (content loss) to the classic GAN architecture. It consists of several layers of a VGG19 neural network trained in standard ImageNet dataset. These layers convert an image into a feature map, which contains information about the content of the image. The loss function minimizes the distance between such maps obtained from the generated and source frames. Also, the presence of such a loss function makes it possible not to spoil the generator in the first steps of training, when the discriminator is not yet trained and provides useless information.

    Acceleration of the neural network

    Everything went well, and after a chain of experiments, we got a good model that could already be applied to old films. However, it was still too slow to handle streaming video. It turned out that simply reducing the generator without a significant loss of quality of the final model is impossible. Then the knowledge distillation approach came to the rescue. This method provides for the training of a lighter model so that it repeats the results of a heavier one. We took a lot of real videos in low quality, processed them with the generative neural network obtained in the previous step and trained an easier network to get the same result from the same frames. Due to this reception, we received a network that is not very much inferior in quality to the original, but faster than it ten times:

    Evaluation of the quality of decisions

    Perhaps the most difficult moment when working with generative networks is an assessment of the quality of the models obtained. There is no clear function of the error, as, for example, in solving the problem of classification. Instead, we know only the accuracy of the discriminator, which does not reflect the quality of the generator we are interested in (a reader well familiar with this sphere could suggest using the Wasserstein metric , but, unfortunately, it gave us a noticeably worse result).

    People helped us solve this problem. We showed Yandex.Tolok userspairs of images, one of which was the original, and the other - processed by the neural network, or both were processed by different versions of our solutions. For a fee, users chose a better video from a pair, so we obtained a statistically significant version comparison, even with changes that were difficult for the eye to see. Our final models triumph in more than 70% of cases, which is quite a lot, considering that users spend only a few seconds on assessing a couple of videos.

    An interesting result was also the fact that the video resolution of 576p, enhanced by DeepHD technology to 720p, wins the same original video with a resolution of 720p in 60% of cases - i.e. Processing not only increases the resolution of the video, but also improves its visual perception.


    In the spring, we tested DeepHD technology on several old films that can be viewed at Kinopoisk: “ Rainbow ” by Mark Donskoy (1943), “ The Cranes Are Flying ” by Mikhail Kalatozov (1957), “ My Dear Man ” by Joseph Kheyfits (1958), “ Man 's Fate ” Sergey Bondarchuk (1959), “ Ivanovo Childhood ” by Andrei Tarkovsky (1962), “ Father of a Soldier ” Rezo Chkheidze (1964) and “ Tango of Our Childhood ” by Albert Mkrtchyan (1985).

    The difference between the versions before and after the treatment is especially noticeable if one looks at the details: study the mimicry of the heroes on close-ups, consider the texture of clothes or fabric design. It was possible to compensate for some of the disadvantages of digitization: for example, remove overexposure on faces or make more visible objects placed in the shade.

    Later, the DeepHD technology was used to improve the quality of broadcasts of some channels in the Yandex.Efir service. Recognizing such content is easy by the dHD tag .

    Now on Yandex, in improved quality, you can watch “The Snow Queen”, “The Bremen Musicians”, “Golden Antelope” and other popular cartoons of the Soyuzmultfilm studio. Several examples in dynamics can be seen in the video:

    For viewers demanding image quality, the difference will be especially noticeable: the image has become sharper, tree leaves, snowflakes, stars in the night sky above the jungle and other small details are better seen.

    Further more.

    useful links

    Jiwon Kim, Jung Kwon Lee, Kyoung Mu Lee Deeply-Recursive Convolutional Network for Image Super-Resolution [ arXiv: 1511.04491 ].

    Christian Ledig et al. Using a Generative Adversarial Network [ arXiv: 1609.04802 ].

    Mehdi SM Sajjadi, Bernhard Schölkopf, Michael Hirsch EnhanceNet: Single Image Super-Resolution Through Automated Texture Synthesis [ arXiv: 1612.07919 ].

    Also popular now: