What video codecs (do not) use browsers for video calls

Transfer

A typical Voximplant technical support request: “Why does a video call between two Chrome look better than a video call between MS Edge and a native iOS application?” Colleagues usually respond neutral - “because codecs”. But we, IT professionals, are curious. Even if I do not develop a new Skype-for-web, reading “what browser can do that” and how they break one video into several streams of different quality enriches the picture of the world and gives a fresh topic for discussion in the smoking room. Successfully turned up the article from the well-known in narrow circles of Dr Alex (with the best explanation of the term “media engine” from all that I saw), a little of our experience, a couple of evenings in “Dial” - and the translation adapted for Habr is waiting for the cut!

Codecs and channel width

When talking about video codecs, the balance of quality and width of the channel used is most often discussed. And they like to ignore the issues of CPU load and how to technically transfer video. It is quite reasonable if we discuss the encoding of an already recorded video.

After all, if you have a finished video, then there is not much difference, it will be compressed for a couple of minutes, a couple of hours or even a couple of days. Any costs of the processor and memory will be justified, because this is a one-time investment, and then the video can be distributed to millions of users. The best video codecs compress video in several passes:

Passage # 1: The video is divided into parts with common features: the action takes place on the same background, a fast or slow scene, and the like.
Pass # 2: Collecting statistics for coding and information on how frames change over time (to get such information, you need several frames).
Passage 3: Each part is encoded with its own codec settings and using the information obtained in the second step.

Streaming is a different matter. No one will wait for the end of the podcast, stream or show before starting to encode the video. Encode and send right away. Live on the fact he and direct, that the minimum delay becomes the most important thing.

When using physical media, DVD or Blu-Ray discs, the video size is fixed and the codec has the task to ensure the maximum quality for a given size. If the video is distributed over the network, then the task of the codec is to prepare such file (s) to get the maximum quality at a fixed channel width or the minimum channel width at a fixed quality, if you need to reduce the price. The network latency can be ignored and buffered on the client side for as many seconds of video as needed. But for streaming, neither size nor quality is fixed, there is no special need, the codec has another task: to reduce delays at any cost.

Finally, the creators of codecs for a long time kept in mind only one usage scenario: one and only one video is played on the user's computer. Which, moreover, can almost always be decoded by the video chip. Then came the mobile platform. And then WebRTC, to ensure minimal delays which developers really wanted to use the Selective Forwarding Unit servers.

Using codecs for video calls is so different from traditional use when playing video that comparing codecs head-on becomes meaningless. When, at the dawn of WebRTC, VP8 and H.264 were compared, one of the hottest debates was about codec settings: making them “realistic” considering unreliable networks or “ideal” for maximum video quality. Fighters for the “clean comparison of codecs” quite seriously argued that codecs should be compared without taking into account packet loss, jitter and other network problems.

What is now with codecs?

H.264 and VP8 are approximately the same in terms of video quality and channel width used;
H.265 and VP9 also roughly correspond to each other, showing an average of 30% better results than the previous generation of codecs due to an increase of 20% in CPU usage;
The new AV1 codec, the explosive mix of VP10, daala and thor, is about the same as the codec of the previous generation, about as much as the better ones of their predecessors.

And now the surprise: no one cares about these differences when it comes to video calls and video conferencing. The most important thing is how the codec plays in a team with the rest of the infrastructure. Developers are concerned with what is called the new term media engine : how a browser or mobile application captures video, encodes / decodes it, breaks it up into RTP packets, and fights network problems (remember the video from our previous web browser).? So, they compared just the media engine - the translator’s note). If the encoder can not work with a sharp decrease in the channel width or stably support 20 frames per second, if the decoder can not work with the loss of a single network packet, then what difference does it make how well the codec compresses the video? It’s easy to see why Google is sponsoring research from the Stanford team for better interaction between the codec and the network. This is the future of video communications.

Codecs and media engine: everything is difficult

Video calls and video conferencing have almost the same tasks as regular media. But the priorities are completely different:

It takes 30 frames per second (codec speed).
It takes 30 frames per second with interactivity (minimum delays).

We also have internet between the participants, the quality of which we can only guess. It's usually worse. Therefore:

It is good to experience small changes in the width of the channel when another visitor comes to coworking.
It is necessary to at least somehow experience strong changes in the width of the channel when this visitor starts downloading torrents.
It is necessary to experience jitter well (random delays between received packets, due to which they can not just linger, but come not in the order in which they were sent).
Need to experience packet loss.

3.1. Main tasks of media engine

What does "need 30 frames per second" mean? This means that the media engine has 33 milliseconds to capture video from the camera, sound from a microphone, compress it with a codec, split it into RTP packets, protect the transmitted data (SRTP = RTP + AES) and send over the network (UDP or TCP , in most cases, UDP). All this on the sending side. And on the receiving side - repeat in the reverse order. Since encoding is usually more complicated than decoding, the sending side is harder.

On the technical side, the goal “you need 30 frames per second” is achievable with delays. And the greater the delay, the easier it is to achieve the goal: if the sending side encodes not one frame at a time, but several at once, then you can significantly save on the channel width (codecs better compress several frames by analyzing the changes between all of them, not just between current and previous). At the same time, the delay between receiving a video stream from a camera and sending it over the network increases in proportion to the number of buffered frames, plus compression becomes slower due to additional calculations. Many sites use this trick, declaring the response time between sending and receiving network packets between video call participants. The delay in coding and decoding, they are silent.

In order to make video calls look like personal communication, the creators of communication services discard all settings and codec profiles that may cause delays. It turns out such a degradation of modern codecs to frame-by-frame compression. At first, this situation caused rejection and criticism from codec developers. But times have changed, and now modern codecs, in addition to the traditional “minimum size” and “maximum quality” presets, have added a set of “realtime” settings. But at the same time, the “sharing of the screen” is also for video calls (there is a specificity there - a large resolution, a little changing picture, the need for lossless compression, otherwise the text will float).

3.2. Media engine and public networks

Small channel width changes

Previously, codecs could not change the bitrate: when starting compression, they took the target bitrate as a setting and then gave out a fixed number of megabytes of video per minute. In those old times, video calls and video conferencing were the lot of local networks and reserved bandwidth. And in case of problems, they called the administrator who repaired the channel width reservation on the tsiska.

The first evolutionary change was the “adaptive bitrate” technology. The codec has many settings that affect the bitrate: video resolution, a slight decrease in fps from 30 to 25 frames per second, quantization of the video signal. The last in this list is the “coarsening” of the transition between colors, the minor changes of which are hardly noticeable to the human eye. Most often, the main “setting” for adaptive bitrate was precisely quantization. And the media engine told the codec about the channel width.

Large channel width changes

The adaptive bitrate mechanism helps the media engine to continue streaming video with minor changes in channel width. But if your colleague started downloading torrents and the available channel sank two or three times, then the adaptive bitrate will not help. It will help reduce the resolution and frame rate. The latter is preferable, since our eyes are less sensitive to the number of frames per second than to the video resolution. Typically, the codec starts to skip one or two frames, reducing the frame rate from 30 to 15 or even to 10.

An important detail: the media engine will skip frames on the sending side. If we have video conferencing for several participants or broadcasting, and the network problem is not with the sender, then one “weak link” will worsen the video quality for all participants. In such a situation, the simulcast bundle (the sending side gives several video streams of different quality at once) and the SFU (Selective Forwarding Unit, the server gives each participant of a video conference or broadcast a stream of the required quality) helps. Some codecs have the ability to create multiple simulcast streams, SVCs that complement each other: customers with the weakest channel receive a minimum quality stream, customers with a better channel receive the same stream plus the first “upgrade”, customers with an even better channel are given already two streams of "upgrade" and so on. This method allows you to not transmit the same data in multiple streams and saves about 20% of traffic compared to encoding several full-fledged video streams. It also simplifies the work of the server - no need to switch threads, it’s enough not to transfer packets with an “upgrade” to clients. However, any codec can be used for simulcast, it is a feature of the media engine and the organization of RTP packets, and not a codec.

Jitter and packet loss

Loss is the hardest to fight. Jitter is a bit simpler - it is enough to make a buffer on the host side in which to collect late and confused packets. Not too big buffer, otherwise you can break realtime and become buffering YouTube video.

Packet loss is usually fought by re-forwarding (RTX). If the sender has a good connection with the SFU, then the server can request the lost packet, retrieve it again and still be within 33 milliseconds. If the network connection is unreliable (more than 0.01% packet loss), then complex lossy algorithms, such as FEC , are needed .

The best solution at the moment is to use SVC codecs. In this case, to receive at least some video, only “support” packets with a stream of minimum quality are needed, these packets are smaller, hence it is easier to send them again, this is enough for “survival” even with a very bad network (more than 1% packet loss). If Simulcast + SFU allows you to deal with channel width subsidence, then Simulcast using the SVC codec + SFU solves both channel width issues and packet loss issues.

What browsers support now

Firefox and Safari use Google’s Media Engine and update libwebrtc from time to time. They do it much less frequently than Chrome, a new version of which is released every 6 weeks. From time to time they begin to lag far behind, but then synchronize again. With the exception of support for the VP8 codec in Safari. Don't even ask.

Kata table with a full comparison of who supports what, but in general, everything is quite simple. Edge all usually ignore. The choice is between support for the mobile version of Safari and good video quality. iOS Safari only supports H.264 video codec, while libwebrtc allows you to use simulcast only with VP8 (different streams with different frame rates) and VP9 (support for SVC) codecs. But you can read and use libwebrtc on iOS by creating a native application. Then, with simulcast, everything will be fine and users will get the highest possible video quality with an unstable Internet connection. A few examples:

Highfive is a desktop application on Electron (Chromium) with H.264 simulcast (libwebrtc) and Dolby audio codecs;
Attlasian - An interesting client solution for React Native and libwebrtc for simulcast;
Symphony - Electron for the desktop, React Native for the mobile device, and there and there is supported simulcast + additional security tools that are compatible with what banks want;
Tokbox - VP8 with simulcast in the mobile SDK, use the libvpx patched version of libwebrtc.

Future

It is already clear that VP8 and VP9 will not be in Safari (unlike Edge, which VP8 supports).

Although Apple supported the inclusion of H.265 in WebRTC, the latest news and a number of indirect signs point to AV1 as the “next big thing”. Unlike the rest of the article, this is my personal opinion. The AV1 data transfer package is already ready, but work is underway on the codec. Now the reference implementation of the encoder shows a sad 0.3 frames per second. This is not a problem when playing pre-compressed content, but so far not applicable to Realtime Communications. For now you can tryAV1 video playback in Firefox, although this is not related to RTC. The implementation from the bitmovin team, which developed MPEG-DASH and received 30 million investments to create the next-generation video infrastructure.

Tags: