Streaming video through a browser with ultra-low latency (and WebRTC!)

Transfer

While the first early adopters try our new video conferences (up to 100 people!) Into their projects, we continue to talk about interesting things from the world of voice and video transmission with the participation of the browser. We will also tell about videoconferences, but later - when a critical mass of users accumulate and interesting statistics are collected. And now I have translated and adapted for us the story of Dr. Alex about the place of different protocols when transmitting video with low latency. The story is essentially a response to another article, and the author, along with the story, points out the mistakes and inaccuracies that his colleagues in the workshop made.

Network data: alarm separately, video separately

In modern systems, if you see video in a browser, then the video stream and the alarm will most likely be processed by different servers. If everything is clear with video, then the “alarm server” provides two pieces: “discovery” and “handshake”. The first, “discovery”, is the choice of data transfer method: IP addresses, intermediate server (if needed). "Handshake" - about the agreement between the participants of the transmission of video and sound: codecs, resolution, frame rate, quality. Interestingly, in ancient Flash, signaling and media transmission were not separated as in VoxIP or WebRTC and were provided with one protocol: RTMP.

The difference between signaling protocol and signaling transport

The signaling protocol defines the language with which the browser and other participants in the video will negotiate discovery and handshake. It can be SIP for discovery in VoIP or WebRTC, and it can also be offer / answer for handshake. A long time ago, Flash used RTMP / AMF . And if you prefer, you can use not SIP for WebRTC, but unusual JSEP.

The signaling transport protocol is from the same stack, but lower: this is how the signaling protocol packets will be physically transmitted. Traditionally, Flash + SIP used TCP or UDP, but now WebSTCets can be found more often in the WebRTC + SIP bundle. The WebSockets transport protocol occupies a TCP niche in browsers where you cannot use “clean” TCP and UDP sockets.

The full signaling stack is now popularly described with phrases like “SIP over web sockets”, “JSEP over web sockets”, obsolete “SIP over TCP / UDP” or the older “part of RTMP”.

Programming Anglicism: media codec

Most video streaming protocols are tied to one or more codecs. Video received from the camera is processed frame by frame. And problems in the network, such as reduced bandwidth, packet loss, or latency between them, are solved by the codec settings for each frame. In order to learn about network problems in time, transport protocol mechanisms (RTP / RTCP) and bandwidth estimation mechanisms (REMB, Transport-CC, TIMBR) are used. One of the fundamental problems with Flash-video was that RTMP could neither do one thing or the other — so the video simply stopped playing when the bandwidth of the channel fell.

Another Anglicism: Streaming Media Protocol

Defines how to divide the video stream into small packets that are sent over the network by the transport protocol. Typically, the streaming protocol still provides mechanisms for dealing with network problems: packet loss and delay. Jitter buffer, retransmission (RTC), redundancy (RED) and Forward Error Correction (FEC).

Media Transfer Protocol

After the video received from the camera is split into small packets, they need to be transmitted over the network. The transport protocol used for this is similar to the signaling protocol, but since the “payload” is completely different, some protocols are better than others. For example, TCP provides packet flow, but this does not add value to the stack, because similar mechanisms (RTX / RED / FEC) already exist in the streaming protocol. But the delay in resending to TCP is a clear flaw, which is deprived of UDP. But at the same time there is the practice of blocking UDP as a “protocol for torrents”.

The choice of protocol and network ports used to be decided by “hardcoding”, but now we use protocols such as ICE in WebRTC, which allows us to agree on ports and transport for each specific connection. In the near future, it is possible to use the QUIC protocol (backward compatible with UDP), which is actively discussed by the IETF and has advantages over TCP and UDP in speed and reliability. Finally, media streaming protocols such as MPEG-DASH and HLS, which use HTTP as transport and benefit from HTTP / 2.0, can be mentioned.

Media Transfer Security

Some engines protect data during transmission over the network: either the media stream itself or the transport layer packets. The process includes the very transfer of encryption keys, for which separate protocols are used: SDES in VoIP and DTLS in WebRTC. The latter has the advantage, because in addition to data it protects the transfer of encryption keys.

What confuses me in all this

Some developers, for example, the authors of this article , place purely WebSocket and QUIC transport protocols on the same level as WebRTC, Flash, or HLS. For me, such a grouping looks strange, because the last three protocols are a story about streaming media. Encoding and partitioning occurs before using WebSocket or QUIC. The WebRTC reference implementation (libwebrtc / chrome) from Google and Microsoft's ORTC use QUIC as the transport protocol.

No less surprising is the lack of mention of HTTP / 2.0 as optimizations for protocols based on HTTP, such as HLS and MPEG-DASH. And the mentioned CMAF is nothing more than a file format for HLS and MPEG-DASH, but not their replacement.

Finally, SRT is just a transport protocol. Of course, it adds a number of chips as compared to those based on the HLS and MPEG-DASH files, but all these chips are already on a different stack level and are implemented in RTMP or WebRTC. Another SRT separates the encoding of the media stream and statistics, which does not allow the codec to keep this information as close as possible to each other. Such a decision may adversely affect the ability to adapt the video to be sent to varying network bandwidth.

File-based protocols, such as HLS, encode several streams and select the ones needed to adapt to the channel width. WebRTC allows you to adapt the encoding of each frame in real time: it is much faster than selecting another stream in HLS, which requires reading up to 10 seconds of already sent data.

Tags:

webrtc