OShapovalov November 28, 2017 at 11:40

Broadcast h264 video without transcoding and delay

It is no secret that when controlling aircraft, video transmission is often used from the device to the ground. Usually, manufacturers of UAVs themselves provide this opportunity. However, what to do if the drone is assembled with your own hands?

The challenge for us and our Swiss partners from Helvetis was to broadcast real-time video from a web camera from a low-power embedded device on a drone via WiFi to a Windows tablet. Ideally, we would like to:

delay <0.3s;
low CPU load on the embedded system (less than 10% per core);
at least 480p resolution (better than 720p).

It would seem that what could go wrong?

So, we settled on the following list of equipment:

Minnowboard 2-core, Atom E3826 @ 1.4 GHz, OS: Ubuntu 16.04
ELP USB100W04H webcam supporting several formats (YUV, MJPEG, H264)
ASUS VivoTab Note 8 Windows Tablet

Attempts to get by with standard solutions

Simple solution with Python + OpenCV

First, we tried to use a simple Python script , which using OpenCV received frames from the camera, compressed them using JPEG, and sent them via HTTP to the client application.

Http mjpg streaming in python

from flask import Flask, render_template, Response
import cv2
class VideoCamera(object):
    def __init__(self):
        self.video = cv2.VideoCapture(0)
    def __del__(self):
        self.video.release()
    def get_frame(self):
        success, image = self.video.read()
        ret, jpeg = cv2.imencode('.jpg', image)
        return jpeg.tobytes()
app = Flask(__name__)
@app.route('/')
def index():
    return render_template('index.html')
def gen(camera):
    while True:
        frame = camera.get_frame()
        yield (b'--frame\r\n'
               b'Content-Type: image/jpeg\r\n\r\n' + frame + b'\r\n\r\n')
@app.route('/video_feed')
def video_feed():
    return Response(gen(VideoCamera()),
                    mimetype='multipart/x-mixed-replace; boundary=frame')
if __name__ == '__main__':
    app.run(host='0.0.0.0', debug=True)

This approach has proven (almost) working. As an application for viewing, you could use any web browser. However, we immediately noticed that the frame rate was lower than expected, and the CPU load on Minnowboard was constantly at 100%. The embedded device simply could not handle frame encoding in real time. Of the advantages of this solution, it is worth noting a very small delay in transmitting 480p video with a frequency of not more than 10 frames per second.

During the search, a web-camera was found which, in addition to uncompressed YUV frames, could produce frames in MJPEG format. It was decided to use such a useful function to reduce the load on the CPU and find a way to transfer video without transcoding.

Ffmpeg / vlc

First of all, we tried everyone's favorite open-source ffmpeg combine, which allows, among other things, to read a video stream from a UVC device, encode it and transmit it. After a little immersion in the manual, command line keys were found that made it possible to receive and transmit a compressed MJPEG video stream without transcoding.

ffmpeg -f v4l2 -s 640x480 -input_format mjpeg -i /dev/video0 -c:v copy -f mjpeg udp://ip:port

The CPU load was low. Rejoicing, we eagerly opened the stream in the ffplay player ... To our disappointment, the video delay level was absolutely unacceptable (about 2 - 3 seconds). After trying everything from here and browsing the Internet, we still could not achieve a positive result and decided to abandon ffmpeg.

After a failure with ffmpeg, the turn came of the VLC media player, or rather the cvlc console utility. By default, VLC uses a bunch of buffers, which on the one hand help to achieve a smooth image, but on the other give a serious delay of several seconds. Fairly tormented, we picked up the parameters with which the streaming looked fairly tolerable, i.e. the delay was not very large (about 0.5 s), there was no transcoding, and the client showed the video quite smoothly (it was necessary, however, to leave a small buffer of 150 ms on the client).

This is the final line for cvlc:

cvlc -v v4l2:///dev/video0:chroma="MJPG":width=640:height=480:fps=30 --sout="#rtp{sdp=rtsp://:port/live,caching=0}" --rtsp-timeout=-1 --sout-udp-caching=0 --network-caching=0 --live-caching=0

Unfortunately, the video did not work quite stably, and a delay of 0.5 s was unacceptable for us.

Mjpg-streamer

Having stumbled upon an article about practically our task, we decided to try mjpg-streamer. Tried it, liked it! With absolutely no changes, it turned out to use mjpg-streamer for our needs without significant delay in video at a resolution of 480p.

Against the background of previous failures, we were happy for a long time, but then we wanted more. Namely: a little less clogging the channel and improve video quality to 720p.

H264 streaming

To reduce the load on the channel, we decided to change the codec used to h264 (finding in our inventory a suitable web-camera). Mjpg-streamer did not have h264 support, so it was decided to modify it. During development, we used two cameras with integrated h264 codec, manufactured by Logitech and ELP. As it turned out, the contents of the h264 stream in these cameras varied significantly.

Cameras and stream structure

The h264 stream consists of several types of NAL (network abstraction layer) packets. Our cameras generated 5 types of packages:

Picture parameter set (PPS)
Sequence parameter set (SPS)
Coded slice layer without partitioning, IDR picture
Coded slice layer without partitioning, non-IDR picture
Coded slice data partition

IDR (Instantaneous decoding refresh) - a package containing an encoded image. In this case, all the necessary data for decoding the image are in this package. This package is necessary for the decoder to begin to form the image. Usually the first frame of any h264 compressed video is an IDR picture.

Non-IDR - a package containing an encoded image containing links to other frames. The decoder is not able to recover an image from a single Non-IDR frame without other packets.

In addition to the IDR frame, the decoder needs PPS and SPS packets to decode the image. These packets contain metadata about the image and frame stream.

Based on the mjpg-streamer code, we used the V4L2 (video4linux2) API to read data from cameras. As it turned out, one “frame” of the video contained several NAL packets.

It was in the contents of the “frames” that a significant difference was found between the cameras. We used the h264bitstream library to parse the stream. There are standalone utilities to view the contents of a stream.

The Logitech camera frame stream consisted mainly of non-IDR frames, furthermore divided into several data partitions. Once every 30 seconds, the camera generated a packet containing IDR picture, SPS and PPS. Since the decoder needs an IDR package in order to start decoding the video, this situation did not suit us right away. Unfortunately, it turned out that there is no adequate way to establish the period with which the camera generates IDR packets. Therefore, we had to abandon the use of this camera.

The ELP camera turned out to be much more convenient. Each frame we received contained PPS and SPS packets. In addition, the camera generated an IDR packet every 30 frames (period ~ 1s). It suited us quite well and we opted for this camera.

Implementation of a broadcast server based on mjpg-streamer

As a basis for the server part, it was decided to take the aforementioned mjpg-streamer. Its architecture made it easy to add new input and output plugins. We started by adding a plugin to read the h264 stream from the device. As the output plugin, we chose the existing http plugin.

In V4L2, it was enough to specify that we want to receive frames in the format V4L2_PIX_FMT_H264 to start receiving the h264 stream. Since an IDR frame is needed to decode the stream, we parsed the stream and expected an IDR frame. To the client application, the stream was sent over HTTP starting from this frame.

On the client side, we decided to use libavformat and libavcodec from the ffmpeg project to read and decode the h264 stream. In the first test prototype, receiving a stream over the network, splitting it into frames and decoding was assigned to ffmpeg, converting the resulting decoded image from NV12 format to RGB, and display was implemented on OpenCV.

The first tests showed that this method of broadcasting video is workable, but there is a significant delay (about 1 second). Our suspicion fell on the http protocol, so it was decided to use UDP to transmit packets.

Since we did not need to support existing protocols like RTP, we implemented our simplest ~~bike~~A protocol in which h264 stream NAL packets were transmitted inside UDP datagrams. After a little refinement of the receiving part, we were pleasantly surprised by the low latency of the video on the desktop PC. However, the very first tests on a mobile device showed that h264 software decoding is not a hobby for mobile processors. The tablet just did not have time to process frames in real time.

Since the Atom Z3740 processor used on our tablet supports Quick Sync Video (QSV) technology, we tried using the QSV h264 decoder from libavcodec. To our surprise, he not only did not improve the situation, but also increased the delay to 1.5 seconds even on a powerful desktop PC! However, this approach really significantly reduced the load on the CPU.

After trying various configuration options for the decoder in ffmpeg, it was decided to abandon libavcodec and use the Intel Media SDK directly.

The first surprise for us was the horror that a person who decided to develop using the Media SDK is invited to plunge into. The official example offered to the developers is a powerful combine that can do everything, but which is hard to figure out. Fortunately, we found like-minded people at Intel forums who were also unhappy with the example. They found old but more digestible tutorials . Based on the simple_2_decode example, we got the following code.

Stream Decoding with Intel Media SDK

mfxStatus sts = MFX_ERR_NONE;
// Буфер с содержимым потока h264
mfxBitstream mfx_bitstream;
memset(&mfx_bitstream, 0, sizeof(_mfxBS));
mfx_bitstream.MaxLength = 1 * 1024 * 1024; // 1MB
mfx_bitstream.Data = new mfxU8[mfx_bitstream.MaxLength];
// Реализация протокола на основе UDP
StreamReader *reader = new StreamReader(/*...*/);
MFXVideoDECODE *mfx_dec;
mfxVideoParam mfx_video_params;
MFXVideoSession session;
mfxFrameAllocator *mfx_allocator;
// Инициализация сессии MFX
mfxIMPL impl = MFX_IMPL_AUTO;
mfxVersion ver = { { 0, 1 } };
session.Init(sts, &ver);
if (sts < MFX_ERR_NONE)
    return 0; // :(
// Создаем декодер, устанавливаем кодек AVC (h.264)
mfx_dec = new MFXVideoDECODE(session);
memset(&mfx_video_params, 0, sizeof(mfx_video_params));
mfx_video_params.mfx.CodecId = MFX_CODEC_AVC;
// Декодируем в системную память
mfx_video_params.IOPattern = MFX_IOPATTERN_OUT_SYSTEM_MEMORY;
// Устанавливаем глубину очереди в минимальное значение
mfx_video_params.AsyncDepth = 1;
// получаем метаинформацию о видео
reader->ReadToBitstream(&mfx_bitstream);
sts = mfx_dec->DecodeHeader(&mfx_bitstream, &mfx_video_params);
if (sts < MFX_ERR_NONE)
    return 0; // :(
// Запросим информацию о размере кадров
mfxFrameAllocRequest request;
memset(&request, 0, sizeof(request));
sts = mfx_dec->QueryIOSurf(&mfx_video_params, &request);
if (sts < MFX_ERR_NONE)
    return 0; // :(
mfxU16 numSurfaces = request.NumFrameSuggested;
// Для декодера необходимо чтобы ширина и высота были кратны 32
mfxU16 width = (mfxU16)MSDK_ALIGN32(request.Info.Width);
mfxU16 height = (mfxU16)MSDK_ALIGN32(request.Info.Height);
// NV12 - формат YUV 4:2:0, 12 бит на пиксель
mfxU8 bitsPerPixel = 12;
mfxU32 surfaceSize = width * height * bitsPerPixel / 8;
// Выделим память для поверхностей в которые будут декодироваться кадры 
mfxU8* surfaceBuffers = new mfxU8[surfaceSize * numSurfaces];
// Метаинформация о поверхностях для декодера
mfxFrameSurface1** pmfxSurfaces = 
                new mfxFrameSurface1*[numSurfaces];
for(int i = 0; i < numSurfaces; i++)
{
    pmfxSurfaces[i] = new mfxFrameSurface1;
    memset(pmfxSurfaces[i], 0, sizeof(mfxFrameSurface1));
    memcpy(&(pmfxSurfaces[i]->Info), 
      &(_mfxVideoParams.mfx.FrameInfo), sizeof(mfxFrameInfo));
    pmfxSurfaces[i]->Data.Y = &surfaceBuffers[surfaceSize * i];
    pmfxSurfaces[i]->Data.U = 
                      pmfxSurfaces[i]->Data.Y + width * height;
    pmfxSurfaces[i]->Data.V = pmfxSurfaces[i]->Data.U + 1;
    pmfxSurfaces[i]->Data.Pitch = width;
}
sts = mfx_dec->Init(&mfx_video_params);
if (sts < MFX_ERR_NONE)
    return 0; // :(
mfxSyncPoint syncp;
mfxFrameSurface1* pmfxOutSurface = NULL;
mfxU32 nFrame = 0;
// Начало декодирования потока
while (reader->IsActive() &&
    (MFX_ERR_NONE <= sts
        || MFX_ERR_MORE_DATA == sts
        || MFX_ERR_MORE_SURFACE == sts))
{
    // Ждем если устройство было занято 
    if (MFX_WRN_DEVICE_BUSY == sts)
        Sleep(1);
    if (MFX_ERR_MORE_DATA == sts)
        reader->ReadToBitstream(mfx_bitstream);
    if (MFX_ERR_MORE_SURFACE == sts || MFX_ERR_NONE == sts)
    {
        nIndex = GetFreeSurfaceIndex(pmfxSurfaces, numSurfaces);
        if (nIndex == MFX_ERR_NOT_FOUND)
            break;
    }
    // Декодирование кадра
    // Декодер самостоятельно находит NAL-пакеты в потоке и забирает их
    sts = mfx_dec->DecodeFrameAsync(mfx_bitstream, 
            pmfxSurfaces[nIndex], &pmfxOutSurface, &syncp);
    // Игнорируем предупреждения
    if (MFX_ERR_NONE < sts && syncp)
        sts = MFX_ERR_NONE;
    // Ожидаем окончания декодирования кадра
    if (MFX_ERR_NONE == sts)
        sts = session.SyncOperation(syncp, 60000);
    if (MFX_ERR_NONE == sts)
    {
        // Кадр готов!
        mfxFrameInfo* pInfo = &pmfxOutSurface->Info;
        mfxFrameData* pData = &pmfxOutSurface->Data;
        // Декодированный кадр имеет формат NV12
        // плоскость Y: pData->Y, полное разрешение
        // плоскость UV: pData-UV, разрешение в 2 раза ниже чем у Y
    }
} // Конец цикла декодирования

After implementing video decoding using the Media SDK, we faced a similar situation - the video delay was 1.5 seconds. Desperate, we turned to the forums and found tips that should reduce the delay in decoding video.

The Media SDK h264 decoder accumulates frames before issuing a decoded image. It was found that if the “end of stream” flag is set in the structure of the data transmitted to the decoder (mfxBitstream), then the delay decreases to ~ 0.5 seconds:

mfx_bitstream.DataFlag = MFX_BITSTREAM_EOS;

Further, it was experimentally found that the decoder keeps 5 frames in the queue, even if the flag for the end of the stream is set. As a result, we had to add a code that simulated “the final end of the stream” and forced the decoder to output frames from this queue:

if( no_frames_in_queue )
    sts = mfx_dec->DecodeFrameAsync(mfx_bitstream, pmfxSurfaces[nIndex], 
                                    &pmfxOutSurface, &syncp);
else
    sts = mfx_dec->DecodeFrameAsync(0, pmfxSurfaces[nIndex], &pmfxOutSurface, &syncp);
if (sts == MFX_ERR_MORE_DATA)
{
    no_frames_in_queue = true;
}

After that, the delay level dropped to acceptable, i.e. imperceptible gaze.

conclusions

Starting the task of broadcasting video in real time, we very much hoped to use existing solutions and do without our bicycles.

Our main hope was such video giants as FFmpeg and VLC. Despite the fact that they seem to be able to do what we need (transmit video without transcoding), we were not able to remove the delay resulting from video transmission.

Having almost stumbled upon the mjpg-streamer project, we were fascinated by its simplicity and precise work in broadcasting MJPG video. If you suddenly need to transfer this particular format, we strongly recommend using it. It is no coincidence that it was on its basis that we implemented our solution.

As a result of development, we got a fairly lightweight solution for transmitting video without delay, not demanding on the resources of either the transmitting or the receiving side. The Intel Media SDK library helped us a lot in the video decoding task, even though we had to use a little force to make it render frames without buffering.

Tags: