Experience creating realtime video sequencer on iOS

    Hi, my name is Anton and I am an iOS developer at Rosberry. Not so long ago, I happened to work on the Hype Type project and solve several interesting problems in working with video, text and animations. In this article, I will talk about the pitfalls and possible ways to get around them when writing a real-time video sequencer on iOS.

    A little about the application itself ...

    Hype Type allows the user to record several short excerpts of video and / or photos with a total duration of up to 15 seconds, add text to the resulting clip and apply one of the animations to it to choose from.


    The main feature of working with video in this case is that the user should have the ability to manage video fragments independently of each other: change the playback speed, reverse, flip and (possibly in future versions) swap passages on the fly.


    Ready solutions?

    “Why not use AVMutableComposition ?” - you can ask, and, in most
    cases, you will be right - this is really a fairly convenient system video sequencer, but, alas, it has limitations that did not allow us to use it. First of all, this is the impossibility of changing and adding tracks on the fly - in order to get the changed video stream, you will need to recreate AVPlayerItem and reinitialize AVPlayer . Also, working with images is far from ideal in AVMutableComposition - in order to add a static image to the timeline, you will have to use AVVideoCompositionCoreAnimationTool, which will add a fair amount of overhead and significantly slow down the rendering.

    A short search on the Internet did not reveal any other solutions more or less suitable for the task, so it was decided to write your sequencer.


    To begin with, a little about the structure of the render pipeline in the project. I must say right away that I will not go into too much detail and will assume that you are more or less familiar with this topic, otherwise this material will grow to incredible proportions. If you are a beginner - I advise you to pay attention to the fairly well-known GPUImage framework ( Obj-C , Swift ) - this is a great starting point in order to understand OpenGLES with a clear example .

    View, which is engaged in rendering the received video on the screen by a timer ( CADisplayLink ), requests frames from the sequencer. Since the application works mainly with video, it is most logical to use YCbCr colorspace and transfer each frame as CVPixelBufferRef . After receiving the frame, luminance and chrominance textures are created, which are transferred to the shader program. The output is an RGB image that the user sees. The Refresh loop in this case will look something like this:

    - (void)onDisplayRefresh:(CADisplayLink *)sender {
        // advance position of sequencer
        [self.source advanceBy:sender.duration];
        // check for new pixel buffer
        if ([self.source hasNewPixelBuffer]) {
            // get one
            PixelBuffer *pixelBuffer = [self.source nextPixelBuffer];
            // dispatch to gl processing queue
            [self.context performAsync:^{
                // prepare textures
                self.luminanceTexture = [self.context.textureCache textureWithPixelBuffer:pixelBuffer planeIndex:0 glFormat:GL_LUMINANCE];
            self.chrominanceTexture = [self.context.textureCache textureWithPixelBuffer:pixelBuffer planeIndex:1 glFormat:GL_LUMINANCE_ALPHA];
                // prepare shader program, uniforms, etc
                self.program.orientation = pixelBuffer.orientation;
                // ...          
                // signal to draw
                [self setNeedsRedraw];
        if ([self.source isFinished]) {
            // rewind if needed
            [self.source rewind];
    // ...
    - (void)draw {
        [self.context performSync:^{
            // bind textures
            [self.luminanceTexture bind];
            [self.chrominanceTexture bind];
            // use shader program
            [self.program use];
            // unbind textures
            [self.luminanceTexture unbind];
            [self.chrominanceTexture unbind];

    Almost everything here is built on wrappers (for CVPixelBufferRef , CVOpenGLESTexture , etc.), which allows you to take the main low-level logic to a separate layer and greatly simplify the basic points of working with OpenGL . Of course, this has its drawbacks (mainly a small loss of performance and less flexibility), but they are not so critical. Something worth explaining: self.context is a fairly simple wrapper on EAGLContext , which makes working with CVOpenGLESTextureCache and multi-threaded calls to OpenGL easier . self.source - a sequencer that decides which frame from which track to give to view.

    Now about how organized the receipt of personnel for rendering. Since the sequencer should work with both video and pictures, it is most logical to close everything with a common protocol. Thus, the task of the sequencer will be to follow the playhead and, depending on its position, give a new frame from the corresponding track.

    @protocol MovieSourceProtocol 
    // start & stop reading methods
    - (void)startReading;
    - (void)cancelReading;
    // methods for getting frame rate & current offset
    - (float)frameRate;
    - (float)offset;
    // method to check if we already read everything...
    - (BOOL)isFinished;
    // ...and to rewind source if we did
    - (void)rewind;
    // method for scrubbing
    - (void)seekToOffset:(CGFloat)offset;
    // method for reading frames
    - (PixelBuffer *)nextPixelBuffer;

    The logic of how to get frames lies with objects that implement MovieSourceProtocol . Such a scheme allows you to make the system universal and extensible, since the only difference in image and video processing will be only the way to obtain frames.

    Thus, VideoSequencer becomes very simple, and the main difficulty remains to determine the current track and bring all tracks to a single frame rate.

    - (PixelBuffer *)nextPixelBuffer {
        // get current track
        VideoSequencerTrack *track = [self trackForPosition:self.position];
        // get track source
        id source = track.source; // Here's our source
        // get pixel buffer
        return [source nextPixelBuffer];

    VideoSequencerTrack here is a wrapper over an object that implements MovieSourceProtocol containing various metadata.

    @interface FCCGLVideoSequencerTrack : NSObject
    - (id) initWithSource:(id)source;
    @property (nonatomic, assign) BOOL editable;
    // ... and other metadata

    We work with statics

    Now we proceed directly to obtaining frames. Consider the simplest case - displaying a single image. You can get it either from the camera, and then we can immediately get CVPixelBufferRef in YCbCr format , which is quite simple to copy (why it is important, I will explain a bit later) and give it on request; or from the media library - in this case you have to dodge a little and manually convert the image to the desired format. Convert operation from RGB to YCbCrcould be rendered on the GPU, but on modern devices and the CPU it copes with this task quite quickly, especially considering the fact that the application additionally sprinkles and compresses the image before using it. Otherwise, everything is quite simple, all that needs to be done is to give the same frame over the allotted time interval.

    @implementation ImageSource
    // init with pixel buffer from camera
    - (id)initWithPixelBuffer:(PixelBuffer *)pixelBuffer orientation:(AVCaptureVideoOrientation)orientation duration:(NSTimeInterval)duration {
        if (self = [super init]) {
            self.orientation = orientation;
            self.pixelBuffer = [pixelBuffer copy];
            self.duration = duration;
        return self;
    // init with UIImage
    - (id)initWithImage:(UIImage *)image duration:(NSTimeInterval)duration {
        if (self = [super init]) {
            self.duration = duration;
            self.orientation = AVCaptureVideoOrientationPortrait;
            // prepare empty pixel buffer
            self.pixelBuffer = [[PixelBuffer alloc] initWithSize:image.size pixelFormat:kCVPixelFormatType_420YpCbCr8BiPlanarFullRange];
            // get base addresses of image planes
            uint8_t *yBaseAddress = self.pixelBuffer.yPlane.baseAddress;
            size_t yPitch = self.pixelBuffer.yPlane.bytesPerRow;
            uint8_t *uvBaseAddress = self.pixelBuffer.uvPlane.baseAddress;
            size_t uvPitch = self.pixelBuffer.uvPlane.bytesPerRow;
            // get image data
            CFDataRef pixelData = CGDataProviderCopyData(CGImageGetDataProvider(image.CGImage));
            uint8_t *data = (uint8_t *)CFDataGetBytePtr(pixelData);
            uint32_t imageWidth = image.size.width;
            uint32_t imageHeight = image.size.height;
            // do the magic (convert from RGB to YCbCr)
            for (int y = 0; y < imageHeight; ++y) {
                uint8_t *rgbBufferLine = &data[y * imageWidth * 4];
                uint8_t *yBufferLine = &yBaseAddress[y * yPitch];
                uint8_t *cbCrBufferLine = &uvBaseAddress[(y >> 1) * uvPitch];
                for (int x = 0; x < imageWidth; ++x) {
                    uint8_t *rgbOutput = &rgbBufferLine[x * 4];
                    int16_t red = rgbOutput[0];
                    int16_t green = rgbOutput[1];
                    int16_t blue = rgbOutput[2];
                    int16_t y = 0.299 * red + 0.587 * green + 0.114 * blue;
                    int16_t u = -0.147 * red - 0.289 * green + 0.436 * blue;
                    int16_t v = 0.615 * red - 0.515 * green - 0.1 * blue;
                    yBufferLine[x] = CLAMP(y, 0, 255);
                    cbCrBufferLine[x & ~1] = CLAMP(u + 128, 0, 255);
                    cbCrBufferLine[x | 1] = CLAMP(v + 128, 0, 255);
        return self;
    // ...
    - (BOOL)isFinished {
        return (self.offset > self.duration);
    - (void)rewind {
        self.offset = 0.0;
    - (PixelBuffer *)nextPixelBuffer {
        if ([self isFinished]) {
            return nil;
        return self.pixelBuffer;
    // ...

    Working with video

    Now add the video. To do this, it was decided to use AVPlayer - mainly due to the fact that it has a fairly convenient API for receiving frames and completely takes care of the sound. In general, it sounds quite simple, but there are some points that are worth paying attention to.
    Let's start with the obvious:

    - (void)setURL:(NSURL *)url withCompletion:(void(^)(BOOL success))completion {
        self.setupCompletion = completion;
        // prepare asset
        self.asset = [[AVURLAsset alloc] initWithURL:url options:@{
            AVURLAssetPreferPreciseDurationAndTimingKey : @(YES),
        // load asset tracks
        __weak VideoSource *weakSelf = self;
        [self.asset loadValuesAsynchronouslyForKeys:@[@"tracks"] completionHandler:^{
            // prepare player item 
            weakSelf.playerItem = [AVPlayerItem playerItemWithAsset:weakSelf.asset];
            [weakSelf.playerItem addObserver:weakSelf forKeyPath:@"status" options:NSKeyValueObservingOptionNew context:nil];
    - (void)observeValueForKeyPath:(NSString *)keyPath ofObject:(id)object change:(NSDictionary *)change context:(void *)context {
        if(self.playerItem.status == AVPlayerItemStatusReadyToPlay) {
            // ready to play, prepare output
            NSDictionary *outputSettings = @{
                (id)kCVPixelBufferPixelFormatTypeKey: @(kCVPixelFormatType_420YpCbCr8BiPlanarFullRange),
                (id)kCVPixelBufferOpenGLESCompatibilityKey: @(YES),
                (id)kCVPixelBufferOpenGLCompatibilityKey: @(YES),
                (id)kCVPixelBufferIOSurfacePropertiesKey: @{
                    @"IOSurfaceOpenGLESFBOCompatibility": @(YES),
                    @"IOSurfaceOpenGLESTextureCompatibility": @(YES),
            self.videoOutput = [[AVPlayerItemVideoOutput alloc] initWithPixelBufferAttributes:outputSettings]; 
            [self.playerItem addOutput:self.videoOutput];
            if (self.setupCompletion) {
    // ...
    - (void) rewind {
        [self seekToOffset:0.0];
    - (void)seekToOffset:(CGFloat)offset {
        [self.playerItem seekToTime:[self timeForOffset:offset] toleranceBefore:kCMTimeZero toleranceAfter:kCMTimeZero];
    - (PixelBuffer *)nextPixelBuffer {
        // check for new pixel buffer...
        CMTime time = self.playerItem.currentTime;
        if(![self.videoOutput hasNewPixelBufferForItemTime:time]) {
            return nil;
        // ... and grab it if there is one
        CVPixelBufferRef bufferRef = [self.videoOutput copyPixelBufferForItemTime:time itemTimeForDisplay:nil];
        if (!bufferRef) {
            return nil;
        PixelBuffer *pixelBuffer = [[FCCGLPixelBuffer alloc] initWithPixelBuffer:bufferRef];
        return pixelBuffer;

    We create AVURLAsset , load track information, create AVPlayerItem , wait for the notification that it is ready to play, and create AVPlayerItemVideoOutput with parameters suitable for rendering - everything is still quite simple.

    However, the first problem lies right here - seekToTime is not fast enough, and there are noticeable delays in the loop. If you do not change the toleranceBefore and toleranceAfter parameters , this does not change much, except that, in addition to the delay, positioning inaccuracy is also added. This is a system limitation and cannot be completely solved, but you can get around it, for which it is enough to prepare 2 AVPlayerItem 'a and use them in turn - as soon as one of them finishes playing, the other starts playing immediately, while the first one rewinds to the beginning. And so in a circle.

    Another unpleasant but solvable problem is that AVFoundation as it should (seamless & smooth) does not support changing the playback speed and reverse for all types of files, and if in the case of recording from the camera we control the output format, then in case the user loads video from the media library, we don’t have such luxury. Forcing users to wait until the video is converted is not a good way out, especially since they will not use these settings, so it was decided to do this in the background and quietly replace the original video with the converted one.

    - (void)processAndReplace:(NSURL *)inputURL outputURL:(NSURL *)outputURL {
        [[NSFileManager defaultManager] removeItemAtURL:outputURL error:nil];
        // prepare reader
        MovieReader *reader = [[MovieReader alloc] initWithInputURL:inputURL];
        reader.timeRange = self.timeRange;
        // prepare writer
        MovieWriter *writer = [[FCCGLMovieWriter alloc] initWithOutputURL:outputURL];
        writer.videoSettings = @{
            AVVideoCodecKey: AVVideoCodecH264,
            AVVideoWidthKey: @(1280.0),
            AVVideoHeightKey: @(720.0),
        writer.audioSettings = @{
            AVFormatIDKey: @(kAudioFormatMPEG4AAC),
            AVNumberOfChannelsKey: @(1),
            AVSampleRateKey: @(44100),
            AVEncoderBitRateStrategyKey: AVAudioBitRateStrategy_Variable,
            AVEncoderAudioQualityForVBRKey: @(90),
        // fire up reencoding
        MovieProcessor *processor = [[MovieProcessor alloc] initWithReader:reader writer:writer];
        processor.processingSize = (CGSize){
            .width = 1280.0, 
            .height = 720.0
        __weak FCCGLMovieStreamer *weakSelf = self;
        [processor processWithProgressBlock:nil andCompletion:^(NSError *error) {
            if(!error) {
                weakSelf.replacementURL = outputURL;

    MovieProcessor here is a service that receives frames and audio samples from the reader and gives them to the writer. (In fact, he also knows how to process the frames received from the reader on the GPU, but this is used only when rendering the entire project, in order to overlay animation frames on the finished video)

    And now more difficult

    But what if the user wants to add 10-15 video clips to the project right away? Since the application should not limit the user to the number of clips that he can use in the application, this scenario should be considered.

    If you prepare each passage for playback as needed, there will be too noticeable delays. It’s also impossible to prepare all the clips at once, due to the iOS limitation on the number of h264 decoders working simultaneously. Of course, there is a way out of this situation and it is quite simple - to prepare in advance a couple of tracks that will be played next, “clearing” those that are not planned to be used in the near future.

    - (void) cleanupTrackSourcesIfNeeded {
        const NSUInteger cleanupDelta = 1;
        NSUInteger trackCount = [self.tracks count];
        NSUInteger currentIndex = [self.tracks indexOfObject:self.currentTrack];
        if (currentIndex == NSNotFound) {
            currentIndex = 0;
        NSUInteger index = 0;
        for (FCCGLVideoSequencerTrack *track in self.tracks) {
            NSUInteger currentDelta = MAX(currentIndex, index) - MIN(currentIndex, index);
            currentDelta = MIN(currentDelta, index + (trackCount - currentIndex - 1));
            if (currentDelta > cleanupDelta) {
                track.playheadPosition = 0.0;
                [track.source cancelReading];
                [track.source cleanup];
            else {
                [track.source startReading];

    In such a simple way managed to achieve continuous playback and loop'a. Yes, with scrubbing, there will inevitably be a small lag, but this is not so critical.

    Underwater rocks

    In the end, I’ll tell you a little about the pitfalls that can occur when solving such problems.

    First, if you are working with pixel buffers received from the device’s camera, either release them immediately or copy them if you want to use them later. Otherwise, the video stream is blocked - I did not find any mention of this restriction in the documentation, but, apparently, the system tracks pixel buffers that it gives and will not give you new ones while the old ones hang in memory.

    The second is multithreading when working with OpenGL . OpenGL itself is not very friendly with it, but it can be circumvented by using different EAGLContexts located in the same EAGLSharegroup , which will allow you to quickly and easily separate the rendering logic of what the user sees on the screen and various background processes (video processing, rendering etc.).

    Also popular now: