Deep Learning, now in OpenCV



    This article is a brief overview of the capabilities of dnn - the OpenCV module designed to work with neural networks. If you are interested in what it is, what it can do and how fast it works, welcome to cat.

    Perhaps, many will agree that OpenCV is the most famous computer vision library. Over the long period of its existence, it has gained a vast audience of users and has become, de facto, the standard in the field of computer vision. A lot of algorithms working “out of the box”, open source code, great support, a large community of users and developers, the ability to use the library in C, C ++, Python (as well as Matlab, C #, Java) under various operating systems - this is far from complete a list of what allows OpenCV to remain in demand. But OpenCV does not stand still - functionality is constantly being added. And today I want to talk about the new features of OpenCV in the field of Deep Learning.

    Downloading and obtaining results (predictions) using models created in any of the three popular frameworks (Caffe, TensorFlow, Torch), fast work on the CPU, support for the main layers of neural networks and, as always, cross-platform, open source and support - about this I am going to tell in this article.

    First of all, I would like to introduce myself. My name is Alexander Rybnikov. I am an Intel engineer and am implementing Deep Learning functionality in the OpenCV library.

    A few words about how OpenCV works. This library is a set of modules, each of which is associated with a specific area of ​​computer vision. There is a standard set of modules - so-called “must have” for any computer vision task. Implementing well-known algorithms, these modules are well developed and tested. All of them are represented in the main OpenCV repository . There is also a repository with additional modules that implement experimental or new functionality. The requirements for experimental modules are, for obvious reasons, softer. And, as a rule, when one of these modules becomes sufficiently developed, formed and in demand, it can be transferred to the main repository.

    This article is related to one of the modules that recently took pride of place in the main repository - with the dnn module (hereinafter simply dnn).

    The (N + 1) th framework for deep learning, why is this at all?


    Why did you need Deep Learning in OpenCV? In recent years, in many areas, deep learning (in some sources, deep learning) has shown results that are significantly superior to those of classical algorithms. This also applies to the field of computer vision, where a lot of tasks are solved using neural networks. In light of this fact, it seems logical to give OpenCV users the ability to work with neural networks.

    Why was the way to write something of your own chosen instead of using existing implementations? There are several reasons for this.

    Firstly, it is possible to achieve a lightweight solution. Leaving only the ability to perform a direct pass (forward pass) through the network, you can simplify the code and speed up the installation and assembly process.

    Secondly, having its implementation, it is possible to minimize external dependencies to a minimum. This will simplify the distribution of applications using dnn. And, if the OpenCV library was previously used in the project, it will not be difficult to add support for deep networks to such a project.

    Also, when developing your solution, there is the opportunity to make it universal, not tied to any particular framework, its limitations and shortcomings. If you have your own implementation, all the ways are available to optimize and speed up the code.

    A proprietary module for launching deep networks greatly simplifies the procedure for creating hybrid algorithms that combine the speed of classical computer vision and the remarkable generalizing ability of deep neural networks.

    It is worth noting that the module is not, strictly speaking, a full-fledged framework for deep learning. At the moment, the module presents exclusively the possibility of obtaining the results of the network.

    Key features


    The main feature of dnn is, of course, in loading and running neural networks (inference). Moreover, the model can be created in any of the three deep learning frameworks - Caffe, TensorFlow or Torch; the way it is downloaded and used is maintained no matter where it was created.

    Supporting three popular frameworks at once, we can simply combine the results of the models loaded from them without the need to create everything anew in one single framework.

    When loading, models are converted to an internal representation similar to those used in Caffe. This happened for historical reasons - Caffe support was added very first. However, there is no one-to-one correspondence between the representations.

    All major layers are supported: from basic (Convolution and Fully connected) to more specialized ones - more than 30 in total.

    List of Supported Layers
    AbsVal
    AveragePooling
    BatchNormalization
    Concatenation
    Convolution (with DILATION)
    the Crop
    DetectionOutput
    Dropout
    Eltwise
    the Flatten
    FullConvolution
    FullyConnected
    the LRN
    LSTM
    MaxPooling
    MaxUnpooling
    MVN
    NormalizeBBox
    the Padding
    Permute
    the Power
    PReLU
    PriorBox
    Relu
    RNN
    the Scale
    the Shift
    Sigmoid
    Slice
    Softmax
    Split
    TANH

    If you can not find the layer list, which is required for you, Do not despair. You can createa request to add support for the layer you are interested in (and our team will try to help you in the near future), or implement everything yourself and submit a pull request.

    In addition to supporting individual layers, support for specific neural network architectures is also important. The module contains examples for classification ( AlexNet , GoogLeNet , ResNet , SqueezeNet ), segmentation ( FCN , ENet ), object detection ( SSD ); many of these models are tested on the original datasets, but more on that later.

    Assembly


    If you are an experienced OpenCV user, you can safely skip this section. If not, then I will try to briefly talk about how to get working examples from the source code for Linux or Windows.

    Brief Assembly Instructions
    First you need to install git (or Git Bash for Windows), [cmake] (http://cmake.org) and the C ++ compiler (Visual Studio for Windows, Xcode on Mac, clang or gcc for Linux). If you intend to use OpenCV from Python, you must also install Python itself (the latest versions 2.7.x or 3.x will work) and the corresponding version of numpy.

    Let's start by cloning the repository:

    mkdir git && cd git
    git clone https://github.com/opencv/opencv.git

    On Windows, repository cloning can also be done, for example, using TortoiseGit or SmartGit. Next, we proceed to generate files for the assembly:

    cd ..
    mkdir build && cd build
    cmake ../git/opencv -DBUILD_EXAMPLES=ON

    (for Windows, hereinafter you need to replace cmake with the full path to the cmake file to be launched, for example, “C: \ Program Files \ CMake \ bin \ cmake.exe” or use the cmake GUI)

    Now directly build:

    make -j5 (Linux)
    cmake --build . --config Release -- /m:5 (Windows)

    After that dnn is ready to use.
    The above instructions are brief enough, so I will also provide links to step-by-step instructions for installing OpenCV on Windows and Linux .

    Examples of using


    By a good tradition, every OpenCV module includes usage examples. dnn is no exception, C ++ and Python examples are available in the samples subdirectory in the source repository. In the examples there are comments, and in general, everything is quite simple.

    Here is a brief example that performs image classification using the GoogLeNet model. In Python, our example will look like this:

    import numpy as np
    import cv2 as cv
    # read names of classes
    with open('synset_words.txt') as f:
        classes = [x[x.find(' ') + 1:] for x in f]
    image = cv.imread('space_shuttle.jpg')
    # create tensor with 224x224 spatial size and subtract mean values (104, 117, 123) 
    # from corresponding channels (R, G, B)
    input = cv.dnn.blobFromImage(image, 1, (224, 224), (104, 117, 123))
    # load model from caffe
    net = cv.dnn.readNetFromCaffe('bvlc_googlenet.prototxt', 'bvlc_googlenet.caffemodel')
    # feed input tensor to the model
    net.setInput(input)
    # perform inference and get output
    out = net.forward() 
    # get indices with the highest probability
    indexes = np.argsort(out[0])[-5:] 
    for i in reversed(indexes):
        print('class:', classes[i], ' probability:', out[0][i])

    This code downloads the image, conducts a small preprocessing, and receives network output for the image. Pre-processing consists in scaling the image so that the smallest of the sides becomes equal to 224, cutting out the central part and subtracting the average value from the elements of each channel. These operations are necessary, since the model was trained on images of a given size (224 x 224) with just such preprocessing.

    The output tensor is interpreted as the probability vector of the image belonging to a particular class and the names for 5 classes with the highest probability are displayed in the console.

    Looks easy, right? If you write the same thing in C ++, the code will turn out to be a little longer. However, the most important thing - the function names and the logic of working with the module - will remain the same.

    Accuracy


    How to understand that one trained model is better than another? You must compare quality metrics for both models. Very often, the struggle at the top of the ranking of the best models is for a fraction of a percent of quality. Since dnn reads and converts models from various frameworks into its internal representation, questions arise about maintaining quality after the model is converted: is the model “spoiled” after loading? Without answers to these questions, which means without verification it is difficult to talk about the full use of dnn.

    I tested models from the available examples for various frameworks and various tasks: AlexNet (Caffe), GoogLeNet (Caffe), GoogLeNet (TensorFlow), ResNet-50 (Caffe), SqueezeNet v1.1 (Caffe) for the task of classifying objects; FCN (Caffe), ENet (Torch) for the task of semantic segmentation. The results are shown in Tables 1 and 2.
    Model (source framework)
    Published value acc @ top-5
    Measured value acc @ top-5 in the original framework
    Measured value acc @ top-5 in dnn
    The average difference per element between the output tensors of the framework and dnn
    The maximum difference between the output tensors of the framework and dnn
    AlexNet (Caffe)
    80.2%
    79.1%
    79.1%
    6.5E-10
    3.01E-06
    GoogLeNet (Caffe)
    88.9%
    88.5%
    88.5%
    1.18E-09
    1.33E-05
    GoogLeNet (TensorFlow)
    - 89.4%
    89.4%
    1.84E-09
    1.47E-05
    ResNet-50
    (Caffe)
    92.2%
    91.8%
    91.8%
    8.73E-10
    4.29E-06
    SqueezeNet v1.1
    (Caffe)
    80.3%
    80.4%
    80.4%
    1.91E-09
    6.77E-06
    Table 1. Quality assessment results for the classification task. The measurements were carried out using the ImageNet 2012 validation kit (ILSVRC2012 val, 50,000 examples).
    Model (framework)
    Published value mean IOU
    The measured value of mean IOU in the original framework
    Measured value of mean IOU in dnn
    The average difference per element between the output tensors of the framework and dnn
    The maximum difference between the output tensors of the framework and dnn
    FCN (Caffe)
    65.5%
    60.402874%
    60.402879%
    3.1E-7
    1.53E-5
    ENet (Torch)
    58.3%
    59.1368%
    59.1369%
    3.2E-5
    1.20
    Table 2. Quality assessment results for the task of semantic segmentation. The explanation of the large maximum difference for ENet is hereinafter.

    The results for FCN are calculated for the validation set of the segmentation part of PASCAL VOC 2012 (736 examples). The results for ENet are computed using the Cityscapes validation kit (500 examples).

    A few words should be said about the meaning of the above numbers. For classification tasks, the generally accepted metric of model quality is accuracy for the top 5 network responses (accuracy @ top-5, [1]): if the correct answer is among 5 network answers with maximum confidence indicators, then this network response is considered as true . Accordingly, accuracy is the ratio of the number of correct answers to the number of examples. This measurement method allows you to take into account not always the correct markup of data when, for example, an object is marked that occupies a far from central position on the frame.

    For the problems of semantic segmentation, several metrics are used - pixel accuracy and mean intersection over union, mean IOU [5]. Per-pixel accuracy is the ratio of the number of correctly classified pixels to the number of all pixels. mean IOU is a more complex characteristic: it is the class-averaged ratio of correctly marked pixels to the sum of the number of pixels of a given class and the number of pixels marked as a given class.

    It follows from the tables that for classification and segmentation problems there is no difference in accuracy between model launches in the original framework and in dnn. This remarkable fact means that the module can be safely used without fear of unpredictable results. All test scripts are also available.here , so you can independently verify the correctness of the results.

    The difference between the numbers published and obtained in the experiments can be explained by the fact that the authors of the models carry out all the calculations using the GPU, while I used the CPU implementation. It has also been noticed that different libraries can decode the jpeg format differently. This could affect the results for FCN, since the PASCAL VOC 2012 dataset contains images of this particular format, and models for semantic segmentation are quite sensitive to changes in the distribution of input data.

    As you can see, Table 2 shows an abnormally large maximum difference between the dnn and Torch outputs for the ENet model. I was also interested in this fact and then I will briefly talk about the reasons for its occurrence.

    Why is there a big difference between dnn and Torch for ENet?
    The ENet model uses several MaxPooling operations. This operation selects the maximum element in the vicinity of each position and writes this maximum value to the output tensor, and also transfers the indices of the selected maximum elements. These indices are further used by the operation, in a sense, the reverse of this is MaxUnpooling. This operation writes the elements of the input tensor at the output position corresponding to the same indices. At this point, a big error occurs: in a certain neighborhood, the MaxPooling operation selects an element with the wrong index; the difference between the correct Torch output and dnn output for this layer lies within the computational error (10E-7), and the difference in the indices corresponds to neighboring elements of the neighborhood. I.e, as a result of a small fluctuation, the adjacent element became somewhat larger than the element with the correct index. The result of the MaxUnpooling operation, in this case, depends not only on the output of the previous layer, but also on the indices of the corresponding MaxPooling operation, which is located much earlier (at the beginning of the computational graph of the model). Therefore, MaxUnpooling writes the item with the correct value to the wrong position. As a result, an error accumulates.

    Unfortunately, it is not possible to eliminate this error, since the root causes of its appearance are most likely associated with slightly different implementations of the algorithms used in training and inference and are not related to the presence of an error in the implementation.
    However, it is fair to say that the average error per element of the output tensor remains low (see Table 2) —that is, errors in the indices occur quite rarely. Moreover, the presence of this error does not lead to a deterioration in the quality of the model, as evidenced by the numbers in the same Table 2.

    Performance


    One of the goals that we set when developing dnn is to achieve decent module performance on various architectures. Not so long ago, optimization was carried out for the CPU, as a result of which dnn now shows good results in terms of speed.

    I measured the operating time for various models when using them - the results are in Table 3.
    Model (source framework)
    Image resolution
    The performance of the original framework, CPU (acceleration library); memory consumption
    Performance dnn, CPU (acceleration relative to the original framework); memory consumption
    AlexNet (Caffe)
    227x227
    23.7 ms (MKL); 945 MB
    14.7 ms (1.6x); 713 MB
    GoogLeNet (Caffe)
    224x224
    44.6 ms (MKL); 197 MB
    20.1 ms (2.2x); 172 MB
    ResNet-50 (Caffe)
    224x224
    70.2 ms (MKL); 386 MB
    58.8 ms (1.2x); 224 MB
    SqueezeNet v1.1
    (Caffe)
    227x227
    12.4 ms (MKL); 113 MB
    5.3 ms (2.3x); 38 MB
    GoogLeNet (TensorFlow)
    224x224
    17.9 ms (Eigen); 310 MB
    21.1 ms (0.8x); 135 MB
    FCN (Caffe)
    various (500x350 on average)
    3873.6 ms (MKL);
    4453 MB
    1229.8 ms (3.1x);
    1332 MB
    ENet (Torch)
    1024x512
    1105.0 ms; 828 MB
    218.7 ms (5.1x); 190 MB
    Table 3. Results of measurements of the operating time of various models. The experiments were conducted using the Intel Core i7-6700k.

    Time measurements were made with averaging over 50 starts and performed as follows: for dnn, the timer built into OpenCV was used; Caffe used the caffe time utility; Torch and TensorFlow used existing time-measuring functions.

    As follows from Table 3, dnn in most cases outperforms the original frameworks. Actual data on the performance of dnn from OpenCV on various models in comparison with other frameworks can also be found here .

    Future plans


    Deep learning has taken a significant place in computer vision and, accordingly, we have big plans to develop this functionality in OpenCV. They relate to improving usability, redesigning the internal architecture of the module itself and improving performance.

    In improving the user experience, we focus primarily on the wishes of the users themselves. We strive to add the functionality that developers and researchers need in real-world tasks. In addition, the plans include the addition of network visualization, as well as expanding the set of supported layers.

    As regards performance, in spite of many optimizations performed, we still have ideas on how to improve the results. One of these ideas is to reduce the bit depth of calculations. This procedure is called quantization. Roughly speaking, throw out some of the bits at the input and layer weights before calculating the convolution (fp32 → fp16), or calculate scaling factors that convert the range of input numbers to the range of int or short. In this case, the speed will increase (due to the use of faster operations with integers), but, perhaps, the accuracy will suffer a little. However, publications and experiments in this field show that even sufficiently strong quantization in certain cases does not lead to a noticeable decrease in quality.

    Running layers in parallel is another optimization idea. In the current implementation, only one layer works at a time. Each layer maximally uses parallelization when performing calculations. However, in some cases, the computation graph can be parallelized at the level of the layers themselves. Potentially, this can give each thread more work, thereby reducing overhead.

    Now something quite interesting is being prepared for release. I think few have heard of the Halide programming language.. It is not Turing-complete - some designs cannot be implemented on it; perhaps that is why it is not popular. However, this drawback is at the same time its advantage - the source code written on it can be automatically turned into highly optimized for different hardware: CPU, GPU, DSP. There is no need to be an optimization guru - a special compiler will do everything for you. Already, Halide allows acceleration of some models - and, for example, semantic segmentation with the ENet model works at 25 fps for a resolution of 512x256 on Intel Core i7-6700k (versus 22 fps for dnn without Halide). And, best of all, without rewriting the code, you can use the integrated GPU in the processor, receiving an additional couple of frames per second.

    In fact, we have high expectations for Halide. Due to its unique characteristics, it will allow you to get accelerated work without requiring additional manipulations from the user. We strive to ensure that for using Halide together with OpenCV, the user does not need to install additional software for using Halide - the principle of working out of the box should be preserved. And, as our experiments show, we have every chance to realize this.

    Conclusion


    Already now dnn has everything to be useful. And every day an increasing number of users are discovering its capabilities. Nevertheless, we still have work to do. I will continue my work on the module, expanding its capabilities and improving the functionality. I hope that this article has been interesting and useful for you.

    If you have questions, suggestions, problems, or if you want to contribute by submitting a pull request, welcome to the github repository , as well as to our forum , where my colleagues and I will try to help you. If none of the above methods worked, on our websiteYou can find additional ways of communication. I will always be happy for cooperation, constructive comments and suggestions. Thanks for attention!

    PS I express my gratitude to my colleagues for their help in the work and writing of this article.

    References


    1. ImageNet Classification with Deep Convolutional Neural Networks
    2. Going deeper with convolutions
    3. Deep Residual Learning for Image Recognition
    4. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size
    5. Fully Convolutional Networks for Semantic Segmentation
    6. ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation
    7. SSD: Single Shot MultiBox Detector
    8. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding
    9. Opencv github
    10. Official OpenCV Website
    11. OpenCV Forum
    12. Halide
    13. Caffe
    14. Tensorflow
    15. Torch

    Also popular now: