I see, it means I exist: a review of Deep Learning in Computer Vision (part 2)

    We continue to comprehend modern magic (computer vision). Part 2 does not mean that you need to read part 1 first. Part 2 means that now everything is serious - we want to understand the full power of neural networks in vision. Detection, tracking, segmentation, posture assessment, action recognition ... The most fashionable and coolest architectures, hundreds of layers and dozens of brilliant ideas are already waiting for you under the cut!

    In the last series

    Let me remind you that in the first part we got acquainted with convolutional neural networks and their visualization, as well as with the tasks of classifying images and constructing their effective representations (embeddings). We even discussed the tasks of face recognition and re-identification of people.

    Even in the previous article we talked about different types of architectures (yes, the same tablets that I made a month ), and here Google did not waste time: they released yet another extremely fast and accurate EfficientNet architecture . They created it using the NAS and the special Compound Scaling procedure. Check out the article , it's worth it.

    In the meantime, some researchers animate faces and look for kisses in films , we will deal with more pressing problems.

    Here people say: “image recognition”. But what is “recognition”? What is “understanding (scene)”? In my opinion, the answers to these questions depend on what exactly we want to “recognize”, and what exactly we want to “understand”. If we build Artificial Intelligence, which will extract information about the world from the visual stream as efficiently (or even better) as people, then we need to go from tasks, from needs. Historically, the modern “recognition” and “understanding of the scene” can be divided into several specific tasks: classification, detection, tracking, evaluation of postures and face points, segmentation, recognition of actions on the video and description of the image in text. This article will focus on the first two tasks from the list (ups, spoiler of the third part), so the current plan is this:

    1. Find me if you can: object detection
    2. Face Detection: Not Caught - Not a Thief
    3. Many letters: text detection (and recognition)
    4. Video and tracking: in a single stream

    Let's rock, superstars!

    Find me if you can: object detection

    So, the task sounds simple - a picture is given, you need to find objects of predefined classes on it (a person, a book, an apple, an artesian-Norman basset-griffon, etc.). In order to solve this problem with the help of neural networks, we pose it in terms of tensors and machine learning.

    We remember that a color picture is a tensor (H, W, 3) (if we do not remember, that is, part 1 ). Previously, we only knew how to classify the whole picture, but now our goal is to predict the positions of objects of interest (pixel coordinates) in the picture and their classes.

    The key idea here is to solve two problems at once - classification and regression. We use a neural network to regress the coordinates and classify the objects inside them.

    Classification? Regression?
    Let me remind you that we are talking about the tasks of machine learning. The problem of classification as a true labels for objects appear classes labels , and we predict the object's class. In the regression problem, real numbers act as real numbers , and we predict the number (for example: weight, height, salary, number of characters who die in the next series of the Game of Thrones ...). In more detail - you are welcome to the 3rd lecture of DLSchool (FPMI MIPT) .

    But the coordinates of the object, generally speaking, can be formalized in different ways, in DL there are three main ways: detection ( boxes of objects), evaluation of posture (key points of objects) and segmentation (“masks” of objects). Now let's talk about predicting precisely bounding boxes , points and segmentation will be further in the text.

    Basically, detection datasets are marked with boxes in the format: “coordinates of the upper left and lower right corners for each object in each picture” (this format is also called top-left, bottom-right ), and most neural network approaches predict these coordinates.

    About datasets and metrics in the detection problem
    After setting the task, it is best to see what data is available for training and what metrics are used to measure quality. This is what I slowly talk about in the first half of the 13th lecture from Deep Learning School (at x2.0 it’s the most).

    Before plunging into the types of neural networks for detection, let's think together how to solve the problem of detecting anything in images. Probably, if we want to find a certain object in the picture, then we roughly know how it looks and what area should occupy in the image (although it can change).

    Inventing detection from scratch
    The naive and simplest approach would be to simply make a “template search" algorithm: let the picture be 100x100 pixels, and we are looking for a soccer ball. Let there be a ball pattern of 20x20 pixels. Take this template and we will go through it just like a convolution throughout the picture, counting the pixel-by-pixel difference. This is how template matching works (some type of correlation is often used instead of pixel-by-pixel difference).

    If there is no template, but there is a neural network-classifier, then we can do this: we will go by a window of a fixed size in the picture and predict the class of the current area of ​​the picture. Then we just say that the most probable regions of the objects are those where the classifier answered confidently. Thus, we can solve the problem of the fact that the object looks different in appearance differently (since it was trained to classify on a very diverse sample).

    But then a problem pops up - the objects in the pictures have different sizes. The same soccer ball can be in the whole height / width of the picture, or it can be far at the goal, taking only 10-20 pixels out of 1000. I would like to write the Brute Force algorithm: we just loop through the window sizes. Suppose we have 100x200 pixels, then we will go to a 2x2, 2X3, 3x2, 2x4, 4x2, 3x3 window ..., 3x4, 4x3 ... I think you understand that the number of possible windows will be 100 * 200, and each of which we will go through the picture , performing (100-W_window) * (200 - H_window) classification operations, which takes a lot of time. I'm afraid we won’t wait until such an algorithm works.

    You can, of course, choose the most characteristic windows depending on the object, but this will also work for a very long time, and if it is fast, it is unlikely to be exact - in real applications there will be an insane amount of variations in the sizes of objects in the images.

    Further, I will sometimes rely on a new review of the detection area from January 2019 (pictures will also be from it). This is just a must read if you want to quickly get the widest possible look at DL in detection.

    One of the first articles on detection and localization using CNN was Overfeat . The authors claim that they first used a neural network for detection on ImageNet, reformulating the problem and changing the loss. The approach, by the way, was almost end-to-end (below is the Overfeat scheme).

    The next important architecture was the Region-based Convolutional Neural Network ( RCNN ), invented by researchers from FAIR in 2014 . Its essence is that it first predicts a lot of the so-called “regions of interest” (RoI's), inside which there can potentially be objects (using the Selective Search algorithm), and it classifies them and refines the coordinates of the boxes using CNN.

    True, such a pipeline made the whole system slow, because we ran every region through the neural network (we did forward pass thousands of times). A year later, the same FAIR Ross Girshick upgraded RCNN to Fast-RCNN . Here the idea was to swap Selective Search and network prediction: first, we pass the entire picture through a pre-trained neural network, and then we predict regions of interest over the feature-map issued by the backbone network (for example, using the same Selective Search, but there are other algorithms ). It was still quite slow, much slower than real-time (for now, we assume that real-time is less than 40 milliseconds per picture).

    The speed was affected most of all not by CNN, but by the box generation algorithm itself, so it was decided to replace it with a second neural network - Region Proposal Network ( RPN ), which will be trained to predict the regions of interest of objects. This is how Faster-RCNN appeared (yes, they obviously didn’t think of the name for a long time). Scheme:

    Then there was another improvement in the form of R-FCN , we won’t talk about it in detail, but I want to mention Mask-RCNN . Mask-RCNN is a unique one, the first neural network that solves the problem of detection and instance segmentation at the same time - it predicts the exact masks (silhouettes) of objects inside bounding boxes. Her idea is actually quite simple - there are two branches: for detection and for segmentation, and you need to train the network for both tasks at once. The main thing is to have tagged data. Mask-RCNN itself is very similar to Faster-RCNN: the backbone is the same, but in the end there are two “heads” (as the last layers of the neural network are often called ) for two different tasks.

    These were the so-called Two-Stage (or Region-based ) approaches. In parallel with them, analogues were developed in DL-detection - One-Stage approaches. These include neural networks such as: Single-Shot Detector (SSD), You Only Look Once (YOLO), Deeply Supervised Object Detector (DSOD), Receptive Field Block Network (RFBNet) and many others (see the map below, from this repository ).

    One-stage approaches, unlike two-stage, do not use a separate algorithm for generating boxes, but simply predict several box coordinates for each feature map produced by a convolutional neural network. YOLO acts in a similar way, SSD is slightly different, but there is only one idea: a 1x1 convolution predicts many numbers from the received feature maps in depth, however we agree in advance what number it means.

    For example, we predict from a feature map the size of 13x13x256 is a feature map 13x13x (4 * (5 + 80)) numbers, where in depth we predict 85 numbers for 4 boxes: the first 4 numbers in the sequence are always the coordinates of the box, the 5th - confidence in boxing, and 80 numbers - the probabilities of each of the classes (classification). This is necessary in order to then submit the necessary numbers to the necessary losses and properly train the neural network.

    I want to draw attention to the fact that the quality of the detector’s work depends on the quality of the neural network to extract features (that is, a backbone neural network ). Usually, this role is played by one of the architectures that I spoke about in a previous article (ResNet, SENet, etc.), but sometimes authors come up with their own more optimal architectures (for example, Darknet-53 in YOLOv3) or modifications (for example, Feature Pyramid Pooling (FPN)).

    Again, I note that we train the network for both classification and regression at the same time. In the community, this is called multi-task loss: the sum of losses for several tasks (with some coefficients) appears in one loss.

    News with leading Multitask Loss
    At Machines Can See 2019, one of the speakers used multi-task loss for 7 tasks simultaneously , Carl . It turned out that some tasks were initially set as a counterbalance to each other and a “conflict” was obtained, which prevented the network from learning better than if it was trained for each task separately. Conclusion: if you are using multi-task loss, make sure that these same multi-tasks do not conflict with the statement (for example, predicting the boundaries of objects and their internal segmentation can interfere with each other, because these things can rely on different signs inside the network). The author circumvented this by adding separate Squeeze-and-Excitation blocks for each task .

    Recently, articles from 2019 appeared in which the authors declare an even better speed / accuracy ratio in the detection task using point-based box prediction . I am talking about the articles “Objects as Points” and “CornerNet-Lite” . ExtremeNet is a modification of CornerNet. It seems that now they can be called SOTA in detection using neural networks (but this is not accurate).

    If suddenly my explanation of the detectors still seemed chaotic and incomprehensible, in our video I discuss it slowly. Perhaps you should first see it.

    Below I have given tables of neural networks in detection with links to the code and a brief description of the chips of each network. I tried to collect only those networks that are really important to know (at least their ideas) in order to have a good idea about object detection today:

    Neural network detectors (two-stage)
    YearArticleKey ideaThe code
    2013-2014RCNNgeneration of regions of interest and neural network prediction of classes within themCaffe
    2015Fast-rcnnfirst pass the picture through the network, and then generate regions of interestCaffe
    2016Faster-rcnnuse RPN to generate regions of interestPytorch
    2016R-FCNfully-convolutional approach instead of generating regions of interestCaffe
    2017Mask-rcnntwo “heads” for solving two tasks at once, RoI-AlignKeras, TF
    2019Reasoning-RCNNimproving the quality of RCNN by constructing a graph of semantic relationships of objects---

    Neural network detectors (one-stage)
    YearArticleKey ideaThe code
    2013-2014Overfeatone of the first neural network detectorsC ++ (with wrappers for other languages)
    2015SSDvery flexible one-stage approach used now in many applicationsPytorch
    2015Yoloan idea similar to SSD, is developing in parallel and no less popular (there are new versions)C ++
    2016YOLOv2 (aka YOLO9000)a number of improvements for YOLOPytorch
    2017YOLOv3a number of improvements for YOLOv2Pytorch
    2017-2018DSODDeep Supervision Idea and DenseNet IdeasCaffe
    2017-2018RFBNetconvolution filters are neatly selected based on the structure of the human visual system ( RF block)Pytorch

    Neural network detectors (miscellaneous)

    Neural network detectors (point-based)
    YearArticleKey ideaThe code
    2019Centerneta new approach to detection, which allows to quickly and efficiently solve the problem of finding points, boxes and 3D boxes at the same timePytorch
    2019Cornernetprediction of boxes based on pairs of corner pointsPytorch
    2019CornerNet-Liteaccelerated cornernetPytorch
    2019ExtremeNetprediction of “extreme” points of objects (geometrically accurate boundaries)Pytorch

    In order to understand how the speed / quality of each architecture is correlated, you can look at this review or its more popular version .

    Architecture is fine, but detection is primarily a practical task. “Do not have a hundred networks, but have at least 1 working” - this is my message. There are links to the code in the table above, but personally, I rarely encounter launching detectors directly from repositories (at least with the goal of further deployment to production). Most often a library is used for this, for example, the TensorFlow Object Detection API (see the practical part of my lesson ) or a library from researchers from CUHK. I bring to your attention another super-table (you like them, right?):

    Libraries for running detection models
    TitleThe authorsDescriptionImplemented Neural NetworksFramework
    DetectronFacebook AI ResearchFacebook repository with various model code for detecting and evaluating postureAll Region-basedCaffe2
    TF Object Detection APITensorFlow teamA lot of models ready to use (weights are given)All Region-based and SSDs (with different backbones)Tensorflow
    DarkflowthtrieuReady-to-use YOLO and YOLOv2 implementationsAll YOLO types (with modifications) except YOLOv3Tensorflow
    mmdetectionOpen MMLab (CUHK)A huge number of detectors on PyTorch, see their articleAlmost all models except the YOLO familyPytorch
    Darknet (modified)AlexABConvenient implementation of YOLOv3 with many improvements to the original repositoryYOLOv3C ++

    Often you need to detect an object of only one class, but specific and highly variable. For example, to detect all faces in the photo (for further verification / counting of people), to detect entire people (for re-identification / counting / tracking) or to detect text on the scene (for OCR / translation of words in the photo). In general, the “ordinary” detection approach will work to a certain extent, but each of these subtasks has its own tricks to improve quality.

    Face Detection: Not Caught - Not a Thief

    Some specificity appears here, since faces often occupy a fairly small part of the image. Plus, people do not always look at the camera, often the face is visible only from the side. One of the first approaches to face recognition was the famous Viola-Jones detector based on Haar cascades, invented back in 2001.

    Neural networks were not in fashion then, they were still not so strong in vision, however, the good old hand-crafted approach did its job. Several types of special filter masks were actively used in it, which helped to extract facial regions from the image and their signs, and then these signs were submitted to the AdaBoost classifier. By the way, this method really works fine and now, it is fast enough and starts out of the box using OpenCV . The disadvantage of this detector is that it only sees faces that are deployed frontally to the camera. One has only to turn around a bit and the stability of the detection is violated.

    For such more complex cases, you can use dlib. This is C ++ - a library in which many vision algorithms are implemented, including for face detection.

    Of the neural network approaches in face detection, Multi-task Cascaded CNN (MTCNN) ( MatLab , TensorFlow ) is especially significant . In general, it is now actively used (in the same facenet ).

    The idea of ​​MTCNN is to use three neural networks sequentially (therefore, a “cascade” ) to predict the position of a face and its singular points . In this case, there are exactly 5 special points on the face: the left eye, the right eye, the left edge of the lips, the right edge of the lips and nose. The first neural network from the cascade ( P-Net ) is used to generate potential regions of the face. The second ( R-Net ) - to improve the coordinates of the received boxes. The third ( O-Net ) neural network once again regresses the coordinates of the boxes and, in addition, predicts 5 key points of the face. This network is a multi-task because three tasks are solved: regression of box points, classification of face / not face for each box, and regression of face points. Moreover, MTCNN does it all in real-time, that is, it requires less than 40 ms per image.

    How, you still do not read articles with ArXiv yourself?
    In this case, I recommend trying to read the original article about MTCNN , if you already have some background in convolution networks. This article takes only 5 pages , but it sets out all the information you need to understand the approach. Try it, it’ll tighten :)

    Among modern State-of-the-Art, Dual Shot Face Detector (DSFD) and FaceBoxes can be noted . FaceBoxes has the ability to quickly launch on the CPU (!), And DSFD has the best quality (released in April 2019). DSFD is more complex than MTCNN, since a special module for improving features (with dilated convolutions ), two branches of their processing and special types of losses are used inside the network . By the way, with dilated convolutions we will come across more than once in the articles on segmentation in the next part. Below is an example of a DSFD (impressive, isn't it?).

    To learn how to recognize faces, do not forget to look at the previous article in the series , where I briefly talked about it.

    Many letters: text detection (and recognition)

    Pay attention to the photo above. It is easy to see that if you predict bounding boxes parallel to the coordinate axes (as we did before), it will turn out to be very poor quality. Often this turns out to be very critical if, for example, we want to submit these boxes to the input of the recognition neural network, which will predict the text from the picture .

    In such cases, it is customary to predict rotated bounding boxes, or even limit the text to polygons instead of rectangles if it is curved (examples below). Prediction of rotated boxes is handled, for example, by an EAST detector .

    The idea of ​​an EAST detector is to predict not the coordinates of the corners of the boxes, but the following three things:

    1. Text Score Maps (probability of finding text in each pixel)
    2. The angle of rotation of each box
    3. Distances to the borders of the rectangle for each pixel

    Thus, this is more reminiscent of the task of segmentation (highlighting text masks) than detection. Explanatory picture from the arxiv article :

    The task of text recognition (and therefore its detection) is very popular, therefore there are analogues: TextBoxes ++ ( Caffe ) and SegLinks , but EAST, in my opinion, is the most simple and affordable.

    After detecting the text, I want to immediately feed it to another neural network in order to recognize it and produce a string of characters. Here you can notice an interesting change in modality - from images to text. You should not be afraid of this at all, because everything depends only on what the network architecture is, what exactly is predicted on the last layer and what kind of loss is used. For example, MORAN ( PyTorch code ) and ASTER ( TensorFlow code) completely cope with the task.

    They do not have something supernatural, but very fundamentally different types of neural networks are used very competently at once: CNN and RNN. The first is for extracting features from the picture, and the second is for generating text. More on the example of MORAN: below is the architecture of its recognizing network.

    However, in spite of the rotated boxes from EAST, a recognizable network still receives a rectangular picture, which means that the text inside it can occupy far from all the space. In order to make it easier for the recognizer to predict directly the text on it from the picture, you can convert it in a certain way.

    We can apply the affine transformation to the input image to stretch / rotate the text. This can be achieved using the Spatial Transformet Network (STN) , since it independently learns such transformations and is easily integrated into other neural networks (by the way, you can do this alignment for any picture, not just for text). Below is an example before / after STN.

    It makes no sense to talk about STN in detail here, because there is a wonderful article on Habré (the picture was taken from there, thanks to the author) and the PyTorch code .

    But MORAN (the same neural network for text recognition) does even smarter - it is not limited to the affine transformation family, but predicts a displacement map of x and y for each pixel of the input image , thus achieving any transformation that improves the learning of the network for recognition. This method is called rectification , that is, correcting a picture using an auxiliary neural network ( rectifier ). Below is a comparison of the image after affinity conversion and after rectification:

    However, in addition to approaches to text recognition “modularly” (detection network -> recognition network), there is an end-to-end architecture: the input is a picture, and the output is a detection and the text recognized inside them. And all this is a single pipeline that learns both tasks at once. In this direction, there is an impressive work of the Fast Oriented Spotting the Text with a Unified the Network ( FOTS ) ( on PyTorch code ), where the authors also note that the end-to-end approach is two times faster than the "detection + recognition." Below is the FOTS neural network diagram, a special role is played by the RoiRotate block, due to which it is possible to “cast gradients” from the network for recognition on the neural network for detection (this is really more complicated than it seems).

    By the way, every year the ICDAR conference is held , to which several competitions for the recognition of text in a variety of images are timed .

    Current problems in detection

    In my opinion, the main problem in detection now is not the quality of the detector model, but the data: they are usually long and expensive to mark up, especially if there are a lot of classes that need to be detected (but by the way there is an example of a solution for 500 classes). Therefore, many works are now devoted to the generation of the most plausible data “synthetically” and obtaining markup “for free”. Below is a picture from my diploma of an article from Nvidia , which deals specifically with the generation of synthetic data.

    But still it's great that now we can say for sure where in the picture what to be. And if we want, for example, to calculate the amount of something on the frame, then it is enough to detect this and give out the number of boxes. In the detection of people, the usual YOLO also works well, just the main thing is to submit a lot of data. The same Darkflow is suitable, and the “human” class is found in almost all major detection datasets. So if we want to use the camera to count the number of people who passed by, say, in one day, or the number of goods that a person took in a store, we’ll simply detect and give out the quantity ...

    Stop. But if we are to detect people in each image from the camera, then we can calculate their number in one frame, and in two - no longer, because we can’t say where exactly which person is. We need an algorithm that allows us to count exactly unique people in the video stream. It may be a re-identification algorithm , but when it comes to video and detection, it’s a sin not to use tracking algorithms.

    Video and tracking: in a single stream

    So far, we have only talked about tasks in pictures, but the most interesting thing happens on the video. To solve the same recognition of actions, we need to use not only the so-called spatial component, but also the temporal one , since video is a sequence of images in time.

    Tracking is an analogue of image detection, but for video. That is, we want to teach the network to predict not boxing in the picture, but a tracklet in time (which is essentially a sequence of boxes). Below is an example of an image showing the “tails” - the tracks of these people in the video.

    Let's think about how to solve the tracking problem. Let there be a video, and its frames # 1 and # 2. Let us consider so far only one object - we track one ball. At frame # 1, we can use a detector to detect it. On the second one we can also detect a ball, and if it is there alone, then everything is fine: we say that boxing on the previous frame is boxing of the same ball as on frame # 2. You can also continue to the remaining frames, below the gif from the pyimagesearch vision course .

    By the way, in order to save time, we can not start the neural network in the second frame, but simply “cut” the box of the ball from the first frame and look for exactly the same correlation in the second frame or pixel by pixel. Correlation trackers utilize this approach , they are considered simple and more or less reliable if we deal with simple cases like “tracking one ball in front of the camera in an empty room”. This task is also called Visual Object Tracking . Below is an example of the work of the correlation tracker using the example of one person.

    However, if there are several detections / people, then you need to be able to match the boxes from frame # 1 and from frame # 2. The first idea that comes to mind is to try to match the box to the one that has the largest intersection area ( IoU ) with it. True, in the case of several overlapping detections, such a tracker will be unstable, so you need to use even more information.

    The approach with IoU relies only on the “geometric” signs of detection, that is, it simply tries to compare them by proximity on frames. But we have at our disposal a whole image (even two in this case), and we can use the fact that inside these detections there are “visual” signs . Plus, we have a history of detections for each person, which allows us to more accurately predict his next position on the basis of speed and direction of movement, this can conditionally be called “physical” signs .

    One of the first real-time trackers, which was completely reliable and able to cope with difficult situations, was published in 2016 Simple Online and Realtime Traker (SORT) ( Python code ). SORT did not use any visual signs and neural networks, but only estimated a number of parameters of each box on each frame: the current speed (x and y separately) and size (height and width). The aspect ratio of a box is always taken from the very first detection of that box. Further, the speeds are predicted using Kalman filters (they are generally good and light in the world of signal processing), the matrix of intersection of the boxes by IoU is built, and the detections are assigned by the Hungarian algorithm .

    If it seems to you that mathematics has already become a bit much, then in this article everything is explained in an accessible way (this is medium :).

    Already in 2017, a modification of SORT was released in the form of DeepSORT ( code for TensorFlow ). DeepSORT has already begun to use the neural network to extract visual signs, using them to resolve collisions. The quality of tracking has grown - it is not for nothing that it is considered one of the best online trackers today.

    The field of tracking is indeed actively developing: there are trackers with Siamese neural networks , and trackers with RNN . Keep your finger on the pulse, because on any day even more accurate and fast architecture can come out (or have already come out). By the way, it is very convenient to follow such things on PapersWithCode , there are always links to articles and code for them (if any).


    We have really experienced a lot and learned a lot. But computer vision is an extremely vast area, and I am an extremely stubborn person. That is why we will see you in the third article of this cycle (will it be the last? Who knows ...), where we will discuss in more detail segmentation, posture assessment, recognition of actions on a video and generation of a description from an image using neural networks.

    PS I want to express special thanks to Vadim Gorbachev for his valuable advice and comments in the preparation of this and the previous article.

    Also popular now: