I see, it means I exist: a review of Deep Learning in Computer Vision (part 1)

    Computer vision. Now they talk a lot about it, where it is applied and implemented a lot. And somehow quite a while ago there were no review articles on Habré on CV, with examples of architectures and modern tasks. But there are a lot of them, and they are really cool! If you are interested in what is happening in Computer Vision now, not only from the point of view of research and articles , but also from the point of view of applied problems, then you are welcome to cat. Also, the article can be a good introduction for those who have long wanted to begin to understand all this, but something was in the way;)


    Today at the PhysTech there is an active collaboration of the “academy” and industrial partners. In particular, many interesting laboratories operate at the PhysTech School of Applied Mathematics and Computer Sciencefrom companies such as Sberbank, Biocad, 1C, Tinkoff, MTS, Huawei.

    I was inspired to write this article by working in the Laboratory of Hybrid Intelligent Systems , opened by VkusVill . The laboratory has an ambitious task - to build a store that works without cash desks, mainly with the help of computer vision. For almost a year of work, I had the opportunity to work on many tasks of vision, which will be discussed in these two parts.

    Shop without cash desks? Somewhere I already heard it ..
    Probably, dear reader, you thought about Amazon Go . In a sense, the task is to repeat their success, but our decision is more about implementation than about building such a store from scratch for a lot of money .

    We will move according to the plan:

    1. Motivation and what's going on
    2. Classification as a lifestyle
    3. Convolutional neural network architectures: 1000 ways to achieve one goal
    4. Visualization of convolutional neural networks: show me passion
    5. I myself am a kind of surgeon: we extract features from neural networks
    6. Stay close: representation learning for people and individuals
    7. Part 2: detecting, evaluating posture and recognizing actions without spoilers

    Motivation and what's going on

    Who is the article for?
    The article focuses more on people who are already familiar with machine learning and neural networks. However, I advise you to read at least the first two sections - suddenly everything will be clear :)

    In 2019, everyone is talking about artificial intelligence, the fourth industrial revolution and the approach of mankind to a singularity . Cool, cool, but I want specifics. After all, we are curious techies who do not believe in fairy tales about AI, we believe in formal task setting, mathematics and programming. In this article, we will talk about specific cases of using the very modern AI - the use of deep learning (namely, convolutional neural networks) in a variety of computer vision tasks.

    Yes, we will talk specifically about grids, sometimes mentioning some ideas from a "classical" view (we will call the set of methods in vision that were used before neural networks, but this in no way means that they are not used now).

    I want to learn computer vision from scratch
    I recommend Anton Konushin's course "Introduction to Computer Vision" . Personally, I went through its counterpart in SHAD, which laid a solid foundation in understanding image and video processing.


    In my opinion, the first really interesting application of neural networks in the vision, which was highlighted in the media as early as 1993, it is recognition of handwritten digits , implemented by Jan LeKunom . Now he is one of the main AI in Facebook AI Research , their team has already released a lot of useful Open Source stuff .

    Today, vision is used in many areas. I will give just a few striking examples:


    Tesla and Yandex unmanned vehicles


    Medical imaging analysis and cancer prediction


    Game consoles: Kinect 2.0 (although it also uses depth information, that is, RGB-D pictures)


    Face Recognition: Apple FaceID (using multiple sensors)


    Face Point Rating: Snapchat masks


    Biometry of the face and eye movements (an example from the project of FPMI MIPT )


    Search by image: Yandex and Google


    Recognition of the text in the picture ( Optical Character Recognition )



    Drones and robots: receiving and processing information through vision


    Odometry : building a map and planning when moving robots


    Improving graphics and textures in video games


    Picture translation: Yandex and Google



    Augmented Reality: Leap Motion (Project North Star) and Microsoft Hololens


    The transfer of style and texture: Prisma , PicsArt

    I'm not talking about the numerous applications in various internal tasks of companies. Facebook, for example, also uses vision to filter media content. In the quality control / damage in the industry are also used methods of computer vision.

    Augmented reality here must, in fact, be given special attention, since it does not work in the near future, this may become one of the main areas of application of vision.

    Motivated. Charged. Go:

    Classification as a lifestyle


    As I said, in the 90s, the nets were fired in sight. And they shot in a specific task - the task of classifying pictures of handwritten numbers (the famous MNIST dataset ). Historically, it was the task of classifying images that became the basis for solving almost all subsequent tasks in vision. Consider a specific example:

    Problem : A folder with photos is given at the entrance, on each photo there is an object: either a cat, a dog, or a person (even if there are no “garbage” photos, it’s a super-non-vital task, but something needs to be done) to begin). It is necessary to expand the image in three folders: /cats, /dogsand by placing in each folder with photo only relevant objects./leather_bags/humans

    What is a picture / photo?
    Almost everywhere in vision it is customary to work with pictures in RGB format. Each picture has a height (H), a width (W), and a depth of 3 (colors). Thus, one picture can be represented as a tensor of dimension HxWx3 (each pixel is a set of three numbers - intensity values ​​in the channels).

    Imagine that we are not familiar with computer vision yet, but we know machine learning. Images are simply numerical tensors in the computer's memory. We formalize the task in terms of machine learning: objects are pictures, their signs are values ​​in pixels, the answer for each of the objects is a class label (cat, dog or person). This is a pure classification task .

    If now it has become difficult ..
    ... then it’s better to first read the first 4 articles from the OpenDataScience ML Open Course and read a more introductory article on vision, for example, a good lecture in Small ShAD .

    You can take some methods from the “classical” view or the “classical” machine learning, that is, not a neural network. Basically, these methods consist in highlighting on the images of certain features (special points) or local regions that will characterize the picture (“ bag of visual words ”). Usually it all boils down to something like SVM over HOG / SIFT .

    But we gathered here to talk about neural networks, so we don’t want to use the signs we invented, but want the network to do everything for us. Our classifier will take the signs of an object as an input and return a prediction (class label). Here, the intensity values ​​in pixels act as signs (see the picture model in
    spoiler above). Remember that a picture is a size tensor (Height, Width, 3) (if it is color). When learning to enter the grid, all this is usually served not by one picture and not by a whole dataset, but by batches, i.e. in small portions of objects (for example, 64 images in the batch).

    Thus, the network receives an input tensor of size (BATCH_SIZE, H, W, 3). You can “expand” each picture into a vector row of H * W * 3 numbers and work with the values ​​in pixels just like with signs in machine learning, the usual Multilayer Perceptron (MLP)that’s what I would have done, but this, frankly, is such a base line, since working with pixels as a vector row does not take into account, for example, the translational invariance of objects in the picture. The same cat can be in the middle of the photo, and in the corner, MLP will not learn this pattern.

    So you need something smarter, for example, a convolution operation. And this is about modern vision, about convolutional neural networks :

    The convolution network training code may look something like this (on the PyTorch framework)
    # взято из официального туториала:
    # https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html
    import torch.nn as nn
    import torch.nn.functional as F
    import torch.optim as optim
    class Net(nn.Module):
        def __init__(self):
            super(Net, self).__init__()
            self.conv1 = nn.Conv2d(3, 6, 5)
            self.pool = nn.MaxPool2d(2, 2)
            self.conv2 = nn.Conv2d(6, 16, 5)
            self.fc1 = nn.Linear(16 * 5 * 5, 120)
            self.fc2 = nn.Linear(120, 84)
            self.fc3 = nn.Linear(84, 10)
        def forward(self, x):
            x = self.pool(F.relu(self.conv1(x)))
            x = self.pool(F.relu(self.conv2(x)))
            x = x.view(-1, 16 * 5 * 5)
            x = F.relu(self.fc1(x))
            x = F.relu(self.fc2(x))
            x = self.fc3(x)
            return x
    net = Net()
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
    for epoch in range(2):  # loop over the dataset multiple times
        running_loss = 0.0
        for i, data in enumerate(trainloader, 0):
            # get the inputs
            inputs, labels = data
            # zero the parameter gradients
            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            # print statistics
            running_loss += loss.item()
            if i % 2000 == 1999:    # print every 2000 mini-batches
                print('[%d, %5d] loss: %.3f' %
                      (epoch + 1, i + 1, running_loss / 2000))
                running_loss = 0.0
    print('Finished Training')

    Since now we are talking about training with a teacher , we need several components for training a neural network:

    • Data (already exists)
    • Network Architecture (Highlight)
    • A loss function that will tell how the neural network to learn (here it will be cross-entropy )
    • Optimization method (will change the network weight in the right direction)
    • Define architecture and optimizer hyperparameters (for example, optimizer step size, number of neurons in layers, regularization coefficients)

    This is exactly what is implemented in the code; the convolutional neural network itself is described in the Net () class.

    If you want to slowly and from the beginning learn about bundles and convolution networks, I recommend a lecture at the Deep Learning School (MIPT MIPT) (in Russian) on this topic, and, of course, Stanford's course cs231n (in English) .

    Deep Learning School - what is it?
    Deep Learning School at the Laboratory of Innovation FPMI MIPT is an organization that is actively engaged in the development of an open Russian-language course on neural networks. In the article I will refer to these video tutorials several times .


    In short, the convolution operation allows you to find patterns on images based on their variability. When we train convolutional neural networks (eng: Convolutional Neural Networks), we, in fact, find convolution filters (neuron weights) that describe images well, and so well that you can accurately determine the class from them. Many ways have been invented to build such a network. More than you think ...

    Convolutional neural network architectures: 1000 ways to achieve one goal


    Yes, yes, another architectural review . But here I will try to make it as relevant as possible!

    First there was LeNet , it helped Jan LeCun recognize numbers in 1998. This was the first convolutional neural network for classification. Her main feature was that she basically began to use convolution and pooling operations.


    Then there was a lull in the development of grids, but the hardware did not stand still; effective computations on GPU and XLA developed . In 2012, AlexNet appeared, she shot in the ILSVRC ( ImageNet Large-Scale Visual Recognition Challenge ) competition .

    A small digression about ILSVRC
    ImageNet was assembled by 2012 , and a subset of thousands of pictures and 1000 classes was used for the ILSVRC competition. ImageNet currently has ~ 14 million pictures and 21,841 classes (taken from the official site), but for the competition they usually usually only select a subset. ILSVRC then became the largest annual image classification competition. By the way, we recently figured out how to train on ImageNet in a matter of minutes .

    It was on ImageNet (in ILSVRC) from 2010 to 2018 that they received SOTA networks in the classification of images. True, since 2016, competitions in localization, detection and understanding of the scene, rather than classification, are more relevant.

    Typically, various architectural reviews shed light on those that were the first at the ILSVRC from 2010 to 2016, and on some individual networks. In order not to clutter up the story, I placed them under the spoiler below, trying to emphasize the main ideas:

    Architecture from 2012 to 2015
    YearArticleKey ideaWeight
    2012Alexnetuse two bundles in a row; divide network training into two parallel branches240 MB
    2013Zfnetfilter size, number of filters in layers-
    2013Overfeatone of the first neural network detectors-
    2014Vggnetwork depth (13-19 layers), the use of several Conv-Conv-Pool blocks with a smaller convolution size (3x3)549MB (VGG-19)
    2014Inception (v1) (aka GoogLeNet)1x1-convolution (idea from the Network-in-Network ), auxilary losses (or deep supervision ), stacking of the outputs of several convolutions (Inception-block)-
    2015Resnetresidual connections , very deep (152 layers ..)98 MB (ResNet-50), 232 MB (ResNet-152)

    The ideas of all these architectures (except for ZFNet, it is usually mentioned little) at one time were a new word in neural networks for vision. However, after 2015 there were many more important improvements, for example, Inception-ResNet, Xception, DenseNet, SENet. Below I tried to collect them in one place.

    Architecture from 2015 to 2019
    YearArticleKey ideaWeight
    2015Inception v2 and v3decomposition of packages into packages 1xN and Nx192 MB
    2016Inception v4 and Inception-ResNetcombination of Inception and ResNet215 MB
    2016-17Resnext2nd place ILSVRC, the use of many branches (“generalized" Inception block)-
    2017Xceptiondepthwise separable convolution , weighs less with comparable accuracy to Inception88 MB
    2017DensenetDense-block ; light but accurate33 MB (DenseNet-121), 80 MB (DenseNet-201)
    2018SenetSqueeze-and-Excitation Block46 MB (SENet-Inception), 440 MB (SENet-154)

    Most of these models for PyTorch can be found here , and there is such a cool thing .

    You may have noticed that the whole thing weighs quite a lot (I would like 20 MB maximum, or even less), while nowadays they use mobile devices everywhere and IoT is gaining popularity , which means that you also want to use grids there.

    Relationship between model weight and speed
    Since the neural networks within themselves only multiply tensors, the number of multiplication operations (read: the number of weights) directly affects the speed of work (if labor-intensive post- or pre-processing is not used). The speed of the network itself depends on the implementation (framework), the hardware on which it is running, and the size of the input image.

    The authors of many articles took the path of inventing fast architectures, I collected their methods under the spoiler below:

    The numbers in all the tables are taken from the ceiling from the repositories, from the Keras Applications table and from this article .

    You ask: “Why did you write about this whole“ zoo ”of models? And why is the task of classification? But we want to teach machines to see, and classification is just some kind of narrow task .. ”. The fact is that neural networks for detecting objects, evaluating postures / points, re-identifying and searching in a picture use exactly the models for classification as a backbone , and 80% of success depends on them.

    But I want to somehow trust CNN more, or they thought up black boxes, but what is “inside” is not obvious. To better understand the mechanism of functioning of convolutional networks, the researchers came up with the use of visualization.

    Visualization of convolutional neural networks: show me passion

    An important step towards understanding what is happening inside convolutional networks is the article “Visualizing and Understanding Convolutional Networks” . In it, the authors proposed several ways to visualize exactly what (on which parts of the picture) neurons in different CNN layers respond to (I also recommend watching a Stanford lecture on this topic ). The results were very impressive: the authors showed that the first layers of the convolutional network respond to some “low-level things” by the type of edges / angles / lines, and the last layers already respond to entire parts of the images (see the picture below), that is, they already carry in itself some semantics.


    Next project for deep imaging from Cornell University, and the company has advanced visualization a step further, while the famous DeepDream learned distorted freaky interesting style (below picture with deepdreamgenerator.com ).


    In 2017, a very good article was published on Distill , in which they conducted a detailed analysis of what each layer “sees”, and most recently (in March 2019) Google invented activation atlases : unique maps that can be built for each network layer, which brings us closer to understanding the overall picture of CNN's work.

    If you want to play with visualization yourself, I would recommend Lucid and TensorSpace .

    Okay, CNN seems to be true to some degree. We need to learn how to use this in other tasks, and not just in classification. This will help us extract Embedding'ov pictures and Transfer Learning.

    I myself am a kind of surgeon: we extract features from neural networks

    Imagine that there is a picture, and we want to find ones that look like it visually (this is, for example, the search in a picture in Yandex.Pictures). Previously (before neural networks), engineers used to manually extract features for this, for example, inventing something that describes the picture well and allows it to be compared with others. Basically, these methods ( HOG , SIFT ) operate with image gradients , usually these things are called “classic” image descriptors. Of particular interest, I refer to the article and to the course of Anton Konushin (this is not advertising, just a good course :)


    Using neural networks, we can not invent these features and heuristics ourselves, but properly train the model and then take the output of one or more layers of the network as signs of the picture .


    A closer look at all the architectures makes it clear that there are two steps to classification in CNN:
    1). Feature extractor layers for extracting informative features from images using convolutional layers
    2). Learning on top of these features Fully Connected (FC) classifier layers


    Embedding of images (features) is just about the fact that you can take their signs after the Feature extractor of a convolutional neural network (although they can be aggregated in different ways) as an informative description of images. That is, we trained the network for classification, and then just take the exit in front of the classification layers. These signs are called features , neural network descriptors, or picture embeddings (although embeddings are usually accepted in the NLP, since this is vision, I will often speak features ). Usually this is some kind of numerical vector, for example, 128 numbers, with which you can already work.

    But what about auto encoders?
    Yes, in fact, features can be obtained by auto-encoders . In my practice, they did it in different ways, but, for example, in articles on re-identification (which will be discussed later), more often they still take features after the extractor, rather than train the auto-encoder for this. It seems to me that it is worthwhile to conduct experiments in both directions, if the question is what works better.

    Thus, the pipeline for solving the problem of searching by picture can be arranged simply: we run the pictures through CNN, take signs from the desired layers and compare these features with each other from different pictures. For example, we simply consider the Euclidean distance of these vectors.


    Transfer Learning is a well-known technique for the effective training of neural networks that are already trained on a specific dataset for their task. Often they also say Fine Tuning instead of Transfer Learning, in the Stanford course notes cs231n these concepts are shared, they say, Transfer Learning is a general idea, and Fine Tuning is one of the implementations of the technique. This is not so important for us in the future, the main thing is to understand that we can just train the network to predict well on the new dataset, starting not from random weights, but from those trained on some large ImageNet-type. This is especially true when there is little data, and you want to solve the problem qualitatively.

    However, simply taking the necessary features and doing additional training from the dataset to the dataset may not be enough, for example, for tasks of searching for similar persons / people / something specific. Photos of the same person visually sometimes can be even more dissimilar than photographs of different people. It is necessary to make the network highlight exactly those signs that are inherent in one person / object, even if it is difficult for us to do this with our eyes. Welcome to the world of representation learning .

    Stay close: representation learning for people and individuals

    Terminology Note
    If you read scientific articles, sometimes it seems that some authors understand the phrase metric learning differently, and there is no consensus on which methods to call metric learning and which are not. That is why in this article I decided to avoid this particular phrase and used a more logical representation learning , some readers may not agree with this - I will be glad to discuss in the comments.

    We set the tasks:

    • Task 1 : there is a gallery (set) of photographs of people's faces, we want the network to be able to respond according to a new photo either with the name of a person from the gallery (supposedly this is it), or said that there is no such person in the gallery (and, perhaps, we add to it new person)

    • Task 2 : the same thing, but we are working not with photographs of faces, but with full-length crop of people

    The first task is usually called face recognition , the second - re-identification (abbreviated as Reid ). I combined them into one block, because their solutions use similar ideas today: in order to learn effective image embeddings that can cope with rather difficult situations, today they use different types of losses, such as, for example, triplet loss , quadruplet loss , contrastive-center loss , cosine loss .

    There are still wonderful Siamese networks , but I honestly did not use them myself. By the way, not only the loss itself “decides”, but how to sample pairs of positives and negatives for it, the authors of the article Sampling matters in deep embedding learning emphasize .

    The essence of all these losses and Siamese networks is simple - we want the pictures of one class (person) in the latent space of features (embeddings) to be “close”, and of different classes (people) to be “far”. Proximity is usually measured as follows: embeddings of images from a neural network are taken (for example, a vector of 128 numbers) and we either consider the usual Euclidean distance between these vectors or the cosine proximity. How to measure it is better to choose on your dataset / task.

    A schematic representation of a problem solving pipeline on representation learning looks something like this:

    But to be more precise, like this
    На стадии обучения: обучаем нейросеть либо на классификацию (Softmax + CrossEntropy), либо с помощью специального лосса (Triplet, Contrastive, etc.). Во втором случае ещё нужно правильно подбирать positive'ы и negative'ы в каждом батче

    На стадии предсказания: если это был именно какой-то особый лосс по типу триплета, то он на вход принимал уже эмбеддинги — их и берём. Если была классификация, то тут нужно экспериментировать — можно брать фичи с какого-то из свёрточных слоёв, а можно и вероятности после классификатора (да, так делают и это работает). Далее ищем расстояние от пришедшей в тесте картинки до всех картинок из галереи и выдаём метку ближайшей. Расстояние меряем косинусом или Евклидовой метрикой

    There are several good articles specifically on face recognition : a review article ( MUST READ! ) , FaceNet , ArcFace , CosFace .


    There are also a lot of implementations: dlib , OpenFace , FaceNet repo , and on Habré it was already told about it for a long time . It seems that only ArcFace and CosFace have been added recently (write in the comments, if I missed something here, I will be glad to know something else).

    However, now it’s more fashion not to recognize faces, but to generate them , is it?

    In turn, the re-identification task is now undergoing active activity, articles are published every month, people try different approaches, something is working now, something is still not very good.

    I will explain the essence of the Reid problem by an example: there is a gallery with crop of people, for example, 10 people, each has 5 crop (can be from different sides), that is, 50 photos in the gallery. A new detection (crop) comes, and I must say what kind of person is from the gallery or say that he is not there and make a new ID for him. The task is complicated by the fact that human detections come from different angles: front, back, side, bottom , and the cameras from which the photos come are also different (different lighting / white balances, etc.).

    By the way, in our laboratory Reid is one of the key tasks. There are really a lot of articles coming out, some of them about a new, more effective loss, some only about a new way to get negative and positive ones.

    A good review of old Reid methods is in a 2016 article . Now, as I wrote above, two approaches are applied - classification or representation learning. However, there is a specificity of the problem, researchers struggle with it in different ways, for example, the authors of Aligned Re-Id proposed to align features in a special way (yes, they were able to improve the network using dynamic programming , Karl ), in another article they proposed to use Generative Adversarial Networks (GAN ).


    Несмотря на все эти продвинутые методы, в моих экспериментах, почему-то, лучше всего себя показал именно подход с классификацией. Возможно, я что-то не учёл, но пока что немного грустно, что придумали столько всего, а в итоге работает старая добрая логистическая регрессия классификация. Но главное — пробовать и не сдаваться!

    Of the implementations, I would definitely mention OpenReid and TorchReid . Pay attention to the code itself - in my opinion, it is written competently from the point of view of the architecture of the framework, more details here . Plus, they are both on PyTorch, and the Readme has many links to articles on Person Re-identification, which is nice.

    In general, there is a special demand for face- and reid-algorithms in China now ( if you know what I mean ). Are we on the line? Who knows…

    A word about acceleration of neural networks

    We have already said that you can just come up with a lightweight architecture. But what if the network is already trained and it is cool, but you still need to compress it? In this case, one (or all) of the following methods may help:

    Well, the rule is not to use float64, but, for example, no one canceled float32. There is even a recent article about low-precision training . Recently, by the way, Google introduced MorphNet , which (sort of) helps to automatically compress the model.

    What's next?


    We really discussed a lot of useful and applied things in DL and CV: classification, network architectures, visualization, embeddings. However, in modern vision there are also other important tasks: detection, segmentation, understanding of the scene. If we are talking about video, then I want to track objects in time , recognize actions and understand what is happening on the video. It is to these things that the second part of the review will be devoted.

    Stay tuned!

    PS: What kind of education does the Physiotechnical School of PMI MIPT now offer?
    FPMI MIPT organizes a bachelor's program (by the way, now it is also in English for foreign students), however, master's programs are being actively developed , both full-time and online . The full list of educational opportunities of FPMI MIPT can be found here . If you are interested, I will be glad to discuss specific programs in the comments or private messages, otherwise PhysTech is not the same (in a good way).

    Also popular now: