An article about how we tried to apply modern neural network technologies to find helmets on people's heads



    Previously, we built all our intelligent modules on traditional video analysis algorithms (hereinafter we will call them “classical”). Of course, we knew about neural networks, and tried to apply them back in 2008. In particular, to compare images of people by cluster. But the results were not outstanding (including due to the low level of development of neural networks). And for many years we have become adherents of the "classic" of machine vision. And all the neural networks were in our heads :)

    With the advent of convolutional neural networks, there was also the hope that they would be able to show themselves well in solving video analysis problems: firstly, to give higher accuracy under the same conditions in which we used the previous algorithms, secondly, to expand the range of these very working conditions.

    On top of that, this development method seemed much more reliable - it immediately allows you to make a conclusion “will or will not go”: when you start working with the “classical” algorithm, you don’t immediately understand whether the path is right or not, it will be possible to solve the problem or not. And it takes some (often substantial) time to reach a result that can be estimated. For example, for about a month we were busy with a stereo nozzle for the camera to implement a new count of visitors, but in the end we didn’t get anything sensible with it ( see the article “The Birth of a Supernova: How New Functions Appear for Example 3D Visitor Counting”) And with neural networks, everything is clearer: already by a small sample of several pictures you can evaluate whether it will or will not. If not, change the selection and check again. Finding the right data type and approach is much faster, and then you just need to improve the selection to get ever higher results.

    When the problem arose of creating a helmets-free detector (more about how it appeared in the article “Custdev in developing products for video surveillance” ), we did not immediately understand how to solve it using traditional methods. We decided to check whether it would be possible to do this using neural networks. And in general, are modern neural networks really so good, how are they talked about? ..

    So, the work of the detector should be reduced to finding the head of a person and determining whether there is a helmet on it or not. In the development process, we tried to solve the problem in two ways .

    The neural network does not know in advance that it needs to distinguish two types of people precisely by the presence of a helmet. All she has is two sets of images (people in helmets and without helmets), and she tries to find signs on them by which these two sets can be distinguished. She knows which picture corresponds to which set, but does not know why, and seeks to choose her own parameters so that she can give the right answers as often as possible.

    I way

    First, we submitted images to the neural network in which people fell into the frame at full height. We immediately had fears that we won’t get enough accuracy, but if it worked, then the development would be as quick and simple as possible.



    The fears were confirmed: in such pictures the neural network could not really learn - the accuracy of finding helmets on the new test set was about 70%. It was completely unacceptable for the module to work, but at the same time, it proved that it was possible to solve the problem using neural networks!

    In general, the accuracy of the helmets detector is made up of sensitivity (responsible for "catching" people without helmets) and the percentage of false positives (responsible for erroneous "catching" people in helmets). In the real enterprise where the detector will be used, people in most cases wear helmets, so even a small percentage of false positives will turn into a large amount of incorrect data in the output.

    Accuracy was adopted as the initial reference point: not less than 60% of sensitivity and not more than 3% of false positives. And in fact, these were serious demands.

    By training in full-length images of people, such accuracy could not be achieved. Perhaps it was influenced by the fact that in such pictures, in addition to the person’s head in a helmet or without a helmet, there are many other elements on which the neural network is “distracted”, taking for essential signs that which in fact is not.

    Method II

    We decided that we can help the neural network if we show not the whole person to the full height, but only his head (with or without a helmet). To highlight the image of the head, we applied the appropriate classifier, long written by us for one of the modules, and trained new convolutional neural networks using the results of its work.


    By the way, practice has shown that it is not so important how many layers and neurons are in the neural network, and in general its parameters are not so important. The main thing is the quality of the training sample. On a large and diverse sample, there is a good chance of success, on a small one, the neural network will simply remember the correct answers, but will not gain the ability to generalize and will not be able to give the correct answers in new pictures for it.

    Our sample was of medium size (several thousand pictures of heads in helmets and without helmets), it included helmets of different colors and slightly different shapes. In order to improve the results and avoid retraining, we had to seriously engage in techniques of augmentation (artificial expansion of the training sample) and regularization (limiting the parameters of the neural network). As a result, on the test sample, the accuracy reached 85-88%. This is a good indicator, but in order to further reduce errors, we did post-processing: the decision that a person without a helmet and that it is necessary to output an “alarm” is made not by one frame, but by the results of the analysis of each individual person on several frames in a row.

    During the testing, we were also not very happy with the work of the head detector, so we made a refinement of the heads found in the image ... also using a neural network. In fact, in one and the other case, this is not one network, but several combined into a cascade for greater accuracy (but here we will call them simply neural networks).

    For our neural network, we took the classic convolutional architecture, which has worked well in classification problems. But they tried different architectures, including the most modern and complex ones - from hundreds of layers with hundreds of millions of parameters. Fundamentally, with the complication of the neural network, the result did not improve. In our experience, we have confirmed that the Vapnik-Chervonenkis theorem works: the complexity of the classifier must correspond to the complexity of the problem. If the classifier is too complicated, it will simply remember all the answers and will not work. If he is too simple, he will not be able to learn.

    We had a fairly simple neural network to solve the relatively simple task of detecting helmets.

    The second method was the most effective. As a result, we solved the problem and

    1) in 2.5 months we developeda working module that went to the first objects for test use. In our estimation, the development by classical methods would take us at least six months.

    2) we use in the detector of the absence of helmets 2 sets of neural networks trained on different data. The first finds the heads of people in the frame, and the second - determines whether this head is in a helmet or not.

    3) reached the declared accuracy threshold - more than 60% of sensitivity for 1.5% of false positives.

    Conclusion: it is possible and even necessary to use neural networks to solve the problems of video analysis, in particular, detecting the absence of a helmet on a person.

    The first successful experience poses a logical question: now all the modules of video analysis to develop using neural networks? And while it is definitely difficult to answer it.
    There are modules in the translation of which on the neural network we do not see the point now. Because there everything is so well solved by classical methods. For example, counting visitors (especially in the new 3D implementation). Now it works very well on classical methods of machine vision and reaches 98% accuracy. And if we used neural networks, it is not yet known whether they would work or not. But neural networks are definitely suitable for smoke and fire detection.

    If a criterion for the applicability of neural networks in video analysis is derived, it can be formulated as follows: if it is clear in advance how and what features to use, then you can get by with the “classics”, otherwise you can try neural networks.

    In 3D counting, there is a good sign - this is the distance to the point. Or in the detector of abandoned objects, for example, it is also easy to find it - a special point on the border of the object, which you can follow and compare it, or the outline. But in the fire it is not clear what signs to take. Color? - There will always be something of the same color as fire. The form? - fire can be of the most diverse form. Flicker in time? - but it is not clear exactly what it should be. Coming up with signs in advance here is a disastrous thing, so let a neural network do it better.

    But back to our task.

    So it is solved. The answer and relevant conclusions were received:


    Also popular now: