Why and how do we hide car license plates in Avito ads

    Hey. At the end of last year, we began to automatically hide license plate numbers on photographs in Avito announcement cards. About why we did this, and what are the ways to solve such problems, read the article.

    Hide my plate!

    Task


    On Avito in 2018, 2.5 million cars were sold. This is almost 7000 a day. All ads for sale need an illustration - photo of a car. But by the state number on it you can find a lot of additional information about the car. And some of our users try to close the license plate on their own.
    imageimage
    imageimage
    prototype for illustration at the beginning of the article

    The reasons why users want to hide the license plate number may be different. For our part, we want to help them protect their data. And we try to improve the processes of sale and purchase for users. For example, an anonymous number service has been working with us for a long time: when you sell a car, a temporary cellular number is created for you. Well, in order to protect data on license plates, we anonymize photos.

    image

    Solution Overview


    To automate the process of protecting user photos, you can use convolutional neural networks to detect a polygon with a license plate.
    Now, for the detection of objects, architectures of two groups are used: two-stage networks, for example, Faster RCNN and Mask RCNN; single-stage (singleshot) - SSD, YOLO, RetinaNet. Detecting an object is the derivation of the four coordinates of the rectangle into which the object of interest is inscribed.

    image

    The networks mentioned above are able to find a lot of objects of different classes in pictures, which is already redundant for solving the license plate search problem, because we usually only have one car in the pictures (there are exceptions when people take pictures of their sold car and its random neighbor , but this happens quite rarely, so this could be neglected).

    Another feature of these networks is that by default they produce a bounding box with sides parallel to the coordinate axes. This happens because a set of predefined types of rectangular frames called anchor boxes is used for detection. More precisely, first using a convolutional network (for example, resnet34), a matrix of attributes is obtained from the picture. Then, for each subset of attributes obtained using the sliding window, a classification occurs: is there an object for the k anchor box or not, and a regression is performed into the four coordinates of the frame, which adjust its position.
    Read more about this here .

    image

    After that, there are two more heads:

    not the most original picture of architecture

    one for classifying the object (dog / cat / plant, etc.),
    the second (bbox regressor) - for the regression of the coordinates of the frame obtained in the previous step in order to increase the ratio of the area of ​​the object to the area of ​​the frame.

    In order to predict the rotated boxing frame, you need to change the bbox regressor so that you also get the angle of rotation of the frame. If this is not done, then it will turn out somehow.

    image

    In addition to the two-stage Faster R-CNN, there are one-stage detectors, such as RetinaNet. It differs from the previous architecture in that it immediately predicts the class and frame, without the preliminary stage of proposing sections of the picture that may contain objects. In order to predict rotated masks, you must also change the head of the box subnet.

    image

    One example of existing architectures for predicting rotated bounding boxes is DRBOX. This network does not use the preliminary stage of the region’s proposal, as in Faster RCNN; therefore, it is a modification of one-stage methods. To train this network, K rotated at certain angles bounding box (rbox) is used. The network predicts the probabilities for each of K rbox to contain the target object, coordinates, bbox size and rotation angle.

    image

    Modifying the architecture and re-training one of the considered networks on data with rotated bounding boxes is a realizable task. But our goal can be achieved more easily, because the scope of the network we have is much narrower - only to hide license plates.
    Therefore, we decided to start with a simple network for predicting the four points of the number, and subsequently it will be possible to complicate the architecture.

    Data


    The assembly of the dataset is divided into two steps: to collect pictures of cars and mark the area with license plate on them. The first task has already been solved in our infrastructure: we carefully store all the ads that have ever been placed on Avito. To solve the second problem, we use Toloka. On toloka.yandex.ru/requester we create a task:
    The task given a photograph of the car. It is necessary to highlight the license plate of the car using a quadrangle. In this case, the state number should be allocated as accurately as possible.
    image

    Using Toloka, you can create tasks for marking up data. For example, evaluate the quality of search results, mark up different classes of objects (texts and pictures), mark up videos, etc. They will be performed by Toloka users, for the fee you charge. For example, in our case, tolokers must highlight the landfill with the license plate number of the car in the photo. In general, it is very convenient for marking up a large dataset, but getting high quality is quite difficult. There are a lot of bots in the crowd, the task of which is to get money from you by giving answers randomly or using some kind of strategy. To counter these bots there is a system of rules and checks. The main check is the mixing of control questions: you manually mark up part of the tasks using the Toloki interface, and then mix them into the main task.

    For the classification task, it is very simple to determine whether the marking is wrong or not, and for the problem of highlighting a region, it is not so simple. The classic way is to count IoU.

    image

    If this ratio is less than a certain threshold for several tasks, then such a user is blocked. However, for two arbitrary quadrangles, calculating IoU is not so simple, especially since in Tolok it is necessary to implement this in JavaScript. We made a small hack, and we believe that the user was not mistaken if for each point of the source polygon in a small neighborhood there is a point marked with a scribe. There is also a quick response rule to block responding users too quickly, captcha, discrepancy with the majority opinion, etc. Having set up these rules, you can expect pretty good markup, but if you really need high quality and complex markup, you need to specifically hire freelancers-scribers. As a result, our dataset amounted to 4k marked images, and it all cost $ 28 at Tolok.

    Model


    Now let's make a network to predict the four points of the area. We will get the signs using resnet18 (11.7M parameters versus 21.8M parameters for resnet34), then we make a head for regression to four points (eight coordinates) and a head for classification whether there is a license plate in the picture or not. The second head is needed, because in ads for the sale of cars, not all photos with cars. The photo may be a detail of the car.

    image

    Similar to us, of course, it is not necessary to detect.

    We do the training of two goals at the same time by adding to the dataset a photo without a license plate with a bounding box (0,0,0,0,0,0,0,0,0) target and a value for the classifier “picture with / without license plate” - (0, 1).

    Then you can create a single loss function for both goals as the sum of the following losses. For regression to the coordinates of the license plate polygon we use a smooth L1 loss.

    image

    It can be interpreted as a combination of L1 and L2, which behaves like L1 when the absolute value of the argument is large and as L2 when the value of the argument is close to zero. For classification, we use softmax and crossentropy loss. The feature extractor is resnet18, we use weights pre-trained on ImageNet, then we will further train the extractor and heads on our dataset. In this problem, we used the mxnet framework, since it is the main one for computer vision in Avito. In general, microservice architecture allows you to not be tied to a specific framework, but when you have a large code base, it is better to use it and not to write the same code again.

    Having received acceptable quality on our dataset, we turned to the designers to get us a license plate with the Avito logo. At first we tried to do it ourselves, of course, but it didn’t look very beautiful. Next, you need to change the brightness of the Avito license plate to the brightness of the original area with the license plate and you can overlay the logo on the image.

    image

    Launch in prod


    The problem of reproducibility of results, support and development of projects, solved with some error in the world of backend- and frontend-development, still stands open where it is required to use machine learning models. You probably had to understand the legacy code model. It’s good if readme has links to articles or open source repositories on which the solution was based. The script for starting retraining may fail with errors, for example, the cudnn version has changed, and that version of tensorflow does not work with this version of cudnn anymore, and cudnn does not work with this version of nvidia drivers. Maybe for training we used one iterator according to the data, and for testing in production another. This can continue for quite some time. In general, reproducibility problems exist.

    We try to remove them using the nvidia-docker environment for training models, it has all the necessary dependencies for cuda, and we also install dependencies for python there. The version of the library with an iterator according to data, augmentations, and inference models is common for the training / experimenting stage and for production. Thus, in order to train the model on new data, you need to pump the repository to the server, run the shell script that will collect the docker environment, inside which the jupyter notebook will rise. Inside, you will have all the notebooks for training and testing, which certainly will not fail with an error due to the environment. It is better, of course, to have one train.py file, but practice shows that you always need to look with your eyes at what the model gives and change something in the learning process, so in the end you will still run jupyter.

    Model weights are stored in git lfs - this is a special technology for storing large files in a git. Before that, we used artifactors, but using git lfs is more convenient, because downloading the repository with the service, you immediately get the current version of the scales, as in production. Autotests are written for model inference, so you won’t be able to roll out a service with weights that do not pass them. The service itself is launched in the docker inside the microservice infrastructure on the kubernetes cluster. To monitor performance, we use grafana. After rolling, we gradually increase the load on service instances with a new model. When rolling out a new feature, we create a / b tests and issue a verdict on the future fate of the feature, based on statistical tests.

    As a result: we launched the glossing of numbers on ads in the auto category for private traders, the 95th percentile of the processing time of one image to hide the number is 250 ms.

    Also popular now: