# Handwritten classification. Yandex Report

A few months ago, our colleagues from Google held a competition on Kaggle to create a classifier of images obtained in the acclaimed game “Quick, Draw!”. The team, in which the developer of Yandex Roman Vlasov participated, took the fourth place in the competition. At the January machine learning training session, Roman shared the ideas of his team, the final implementation of the classifier and interesting practices of his rivals.

- Hello! My name is Roma Vlasov, today I will tell you about Quick, Draw! Doodle Recognition Challenge.

There were five people in our team. I joined her right before merge-deadline. We were unlucky, we had a little bit of a shake, but we got a shake of money, and from a gold position. And we took the honorable fourth place.

(During the competition, the teams observed themselves in the ranking, which was formed according to the results shown on one part of the proposed data set. The final rating, in turn, was formed on the other part of the dataset. So the competitors do not adjust their algorithms to specific data. Therefore, in the final, when switching between ratings, the positions slightly “make-up” (from the English. Shake up - shuffle): on other data and the result may be different. Roman's team was first in the top three. AU troika - is money, money rankings zone, since only the first three locations relied prize After the "shake apa 'team was already in fourth place the same way the other team lost the victory, the position of gold -... Ed)..

Also, the competition was significant in that Evgeny Babakhnin received grandmaster for him, Ivan Sosin - masters, Roman Solovyov remained grand master, Alex Parinov received masters, I became an expert, and now I am a master.

What is this Quick Draw? This is a service from Google. Google aimed to popularize AI and wanted to show how neural networks work with this service. You go in there, click Let's draw, and a new page comes up where you are told: draw a zigzag, you have 20 seconds to do so. You are trying to draw a zigzag in 20 seconds, like here, for example. If everything works out for you, the network says it's a zigzag and you go on. There are only six such pictures.

If the network from Google could not recognize what you drew, a cross was put on the task. Later I will tell you what will mean in the future whether the drawing is recognized by the network or not.

This service gathered a fairly large number of users, and all the pictures that users drew were logged.

We managed to collect almost 50 million pictures. From this formed the train and test date for our competition. By the way, the amount of data in the test and the number of classes are not in vain in bold. I'll talk about them later.

The data format was as follows. These are not just RGB pictures, but, roughly speaking, a log of everything the user was doing. Word is our target, countrycode is where the author of the doodle comes from, timestamp is time. Recognized label just shows whether the network recognizes the image from Google or not. And the drawing itself is a sequence, approximation of a curve that the user draws with dots. And timings. This is the time from the start of drawing pictures.

The data was presented in two formats. This is the first format, and the second is simplified. From there they cut timings and approximated this set of points with a smaller set of points. For this, they used the Douglas-Pecker algorithm. You have a large set of points that simply approximate a straight line, and you can actually approximate this line with just two points. This is the idea of the algorithm.

The data was distributed as follows. Everything is uniform, but there are some outliers. When we solved the problem, we did not look at it. The main thing was that there were not those classes that were really few, we did not have to do weighted samplers and data oversampling.

What did the pictures look like? This is the “aircraft” class and examples from it with recognized and unrecognized tags. The ratio of them was somewhere 1 to 9. As you can see, the data is quite noisy. I would suggest that this is a plane. If you look at not recognized, in most cases it is just noise. Someone even tried to write a “plane”, but apparently in French.

Most of the participants simply took grids, rendered data from this sequence of lines as RGB-pictures and threw them into the network. I did the same about drawing: I took a palette of colors, drawed the first line with one color that was at the beginning of this palette, the last one with another color that at the end of the palette, and between them interpolated everywhere on this palette. By the way, it gave a better result than if you would draw like on the very first slide - just in black.

Other team members, such as Ivan Sosin, tried a slightly different approach to drawing. With one channel he simply drew a gray picture, with another channel he drew each stroke with a gradient from beginning to end, from 32 to 255, and the third channel drew a gradient across all strokes from 32 to 255.

Even from the interesting - Alex Parinov threw information into the network by countrycode.

The metric used in the competition is Mean Average Precision. What is the essence of this metric for competition? You can give three predicids, and if there is no right one in these three predicates, then you get 0. If there is a correct one, then its order is taken into account. And the result on the target will be considered as 1 divided by the order of your prediction. For example, you made three predicids, and the correct one is the first, then you divide 1 by 1 and get 1. If the predicate is correct and its order is 2, then 1 is divided by 2, you get 0.5. Well, etc.

With data preprocessing - how to draw pictures and so on - we decided a bit. What architectures did we use? We tried to use bold architectures, such as PNASNet, SENet, and such classical architectures as SE-Res-NeXt, they are coming into new competitions more and more. There were also ResNet and DenseNet.

How did we teach this? All the models that we took, we took ourselves pre-trained on imagenet. Although there is a lot of data, 50 million pictures, but all the same, if you take a network pre-trained on imagenet, it showed the best result than if you just train it from scratch.

What technology for training we used? This is Cosing Annealing with Warm Restarts, I'll talk about it a little later. This is a technique that I use in almost all my recent competitions, and with them it turns out pretty well to train the grid, to achieve a good minimum.

Further Reduce Learning Rate on Plateau. You start to train the network, set a certain learning rate, then learn it, you gradually have the loss converges to a certain value. You check this, for example, during ten epochs, loss has not changed at all. You reduce your learning rate by some value and continue to teach. It falls a little again, it converges at some minimum and you again lower the learning rate, and so on, until your network finally converges.

Next interesting technique is not to increase the batch size. There is an article with the same name. When you train a network, you don't have to decrease the learning rate, you can simply increase the batch size.

By the way, this technique was used by Alex Parinov. He started with a batch of 408, and when the network came to a plateau, he simply doubled the batch size, and so on.

In fact, I don’t remember how much he got the batch size, but what is interesting was the teams at Kaggle who used the same technique, they had about 10,000 batch size. By the way, modern deep learning frameworks such as PyTorch, for example, allows you to do this very easily. You generate your batch and do not submit it to the network as it is, entirely, but divide it into chunks so that you can fit into your video card, consider gradients, and after calculating the gradient for the whole batch, we update the scales.

By the way, in this competition, even large batch sizes came in, because the data was rather noisy, and a large batch size helped you more accurately approximate the gradient.

Pseudo-lending was also used, for the most part Roman Soloviev used it. He sampled about half of the data from the test in a batch, and on such bats he trained the grid.

The size of the pictures played a role, but the fact is that you have a lot of data, you need to train for a long time, and if your picture size is quite large, then you will train for a very long time. But it didn’t bring as much in the quality of your final classifier, so it was worth using a certain trade-off. And they tried only pictures of not very large size.

How did this all learn? At first, small-sized pictures were taken, several epochs were chased away, it quickly took time. Then large-sized pictures were given, the network was trained, then even more, more so as not to train it from scratch and not spend a lot of time.

About optimizers. We used SGD and Adam. In this way, it was possible to get a single model, which gave 0.941-0.946 speed on a public leaderboard, which is pretty good.

If you are embroiling models in some way, then you will get somewhere 0.951. If you apply another technique, then the final speed you will receive on public board 0.954, as we received. But more on that later. Then I will tell you how we assembled the models, and how we managed to achieve such a final score.

Then I would like to tell you about Cosing Annealing with Warm Restarts or Stochastic Gradient Descent with Warm Restarts. Roughly speaking, in principle, you can shove any optimizer, but the bottom line is this: if you just train one network and gradually it converges to some minimum, then everything is okay, you get one network, it makes certain mistakes, but you can teach her a little differently. You will set some initial learning rate, and gradually lower it according to this formula. You underestimate it, your network comes to some kind of minimum, then you save the weight, and again put the learning rate that was at the beginning of the training, thus go somewhere upward from this minimum, and again underestimate your learning rate.

Thus, you can visit several minima at once, in which the loss you have will be plus or minus the same. But the fact is that the networks with these scales will give different errors on your date. Averaging them, you get some approximation, and your speed will be higher.

About how we assembled our models. At the beginning of the presentation, I said to pay attention to the amount of data in the test and the number of classes. If you add 1 to the number of targets in the test set and divide by the number of classes, you will get the number 330, and this was written on the forum - that the classes in the test are balanced. This could be used.

Roman Solovyov, on the basis of this, came up with a metric, we called it Proxy Score, which correlated quite well with the leaderboard. The point is: you make predictions, take the top 1 of your predicates and count the number of objects for each class. Next, subtract 330 from each value and add the resulting absolute values.

Such values turned out. This helped us not to do a proding leaderboard, but to validate locally and select coefficients for our ensembles.

With the ensemble you could get this fast. What else to do? Suppose you took advantage of the information that the classes in your test are balanced.

The balances were different. An example of one of them is balancing from the guys who took first place.

What did we do? Our balancing was quite simple, it was suggested by Evgeny Babakhnin. We first sorted our predictions by top-1 and chose candidates out of them - so that the number of classes did not exceed 330. But for some classes, it turns out that you have less than 330 predictions. Okay, let's sort them by top-2 and top 3, and choose the candidates as well.

How did our balancing differ from balancing first place? They used an iterative approach, took the most popular class and reduced the probabilities for this class by some small number - until this class became not the most popular. They took the next most popular class. So on and down until the number of all classes became equal.

All used plus or minus one approach for training networks, but not all used balancing. Using balancing, you could go into gold, and if you were lucky, then in mani.

How to pre-process the date? All the preprocessing date plus or minus is the same - made handcrafted features, tried to encode timings with different color strokes, etc. Alexey Nozdrin-Plotnitsky, who took 8th place, said just that.

He did differently. He said that all these handcrafted features of yours do not work, it is not necessary to do this, your network should learn it all by itself. And instead, he came up with learning modules that preprocess your data. He threw in them the original data without preprocessing - the coordinates of points and timings.

Further, he used the coordinates to take the difference, and over the timing it averaged it all. And he got a rather long matrix. He applied 1D convolution to it several times to get a matrix of 64xn, where n is the total number of points, and 64 is done in order to apply the resulting matrix already to a layer of a convolutional network that accepts the number of channels - 64. it produced a 64xn matrix, then it was necessary to create a tensor of some size so that the number of channels was 64. It normalized all X, Y points in the range from 0 to 32 to make a 32x32 tensor. I do not know why he wanted 32x32, it happened. And in this coordinate, he put a fragment of this matrix of size 64xn. Thus, he simply received a 32x32x64 tensor, which could be put further into your convolutional neural network. That's all I wanted to say.

- Hello! My name is Roma Vlasov, today I will tell you about Quick, Draw! Doodle Recognition Challenge.

There were five people in our team. I joined her right before merge-deadline. We were unlucky, we had a little bit of a shake, but we got a shake of money, and from a gold position. And we took the honorable fourth place.

(During the competition, the teams observed themselves in the ranking, which was formed according to the results shown on one part of the proposed data set. The final rating, in turn, was formed on the other part of the dataset. So the competitors do not adjust their algorithms to specific data. Therefore, in the final, when switching between ratings, the positions slightly “make-up” (from the English. Shake up - shuffle): on other data and the result may be different. Roman's team was first in the top three. AU troika - is money, money rankings zone, since only the first three locations relied prize After the "shake apa 'team was already in fourth place the same way the other team lost the victory, the position of gold -... Ed)..

Also, the competition was significant in that Evgeny Babakhnin received grandmaster for him, Ivan Sosin - masters, Roman Solovyov remained grand master, Alex Parinov received masters, I became an expert, and now I am a master.

What is this Quick Draw? This is a service from Google. Google aimed to popularize AI and wanted to show how neural networks work with this service. You go in there, click Let's draw, and a new page comes up where you are told: draw a zigzag, you have 20 seconds to do so. You are trying to draw a zigzag in 20 seconds, like here, for example. If everything works out for you, the network says it's a zigzag and you go on. There are only six such pictures.

If the network from Google could not recognize what you drew, a cross was put on the task. Later I will tell you what will mean in the future whether the drawing is recognized by the network or not.

This service gathered a fairly large number of users, and all the pictures that users drew were logged.

We managed to collect almost 50 million pictures. From this formed the train and test date for our competition. By the way, the amount of data in the test and the number of classes are not in vain in bold. I'll talk about them later.

The data format was as follows. These are not just RGB pictures, but, roughly speaking, a log of everything the user was doing. Word is our target, countrycode is where the author of the doodle comes from, timestamp is time. Recognized label just shows whether the network recognizes the image from Google or not. And the drawing itself is a sequence, approximation of a curve that the user draws with dots. And timings. This is the time from the start of drawing pictures.

The data was presented in two formats. This is the first format, and the second is simplified. From there they cut timings and approximated this set of points with a smaller set of points. For this, they used the Douglas-Pecker algorithm. You have a large set of points that simply approximate a straight line, and you can actually approximate this line with just two points. This is the idea of the algorithm.

The data was distributed as follows. Everything is uniform, but there are some outliers. When we solved the problem, we did not look at it. The main thing was that there were not those classes that were really few, we did not have to do weighted samplers and data oversampling.

What did the pictures look like? This is the “aircraft” class and examples from it with recognized and unrecognized tags. The ratio of them was somewhere 1 to 9. As you can see, the data is quite noisy. I would suggest that this is a plane. If you look at not recognized, in most cases it is just noise. Someone even tried to write a “plane”, but apparently in French.

Most of the participants simply took grids, rendered data from this sequence of lines as RGB-pictures and threw them into the network. I did the same about drawing: I took a palette of colors, drawed the first line with one color that was at the beginning of this palette, the last one with another color that at the end of the palette, and between them interpolated everywhere on this palette. By the way, it gave a better result than if you would draw like on the very first slide - just in black.

Other team members, such as Ivan Sosin, tried a slightly different approach to drawing. With one channel he simply drew a gray picture, with another channel he drew each stroke with a gradient from beginning to end, from 32 to 255, and the third channel drew a gradient across all strokes from 32 to 255.

Even from the interesting - Alex Parinov threw information into the network by countrycode.

The metric used in the competition is Mean Average Precision. What is the essence of this metric for competition? You can give three predicids, and if there is no right one in these three predicates, then you get 0. If there is a correct one, then its order is taken into account. And the result on the target will be considered as 1 divided by the order of your prediction. For example, you made three predicids, and the correct one is the first, then you divide 1 by 1 and get 1. If the predicate is correct and its order is 2, then 1 is divided by 2, you get 0.5. Well, etc.

With data preprocessing - how to draw pictures and so on - we decided a bit. What architectures did we use? We tried to use bold architectures, such as PNASNet, SENet, and such classical architectures as SE-Res-NeXt, they are coming into new competitions more and more. There were also ResNet and DenseNet.

How did we teach this? All the models that we took, we took ourselves pre-trained on imagenet. Although there is a lot of data, 50 million pictures, but all the same, if you take a network pre-trained on imagenet, it showed the best result than if you just train it from scratch.

What technology for training we used? This is Cosing Annealing with Warm Restarts, I'll talk about it a little later. This is a technique that I use in almost all my recent competitions, and with them it turns out pretty well to train the grid, to achieve a good minimum.

Further Reduce Learning Rate on Plateau. You start to train the network, set a certain learning rate, then learn it, you gradually have the loss converges to a certain value. You check this, for example, during ten epochs, loss has not changed at all. You reduce your learning rate by some value and continue to teach. It falls a little again, it converges at some minimum and you again lower the learning rate, and so on, until your network finally converges.

Next interesting technique is not to increase the batch size. There is an article with the same name. When you train a network, you don't have to decrease the learning rate, you can simply increase the batch size.

By the way, this technique was used by Alex Parinov. He started with a batch of 408, and when the network came to a plateau, he simply doubled the batch size, and so on.

In fact, I don’t remember how much he got the batch size, but what is interesting was the teams at Kaggle who used the same technique, they had about 10,000 batch size. By the way, modern deep learning frameworks such as PyTorch, for example, allows you to do this very easily. You generate your batch and do not submit it to the network as it is, entirely, but divide it into chunks so that you can fit into your video card, consider gradients, and after calculating the gradient for the whole batch, we update the scales.

By the way, in this competition, even large batch sizes came in, because the data was rather noisy, and a large batch size helped you more accurately approximate the gradient.

Pseudo-lending was also used, for the most part Roman Soloviev used it. He sampled about half of the data from the test in a batch, and on such bats he trained the grid.

The size of the pictures played a role, but the fact is that you have a lot of data, you need to train for a long time, and if your picture size is quite large, then you will train for a very long time. But it didn’t bring as much in the quality of your final classifier, so it was worth using a certain trade-off. And they tried only pictures of not very large size.

How did this all learn? At first, small-sized pictures were taken, several epochs were chased away, it quickly took time. Then large-sized pictures were given, the network was trained, then even more, more so as not to train it from scratch and not spend a lot of time.

About optimizers. We used SGD and Adam. In this way, it was possible to get a single model, which gave 0.941-0.946 speed on a public leaderboard, which is pretty good.

If you are embroiling models in some way, then you will get somewhere 0.951. If you apply another technique, then the final speed you will receive on public board 0.954, as we received. But more on that later. Then I will tell you how we assembled the models, and how we managed to achieve such a final score.

Then I would like to tell you about Cosing Annealing with Warm Restarts or Stochastic Gradient Descent with Warm Restarts. Roughly speaking, in principle, you can shove any optimizer, but the bottom line is this: if you just train one network and gradually it converges to some minimum, then everything is okay, you get one network, it makes certain mistakes, but you can teach her a little differently. You will set some initial learning rate, and gradually lower it according to this formula. You underestimate it, your network comes to some kind of minimum, then you save the weight, and again put the learning rate that was at the beginning of the training, thus go somewhere upward from this minimum, and again underestimate your learning rate.

Thus, you can visit several minima at once, in which the loss you have will be plus or minus the same. But the fact is that the networks with these scales will give different errors on your date. Averaging them, you get some approximation, and your speed will be higher.

About how we assembled our models. At the beginning of the presentation, I said to pay attention to the amount of data in the test and the number of classes. If you add 1 to the number of targets in the test set and divide by the number of classes, you will get the number 330, and this was written on the forum - that the classes in the test are balanced. This could be used.

Roman Solovyov, on the basis of this, came up with a metric, we called it Proxy Score, which correlated quite well with the leaderboard. The point is: you make predictions, take the top 1 of your predicates and count the number of objects for each class. Next, subtract 330 from each value and add the resulting absolute values.

Such values turned out. This helped us not to do a proding leaderboard, but to validate locally and select coefficients for our ensembles.

With the ensemble you could get this fast. What else to do? Suppose you took advantage of the information that the classes in your test are balanced.

The balances were different. An example of one of them is balancing from the guys who took first place.

What did we do? Our balancing was quite simple, it was suggested by Evgeny Babakhnin. We first sorted our predictions by top-1 and chose candidates out of them - so that the number of classes did not exceed 330. But for some classes, it turns out that you have less than 330 predictions. Okay, let's sort them by top-2 and top 3, and choose the candidates as well.

How did our balancing differ from balancing first place? They used an iterative approach, took the most popular class and reduced the probabilities for this class by some small number - until this class became not the most popular. They took the next most popular class. So on and down until the number of all classes became equal.

All used plus or minus one approach for training networks, but not all used balancing. Using balancing, you could go into gold, and if you were lucky, then in mani.

How to pre-process the date? All the preprocessing date plus or minus is the same - made handcrafted features, tried to encode timings with different color strokes, etc. Alexey Nozdrin-Plotnitsky, who took 8th place, said just that.

He did differently. He said that all these handcrafted features of yours do not work, it is not necessary to do this, your network should learn it all by itself. And instead, he came up with learning modules that preprocess your data. He threw in them the original data without preprocessing - the coordinates of points and timings.

Further, he used the coordinates to take the difference, and over the timing it averaged it all. And he got a rather long matrix. He applied 1D convolution to it several times to get a matrix of 64xn, where n is the total number of points, and 64 is done in order to apply the resulting matrix already to a layer of a convolutional network that accepts the number of channels - 64. it produced a 64xn matrix, then it was necessary to create a tensor of some size so that the number of channels was 64. It normalized all X, Y points in the range from 0 to 32 to make a 32x32 tensor. I do not know why he wanted 32x32, it happened. And in this coordinate, he put a fragment of this matrix of size 64xn. Thus, he simply received a 32x32x64 tensor, which could be put further into your convolutional neural network. That's all I wanted to say.