How to participate in machine learning competitions. Lecture in Yandex

    Many of the regular visitors of the ML-trainings adhere to the well-founded opinion that participation in competitions is the fastest way to get into the profession. We even had an article on this topic. The author of today's lecture, Arthur Kuzin, showed by his own example how in a couple of years it is possible to retrain from a field not related to programming at all to a data analyst.


    - Hello. My name is Arthur Kuzin, I am a lead data scientist from Dbrain.

    Emil had a rather detailed report on many aspects. I will focus on what I consider the most important and fun. Before proceeding to the topic of the report, I want to introduce myself. Generally speaking, I graduated from the Department of Physics and approximately 8 years old, from the third year, I worked in the laboratory, which is located on the floor of NK. This laboratory is engaged in the creation of micro-and nanostructures.



    I worked as a researcher all this time, and it had nothing to do with ML, or even with programming. This shows how low the entrance to machine learning is, how fast it can be developed in this. Further, around 2013, friends called me to a start-up who was engaged in ML. And in the course of 2-3 years, I studied both programming and ML at the same time. My progress was rather slow - I studied the materials, delved into it, but it was not as fast as it happens now. For me, everything changed when I started participating in ML competitions. The first competition was from Avito, about the classification of cars. I did not really know how to participate in it, but managed to take third place. Immediately after that, another competition started, already dedicated to the classification of ads. There were pictures, text, description, price - it was a complex competition. In it, I won the first place, after which I almost immediately got an offer and they took me to Avito. Then there was no position of the junior, I was taken immediately by the middle - almost without relevant experience.

    Further, when I was already working at Avito, I began to participate in competitions at Kaggle and in about a year received a grand master. Now I am in 58th place in the overall ranking. This is my profile. Having worked in Avito for a year and a half, I moved to Dbrain and now I’m a bit of a data science director, coordinating the work of seven data scientists. Everything that I use in my work, I learned from the competition. Therefore, I think that this is a very cool topic, and I strongly advocate participating in competitions and developing.



    People sometimes ask me what to do if I want to become a data scientist. There are two ways here. First listen to any course. There are quite a lot of them, they are all quite high quality. But for me personally, this does not work at all. All people are different, but I don’t like it, simply because, as a rule, courses have very abstract tasks, and when I go through a section, I don’t always understand why I need to know it. In contrast to this approach, you can simply take and start to solve the competition. And this is a completely different flow in terms of approach. It differs in that you immediately acquire a certain amount of knowledge and begin to explore a new topic when you encounter the unknown. That is, you start to decide and understand that you do not have enough knowledge about how to train a neural network. You take, google and learn - only when you need it. This is very simple in terms of motivation, progress, because you already have a task that is strictly formulated within the framework of the competition, a target metric and a lot of support in terms of Open Data Science. And, as a distant bonus, the fact that your decision will become a project that is not yet available.



    Why is it so fun? Where are the positive emotions? The idea is that when you send a submit and it is a little bit better than the previous one, then they say - you have improved the metric, it's cool. You climb up the leaderboard. In contrast, if you do not do anything and send submit, then you go down. And this causes feedback: you feel good when you progress and vice versa. This is a cool mechanism that only Kaggle seems to exploit. And another thing: Kaggle exploits the same dependency mechanism as slot machines and Tinder. You do not know whether your submit is better or worse. This causes an expectation of a result that is unknown to you. Thus, Kaggle is very addictive, but it is quite constructive: you develop and try to improve your decision.

    How to get the first dose? Need to get into the kernels section. They lay out some pieces of pipelines or solutions entirely. A separate question - why do people do it. The man spent the time on development - what's the point of putting it in public? They can use and circumvent the author.

    The idea is that, firstly, the best solutions do not spread. As a rule, these solutions are not optimal from the point of view of teaching models, they do not have all the nuances, but there is an entire pipeline from beginning to end, so that you do not solve the routine tasks associated with data processing, postprocessing, collecting submissions, etc. This is a lower entry threshold to attract new members. It should be understood that the data community community is very open to the discussion and, in general, quite positive. I have not seen this in the scientific community. The main motivation is that new people come with new ideas. It develops a discussion of the task, the competition and allows the entire community to develop.

    If you have taken someone else's decision, launched it, started teaching, then the next thing I highly recommend to do is to look into the data. Banal advice, but you will not believe how many people from the top do not use it. To understand why this is important, I advise you to look at the report of Evgeny Nizhibitsky. He talks about the faces in the picture competitions and about the face in Airbuswhich could also be seen simply by looking at the data. It takes not very much time and helps to understand the task. And the faces in the pictures were about the fact that on different platforms and in different contests it was possible to get the test answers from the train. That is, you could not train any model, but simply look at the data and understand how you can collect the answers for your test - in part or in full. This is a habit that is important not only in competitions, but also in real practice, when you will work with data scientists. In real life, most likely, the task will be formulated poorly. It is not you who formulate it, but you need to understand what its essence and data are. The habit of looking at the data is very important, spend time on it.

    Next you need to understand what the task is. If you looked at the data and understand what the target is ... But you, if I understand correctly, for the most part, are from the Physics and Technology Institute. You must have some kind of critical thinking that prompts you to the question: why did the people who issued the competition do everything right? Why not change, for example, the target metric, look for something else, and collect the right things from a new metric? In my opinion, now, when there are a lot of tutorials and someone else's code, making feed predict is not a problem. To train a model, to train a neural network is a very simple task accessible to a very wide range of people. But it is important to understand what your target is, what you predict and how to collect your target metric. If you predict something irrelevant in objective reality, then the model simply does not learn and you get a very bad speed.

    Examples There was a competition that took place on Topcoder Konica-Minolta.



    It was as follows: you have two pictures, the top one, and on one of them there is dirt, a small dot on the right. It was necessary to allocate it and osegmentirovat. It would seem a very simple task, and neural networks must solve it on time. But the problem is that these two pictures were taken either with a time difference, or from different cameras. As a result, one picture moved a little relative to the other. The scale at the same time was really very small. But there was another feature of this task that the masks are also small. There is a picture that moved relative to the other, and at the same time the mask still moved relative to it. Understandably, what is the difficulty.



    Alexey Buslaev in third place, he took the Siamese neural network with two inputs so that these Siamese heads learn some transformations regarding this distorted picture. After that, he had these features combined, there was a set of bundles, and a certain prediction was obtained. To level this joint in the data, he collected a rather complex network. For example, I have never taught the Siamese network, I did not have to do it. He did it, it's very cool, took third place. In the first place was Eugene (nrzb.), Who simply otrezayzil picture. He saw it as a jamb in the data, because he looked at them, made a resize of the picture, and taught vanilla UNet. This is a very simple neural network, it is just in the textbook, there are articles. This shows that if you look at the data and choose the right target, then you can be in the top with a simple solution.

    I was in second place, because I am friends with Zhenya, after that for some reason the topcoders offended me and did not take the team to Kaggle. But they are very cool guys, Topcoder took 5-6 place, this (nrzb.) And Victor Durnov. Alexander Buslaev took the third place. They further joined up and showed the class at the competition that was at Kaggle. This is also an example of a very beautiful solution, when dudes not only developed a monstrous architecture, but picked up the right target.


    Link from the slide

    Here the task was to segment the cells, and not only say where the cell is and where it is not, but it was necessary to isolate individual cells, such as instsegmentation of each independent cell. At the same time, there were a lot of segmentation competitions before this competition, and it was argued that the segmentation task was solved by the ODS community quite well, at the state of the craft level, some cutting edge of science that allows solving this problem well.

    In this case, the task of segmentation, when you need to divide the cells, was solved very badly. State of the art before this competition was MacrCNN, which is a type of detector, some feature extractor, then a block that produces mask segmentation, and this is all rather difficult to train, you need to train each piece of pipeline separately, this is a whole song.

    Instead, Topcoder developed a pipeline when you predict only cells and borders. Pipeline segmentation is complicated by a minor and allows you to do very beautiful segmentation, subtraction of borders from cells. After that, they raised the bar in terms of the accuracy of this algorithm, while they have a separate neural network that predicts cells better than everything that academics in this field have done before. This is fun for topcoders and very bad for academics. As far as I know, recently academicians tried to publish an article on this datasat, they rejected it because they could not surpass the result by Kaggle. Difficult times have come for academics, now it is necessary to do something normal, and not just do incipital work in your field.



    The next thing I am very drowning for, not only in Kaggle, but also in work, is pipe-training. I don’t see a lot of values ​​in making a monstrous neural network architecture, to come up with cool attributes with attentions, with concatenations of features. It all works, but it is much more important just to be able to train a neural network. And this is not a rocket law, this is quite a simple thing, considering that now there are a lot of articles, tutorials, and so on. I see a lot of values ​​in that you have exactly pipeline training. I understand by this the code that runs on the configuration, and in a controlled, predictable and rather quickly, teaches you a neural network.

    This slide shows the training logs from the competition that is taking place now, Kaggle salt. I still have a bunch of video cards, this is also a bonus. The idea is that with the help of the pipeline I conducted a grid search search for architectures that seemed to me the most interesting. I just made one launch config for all architectures, conditionally, a forum on the zoo of neural networks, walked and trained all the neural networks without taking any effort. This is a very big bonus, and this is what I reuse from competition to competition and in work. Therefore, I am extremely agitating not only to train the neural network, but also to think about what you are teaching and what you are writing in terms of the pipeline, that you have to reuse it.



    Here I highlighted some key things that should be in the training pipeline. This is the startup config that fully defines the learning process. Where you specify all the parameters about the data, about the neural network, about the losses - everything should be in the launch config. It must be controlled. Further logging. The beautiful logs that I showed are the result of recording every step I take.

    Modularity means that you should not take a lot of time to add a new neural network, a new augmentation, a new dataset. This all should be very simple and supported.

    Reproducibility is just a fixation of seeds, not only random ones in NumPy and Random, but there are still some peytorchovye things, I'll tell you later. And reusability. Once you have developed a pipeline, you can use it in other tasks. And this is a big bonus, those who start participating in competitions early, can further use these pipelines in competitions and in work, this all gives a big bonus to other participants.

    Some may ask: I do not know how to code, what to do, how to develop a pipeline? There is a solution.


    Links from the slide: first , second , third

    Sergey Kolesnikov is my colleague who works at Dbrain, he has long been developing such a thing. At first he called it PyTorch Common, then he called it “Prometheus”, now it is called “Catalyst”. Most likely, in a week the name will be different, but the link will be the next name, follow the link "Catalyst".

    The idea is that Sergey developed some lib, which is a trainloop. And it in the current version has almost all the properties that I described. There are still a bunch of examples of how to make a classification, segmentation, and a bunch of other cool things that he developed.

    Here is a list of features that are, and they are developing. You can take this library, and start teaching your own algorithms, your neural networks at a competition that is currently taking place. I recommend everyone to do it.

    In contrast, there is another FastAI, recently released version 1.0, but there is a disgusting code and nothing is clear.

    You can master it, it will give you some gain, but due to the fact that it is very poorly written in terms of code, they have their own flow in terms of how it should be written. From a certain moment on, you will not understand what is happening. Therefore, I do not recommend FastAI, I recommend using Catalist.



    Now suppose you have gone through all this, you have your pipeline, your decision, and you can now participate in the team. Just Emil was asked the question of how justifiable to unite in a team, if you participate, how it happens. It seems to me that teaming up is worth it anyway, even if you are not in the top, but somewhere in the middle. If you have developed your own solution, it is always different from the decisions of other people in some trifles. And when combined, it almost always gives a boost with other participants.

    In addition, this is fun, this is some teamwork in terms of the fact that now you will have a common repository, where you can watch each other’s code, you have a common submission format and a chat where all the fun happens. Social interaction and soft skills are also very important in work that is also worth developing.



    This is a big bonus in the sense that you now see the code of other people as they make this or that decision. And I often look in the repository with my previous teams, I find there cool solutions in terms of the code itself. This is something that can be taken out of the competition in the form of team interaction.

    Suppose you have gone through this whole circle. What did you take out?



    Most likely, you learned how to run someone else's code. I really hope that you have instilled in the habit of looking into the data. You understand the problem, you have learned how to experiment, you have some solution of your own, and now you can arrange it in the form of a project. If you look abstract, it is very similar to the normal work in some IT-company. If you went through the competition and showed a good result, this is a strong point in the resume, at least for me. I interviewed some 20-25 people somewhere when I was employed at Dbrain. There one could single out some boundary cases. There was a dude who just started the public kernel, and didn't really understand them. It looked bad for me, I just wanted the man to understand the question, I did not take it.

    Another dude who honestly said that overfiddled on a leaderboard, but at the same time told all the details of his decision, which was at the Datascience Bowl, we took it, I really like working with him. Kaggle and your decision there is quite a strong point in your resume, if you can arrange it correctly in the form of a presentation, it is good to show it for a future employer.

    If questions about personal gain, I hope I closed, why do companies need it?



    I worked at Avito, they regularly organized data analysis contests. There are several reasons for this. When a competition is held, you need to collect at least a dataset, and very well formulate a task that represents some pain.


    Link from the slide

    That is, the formulation of the task plus dataset is already very much for the company. For example, when Avito held contests on duplicates, it was before I started working with them, they simply collected data, retrained old models and greatly improved the quality. Dataset collection is the work that is needed to raise the quality of models.

    In addition, when many participants work with a single dataset, they get some insights, useful information about how this data is arranged. Not to mention that following the results of the competition one can understand what is the upper level of this decision. If the best data scientists solve this problem, most likely, this is the ceiling of what can be achieved on this task.

    And besides, it is no secret that usually competitions are also hunting. The company usually takes on the work of people from the top, so that they implement this solution with us. For example, after duplicates, Avito was called to work on Dmitry Sobolevsky, who worked with me for the same amount of time. It was a big bonus for Avito - they found an motivated person who wanted to finish this solution for production.



    If you look at the place of employees, the contests are also some kind of proxy to the level of their skills. There is a task that saws data-scientist within the company. And it is not clear how well he does it - provided that he is sufficiently motivated. To compare his performance, how adequately he does his work, you can simply take part in competitions. Here is a complicated thing - the importance of the contribution of each model in the second level XGBoost. If you are not particularly in the subject, then it shows how important each model in the ensemble. But it says nothing about the quality of the model. It shows how I participated with seven dudes. It seems like at that time my models were normal, they gave a good code to the ensemble. This provided some peace of mind that I was doing my job well.

    If companies do not want to participate, do not want to organize, they simply have some relevant task, then, as a rule, after the competition, people post reviews of their decisions and even code. And if you yourself have a relevant task that is very similar, then you can take and get some hot start in the form of a review of the literature, a review of solutions from the top.



    As an example - Coursera course on how to win the competition. Next is the ML training site and ODS chat for discussion. That's all I wanted to say.

    Also popular now: