NeurIPS: how to conquer the best ML conference

NeurIPS is a conference that is currently considered the most top event in the world of machine learning. Today I will tell you about my experience of participating in NeurIPS contests: how to compete with the best academics in the world, take a podium place and publish an article.

What is the essence of the conference

NeurIPS supports the introduction of machine learning methods in various scientific disciplines. About 10 tracks are launched annually to solve actual problems of the academic world. According to the results of the competition, the winners perform at the conference with presentations, new developments and algorithms. Most of all I am passionate about learning with reinforcements (Reinforcement Learning or RL), so for the second year I have been participating in RL contests dedicated to NeurIPS.

Why NeurIPS

NeurIPS is primarily focused on science, not money. By participating in contests, you do something really important, deal with actual problems.

Secondly, this conference is a global event, scientists from different countries gather in one place, with each of which you can communicate.

In addition, the entire conference is filled with the latest scientific achievements and state-of-the-art results, to know and follow which people from the field of data science is extremely important.

How to start?

Starting to participate in such contests is quite simple. If you understand so much in DL that you can train ResNet - that's enough: register and go ahead. There is always a public leaderboard on which you can soberly assess your level compared to other participants. And if something is not clear –– there are always channels in slack / discord / gitter / etc to discuss all the issues that arise. If the topic is really “yours”, then nothing will stop you from getting the coveted result –– in all the contests in which I participated, all approaches and solutions were studied and implemented right in the course of the competition.

NeurIPS on the example of a specific case: Learning to Run

Problematics

The gait of a person is the result of the interaction of muscles, bones, organs of sight and inner ear. When the central nervous system is impaired, certain movement disorders can occur, including gait disturbance –– abasia.
Researchers from the Stanford laboratory of neuromuscular biomechanics decided to connect machine learning to the subject of treatment in order to be able to experiment and test their theories on a virtual skeleton model, and not on living people.

Formulation of the problem

The participants were given a virtual human skeleton (in the OpenSim simulator ), which had a prosthesis in place of one leg. The task was to teach the skeleton to move in a certain direction with a given speed. During the simulation, both the direction and speed could change.

To obtain a virtual skeleton control model, it was proposed to use reinforcement learning. The simulator gave us some state of the skeleton S (a vector of ~ 400 numbers). It was necessary to predict what action A should be performed (the forces of activation of the muscles of the legs –– a vector of 19 numbers). In the course of the simulation, the skeleton was given a reward R - as a kind of constant minus the penalty for deviation from a given speed and direction.

Pro training with reinforcements

Reinforcement Learning (RL) is an area that deals with decision theory and the search for optimal behavior policies.

Recall how to teach ~~cat~~ собачку новым трюкам. Повторяете какое-то действие, за выполнение трюка даете вкусняшку, за невыполнение – не даете. Собачке во всем этом следует разобраться и найти стратегию поведения (“политику” или “policy” в терминах RL), которая максимизирует количество получаемых вкусяшек.

Формально у нас есть агент (собачка), который обучается по истории взаимодействий со средой (человеком). При этом среда, оценивая действия агента, предоставляет ему награду (вкусняшку) – чем лучше поведение агента, тем больше и награда. Соответственно, задача агента – найти политику, которая хорошо максимизирует награду за все время взаимодействия со средой.

Развивая эту тему дальше, rule-based solutions – software 1.0, когда все правила задавались разработчиком, supervised learning – тот самый software 2.0, когда система обучается сама по имеющимся примерам и находит зависимости в данных, reinforcement learning – шаг чуть дальше, когда система сама учится исследовать, экспериментировать и находить требуемые зависимости в своих решениях. Чем дальше мы идем, тем лучше пытаемся повторить то, как обучается сам человек.

Task features

The task looks like a typical reinforcement training representative for tasks with a continuous space of action (RL for continuous action space). It differs from the usual RL in that instead of choosing a specific action (pressing the joystick button), this action is required to be accurately predicted (there are infinitely many possibilities here).

The basic approach to solving ( Deep Deterministic Policy Gradient ) was invented in 2015, which for a long time by the standards of DL, the area continues to actively develop in application to robotics and real-world RL applications. There is something to improve: robustness of approaches (not to break a real robot), sample efficiency (not to collect data from real robots for months) and other problems of RL (exploration vs exploitation trade-off, etc). In this competition, a real robot was not given to us - just a simulation, but the simulator itself was 2000 times slower than Open Source analogues (on which everyone checks their RL algorithms), and therefore brought the problem of sample efficiency to a new level.

Stages of the competition

The competition itself took place in three stages, during which the task and conditions were somewhat modified.

Stage 1: the skeleton learned to walk straight at a speed of 3 meters per second. The task was considered completed if the agent will pass 300 steps.
Stage 2: changing the speed and direction with a regular frequency. The length of the distance increased to 1000 steps.
Stage 3: the final decision had to be packaged in a docker image and sent for review. In total, it was possible to make 10 parcels.

The main quality metric was considered the total reward for the simulation, which showed how well the skeleton adhered to a given direction and speed throughout the distance.

During the 1st and 2nd stages, the progress of each participant was displayed on the leaderboard. The final decision was required to send in the form of a docker-image. There were restrictions on work time and resources.

Coolstory: public leaderboard and RL

Из-за доступности лидерборда никто не показывает свою лучшую модель, чтобы в финальном раунде выдать “чуть больше обычного” и удивить соперников.

Why docker images are so important

Last year, there was a small incident in evaluating decisions in the very first round. At that time, the test went through http-interaction with the platform, and the face of testing conditions was found. It was possible to find out in which situations the agent was evaluated and to retrain him only under these conditions. Which, of course, did not solve the real problem. That is why it was decided to transfer the system submit to docker-images and launch on remote servers of the organizers. Dbrain uses the same system for calculating the result of competitions exactly from these considerations.

Key points

Team

The first thing that matters to the success of an entire enterprise is the team. No matter how good you are (and how powerful your hands are) - participation in a team greatly increases the chances of success. The reason is simple - a variety of opinions and approaches, double-checking hypotheses, the ability to parallelize the work and conduct more experiments. All this is extremely important when solving new problems that you have to face.

Ideally, your knowledge and skills should be at the same level and complement each other. For example, this year I got our team on PyTorch, and I got some initial ideas for implementing a distributed agent training system.

How to find a team? First, you can join the ranks of ods and look for like-minded people there. Secondly, for RL-fellows there is a separate chat in a telegram - the RL club . Thirdly, you can take a wonderful course from the ShAD - Practical RL , after which you will definitely get a couple of acquaintances.

However, it is worth remembering the policy of “submitting - or not”. If you want to unite - first get your decision, zabmmit, appear on the leaderboard and show your level. As practice shows, such teams are much more balanced.

Motivation

As I already wrote, if the theme is “yours”, then nothing will stop you. This means that the region does not just like you, but inspires you - you burn with it, you want to become the best in it.
I met RL 4 years ago - during the passage of the Berkeley 188x - Intro to AI - and still do not cease to be surprised at the progress in this area.

Systematic

Third, but just as important - you need to be able to do what you promised, to invest in the competition every day and just ... solve it. Everyday. No innate talent can compare with the ability to do something, even a little bit, but every day. This is what motivation is needed for. To succeed in this, I advise you to read DeepWork and AMA ternaus .

Time management

Another extremely important skill is the ability to distribute your strength and use your free time properly. Combining fulltime work and participation in competitions is not a trivial task. The most important thing in these conditions - do not burn out and withstand the entire load. To do this, you need to properly manage your time, soberly assess your strength and do not forget to rest in time.

Overwork

At the final stage of the competition, there is usually a situation where in just a week you need to do more than just a lot, but A LOT. For the sake of a better result, you need to be able to force yourself to sit down and make the last dash to the coveted prize.

Coolstory: deadline after deadline

Из-за чего вообще может понадобиться переработать на благо соревнования? Ответ довольно прост – перенос дедлайнов. На таких соревнованиях организаторы часто не могут всего предугадать, из-за чего наиболее простым выходом является дать участникам больше времени. В этом году соревнование продлевалось 3 раза: сначала на месяц, потом на неделю и в самый последний момент (за 24 часов до дедлайна) – еще на 2 дня. И если во время первых двух переносов нужно было просто правильно организовать дополнительное время, то на последних двух днях надо было просто пахать.

Theory

Among other things, do not forget about the theory - to be aware of what is happening in the region and be able to note the relevant. For example, to solve last year, our team pushed away from the following articles:

Continuous control with deep reinforcement learning - a basic article on deep reinforcement learning for problems with continuous action space.
Parameter Space Noise for Exploration - a study on adding noise to the weight of an agent to better study the environment. By experience - one of the best techniques for exploration in RL.

This year they added another “couple”:

A Distributional Perspective on Reinforcement Learning - a new look at predicting possible rewards. Instead of simply predicting the average, the distribution of the future reward is calculated.
Distributional Reinforcement Learning with Quantile Regression is a continuation of the previous work, but already with “quantizing” the distribution.
Distributed Prioritized Experience Replay - work from the direction of deep reinforcement learning at scale. How to properly organize the architecture of the experiment to maximize the use of available resources and increase the speed of training agents.
Distributed Distributional Deterministic Policy Gradients - combining the three previous articles for tasks with a continuous space of action.
The Addressing Function Approximation Error in Actor-Critic Methods is an excellent job of increasing the robustness of RL agents. I recommend reading.
Data-Efficient Hierarchical Reinforcement Learning - a development of a previous article on hierarchical reinforcement learning (HRL).

Additional reading

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor – авторы предложили метод тренировки стохастических политик при off-policy reinforcement learning. Благодаря этой статье стало возможно обучать не детерминированные политики даже в задачах с непрерывным пространством действий.
Latent Space Policies for Hierarchical Reinforcement Learning – продолжение предыдущей статьи в области HRL с многоуровневыми стохастическими политиками.
Diversity is All You Need: Learning Skills without a Reward Function – статья содержит подход с обучением множества случайных низкоуровневых стохастических политик без какой-либо награды от среды. Впоследствии, когда у нас задана reward function, наиболее коррелирующие с наградой можно использовать для обучения высокоуровневой политики поверх.
Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review – обзор всевозможных maximum entropy reinforcement learning методов от Sergey Levine.

Также советую OpenAI подборку статей по reinforcement learning и ее версию для mendeley. А если вас заинтересовала тема обучения с подкреплением – присоединяйтесь к RL club и RL papers.

Practice

Knowledge of theory alone is not enough - it is important to be able to implement all these approaches in practice and establish the correct validation system for evaluating decisions. For example, this year we learned that our agent does not cope well with some marginal cases only 2 days before the end of the competition. Because of this, we did not have time to completely correct our model and literally got a few points to the coveted second place. If we found it at least a week later - the result could be better.

Coolstory: episode III

В качестве итоговой оценки решения выступала усредненная награда по 10ти тестовым эпизодам.

На графике представлены результаты тестирований нашего агента: 9 из 10 эпизодов наш скелет проходил просто отлично (среднее – 9955.66), но один эпизод….эпизод 3 ему не давался (награда 9870). Именно эта ошибка привела к падению итогового скора до 9947 (-8 пунктов).

Luck

And finally - do not forget about banal luck. Do not think that this is a controversial point. On the contrary, a little luck strongly contributes to continuous work on yourself: even if the probability of luck is only 10%, a person who tried to participate in the competition 100 times succeeds much more than someone who tried only 1 time and abandoned the idea.

There and back: the decision of last year - the third place

Last year, our team - Mikhail Pavlov and I - participated in NeurIPS competitions for the first time and the main motivation was simply to participate in the first NeurIPS competition in reinforcement learning. Then I just completed the Practical RL course at the SAD and wanted to test my skills. As a result, we took the honorable third place, behind only nnaisene (Schmidhuber) and the university team from China. At that time, our decision was “pretty simple” and was based on Distributed DDPG with parameter noise ( publication and performance on ml . Trainings ).

The decision of this year - the third place

This year there have been a couple of changes. First, simply there was no desire to participate in this competition, I wanted to win it. Secondly, the team has also changed: Alexey Grinchuk, Anton Pechenko, and me. To take and win did not work, but we again took 3rd place.
Our solution will be officially presented at NeurIPS, and now we will limit ourselves to a small number of details. Taking the decision of last year and the success of off-policy reinforcement learning of this year (articles above), we added to this a number of our own developments, which we will tell on NeurIPS, and got Distributed Quantile Ensemble Critic, with which we took the third place.

All of our achievements –– distributed learning system, algorithms, etc. will be published and available in Catalyst.RL after NeurIPS.

Coolstory: big boys - big guns

Наша команда уверенно шла на 1е место на протяжении всего конкурса. Однако у больших ребят были другие планы – за 2 недели до конца соревнования на конкурс зашли сразу 2 крупных игрока: FireWork (Baidu) и nnaisense (Шмидхубер). И если с китайским гугл поделать ничего не удалось, то вот с командой Шмидхубера нам довольно долго удавалось честно бороться за второе место, уступив лишь с минимальным отрывом. Как мне кажется довольно неплохо для любителей.

Why is this all?

Connections Top researchers come to the conference with whom you can communicate live, which will not give any email correspondence.
Publication. If the decision takes a prize, then the team is invited to the conference (and maybe more than one) to present its decision and publish the article.
Job offer and PhD. Publication and prizes in such a conference significantly increase your chances of getting a position in such leading companies as OpenAI, DeepMind, Google, Facebook, Microsoft.
Real world value. NeurIPS is conducted to solve actual problems of academic and real world. You can be sure that the results will not go to the table, but will really be in demand and will help to improve the world.
Drive. Solving such contests ... just interesting. In the conditions of competition, you can come up with many new ideas, test different approaches - just to be the best. And let's be honest, when else can you drive skeletons, play games and all this with a serious look and for the sake of science?

Coolstory: visa and RL

Настоятельно не рекомендую пытаться объяснить проверяющему вас американцу, что вы едете на конференцию, так как тренируете виртуальных скелетов бегать в симуляции. Просто поезжайте на конференцию с докладом.

Results

Participation in NeurIPS is an experience that is difficult to overestimate. Do not be afraid of loud headlines - you just need to pull yourself together and start to decide.

And go to Catalyst.RL , what really.

Tags: