IvanLobov August 17, 2015 at 23:06

Deephack: Reinforced Deep Learning Hackathon, or How We Enhanced Google Deepmind

From July 19 to July 25, the Deephack hackathon took place , where participants improved the learning algorithm with reinforcements based on Google Deepmind. The goal of the hackathon is to learn how to better play the classic Atari games (Space Invaders, Breakout, etc.). We want to tell you why this is important and how it was.

Authors of the article: Ivan Lobov IvanLobov , Konstantin Kiselev mKKonstantin , George Ovchinnikov ovchinnikoff .
Photos of the event: Maria Molokova, Polytechnic Museum.

Why reinforcement hackathon is awesome:

This is the first hackathon in Russia using deep learning and reinforced learning;
The Google Deepmind Algorithm is one of the latest advances in reinforcement learning;
If you are interested in artificial intelligence, then this topic is very close to this concept (although we ourselves would not want to call it AI).

Where is the interest in reinforcement learning research?

Let's start with the state of machine learning and what can be solved with it. There are 3 main areas (intentionally simplifying):

Teaching with a teacher is any task where the algorithm is trained using examples: gave an answer - got a result. This includes regressions and classifications. Tasks from the real world: estimate the value of real estate, forecast sales, predict earthquakes;
Learning without a teacher is a task without “answers”, where you need to find patterns in the data, find “similarity” or “dissimilarity”. Tasks from the real world: consumer clustering, search for association rules;
Reinforced learning is an intermediate type of task when learning occurs when interacting with a particular environment. The algorithm (agent) performs actions in the environment and sometimes receives feedback. Many “interesting” human activities fall under this class of tasks: sports competitions (in every second of time you do not have the “right” action, there is only the result - a goal is scored or not), negotiations, the process of scientific research, etc.

So the tasks of teaching with a teacher have been solved for almost all areas. However, the next big step is the field of machine learning, where there is no obvious and instant result of interaction. It is in this direction that the field of machine learning is moving now.

How reinforcement learning algorithms work on the example of Atari games (intuition)

There is an agent and the environment is a game. For an agent, the game looks like a black box - he does not know its rules and does not realize that this is a game at all. At the start of training, the game gives the agent a set of actions (actions) that he can perform in it, while all actions for the agent look identical. Further, at each step, the agent performs one of the actions and receives the state of the game (state) and points (reward) in response. The state of the game is a picture from the screen. Points (reward) - a reward for completed actions, can be positive, negative or 0. During training, the agent selects actions and tries them in the game, getting points. The agent's task is to develop a strategy that maximizes the final number of points.

In fact, we mimic a person’s training in playing the game (very rude, since in fact we don’t know exactly how we are learning). With some differences, for example: an adult already has a lot of experience and a wide class of associative concepts has been formed, which allows him to understand the rules of the game at a glance and learn quickly. In our case, the agent’s training model is more likely to resemble the training of a 2-year-old child: first, he randomly presses the control buttons and gradually, catching the laws and principles of the game, he starts to play better and better.

For training, we need a model with the help of which our algorithm approximates the rules of interaction of the agent with the environment. One of the common techniques that the DeepMind team has successfully applied in Atari games is Q-learning. In this technique, the reward function Q is modeled. Then, during testing, the agent selects actions according to the rule: the action should maximize the reward function.

As a model, there may be decision trees, a multidimensional linear function, neural networks, etc. When we deal with complex multidimensional data, such as pictures, convolutional neural networks have proven themselves well. Deepmind's innovation is to combine convolutional networks and Q-learning.

What will the development of machine learning lead to?

Now we are able to teach the computer to play the simplest games better than humans, which most people do not make much of an impression. The next step is to teach the computer to play, say in Doom, i.e. make him study in a 3-dimensional environment. Further gradually complicate the game. The main task is to develop certain principles for finding optimal solutions to tasks in complex environments and use these principles in the form of algorithms in the real world. Thus, machines playing games can obtain an effective representation of their environment and use it to summarize past experiences in new situations.

If you manage to make the computer learn to play independently, for example, in Need For Speed, and learn to play well, then the created algorithms with minor modifications can be used in training robots to drive real cars. And not just a car ... This will allow the mass use of robots, from personal assistants to smart self-service systems, a smart urban environment in which cars under human supervision independently serve the entire complex urban infrastructure.

Now it’s clear why Google acquired Deepmind for more than $ 400 million.

How was the hackathon

The organizers prepared for the hackathon seriously: 7 full days with accommodation, a competition for participation - 5 people per place, lecturers from the top 10 researchers in the field of machine learning, 15 gpu-clusters, 24/7 support for participants on any questions. Venue - MIPT campus in Dolgoprudny.

The order of the competition:

Qualifying round (first 6 days) - who better learns to play Gopher , Seaquest and Tutakham . Our team, Rockband, took 3rd place;
Finals - the Olympic system of 8 teams in 3 previously unknown games ( Space Invaders , Hero , Kung-Fu Master ).

What did all this work on? (Software)

Stella (Atari emulator) -> ALE (Arcade Learning Environment) -> machine learning framework to choose from.

All decisions were based on the Google Openmind open source code and a 2015 article in Nature . The teams solved the problem in one of the three ML frameworks: Lua + Torch (the original code on it), Python + Theanoor C ++ / Caffe. We chose Python + Theano, as we had more experience with it. We can’t unambiguously highlight the best framework, each had its own minuses. In general, there is a feeling that the area is still fresh, so there is little proven and well-functioning code. Much has to be rewritten, double-checked and debazed. We did not find significant advantages of one of the frameworks: neither in the speed of calculations (anyway, a narrow neck - convolution in cuDNN), nor in the convenience of prototyping.

What did all this work on? (Iron)

For calculations, each team was allocated a cluster with 4 GRID K520 on AWS (g2.8xlarge), so it was possible to run up to 4 calculations simultaneously. This was enough to run a number of tests over the course of a week (a full-fledged test takes about 5 full days to study on one GPU, however, we drove short ten-hour tests) that helped test the first hypotheses. But not enough to conduct a full-fledged study, so some teams continue to develop their achievements after the hackathon. Although the question here is not so much in hardware as in real time.

Examples of how a 24-hour-trained model plays:

Space Invaders
Kung Fu Master

How it was (photo)

Still fresh, discussing ideas:

First (?) Night right in coworking:

Despite fatigue, everyone listens to lectures:

The last days before the finals. A few days with almost no sleep. Only the most persistent still work:

Final at VDNH. Everything has already been decided, it remains to root for the trained models:

Participants in the hackathon:

Instead of a conclusion

We all have a very different background - advertising, IT, science. However, we are united by one thing - we see the future behind the development of machine learning and we understand that in Russia we have very little to do with this:

Educational institutions - units with professors who publish something;
Companies where they really use something - two hands are enough to count. And 3 fingers to count those who use deep learning to scale;
There are hundreds of specialists, maybe 1000 (?) People who can assemble a convolution network.

At the same time, all knowledge, software and even hardware for deep learning are available to any student. The level of entry into the region from the point of view of labor also does not roll over.

Where I would like to move:

Popularization of the possibilities of ML in ordinary people;
Popularization of ML and deep learning among students;
Promoting the use of ML and deep learning in business;

If you have ideas or suggestions on how to do this, write in the comments.

PS Thank you very much to all the organizers, and especially Yevgeny Botvinovsky, Sergey Plis, Mikhail Burtsev, Andrey Pakosh, Elizaveta Chernyagina, Vitaly Lvovich Dunin-Barkovsky, Valeria Tsveloy, etc.

Tags: