Teach the agent to play Mario Kart using filters.

Published on November 12, 2018

Teach the agent to play Mario Kart using filters.


    Vladimir Ivanov vivanov879 , Sr. Deep Learning Engineer at NVIDIA , continues to talk about reinforcement training. In this article, we will discuss the training of the agent for completing quests and how neural networks use filters for image recognition.

    The previous article dealt with agent training for simple shooters.

    Vladimir will talk about the application of reinforcement training in practice at the AI Conference on November 22.

    The previous time, we examined examples of video games, where reinforcement training helps to solve the problem. It is curious that for a successful neural network game only visual information was needed. Every fourth frame of the neural network analyzes the screen shot and makes a decision.


    At first glance, it looks like magic. Some kind of complex structure, which is a neural network, receives a picture as input and gives the correct solution. Let's see in stages what is happening inside: what turns a set of pixels into action?

    Before you go to the computer, let's see what a person sees.
    Когда человек смотрит на изображение, его взгляд цепляется как за мелкие детали (лица, фигуры людей, деревья), так и за картину в целом. Будь то детская игра на аллее или футбольный матч, человеку на основании его жизненного опыта удается понять содержание картины, настроение и контекст сделанного снимка.



    Когда мы любуемся работой мастера в картинной галерее, наш жизненный опыт все равно подсказывает нам, что за слоями красок кроются персонажи. Можно угадать их намерения и движение в картине.



    В случае абстрактной живописи взгляд находит на изображении простые фигуры: круги, треугольники, квадраты. Их найти значительно проще. Порой это все, что удается разглядеть.



    Предметы можно расположить так, чтобы картина приобрела неожиданный оттенок.


    That is, we can perceive the picture as a whole, abstracting from its specific components. Unlike us, the computer initially does not have this feature. We have a wealth of life experience that tells us which objects are important and what physical properties they have. Let's think what tool to give the machine so that it can study the images.

    Many happy owners of phones with high-quality cameras before placing a photo from the phone in a social network impose on it various filters. With the filter, you can change the mood of the photo. You can select some items more clearly.


    In addition, the filter can highlight the edges of objects in the photo.


    Since the filters have the ability to highlight different objects in the image, let's give the computer the opportunity to pick them up. What is a digital image? This is a square matrix of numbers, at each point of which the intensity values ​​are located for the three channels of colors: red, green and blue. Now we give the neural network available, for example, 32 filters. Each filter in turn impose on the image. The filter core is applied to adjacent pixels.


    Initially, the values ​​of the core of each filter will be random. But we will give neural networks the ability to configure them depending on the task. After the first layer with filters, we can put a few more. Since there are a lot of filters, we need a lot of data to set them up. For this fit any big bank of marked up pictures. For example, dasset MSCoco.



    The neural network adjusts the weights for solving this problem. In our case - for the segmentation of the image, that is, the definition of the class of each pixel of the image. Now let's see how the images will look after each layer of filters.



    If you look closely, you will notice that the filters in one degree or another leave the car, and clean the surrounding area - the road, trees and sky.

    Let's return to the agent who studies to play games. For example, take the racing game Mario Kart.


    We gave him a powerful image analysis tool - a neural network. We will see what filters will be selected to learn how to drive. Take for a start an open area.


    Let's see how the image looks after the first 24 filrov. Here they are located in the form of a table 8x3.


    It is absolutely not necessary that each of the 24 outputs has an obvious meaning, because the images go further to the input to the following filters. Dependencies can be completely different. However, in this case, you can detect some logic in the outputs. For example, the second filter in the first line highlights the road with black. The first filter of the seventh line duplicates its function. And most of the other filters are clearly visible cards that we manage.

    In this game, the surrounding area changes and there is a tunnel. What does the racing neural network pay attention to when entering the tunnel?


    Outputs of the first filter layer:


    In the sixth line, the first filter marks the entrance to the tunnel. Thus, during the ride, the network learned how to identify them.

    And what happens when the machine enters the tunnel?


    The result of the action of the first 24 filters:


    Despite the fact that the illumination of the scene has changed, as well as the environment, the neural network snatches the most important thing - the road and maps. Again, the second filter in the first line, which was responsible for finding the path in an open area, retains its functions in the tunnel. And in the same way, the first filter of the seventh row, as before, finds its way.

    Now that we have figured out what the neural network sees, let's try to use it to solve more complex problems. Before that, we considered tasks where there is practically no need to think ahead, but we need to solve the task that is right before us. In shooters and races, you need to act “reflexively”, quickly responding to sudden changes in the game. What about the passage of the game quest? For example, the game Montezuma Revenge, in which you need to find the keys and open the locked doors to get out of the pyramid.



    The last time we discussed that the agent did not learn to look for new keys and doors, since these actions take a lot of playing time, and therefore the signal in the form of points received will be very rare. If you use points for battered enemies as a reward to the agent, he will constantly knock out rolling skulls and will not look for new moves.

    Let's reward the agent for open new rooms. We will use a priori the well-known fact that this is a quest, and all the rooms in it are different.



    Therefore, if the picture on the screen is fundamentally different from what we saw before, the agent receives a reward.

    Before that, we considered game agents that rely solely on visual data during training. But if we have access to other data from the game, we will use them too. Consider, for example, the game Dota. Here, the network receives twenty thousand numbers at the entrance, which completely describe the state of the game. For example, the position of the allies, the health of the towers.



    Players are divided into two teams of five people each. The game lasts an average of 40 minutes. Each player chooses a hero with unique abilities. And each player can buy items that change the parameters of damage, speed and field of view.

    Despite the fact that the game at first glance is significantly different from Doom, the learning process remains the same. Except for a few moments. Since the planning horizon in this game is higher than in Doom, we will process the last 16 frames for decision-making. And the reward signal that the agent receives will be somewhat more difficult. It includes the number of defeated enemies, the damage caused, and also the money earned in the game. In order for neural networks to play together, we will include in the reward the well-being of the team members of the agent.

    As a result, the team of bots beatsstrong enough teams of people, but loses to the champions. The reason for the defeat is that the bots rarely played an hour long matches. A game with real people dragged on longer than those who played on simulators. That is, if an agent finds himself in a situation for which he has not trained, he begins to have difficulties.