New realization of curiosity in AI. Training with a reward that depends on the difficulty of predicting the outcome

Original author: Karl Cobbe, Alex Nichol, Joshua Achiam, Phillip Isola, Alex Ray, Jonas Schneider, Jack Clark, Greg Brockman, Ilya Sutskever, Ben Barry, Amos Storkey, Alexei Efros, Deepak Pathak, Trevor Darrell, Andrew Brock, Antreas Antoniou , Stanislaw Jastrzebski, Ashle
  • Transfer

Progress in the game “Revenge of Montezuma” was considered by many to be synonymous with advances in the study of unfamiliar environments.

We have developed apredictive-basedmethod of random network distillation (RND), which encourages reinforcement learning agents to explore their surroundings through curiosity. This method for the first time exceeded the average results of a person in the computer game “Revenge of Montezuma” (except for an anonymous application in ICLR, where the result is worse than ours). RND demonstrates cutting-edge efficiency, periodically finds all 24 rooms and passes the first level without prior demonstration and without access to the basic state of the game.

The RND method stimulates the transition of the agent to unfamiliar states by measuring the complexity of predicting the result of applying a constant random neural network to the state data. If the condition is unfamiliar, then the final result is difficult to predict, and therefore the reward is high. The method can be applied to any learning algorithm with reinforcement; it is simple to implement and effective in scaling. Below is a link to the implementation of RND, which reproduces the results from our article.

The text of the scientific article , code

Results in the game "Revenge of Montezuma"

To achieve the desired goal, the agent must first examine what actions are possible in this environment and what constitutes progress towards the goal. Many reward cues in games provide a curriculum, so even simple research strategies are enough to achieve a goal. In the initial work with the presentation of the DQN “Revenge of Montezuma” was the only game where DQN showed a result of 0% of the average human score (4700) . Simple intelligence strategies are unlikely to collect any rewards and will find no more than a few rooms at a level. Since then, progress in the game “Revenge of Montezuma” has been widely regarded as synonymous with advances in the field of exploring an unfamiliar environment.

Significant progress was achieved in 2016by combining DQN with a bonus on the counter, with the result that the agent managed to find 15 rooms and get the highest score of 6600 with an average of about 3700. Since then, significant improvements in the result are achieved only with the help of demonstrations from expert people or by accessing basic emulator states .

We conducted a large-scale RND experiment with 1024 workers, receiving an average result of 10,000 over 9 starts and the best average result of 14,500 . In each case, the agent found 20−22 rooms. In addition, one of the smaller but longer start-ups (out of 10) showsthe maximum result is 17,500, which corresponds to passing the first level and finding all 24 rooms . The graph below compares these two experiments, showing the average value depending on the update parameters.

The visualization below shows the course of the experiment on a smaller scale. The agent, under the influence of curiosity, opens up new rooms and finds ways to score points. During training, this external reward causes him to return to these rooms later.

Agent-detected rooms and average result during training. The degree of transparency of the room corresponds to how many times out of 10 passes of the agent it was detected. Video

Large-scale curiosity study study

Prior to the development of RND, we, along with staff from the University of California at Berkeley, investigated learning without any environmental rewards. Curiosity gives an easier way to teach agents to interact with any environment, rather than using a specially designed reward function for a specific task, which is not yet a fact, which corresponds to the solution of the problem. In projects like ALE , Universe , Malmo , Gym , Gym Retro , Unity , DeepMind Lab , CommAIA large number of simulated media are opened for the agent through a standardized interface. An agent using a generic reward function that is not specific to a particular environment can acquire a basic level of competence in a wide range of environments. This allows him to determine useful behavior even in the absence of well-designed rewards.

The text of the scientific article , code

In the standard training settings with reinforcements at each discrete time step, the agent sends the action to the environment, and it responds, giving the agent a new observation, a switch reward and an episode end indicator. In our previous article, we set up the environment to issue onlyfollowing observation. There, the agent studies the predictor model of the next state based on his experience and uses the prediction error as an internal reward. As a result, he is attracted by unpredictability. For example, changing a game account is rewarded only if the score is displayed on the screen and the change is difficult to predict. An agent usually finds interactions with new objects useful, since the results of such interactions are usually more difficult to predict than other aspects of the environment.

Like other researchers , we tried to avoid modeling all aspects of the environment, regardless of whether they are relevant or not, choosing observation features for modeling. Surprisingly, we found that even random functions work well.

What are curious agents doing?

We tested our agent in more than 50 different environments and observed a range of competencies from seemingly random actions to conscious interaction with the environment. To our surprise, in some cases the agent managed to get through the game, although he was not informed of the goal through an external reward.

Internal reward at the beginning of training

Jump of internal reward at the first pass of the

Breakout level - jumps of the internal reward, when the agent sees a new configuration of blocks at an early stage of training and when for the first time passes the level after training for several hours.

Pong - we trained the agent to control both platforms simultaneously, and he learned to keep the ball in play, which led to protracted fights. Even when training against an in-game AI, the agent tried to extend the game as much as possible, rather than win.

Bowling - the agent learned to play the game better than other agents who were trained to directly maximize the external reward. We think this is happening, because the agent is attracted by the hardly predictable blinking of the board after the shots.

Mario - internal reward is particularly well consistent with the goal of the game: advance through the levels. The agent is rewarded for searching for new areas, since the details of the newly found area cannot be predicted. As a result, the agent found 11 levels, found secret rooms and even defeated bosses.

The problem of noisy TV

As a gambler on a slot machine, attracted by random results, an agent sometimes falls into the trap of his curiosity as a result of the “problem of a noisy TV”. The agent finds a source of chance in the environment and continues to observe it, always experiencing a high internal reward for such transitions. An example of such a trap is watching TV that reproduces static noise. We demonstrate this literally by placing an agent in a Unity maze with a TV that plays random channels.

Agent in a maze with noisy TV

Agent in a maze without a noisy TV

Theoretically, the problem of a noisy TV is really serious, but we still expected that in much deterministic environments like Revenge of Montezuma, curiosity would force the agent to find rooms and interact with objects. We tried several options for predicting the next state based on curiosity, combining a research bonus with a score from the game.

In these experiments, the agent controls the environment through a noise controller, which with some probability repeats the last action instead of the current one. This setup with repeatable sticky actions has been proposed as a best practice for training agents in fully deterministic games, such as Atari, to prevent memorization. Sticky actions make the transition from room to room unpredictable.

Random distillation of the network

Since the prediction of the next state is inherently susceptible to the problem of a noisy TV, we identified the following relevant sources of prediction errors:

  • Factor 1 . The prediction error is high if the predictor fails to summarize from the previously discussed examples. New experience corresponds to a high prediction error.
  • Factor 2 . The prediction error is high due to the stochastic prediction goal.
  • Factor 3 . The prediction error is high due to the lack of information needed to predict, or because the class of predictor models is too limited to fit the complexity of the objective function.

We determined that factor 1 is a useful source of errors, since it quantifies the novelty of the experience, while factors 2 and 3 lead to the problem of a noisy TV. To avoid factors 2 and 3, we developed RND - a new research bonus based on predicting the issuance of a permanent and randomly initialized neural network in the next state, taking into account the next state itself .

Intuition suggests that prognostic models have a low error in predicting the states in which she was trained. In particular, the agent's predictions about the issuance of a randomly initialized neural network will be less accurate in the new states than in the states that the agent has often met before. The advantage of using the synthetic prediction problem is that it can be deterministic (bypassing factor 2), and inside the function class, the predictor can choose a predictor of the same architecture as the target network (bypassing factor 3). This saves RND from the problem of noisy TV.

We combined bonus for research with external rewards through a variety of optimization nearest policies - Proximal the Policy Optimization ( the PPO ), which usestwo values ​​for two streams of rewards . This allows you to use different discounts for different rewards and to combine episodic and non-episodic rewards. Thanks to this additional flexibility, our best agent often finds 22 out of 24 rooms on the first level in Revenge of Montezuma, and sometimes passes the first level after finding the remaining two rooms. The same method demonstrates record performance in Venture and Gravitar games.

The visualization below shows the internal reward schedule in the Episode "Revenge of Montezuma", where the agent finds the torch for the first time.

Important competent implementation

To choose a good algorithm, it is important to consider general considerations, such as susceptibility to the problem of a noisy TV. However, we found that seemingly very small changes in our simple algorithm strongly affect its efficiency: from an agent who cannot leave the first room to an agent who passes the first level. To add stability to learning, we avoided saturating the signs and brought internal rewards to a predictable range. We also noticed significant improvements in the effectiveness of RND every time we found and fixed a bug.(our favorite includes an accidental zeroing of the array, which led to the fact that external rewards were regarded as non-episodic; we understood this only after thinking about the external function of value, which looked suspiciously periodic). Correcting these details has become an important part of achieving high performance even when using algorithms that are conceptually similar to previous work. This is one of the reasons why it is better to choose simple algorithms if possible.

Future work

We offer the following areas for further research:

  • Analysis of the benefits of various research methods and the search for new ways of combining them.
  • Learning a curious agent in many different environments without reward and learning to transfer to a target environment with rewards.
  • Global intelligence, including coordinated decisions on long time horizons.

Also popular now: