Reinforcement training never worked

Original author: Himanshu Sahni
  • Transfer
TL; DR: Reinforcement Learning (RL) has always been difficult. Do not worry if standard deep learning techniques do not work.

In the article Alex Irpana well laid out many of today's problems of deep RL. But most of them are not new - they always existed. In fact, these are the fundamental problems underlying RL since its inception.

In this article, I hope to bring you two thoughts:

  1. Most of the flaws described by Alex boil down to two major RL issues.
  2. Neural networks help to solve only a small part of the problems, while creating new ones.

Note : the article in no way refutes Alex's claims. On the contrary, I support most of his conclusions and I believe that researchers should more clearly explain the existing limitations of RL.

Two main problems of RL


At the highest level, reinforced learning is maximizing some form of long-term return on action in a given environment. There are two fundamental difficulties in solving RL problems: the balance of exploration and exploitation (exploration-vs-exploitation), as well as the long term credit assignment.

As noted on the first page of the first chapter of Sutton and Barto's book on reinforcement learning , these are unique problems of reinforcement learning.

There are related varieties of the main problems of RL, which are manifested by their own scary monsters, such as partial observability, multi-agent environments, training with people and with people, etc. We will omit all this for now.


The constant state of the researcher in the field of RL. [Caption: “This is normal”]

On the other hand, teaching with a teacher deals with the problem of generalization. A generalization is the assignment of labels to invisible data, given that we already have a bunch of visible data with labels. Some parts of the fundamental problems of RL can be solved by a good generalization. If you generalize the model well to invisible states, then such extensive intelligence is not required. This is where deep learning usually comes in.

As we will see, reinforcement learning is a different and fundamentally more complex problem than learning with a teacher. It is not strange that an extremely successful teaching method with a teacher, such as in-depth training, does not fully solve all problems. In fact, deep learning while improving generalization brings in its own demons.

What reallyweird is the surprise at the current limitations of RL. The inability of DQN to work in the long run or a million steps when learning in the environment - there is nothing new here, this is not some mysterious feature of deep learning with reinforcement. All this is explained by the very nature of the problem - and always has been.

Let's take a closer look at these two fundamental problems, and then it will become clear: there is nothing surprising in the fact that reinforcement learning does not work so far.

Intelligence versus exploitation


Inefficiency of sampling, reproducibility and exit from local optimums.

Each agent should learn from the very beginning to answer the question: should we continue to follow this strategy, which gives good results, or take some relatively suboptimal actions that can increase the gain in the future? This question is so complicated because there is no one correct answer - there is always a compromise.

Good start


Bellman's equations guarantee convergence to the optimal value of the function
only if each state is checked an infinite number of times, and each action is tested an infinite number of times on it. So from the very beginning we need an infinite number of samples for training, and they are needed everywhere!

You may ask: “Why so fix on optimality?”

Fair. In most cases, if a successful strategy is developed relatively quickly and does not spoil too many things, then this is enough. Sometimes in practice, we are pleased that good politics can be learned in a finite number of steps (20 million is much less than infinity). But it is difficult to define these subjective concepts without numbers to maximize / minimize some parameter. Even harder to guarantee something. More about this later.

So, let's agree that we will be happy with an approximately optimal solution (whatever that means). The number of samples to obtain the same approximation grows exponentially with the space of actions and states.

But hey, it gets worse


If you do not make any assumptions, then the best way to intelligence is random. You can add heuristics such as curiosity , and in some cases they work well, but so far we do not have a complete solution. In the end, you have no reason to believe that some action in a certain state will bring more or less reward if you do not try it.

Moreover, model-free learning algorithms with reinforcement usually try to solve the problem in the most general way. We have few assumptions about the form of distribution, the dynamics of the transition of the environment, or optimal strategies (for example, see this paper ).

And that makes sense. A single receipt of a large reward does not mean that you will receive it every time in this state as a result of the same actions. Here, the only rational behavior is not to trust too much a particular reward, but to slowly increase the assessment of how good this action is in this state.

So, you make small conservative updates to functions that try to approximate the expectations of arbitrarily complex probability distributions over an arbitrarily large number of states and actions.

But hey, it really gets worse


Let's talk about continuous states and actions.

A world of our size seems mostly continuous. But for RL, this is a problem. How to accept an infinite number of states an infinite number of times and perform an infinite number of actions an infinite number of times? If only to generalize some acquired knowledge to invisible states and actions. Training with a teacher!

Let me explain a little.

A generalization in RL is called the approximation of a function . The approximation of the function reflects the idea that state and actions can be transferred to a function that calculates their values ​​- and then you do not need to store the values ​​of each state and action in a giant table. Teach the function on the data - and you are actually learning with the teacher. Mission accomplished.

Not so fast


Even this is not done elementarily in RL.

To begin with, let's not forget that neural networks have their own exorbitant sample inefficiency due to the slow pace of gradient descent.

But hey, the situation is actually even worse


In RL, data for training the network must come on the fly during interaction with the environment. As the intelligence and data collection progresses, the assessment of the utility function Q changes.

Unlike teaching with a teacher, here ground truth marks are not fixed! Imagine that at the beginning of ImageNet training you mark the image as a cat, but later change your perception and see a dog, a car, a tractor, etc. in it. The only way to get closer to a true assessment of the objective function is to continue exploration.

In fact, even in a training kit, you will never get samples of the true objective function, which is the optimal value of the function or policy. However, you are still able to learn! That’s the reason for this popularity.reinforcement learning.

Currently, we have two very unstable things that need to be changed slowly to prevent a complete collapse. Rapid reconnaissance can lead to sudden changes in the target landscape that the network is so painstakingly trying to match. Such a double blow from reconnaissance and network training complicates the sampling compared to the usual tasks of teaching with a teacher.

Reconnaissance in unstable dynamics also explains why RL is more sensitive to hyperparameters and random initial values ​​than teaching with a teacher. There are no fixed datasets on which neural networks are trained. The training data directly depends on the issuance of the neural network, the intelligence mechanism used, and the randomness of the environment. Thus, with the same algorithm on the same environment in different runs, you can get completely different training sets, which will lead to a strong performance difference . Again, the main problem in controlled intelligence is to see similar distributions of states - and the most general algorithms make no assumptions about this.

But hey! The situation is even ...


For continuous spaces, the most popular on-policy methods are . These methods can only use patterns that match the rules that are already being implemented. It also means that as soon as you update the current rules, all the experience learned in the past will immediately become unusable. Most of the algorithms that are mentioned in connection with strange yellow people and animals in the form of a bundle of tubes ( Mujoco ) belong to the category of on-policy.


Cheetah


A model from tubes

On the other hand, off-policy methods can learn optimal rules by observing the implementation of any other rules. Obviously, this is much better, but we are not good enough yet, unfortunately.

But hey!


No, actually that's all. However, it will still be worse, but in the next chapter.


It starts to look simple.

To summarize, these questions arise because of the main problem of reinforcement learning, and in the broader sense of all AI systems: because of intelligence.

RainbowDQN needs training for 83 hours because she does not have prior knowledge of what a video game is, that enemies shoot bullets at you, that bullets are bad, that a bunch of pixels on the screen that move together all the time are bullets, that bullets exist in the same world as objects, that the world is arranged according to some principles, this is not just the maximum distribution of entropy. All these priors (presets) help us humans sharply limit intelligence to a small set of high-quality states. DQN must learn all this by random intelligence. The fact that after training he is able to beat real masters and surpass the centuries-old wisdom of the game, as is the case with AlphaZero, still seems surprising.

Long Term Merit


Reward functions, their design and assignment

Do you know how some people scratch lottery tickets only with a lucky coin, because once they did so - and won a lot of money? RL agents essentially play the lottery at every turn - trying to figure out what they did to break the jackpot. They maximize one indicator resulting from actions over numerous steps, mixed with a high degree of randomness of the environment. Finding out what specific actions actually brought a high reward is the task of credit assignment.

You want to easily determine rewards. Reinforced learning promises that you will inform the robot about the right actions - and over time it will reliably learn the right behavior. You yourself in reality do not need to know the correct behavior and do not need to provide supervision at every step.

In fact, the problem arises because the scale of possible rewards for meaningful tasks is much wider than today's algorithms can handle. The robot works on a much denser time scale. He must adjust the speed of each joint every millisecond, and the person will reward it only when he makes a good sandwich. There are many events between these rewards, and if the gap between the important choice and the reward is too large, then any modern algorithm will simply fail.

There are two options. One of them is to reduce the scale of issuing rewards, i.e. give them out more smoothly and often. As usual, if you show some weakness to the optimization algorithm, it will begin to constantly exploit it. If the reward is not well thought out, this can lead to hacking rewards .



Ultimately, we fall into such a trap because we forget: the agent optimizes the entire landscape of value, not just immediate reward. Thus, even if the structure of immediate rewards seems harmless, the landscape picture may turn out to be unintuitive and contain many of these exploits if you are not careful enough.

The question is, why are rewards primarily used? Remuneration is a way of defining goals that allow you to use optimization opportunities to develop good rules. Formation of rewards is a way to introduce more specific knowledge in this area from above.

Is there a better way to set goals? In simulation training, you can trickyly get around the whole RL problem by requesting labels directly from the target distribution, i.e. optimal policies. There are other ways of training without direct remuneration , it is possible to give rewards to agents in the form of images (do not miss the IMCL seminar on goal specifications in RL!)

Another promising way to cope with the long term (strongly deferred remuneration) is hierarchical training with reinforcement. I was surprised that Alex did not mention it in his article, because this is the most intuitive solution to the problem (although I can be biased in this regard!)

Hierarchical RL tries to decompose a long-term task into a number of goals and subtasks. Having decomposed the problem, we effectively expand the time frame in which decisions are made. Really interesting things happen when rules are applied to subtasks that are applicable to other goals.

In general, the hierarchy can be arbitrarily deep. Canonical Example - Traveling to Another City. The first choice is the decision to go or not. After that, you need to determine how each stage of the journey will be completed. Boarding the train to the airport, flying, taking a taxi to the hotel seem like reasonable steps. For the railway stage, we distinguish subtasks: viewing the schedule, buying tickets, etc. Calling a taxi includes many movements to pick up the phone and activate the vibration of the vocal cords.


A legitimate request in an RL study

Although a bit simplistic, it is a convincing example in the old-fashioned spirit of the 1990s. A single scalar reward for getting to the right city can be distributed through the Markov chain to different levels of the hierarchy.

The hierarchy promises great benefits, but we are still far from them. Most of the best systems consider hierarchies of only one level of depth, and transferring acquired knowledge to other tasks is difficult to achieve.

Conclusion


My conclusion is generally the same as that of Alex.

I am very glad that such activity is in this area, and we finally tackled the problems that I always wanted to solve. Reinforced learning finally went beyond the primitive simulator!


No panic!

I want to add only one thing: do not despair if standard deep learning methods do not kill reinforcement learning monsters. Reinforced learning has two fundamental difficulties that are absent in teaching with a teacher — intelligence and long-term merit. They have always been here, and their solution will require more than a really good function approximator. You should find much better ways of reconnaissance, using samples from past reconnaissance, transferring experience between tasks, training with other agents (including people), actions on different time scales and solving difficult problems with scalar rewards.

Despite the extremely difficult problems in RL, I still think that today it is the best framework for developing strong artificial intelligence. Otherwise, I would not do it. When DQN played Atari on visual data, and AlphaGo defeated the world champion in go - at these moments we really observed small steps on the way to a strong AI.

I admire the future of reinforcement learning and artificial intelligence.

Also popular now: