Reinforcement learning or evolutionary strategies? - Both

Original author: Arthur Juliani
  • Transfer
Hello, Habr!

We rarely decide to post here translations of texts two years ago, without a code and clearly academic focus - but today we will make an exception. We hope that the dilemma put forward in the title of the article is of concern to many of our readers, and that you have already read the original work or read now the fundamental work on evolutionary strategies with which this post is polemicized. Welcome to cat!



In March 2017, OpenAI made a fuss in the deep learning community by publishing the article “ Evolution Strategies as a Scalable Alternative to Reinforcement Learning. ”In this work, impressive results were described in favor of the fact that the light did not converge in reinforcement (RL) training, and it is advisable to try other methods when training complex neural networks. Then a discussion arose about the importance of reinforced learning and how much it deserves the status of “compulsory” technology in learning to solve problems. Here I want to talk about the fact that you should not consider these two technologies as competing, one of which is clearly better than the other; on the contrary, they ultimately complement each other. Indeed, if you think a little about what is required to create a common AIand such systems that throughout their existence would be capable of learning, judgment, and planning, we will almost certainly come to the conclusion that this or that combined solution will be required. By the way, it was nature that came to the combined decision, endowing with the complex intelligence of mammals and other higher animals during evolution.

Evolutionary strategies


The main thesis of the OpenAI article was that instead of using reinforcement learning in combination with traditional backpropagation, they successfully trained the neural network to solve complex problems using the so-called “evolutionary strategy” (ES). Such an ES approach consists in maintaining the distribution of weight values ​​on a network scale, with many agents working in parallel and using parameters selected from this distribution. Each agent acts in its own environment and upon completion of a given number of episodes or stages of an episode, the algorithm returns a total reward, expressed as a fitness score. Given this value, the distribution of parameters can be shifted towards more successful agents, depriving less successful ones. Millions of times repeating such an operation involving hundreds of agents, it is possible to move the distribution of weights into a space that will allow us to formulate a quality policy for agents to solve their task. Indeed, the results presented in the article are impressive: it is shown that if you run a thousand agents in parallel, then anthropomorphic movement on two legs can be studied in less than half an hour (while even the most advanced RL methods require more than one hour). For a more detailed review, I recommend reading an excellent if you run a thousand agents in parallel, then anthropomorphic movement on two legs can be studied in less than half an hour (while even the most advanced RL methods require more than one hour). For a more detailed review, I recommend reading an excellent if you run a thousand agents in parallel, then anthropomorphic movement on two legs can be studied in less than half an hour (while even the most advanced RL methods require more than one hour). For a more detailed review, I recommend reading an excellentpost from the authors of the experiment, as well as the scientific article itself .



Various learning strategies for anthropomorphic upright posture, studied using OpenAI's ES method.

Black box


The great benefit of this method is that it is easy to parallelize. While RL methods, for example, A3C, require the exchange of information between workflows and the parameter server, the ES only needs validity estimates and generalized information on the distribution of parameters. Thanks to such simplicity, this method bypasses modern RL methods in scalability. However, all this is not in vain: you have to optimize the network on the principle of a black box. In this case, the “black box” means that during training the internal structure of the network is completely ignored, and only the overall result (reward for the episode) is used, and it depends on it whether the weights of a particular network will be inherited by future generations. In situations when we don’t get a pronounced feedback from the environment — and when solving many traditional RL-related tasks, the reward flow is very rarefied — the problem turns from a “partially black box” to a “completely black box”. In this case, it is possible to seriously increase productivity, so, of course, such a compromise is justified. “Who needs gradients if they are still hopelessly noisy?” - this is the general opinion.

However, in situations where feedback is more active, ES matters are starting to go wrong. The OpenAI team describes how the simple classification network MNIST was trained using ES, and this time the training was 1000 times slower. The fact is that the gradient signal in the classification of images is extremely informative as to how to teach the network a better classification. Thus, the problem is associated not so much with the RL technique, but with sparse rewards in environments that produce noisy gradients.

Solution found by nature


If you try to learn from the example of nature, thinking through ways to develop AI, in some cases AI can be represented as a problem-oriented approach . In the end, nature operates within such limitations that computer scientists simply do not have. There is an opinion that a purely theoretical approach to solving a particular problem can give more effective solutions than empirical alternatives. Nevertheless, I still think that it would be advisable to check how a dynamic system operating under conditions of certain restrictions (Earth) formed agents (animals, in particular mammals), capable of flexible and complex behavior. While some of these limitations are not applicable in the simulated worlds of data science, others are just very good.

Having examined the intellectual behavior of mammals, we see that it is formed as a result of the complex interaction of two closely interrelated processes: learning from others' experience and learning from our own experience. The first is often identified with evolution due to natural selection, but here I use a broader term to take into account epigenetics, microbiomes, and other mechanisms that ensure the exchange of experience between organisms that are not genetically related to each other. The second process, first-hand learning is all the information that an animal manages to assimilate throughout life, and this information is directly related to the interaction of this animal with the outside world. This category includes everything from learning to recognize objects to mastering the communication inherent in the educational process.

Roughly speaking, these two processes occurring in nature can be compared with two options for optimizing neural networks. Evolutionary strategies, where gradient information is used to update information about the body, come close to learning from someone else's experience. Similarly, gradient methods, where gaining one or another experience leads to one or another change in the agent’s behavior, are comparable to learning from experience. If you think about the varieties of intellectual behavior or about the abilities that each of these two approaches develops in animals, such a comparison is more pronounced. In both cases, “evolutionary methods” contribute to the study of reactive behaviors that allow the development of a certain fitness (sufficient to stay alive). Learning to walk or escape from captivity is in many cases equivalent to more “instinctive” behaviors that are “hard-wired” in many animals at the genetic level. In addition, this example confirms that evolutionary methods are applicable in cases where signal-reward is extremely rare (such, for example, the fact of successful rearing of a cub). In such a case, it is impossible to correlate the reward with any specific set of actions that may have been committed many years before the onset of this fact. On the other hand, if we consider the case in which ES fails, namely, the classification of images, then the results will be remarkably comparable with the results of animal training achieved in the course of countless behavioral psychological experiments conducted in more than 100 years. This example confirms that evolutionary methods are applicable in cases when the signal-reward is extremely rare (such, for example, the fact of the successful rearing of the cub). In such a case, it is impossible to correlate the reward with any specific set of actions that may have been committed many years before the onset of this fact. On the other hand, if we consider the case in which ES fails, namely, the classification of images, then the results will be remarkably comparable with the results of animal training achieved in the course of countless behavioral psychological experiments conducted in more than 100 years. This example confirms that evolutionary methods are applicable in cases when the signal-reward is extremely rare (such, for example, the fact of the successful rearing of the cub). In such a case, it is impossible to correlate the reward with any specific set of actions that may have been committed many years before the onset of this fact. On the other hand, if we consider the case in which ES fails, namely, the classification of images, then the results will be remarkably comparable with the results of animal training achieved in the course of countless behavioral psychological experiments conducted in more than 100 years. In such a case, it is impossible to correlate the reward with any specific set of actions that may have been committed many years before the onset of this fact. On the other hand, if we consider the case in which ES fails, namely, the classification of images, then the results will be remarkably comparable with the results of animal training achieved in the course of countless behavioral psychological experiments conducted in more than 100 years. In such a case, it is impossible to correlate the reward with any specific set of actions that may have been committed many years before the onset of this fact. On the other hand, if we consider the case in which ES fails, namely, the classification of images, then the results will be remarkably comparable with the results of animal training achieved in the course of countless behavioral psychological experiments conducted in more than 100 years.

Animal training


The methods used in reinforced learning are in many cases taken directly from the psychological literature on operant conditioning, and operant conditioning was investigated on the basis of animal psychology. By the way, Richard Sutton, one of the two founders of reinforcement training, has a bachelor's degree in psychology. In the context of operant conditioning, animals learn to associate reward or punishment with specific behavioral patterns. Trainers and researchers can somehow manipulate such an association with rewards, provoking animals to show ingenuity or certain behaviors. However, the operant conditioning used in the study of animals is nothing more than a more pronounced form of that conditioning, on the basis of which animals are trained throughout life. We constantly receive positive reinforcement signals from the environment and adjust our behavior accordingly. In fact, many neurophysiologists and cognitive scientists believe that in fact people and other animals act even one level higher and are constantly learning to predict the results of their behavior in future situations, counting on potential rewards.

The central role of forecasting in self-learning is changing the dynamics described above in the most significant way. The signal that was previously considered very rarefied (episodic reward) is very dense. Theoretically, the situation is approximately the following: at every moment of time, the mammalian brain calculates the results based on a complex stream of sensory stimuli and actions, while the animal is simply immersed in this stream. In this case, the final behavior of the animal gives a dense signal, which must be guided by the correction of forecasts and the development of behavior. The brain uses all these signals in order to optimize forecasts (and, accordingly, the quality of the actions taken) in the future. An overview of this approach is given in the excellent book “ Surfing Uncertainty”Cognitive and philosopher Andy Clark. If we extrapolate such arguments to the training of artificial agents, then reinforcement training reveals a fundamental flaw: the signal used in this paradigm is hopelessly weak compared to what it could be (or should be). In cases where it is impossible to increase the signal saturation (perhaps because it is, by definition, weak, or is associated with low-level reactivity) - it is probably better to prefer a training method that is well parallelized, for example, ES.

Better learning of neural networks


Based on the principles of higher nervous activity inherent in the mammalian brain, which is constantly engaged in forecasting, lately it has been possible to achieve certain successes in reinforcement training, which now takes into account the importance of such forecasts. I can recommend you two similar works:


In both of these articles, the authors supplement the typical default neural network policy with forecast results regarding future environmental conditions. In the first article, forecasting is applied to a variety of measurement variables, and in the second, changes in the environment and the behavior of the agent as such. In both cases, the sparse signal associated with positive reinforcement becomes much more saturated and informative, providing both accelerated learning and the assimilation of more complex behavioral models. Such improvements are available only when working with methods using a gradient signal, but not with methods that operate on the principle of a "black box", such as, for example, ES.

In addition, first-hand learning and gradient methods are much more effective. Even in those cases when it was possible to study a particular problem using the ES method faster than using reinforcement training, the gain was achieved due to the fact that many times more data was involved in the ES strategy than with RL. Thinking in this case about the principles of learning in animals, we note that the result of training on a foreign example manifests itself after many generations, while sometimes a single event, experienced in person, is enough for the animal to learn the lesson forever. While such training without examples does not yet fully fit into traditional gradient methods, it is much more intelligible than ES. There are, for example, approaches such as neural episodic control, where Q-values ​​are stored in the learning process, after which the program is checked with them before performing actions. It turns out a gradient method that allows you to learn how to solve problems much faster than before. In the article on episodic neural control, the authors mention the human hippocampus, which is able to store information about the event even after a once experienced experience and, therefore, plays a critical role in the recall process. Such mechanisms require access to the internal organization of the agent, which is also by definition impossible in the paradigm of ES.

So why not combine them?


Probably most of this article could have left the impression that I advocated RL methods in it. However, in fact, I believe that in the long run, the best solution would be a combination of both methods, so that each is used in those situations in which it is best suited. Obviously, in the case of many reactive policies or in situations with very sparse signals of positive reinforcement, ES wins, especially if you have the computing power at which you can run mass-parallel training. On the other hand, gradient methods using reinforced learning or teacher training will be useful when extensive feedback is available to us, and the solution to the problem needs to be learned quickly and with less data.

Turning to nature, we find that the first method, in essence, lays the foundation for the second. That is why, during evolution, mammals have developed a brain that allows extremely efficient learning from the material of complex signals coming from the environment. So, the question remains open. Perhaps evolutionary strategies will help us invent effective learning architectures that will be useful for gradient learning methods. After all, the solution found by nature is indeed very successful.

Also popular now: