Learning without a teacher: a curious student

Original author: DeepMind
  • Transfer
Over the past decade, machine learning has advanced unprecedentedly in areas as diverse as pattern recognition, robomobiles, and complex games such as go. These successes were mainly achieved through training deep neural networks with one of two paradigms - learning with a teacher and learning with reinforcement . Both paradigms require the development of human training signals, which are then transmitted to the computer. In the case of training with a teacher, these are “goals” (for example, the correct signature below the image); in the case of reinforcements, these are “rewards” for successful behavior (high score in the game from Atari). Therefore, the limits of learning are determined by people.

And if some scientists believe that a sufficiently extensive training program - for example, the ability to successfully complete a wide range of tasks - should be enough to generate general-purpose intelligence, then others think that true intelligence will require more independent learning strategies. Consider, for example, the process of teaching a baby. His grandmother can sit down with him and patiently show him examples of ducks (working as a teaching signal when learning with a teacher) or rewarding him with applause for solving a puzzle with cubes (as with reinforced learning). However, most of the time the infant naively explores the world, and comprehends the environment through curiosity, play and observation. Teacherless Learning- This is a paradigm designed to create autonomous intelligence by rewarding agents (computer programs) for studying the data they observe, regardless of any specific tasks. In other words, the agent is trained to learn.

The key motivation in learning without a teacher is that if the data transmitted to learning algorithms has an extremely rich internal structure (images, videos, text), then the goals and rewards in training are usually very dry (the “dog” label for this species, or unit / zero, indicating success or failure in the game). This suggests that most of what the algorithm studies should consist of an understanding of the data itself, and not of applying this understanding to the solution of certain problems.

Decoding of the elements of vision


2012 was a landmark year for deep learning when AlexNet (named after lead architect Alex Krizhevsky) dared competitors in the ImageNet classification contest . Her ability to recognize images had no analogues, but even more surprising was what was happening under the hood. After analyzing the actions of AlexNet, scientists found that it interprets images through the construction of increasingly complex internal representations of input data. Low-level features, for example, textures and faces, are represented by lower layers, and then from them on higher layers, concepts of a higher level are combined, such as wheels or dogs.

This is surprisingly similar to how our brain processes information - simple faces and textures in the main areas related to the senses are assembled into complex objects like faces in higher areas of the brain. Thus, a complex scene can be assembled from visual primitives, in much the same way that meaning arises from the individual words that make up a sentence. Without direct installation, AlexNet layers revealed a fundamental visual “dictionary” suitable for solving the problem. In a way, the network learned to play what Ludwig Wittgenstein called the “ language game, ” which goes step by step from pixels to image labels.


Visual dictionary of the convolutional neural network. For each layer, images are created that maximize the activation of certain neurons. Then the reaction of these neurons to other images can be interpreted as the presence or absence of visual “words”: textures, bookshelves, faces of dogs, birds.

Transfer Training


From the point of view of general-purpose intelligence, the most interesting thing in the AlexNet dictionary is that it can be reused, or transferred, to other visual tasks, for example, to recognize not only individual objects, but also entire scenes. Transfer in an ever-changing world is absolutely necessary, and people do it very well: we are able to quickly adapt the skills and understanding gained from experience (world model) to any current situation. For example, a pianist with a classical education will easily learn how to play jazz. Artificial agents, forming the correct internal idea of ​​the world, probably should have the same opportunities.

However, representations obtained by classifiers such as AlexNet have their limitations. In particular, since the network is trained to label messages of one class (dog, cat, car, volcano), the rest of the information - no matter how useful it can be for other tasks - it will ignore. For example, representations may not capture the background of images if labels refer only to objects in the foreground. A possible solution is to give more comprehensive training signals, for example, detailed image descriptions: not just a “dog,” but “a Corgi catches frisbee in a sunny park.” However, such labels are difficult to affix, especially on a large scale, and they may still not be enough to perceive all the information necessary to complete the task. The basic premise of learning without a teacher is that the best way to learn easily portable representations is to try to learn all that is possible about the data.

If the concept of transference through the training of representations seems too abstract to you, imagine a child who has learned to draw people in the style of “stick, stick, cucumber”. He discovered a representation of a person’s appearance, which is both very compact and well adapted. Complementing each figure with certain features, he can create portraits of all classmates: glasses for his best friend, a beloved red T-shirt to his schoolmate. And he developed this skill not in order to fulfill a specific task or receive a reward, but in response to a basic need to display the world around him.

Learning through creativity: generative models


Perhaps the simplest goal of learning without a teacher is to train the algorithm to create its own data examples. T.N. generative models should not only reproduce the data on which they were trained (this is simply an uninteresting “remembering”), but create a model of the class from which the data was taken. Not a specific photograph of a horse or rainbow, but a set of photographs of horses and rainbows; not a specific statement of a particular speaker, but the general distribution of verbal statements. The basic principle of generative models is that the possibility of creating a convincing example of the data is the strongest evidence that they are understood: as Richard Feynman said, “that which I cannot create, I do not understand.”

So far, the most successful generative model for images remains the generative-adversarial network.(GSS), in which two networks - the generator and the discriminator - enter the recognition competition, similar to the competition of a fake specialist and a detective. The generator produces images, trying to make the discriminator believe in their reality; the discriminator is rewarded for detecting fakes. The generated images are first random and sloppy, then improved over many approaches, and the dynamic interaction of networks leads to the appearance of increasingly realistic images, which in many cases cannot be distinguished from real photos. GSS can also provide detailed landscapes based on rough sketches of users.

A single glance at the images below will be enough to make sure that the network has learned to portray the many key features of the photographs on which it was trained - the structure of animal bodies, the texture of the grass and the details of the play of light and shadow (even in reflection of a soap bubble). A careful study reveals small anomalies, such as an extra leg in a white dog and a strange right angle in the jets of one of the fountains. Although the creators of generative models are trying to get rid of such imperfections, the fact that we can see them speaks of one of the advantages of reconstructing such familiar data as images: by studying samples, researchers can understand what the model has learned and what has not yet.



Creation through prediction


Another noteworthy family of networks in a teacherless class is autoregressive models, in which data is broken into small pieces, each of which the model predicts in turn. Such models can be used to generate data in the following way: first, the network guesses what is going on, feed this conjecture into its input, and it guesses again. The most famous example is the language model, where each next word is predicted based on the previous ones: this model works with text predictions that appear in various applications and when sending an email. Recent developments in language modeling allow you to create surprisingly reasonable passages - for example, taken from the OpenAI GPT-2 :
Hint to the system (written by man)

Scientists made a shocking discovery by discovering a herd of unicorns living in the remote and previously unexplored Andes Valley. Even more surprising for researchers was that unicorns spoke excellent English.

The text completed by the model (after 10 attempts)

Scientists called the population “Ovid unicorns” because of their distinctive horns. These four-horned silver-white unicorns were previously unknown to science. And finally, after two centuries, the mystery of what gave rise to this strange phenomenon is solved.

Dr. Jorge Perez, an evolutionary biologist at Lapaz University and several of his associates studied the Andes when they discovered a small valley where there were no other animals and people. Perez noticed something in the valley that looked like a natural fountain, surrounded by two rocky peaks and silvery snow.

Then Perez and the rest went further into this valley. “When we reached one of the peaks, the water seemed blue, and there were few crystals on its surface,” Perez said.

Perez and his friends were amazed to see a herd of unicorns. These creatures could be seen from the air, not getting too close to them to see them — they were so close that they could touch their horns.
One interesting inconsistency is the description of unicorns as “four-horned”: it is fun to study the limitations of understanding the network.

By controlling the input sequence used to adjust predictions, autoregressive models can be used to translate one sequence into another. This demonstration uses a conditional autoregressive model to translate text into a realistic handwritten look. WaveNet converts text to natural speech, and is now used to generate voice for Google Assistant . Similar progress in tuning and autoregressive generation can be used for translations from one language to another.

Autoregressive models study data, trying to predict each part in a certain order. You can create a more generalized class of networks with learning without a teacher, making predictions about any part of the data on the basis of any other. For example, this may mean that we remove one word from the sentence and try to predict it based on the rest of the text . By teaching a system through a query from it of a multitude of local predictions, we force it to study all the data as a whole.

One of the problems of generative models is the possibility of their malicious use. Manipulation of evidence in the form of photographs, videos and audio recordings has been possible for a long time, but generative models can greatly facilitate the editing of these materials with malicious intent. We have already seen a demonstration of the so-called deepfake - for example, a fake video with Obama . It is gratifying to see that there are serious attempts to answer these challenges - for example, the use of statistical techniques to detect synthetic materials and confirm authentic ones, familiarizing the public with what is happening, and discussions about limiting the availability of trained generative models. In addition, generative models themselves can be used to detect fabricated materials and abnormal data - for example, detect fake speech or detect abnormal payments to protect users from fraudsters. Researchers need to work on generative models to better understand them and reduce risks in the future.

Reinventing Intelligence


Generative models themselves are very interesting, but at DeepMind we treat them as a stage in the path to general-purpose intelligence. Giving an agent the ability to generate data is about how to give him imagination, and, consequently, the ability to plan and reason about the future. Our studies show that training in predicting various aspects of the environment, even without a special task for generating data, enriches the agent’s world model and, therefore, improves his ability to solve problems.

These results resonate with our intuitive understanding of the human mind. Our ability to study the world without special supervision is one of the fundamental properties of intelligence. On a training trip, we can indifferently look out the window, touch the velveteen in the seats, consider passengers traveling with us. In these studies, we have no purpose: we can’t almost take our minds off of collecting information, and our brain tirelessly works on understanding the world around us and our place in it.

Also popular now: