Limitations of Deep Learning and the Future

Original author: Francois Chollet
  • Transfer
This article is an adaptation of sections 2 and 3 of Chapter 9 of my book " Deep learning the Python » (Manning the Publications).

The article is intended for people who already have significant experience with deep learning (for example, those who have already read chapters 1-8 of this book). A large amount of knowledge is assumed.



Limitations of Deep Learning


Deep Learning: A Geometric View


The most amazing thing in deep learning is how simple it is. Ten years ago no one could imagine what amazing results we will achieve in problems of machine perception, using simple parametric models trained with gradient descent. Now it turns out that we need only sufficiently large parametric models trained on a sufficiently large number of samples. As Feynman once said about the Universe: “ It’s not complicated, it’s just a lot .”

In deep learning, everything is a vector, that is, a point in geometric space. The input data of the model (this can be text, images, etc.) and its goals are first “vectorized”, that is, translated into some initial vector space at the input and target vector space at the output. Each layer in the deep learning model performs one simple geometric transformation of the data that goes through it. Together, the chain of model layers creates one very complex geometric transformation, broken down into a number of simple ones. This complex transformation attempts to transform the input data space into the target space, for each point. Transformation parameters are determined by the weights of the layers, which are constantly updated based on how well the model is working at the moment. A key characteristic of geometric transformation is that it must be differentiable, that is, we should be able to find out its parameters through gradient descent. Intuitively, this means that geometric morphing must be smooth and continuous - an important limitation.

The whole process of applying this complex geometric transformation to the input data can be visualized in 3D, depicting a person who is trying to deploy a paper ball: a crumpled paper ball is a variety of input data with which the model starts working. Each movement of a person with a paper ball is like a simple geometric transformation that a single layer performs. The complete sequence of gestures to deploy is a complex transformation of the entire model. Deep learning models are mathematical machines for deploying an intricate variety of multidimensional data.

This is the magic of deep learning: turn the value into vectors, into geometric spaces, and then gradually learn complex geometric transformations that transform one space into another. All that is needed is spaces of a sufficiently large dimension to convey the entire spectrum of relations found in the source data.

Limitations of Deep Learning


The set of tasks that can be solved using this simple strategy is almost endless. Still, many of them are still beyond the reach of current deep learning techniques - even though there is a huge amount of manually annotated data. Say, for example, that you can collect a dataset of hundreds of thousands - even millions - of English descriptions of software functions written by product managers, as well as the corresponding source year developed by teams of engineers to meet these requirements. Even with this data, you cannot train the deep learning model, just read the product description and generate the appropriate code base. This is just one of many examples. In general, all that requires argumentation, reasoning - like programming or applying a scientific method, long-term planning, manipulation of data in an algorithmic style - is beyond the capabilities of deep learning models, no matter how much data you throw in them. Even training a neural network in a sorting algorithm is an incredibly difficult task.

The reason is that the deep learning model is “only” a chain of simple, continuous geometric transformations that transform one vector space into another. All she can do is convert one set of X data into another set Y, provided that there is a possible continuous transformation from X to Y that can be learned, and that a dense set of X: Y transformation patterns are available as training data. So, although the deep learning model can be considered a kind of program, most programs cannot be expressed as deep learning models - for most tasks, either there is no deep neural network of practically suitable size that solves the problem, or if it exists, it can be untrained, that is, the corresponding geometric transformation may be too complicated, or there is no suitable data for its training.

Scaling up existing deep learning techniques — adding more layers and using more data for learning — can only superficially mitigate some of these problems. It will not solve the more fundamental problem that deep learning models are very limited in what they can represent, and that most programs cannot be expressed as continuous geometric morphing of a data variety.

The risk of anthropomorphization of machine learning models


One of the very real risks of modern AI is a misinterpretation of the work of deep learning models and an exaggeration of their capabilities. The fundamental feature of the human mind is the “model of the human psyche,” our tendency to project goals, beliefs, and knowledge onto things around us. The drawing of a smiling face on a stone suddenly makes us “happy” - mentally. As applied to deep learning, this means, for example, that if we can more or less successfully train a model to generate textual descriptions of pictures, then we tend to think that the model “understands” the content of the images, as well as the generated descriptions. We are then very surprised when, due to a slight deviation from the set of images presented in the training data, the model begins to generate completely absurd descriptions.



In particular, this is most clearly manifested in “adversarial examples”, that is, samples of the input data of the deep learning network, specially selected to be classified incorrectly. You already know that you can make a gradient climb in the input data space to generate samples that maximize activation, for example, a certain convolutional neural network filter - this is the basis of the visualization technique that we discussed in chapter 5 (note: books “ Deep Learning with Python»), As well as the Deep Dream algorithm from Chapter 8. In a similar way, through gradient climbing, you can slightly change the image to maximize class prediction for a given class. If we take a photo of a panda and add the “gibbon” gradient, we can make the neural network classify this panda as a gibbon. This indicates both the fragility of these models and the profound difference between the transformation from entrance to exit, which it is guided by, and our own human perception.



In general, deep learning models do not have an understanding of the input, at least not in the human sense. Our own understanding of images, sounds, language, is based on our sensorimotor experience as people - as material earthly creatures. Machine learning models do not have access to such experience and therefore they cannot “understand” our input data in any human-like way. By annotating a large number of training examples for our models, we force them to learn a geometric transformation that brings data to human concepts for this specific set of examples, but this transformation is just a simplified outline of the original model of our mind, such as was developed based on our experience as bodily agents - it's like a faint reflection in the mirror.



As a machine learning practitioner, always remember this, and never fall into the trap of believing that neural networks understand the task they are doing - they don’t understand, at least not in the way that makes sense to us. They were trained in a different, much narrower task than the one we want to teach them: the simple conversion of input training samples into target training samples, point to point. Show them something that is different from the training data, and they will break in the most absurd way.

Local generalization versus limit generalization


There seems to be fundamental differences between direct geometric morphing from entrance to exit, which deep learning models do, and the way people think and learn. The point is not only that people learn themselves from their bodily experience, and not through processing a set of training samples. In addition to differences in learning processes, there are fundamental differences in the nature of the underlying notions.

People are capable of much more than converting an immediate stimulus into an immediate response, like a neural network or maybe an insect. People hold in their minds complex, abstract models of the current situation, themselves, other people, and can use these models to predict various possible options for the future, and carry out long-term planning. They are capable of uniting well-known concepts into a single whole to present what they never knew before - like drawing a horse in jeans, for example, or an image of what they would do if they won the lottery. The ability to think hypothetically, to expand our model of mental space far beyond what we directly experienced, that is, the ability to make abstractions and reasoningperhaps the defining characteristic of human cognition. I call this the “ultimate generalization”: the ability to adapt to new situations never before experienced, using very little data or not using any data at all.

This is very different from what deep learning networks do, which I would call a “local generalization”: converting input to output quickly ceases to make sense if new input is at least slightly different from what it encountered during training . Consider, for example, the problem of learning the appropriate launch parameters for a rocket that should land on the moon. If you used a neural network for this task, teaching it with a teacher or with reinforcements, you would need to give it thousands or millions of flight paths, that is, you need to issuedense set of examplesin the input value space to learn how to reliably convert from the input value space to the output value space. In contrast, people can use the power of abstraction to create physical models - rocket science - and deduce the exact solution that will deliver the rocket to the moon in just a few attempts. In the same way, if you have developed a neural network to control the human body and want it to learn how to safely walk around the city without being hit by a car, the network must die many thousands of times in various situations before it concludes that cars are dangerous and will not work out. appropriate behavior to avoid them. If you move it to a new city, then the network will have to re-learn most of what she knew. On the other hand, people are able to learn safe behavior,



So, despite our progress in machine perception, we are still very far from human-level AI: our models can only perform local generalization , adapting to new situations that should be very close to past data, while the human mind is capable of ultimate generalization , quickly adapting to completely new situations or planning far into the future.

conclusions


Here's what you need to remember: the only real success of deep learning so far is the ability to translate space X into space Y, using continuous geometric transformation, in the presence of a large amount of data annotated by a person. A good accomplishment of this task represents a revolutionary achievement for the whole industry, but it is still very far from human AI.

To remove some of these limitations and begin to compete with the human brain, we need to move away from direct conversion from input to output and move on to reasoning and abstraction.. Perhaps a suitable basis for abstract modeling of various situations and concepts might be computer programs. We said earlier (note: in the book “ Deep Learning with Python ”) that machine learning models can be defined as “learning programs”; at the moment, we can only train a narrow and specific subset of all possible programs. But what if we could train each program, modularly and repeatedly? Let's see how we can come to this.

The future of deep learning


Given what we know about the work of deep learning networks, their limitations and the current state of scientific research, can we predict what will happen in the medium term? Here are a few of my personal thoughts about this. Keep in mind that I don’t have a crystal ball for predictions, so much of what I expect may not come true. These are absolute speculations. I do not share these forecasts because I expect them to be fully realized in the future, but because they are interesting and applicable in the present.

At a high level, these are the main areas that I consider promising:

  • Models will come closer to general purpose computer programs built on top of much richer primitives than our current differentiable layers - this will give us reasonings and abstractions , the absence of which is a fundamental weakness of current models.
  • New forms of training will appear that make this possible - and allow models to move away simply from differentiable transformations.
  • Models will require less developer involvement - it shouldn't be your job to constantly twist the knobs.
  • A larger, systematic reuse of learned features and architectures will appear; meta-learning systems based on reusable and modular routines.

In addition, note that these considerations do not apply specifically to teacher training, which still remains the basis of machine learning - they also apply to any form of machine learning, including non-teacher learning, supervised learning and reinforced learning. Fundamentally, it doesn't matter where your tags came from or what your training cycle looks like; these different branches of machine learning are simply different facets of the same design.

So go ahead.

Models as Programs


As we noted earlier, the necessary transformational development that can be expected in the field of machine learning is to move away from models that perform pure pattern recognition and are capable of only local generalization , to models capable of abstraction and reasoning , which can reach the ultimate generalization. All current AI programs with a basic level of reasoning are hard-coded by human programmers: for example, programs that rely on search algorithms, graph manipulation, formal logic. So, in the DeepMind AlphaGo program, most of the “intelligence” on the screen is designed and hard-coded by expert programmers (for example, searching in a tree using the Monte Carlo method); training on new data takes place only in specialized submodules - value networks and policy networks. But in the future, such AI systems may be fully trained without human involvement.

How to achieve this? Take the well-known type of network: RNN. Importantly, RNN has slightly fewer restrictions than direct distribution neural networks. This is because RNNs are a little more than simple geometric transformations: these are geometric transformations that take place continuously in a loopfor . Time cyclefordeveloper-defined: this is a built-in network assumption. Naturally, RNNs are still limited in what they can represent, mainly because each step is still a differentiable geometric transformation and because of the way they transmit information step by step through points in a continuous geometric space ( state vectors). Now imagine neural networks that would be “built up” with programming primitives in the same way as loops for- but not just with a single hard-coded loop forwith stitched geometric memory, but with a large set of programming primitives that the model could freely handle to expand its processing capabilities such as branches if, operatorswhile, creating variables, disk storage for long-term memory, sorting operators, advanced data structures like lists, graphs, hash tables, and much more. The scope of the programs such a network can represent will be much wider than the existing deep learning networks can express, and some of these programs can achieve excellent generalization power.

In a word, we will get away from the fact that on the one hand we have “hard-coded algorithmic intelligence” (hand-written software), and on the other hand, “trained geometric intelligence” (deep learning). Instead, we get a mixture of formal algorithmic modules that provide reasoning and abstraction, and geometric modules that provide informal intuition and pattern recognition capabilities . The whole system will be trained with little or no human involvement.

A related area of ​​AI, which, in my opinion, can advance very soon, is software synthesis, in particular, neural program synthesis. Software synthesis consists in the automatic generation of simple programs, using a search algorithm (possibly genetic search, as in genetic programming) to study a large space of possible programs. The search stops when a program is found that meets the required specifications, often provided as a set of input / output pairs. As you can see, this closely resembles machine learning: “training data” is provided as input-output pairs, we find a “program” that corresponds to the transformation of input into output data and is capable of generalizations for new input data. The difference is that instead of the values ​​of the training parameters in a hard-coded program (neural network), we generate the source code through a discrete search process.

I definitely expect great interest in this area again in the next few years. In particular, I expect the mutual penetration of related areas of deep learning and software synthesis, where we will not only generate programs in general languages, but where we will generate neural networks (flows of processing geometric data), complemented by a rich set of algorithmic primitives, such as loopsfor- and many others. This should be much more convenient and useful than direct source code generation, and significantly expand the boundaries for those problems that can be solved using machine learning - the space of programs that we can automatically generate, receiving the relevant data for training. A mixture of symbolic AI and geometric AI. Modern RNNs can be considered as the historical ancestor of such hybrid algorithm-geometric models.


Figure: A trained program simultaneously relies on geometric primitives (pattern recognition, intuition) and algorithmic primitives (argumentation, search, memory).

Beyond backpropagation and differentiable layers


If machine learning models become more like programs, then they will almost no longer be differentiable - definitely, these programs will still use continuous geometric layers as routines that will remain differentiable, but the whole model will not be like that. As a result, the use of backpropagation to adjust the weight values ​​in a fixed, hard-coded network cannot remain the preferred method for training models in the future - at least, this cannot be limited. We need to figure out how to best train non-differentiable systems. Current approaches include genetic algorithms, “evolutionary strategies”, specific reinforcement learning methods, and ADMM (the method of variable directions of Lagrange multipliers). Naturally, gradient descent will not go anywhere else - information about the gradient will always be useful for optimizing differentiable parametric functions. But our models will definitely become more and more ambitious than simply differentiable parametric functions, and therefore their automated development (“learning” in “machine learning”) will require more than back propagation.

In addition, backpropagation has an end-to-end framework, which is suitable for learning good interlocking transformations, but is rather computationally inefficient because it does not fully utilize the depth modularity of networks. To increase the efficiency of anything, there is one universal recipe: introduce modularity and hierarchy. So we can make the reverse distribution itself more efficient by introducing disengaged learning modules with a specific synchronization mechanism between them, organized in a hierarchical order. This strategy is partially reflected in DeepMind's recent work on “synthetic gradients.” I expect much, much more work in this direction in the near future.

One can imagine a future where globally non-differentiable models (but with differentiable parts) will be trained - grow - using an efficient search process that will not apply gradients, while differentiable parts will be trained even faster using gradients using some more efficient back distribution versions

Automated machine learning


In the future, architecture models will be created by training, and not written manually by engineers. Learning models automatically work together with a richer set of primitives and program-like machine learning models.

Most of the time, the developer of deep learning systems is endlessly modifying data with Python scripts, then it takes a long time to configure the architecture and hyperparameters of the deep learning network to get a working model - or even to get an outstanding model if the developer is so ambitious. Needless to say, this is not the best state of things. But AI can help here too. Unfortunately, the data processing and preparation part is difficult to automate, since it often requires knowledge of the field, as well as a clear understanding at a high level of what the developer wants to achieve. However, setting up hyperparameters is a simple search procedure, and in this case we already know what the developer wants to achieve: this is determined by the loss function of the neural network that needs to be configured. Now it has become common practice to install basic AutoML systems that take over most of the model settings twist. I myself installed one to win the Kaggle competition.

At the most basic level, such a system will simply adjust the number of layers on the stack, their order and the number of elements or filters in each layer. This is usually done using libraries like Hyperopt, which we discussed in Chapter 7 (note: Deep Learning with Python books ). But you can go a lot further and try to get the appropriate architecture from scratch with training, with a minimum set of restrictions. This is possible through reinforced learning, for example, or through genetic algorithms.

Another important development area of ​​AutoML is to receive training in model architecture at the same time as model weights. Training the model from scratch every time we try slightly different architectures, which is extremely inefficient, therefore a really powerful AutoML system will control the development of architectures, while the properties of the model are configured through backpropagation to the data for training, thereby eliminating all the redundancy of calculations. When I write these lines, similar approaches have already begun to be applied.

When all this begins to happen, the developers of machine learning systems will not be left without work - they will move to a higher level in the value chain. They will begin to make much more efforts to create complex loss functions that truly reflect business objectives, and will deeply understand how their models affect the digital ecosystems in which they operate (for example, clients who use model predictions and generate data for her training) - problems that only the largest companies can now afford to consider.

Lifelong learning and reuse of modular routines


If models become more complex and built on richer algorithmic primitives, then this increased complexity will require more intensive reuse between tasks, rather than training the model from scratch every time we have a new task or a new data set. In the end, many data sets do not contain enough information to develop a new complex model from scratch, and it will just become necessary to use information from previous data sets. You don’t learn English again every time you open a new book — that would be impossible. In addition, training models from scratch on each new task is very inefficient due to the significant coincidence between current tasks and those that were encountered earlier.

In addition, in recent years, a remarkable observation has repeatedly been made that teaching the same model to do several loosely coupled tasks improves its results in each of these tasks . For example, learning to translate the same neural network from English into German and from French into Italian will result in a model that is better in each of these language pairs. Training the image classification model simultaneously with the image segmentation model, with a single convolution base, will lead to a model that is better in both tasks. Etc. This is quite intuitive: there is always some kind ofinformation that partially coincides between these two seemingly different tasks, and therefore the general model has access to more information about each individual task than the model that was trained only on this specific task.

What we actually do when we re-apply the model for different tasks is to use pre-trained weights for models that perform common functions, such as extracting visual attributes. You saw this in practice in Chapter 5. I expect that in the future a more general version of this technique will be used everywhere: we will not only begin to apply previously acquired features (submodel weights), but also model architectures and training procedures. As models become more similar to programs, we will begin to reuseroutines as functions and classes in ordinary programming languages.

Think about how the software development process looks today: as soon as an engineer solves a certain problem (HTTP requests in Python, for example), he packs it up as an abstract library for reuse. Engineers who will face a similar problem in the future simply search for existing libraries, download and use them in their own projects. In the same way, in the future, meta-learning systems will be able to build new programs by sifting through a global library of high-level reusable blocks. If the system starts developing similar routines for several different tasks, it will release an “abstract” reusable version of the routine and save it in the global library. Such a process will open the door to abstraction., a necessary component to achieve the "ultimate generalization": a subprogram that will be useful for many tasks and areas, we can say, "abstracts" a certain aspect of decision-making. This definition of “abstraction” does not seem to be the concept of abstraction in software development. These routines can be either geometric (deep learning modules with pre-trained representations), or algorithmic (closer to the libraries that modern programmers work with).



Figure: A meta-learning system capable of quickly developing task-specific models using reusable primitives (algorithmic and geometric), thereby achieving “ultimate generalization”.

The bottom line: long-term vision


In short, here is my long-term vision for machine learning:

  • Models will become more like programs and gain capabilities that extend far beyond the continuous geometric transformations of the source data that we are working with now. Perhaps these programs will be much closer to the abstract mental models that people support about their environment and about themselves, and they will be able to more strongly generalize due to their algorithmic nature.

  • In particular, models will mix algorithmic modules with formal reasoning, search, abstraction abilities - and geometric modules with informal intuition and pattern recognition. AlphaGo (a system that required intensive manual programming and architectural development) is an early example of how a fusion of symbolic and geometric AI might look.

  • They will be grown automatically (rather than manually written by human programmers), using modular parts from the global library of reusable routines - a library that has evolved through the assimilation of high-performance models from thousands of previous tasks and data sets. Once the meta-learning system has identified common problem-solving patterns, they are converted to reusable routines - much like functions and classes in modern programming - and added to the global library. This is how abstraction is achieved .

  • The global library and the corresponding model growing system will be able to achieve some form of human-like “limit generalization”: faced with a new task, a new situation, the system will be able to assemble a new working model for this task, using a very small amount of data, thanks to: 1) rich program-like primitives, who make generalizations well and 2) extensive experience in solving similar problems. In the same way, people can quickly learn a new complex video game because they have previous experience with many other games and because models based on previous experience are abstract and program-like, rather than simply transforming a stimulus into an action.

  • Essentially, this continuously learning system for growing models can be interpreted as Strong Artificial Intelligence. But do not expect the onset of some kind of singular robo-apocalypse: it is a pure fantasy, which was born from a large list of deep misunderstandings in the understanding of intelligence and technology. However, this criticism does not belong here.

Also popular now: