# The basis for a generalized theory of neural networks is created

- Transfer

### The tremendous capabilities of neural networks are sometimes comparable to their unpredictability. Now mathematicians begin to understand how the shape of a neural network affects its work.

When we design a skyscraper, we expect that in the end it will satisfy all the specifications: that the tower will be able to withstand such a weight, as well as an earthquake of a certain strength.

However, one of the most important technologies of the modern world, we, in fact, design blindly. We play with different schemes, different settings, but until we start a trial run of the system, we really have no idea what it can do, or where it will refuse to work.

It's about neural network technology that underlies the most advanced modern artificial intelligence systems. Neural networks are gradually moving into the most basic areas of society: they determine what we learn about the world from the news feed in social networks, they help doctors make a diagnosis, and eveninfluence whether the offender will be sent to prison.

And “the best description of what we know is to say that we know practically nothing about how the neural networks actually work, and what the theory describing them should be,” said Boris Ganin , a mathematician at the University of Texas, and a guest specialist at Facebook AI Research studying neural networks.

He compares the situation with the development of yet another revolutionary technology: a steam engine. Initially, steam engines could only pump water. Then they served as engines for steam locomotives, and today neural networks have probably reached about the same level. Scientists and mathematicians developed a theory of thermodynamics that allowed them to understand what exactly is happening inside any engine. And in the end, such knowledge brought us into space.

“At first there were great engineering achievements, then great trains, and then it took a theoretical understanding to move from this to rockets,” Ganin said.

In the growing community of neural network developers, there is a small group of researchers with a mathematical bias trying to create a theory of neural networks that can explain how they work and ensure that after creating a neural network of a certain configuration, it can perform certain tasks.

While the work is at an early stage, but over the past year, researchers have already published several scientific papers that describe in detail the relationship between the form and functioning of neural networks. The work describes the neural networks in full, up to their very foundations. She demonstrates that long before confirming the ability of neural networks to drive cars, it is necessary to prove their ability to multiply numbers.

## The best brain recipe

Neural networks strive to imitate the human brain - and one way to describe his work is to say that he merges small abstractions into larger ones. From this point of view, the complexity of thoughts is measured by the number of small abstractions that underlie them, and the number of combinations of low-level abstractions into high-level abstractions - in tasks such as studying the differences between dogs and birds.

“If a person learns to recognize a dog, then he learns to recognize something shaggy on four legs,” said Maitra Ragu , a graduate student in computer science at Cornell University, a member of the Google Brain team . “Ideally, we would like our neural networks to do something similar.”

*Maitra Ragu*

Abstraction originates in the human brain in a natural way. Neural networks have to work for this. Neural networks, like the brain, are made up of building blocks called “neurons,” connected in various ways to each other. At the same time, neurons of the neural network, although made in the image of brain neurons, do not try to imitate them completely. Each neuron can represent an attribute or a combination of attributes that the neural network considers at each level of abstraction.

Engineers have a choice of many options for combining these neurons. They need to decide how many layers of neurons a neural network should have (that is, determine its “depth”). Consider, for example, a neural network that recognizes images. The image is included in the first layer of the system. On the next layer, the network may have neurons that simply recognize the edges of the image. The next layer combines the lines and defines the curves. The next one combines the curves into shapes and textures, and the last processes the shapes and textures to make a decision about what he is looking at: the furry mammoth!

“The idea is that each layer combines several aspects of the previous one. A circle is a curve in many places, a curve is a line in many places, ”says David Rolnik , a mathematician at the University of Pennsylvania.

Engineers also have to choose the “width” of each layer, corresponding to the number of different features that the network considers at each level of abstraction. In the case of image recognition, the width of the layers will correspond to the number of types of lines, curves or shapes that the neural network will consider at each level.

In addition to the depth and width of the neural network, there is a choice of the method of connecting neurons in the layers and between them, and a choice of weights for each of the connections.

If you are planning to complete a specific task, how do you know which neural network architecture can perform it in the best way? There are fairly general sample rules. For problems with image recognition, programmers usually use "convolutional" neural networks, the system of links between layers in which is repeated from layer to layer. To process a natural language - speech recognition or language generation - programmers have found that recurrent neural networks are best suited. The neurons in them can be connected with neurons not only from neighboring layers.

However, outside these general principles, programmers mostly have to rely on experimental evidence: they simply run 1,000 different neural networks and see which one does the job better.

“In practice, these choices are often made by trial and error,” Ganin said. “This is a rather complicated way, since there are infinitely many elections, and no one knows which will be the best.”

The best option would be to rely less on the trial and error method, and more on the pre-existing understanding of what a particular neural network architecture can give you. Several recently published scientific papers have advanced this area in this direction.

“This work is aimed at creating something like a recipe book for designing a suitable neural network. If you know what you want to achieve with it, then you can choose the right recipe, ”said Rolnik.

## Lasso red sheep

One of the earliest theoretical guarantees of neural network architecture appeared three decades ago. In 1989, a computer scientist proved that if a neural network has only one computational layer, in which there can be an unlimited number of neurons and an unlimited number of connections between them, then the neural network will be able to perform any task.

This was a more or less general statement, which turned out to be rather intuitive and not particularly useful. This is the same as saying that if you can define an unlimited number of lines in an image, then you can distinguish all objects with just one layer. In principle, this may be fulfilled, but try to put it into practice.

Today, researchers call such wide and flat networks "expressive," because in theory they can cover a richer set of relationships between possible input data (such as an image) and output (such as an image description). At the same time, it is extremely difficult to train these networks, that is, it is practically impossible to make them actually give out this data. They also require more computing power than any computer.

*Boris Ganin*

Recently, researchers have been trying to understand how far you can get neural networks by going in the opposite direction - making them narrower (fewer neurons per layer) and deeper (more layers). You may be able to recognize only 100 different lines, but with the connections needed to turn 100 of these lines into 50 curves that can be combined into 10 different shapes, you can get all the necessary building blocks to recognize most objects.

In the work they completed last year, Rolnik and Max Tegmarkfrom MIT proved that by increasing depth and decreasing width, one can perform the same tasks with an exponentially smaller number of neurons. They showed that if you simulated the situation 100 input variables, one can obtain the same reliability, or using 2

^{100}neurons in a single layer, or 2

^{10}neurons in two layers. They found that there were advantages in taking small parts and combining them at higher levels of abstraction, rather than trying to cover all levels of abstraction at once.

“The concept of the depth of the neural network is connected with the possibility of expressing something complex by performing many simple steps,” said Rolnik. “It looks like an assembly line.”

Rolnik and Tegmark proved the usefulness of depth by forcing neural networks to perform a simple task: multiply polynomial functions. (These are equations with variables raised to natural degrees, for example, y = x

^{3}+ 1). They trained the networks, showing them examples of equations and the results of their multiplication. Then they told the neural networks to calculate the result of the multiplication of equations that they had not seen before. Deeper neural networks learned how to do this with much less neurons than small ones.

And while multiplication is unlikely to turn our world upside down, Rolnik says that an important idea was described in the work: “If a shallow neural network cannot even multiply, you should not trust it with something else.”

*David Rolnik*

Other researchers are investigating the issue of minimum sufficient width. At the end of September, Jesse Johnson , formerly a mathematician from the University of Oklahoma, and now a researcher working for the pharmaceutical company Sanofi, proved that at some point no depth could compensate for the lack of width.

To make sense of this, imagine the lambs on the field, but let them be punk rock lambs: each of them will be painted in one of several colors. The neural network should draw a border around all the sheep of the same color. In essence, this task is similar to the classification of images: a neural network has a set of images (which it represents as points in a multidimensional space), and it needs to group similar ones.

Johnson proved that a neural network will not cope with this task if the width of the layers is smaller, or equal to the amount of input data. Each of our sheep can be described by two input data: the coordinates of its location on the field, x and y. Then the neural network marks each sheep with color and draws a border around the sheep of the same color. In this case, to solve the problem you need at least three neurons per layer.

More specifically, Johnson showed that if the ratio of the width to the number of variables is not enough, the neural network will not be able to draw closed loops - and a neural network would have to draw such a loop if, for example, all the red sheep had accumulated in the middle of the pasture. “If none of the layers is thicker than the number of input measurements, the function cannot create some forms, regardless of the number of layers,” Johnson said.

Such work begins to build the nucleus of the theory of neural networks. So far, researchers are able to make only the simplest statements regarding the relationship between architecture and functionality - and these statements are very few in comparison with the number of tasks solved by neural networks.

So, although the theory of neural networks will not be able to change the process of their design in the near future, blueprints are being created for a new theory of how computers are trained - and its consequences will be even stronger than a person going into space.