Hinton Capsule Networks
On October 27, 2017, an article appeared by Dr. Geoffrey Hinton et al. From Google Brain . Hinton is more than a renowned scientist in the field of machine learning. He at one time developed the mathematics of back propagation of errors, was the scientific adviser of Jan Lekun, the author of the architecture of convolution networks.
Although the presentation was rather modest, it’s correct to speak of a revolutionary change in the approach to artificial neural networks (ANNs). They called the new approach “capsule networks”. So far, there is little information about them in the Russian segment of the Internet, so I will fill this gap.
So, as it turned out, Hinton has been looking for ideas in recent years that will allow you to do something cooler than convolutional networks in computer vision. What did he dislike about CNN (convolutional neural networks, convolutional neural networks)? Basically, the max-pooling layer and the detection invariance are only to the position in the image. It is used to reduce the dimension of the outputs of convolutional layers, to ignore small differences in the spatial structure of images (or other data types).
From the image it’s clear how this layer works: it selects the maximum in the exit window of the convolutional layer. In this case, part of the information is lost. In fact, this is not so obvious to prove, and there may even be classes of objects for which information loss does not happen. In addition, in a number of convolution network architectures there is simply no max-pooling layer. But, nevertheless, - max-pooling is really quite vulnerable to criticism layer.
There is even more criticism of convolutional networks at the end of one of my previous articles . Hinton also speaks of similar problems.
In addition, a large “elephant”, which we engineers try to not notice in our room, is a mini-column in the cerebral cortex, which clearly should have an understandable, limited and not too primitive function.
So Hinton says: neural capsules are mini-columns, they can correspond to recognizable objects, their characteristics. The total activity of neurons characterizes the probability of recognition. The output inside the capsule should be enough to restore input with almost no loss.
Architecture
For those who did a little bit of convolutional networks, this picture already explains almost everything:
So, first comes the standard selection of features that are invariant to translation on the image using a convolutional layer. Even if you do not know what a convolutional layer is, you can still read on - this is not critical here.
Then another 32 convolutional layers are selected, each of which looks at the first convolutional layer. This is already unusual, but the picture does not explain why. Fordynamic routing (of course, routing, but let it be routing, otherwise it looks like talking about network technologies). This is one of the two "chips" that distinguishes this architecture from the rest, so it deserves a separate chapter.
Then comes the “digitCaps” layer — actually the capsules that Hinton is talking about. Although he calls the previous layer “primaryCaps”, i.e. primary capsules, they themselves do not carry a particularly new function. The only important thing is how they then connect to the “digitCaps”.
So, each capsule (or mini-column) has a strict meaning: the amplitude of the vector in each of them corresponds to the probability of one of the numbers in the image.
A rather important part of the architecture lurks at the bottom of the image: a decoder, to the input of which a DigitCaps layer is supplied, while all the capsules are reset to zero except for the one that “won” in amplitude. Decoders are essentially an inverted ANN (artificial neural network) with a small vector at the input and an original image at the output. This decoder provides the second most important “chip” of this architecture: the contents of one capsule should fully describe the subset of a specific digit supplied to the input. Otherwise, the decoder will not be able to restore the original image. There is another useful feature of the decoder - this is a regularization method. Close codes in capsules will be in images close to Euclid.
For example, the article gives the result of “stirring” the components in one of the capsules (in this case, responsible for the figure 6):
The components of the capsule vector are responsible for the form of representation of the figure according to Hinton.
Dynamic routing
Hinton, Frosst and Sara Sabour write: “There are many ways to realize the idea of capsular networks. And the dynamic routing in the proposed form is just one example. ” Therefore, it may turn out that more elegant solutions will soon appear. But so far, everything works as follows:
Each capsule from the previous layer is connected to each next. Here, “connected” means that there is a weight matrix between each capsule from the previous layer and each next. These weights are trained, as in a conventional artificial neural network, by the backpropagation method. But on top of these weights, a multiplier for each pair is superimposed by the coefficients, and the multiplier should eventually tend to one or zero. In the limit, each capsule from the upper level “listens” to only one capsule from the lower one at a time. In the image above, the first capsule from the upper level draws attention only to the second capsule of the lower, and the second in the upper - only to the third from the lower. Please note that this matrix is temporary and must be re-calculated for each example. Further, I will call it the “time matrix”.
How is this time matrix calculated? Quite a tricky iterative process: when presenting a new example of a capsular network, it is estimated what contribution each lower-level connection makes to the capsule vector. The compound that makes the largest contribution increases its weight in the time matrix, and then everything normalizes so that the coefficients do not go to infinity. And so it is repeated 3 times.
It is not obvious here, but, in fact, this is the realization of the principle "the winner takes everything." Well, because of the iterative principle here, "the winner takes almost everything." And so the distribution of roles at the lower level occurs: each element begins to better describe its own situation for each of the capsules of the upper level.
Thus, when recognizing (as in training) for each capsule responsible for a specific digit, the problem is solved: the model of which of the lower-level capsules better describes the observed image. And then the amplitude of each of the vectors in the capsules of the upper level is already estimated to decide which figure is most likely.
If you want to understand mathematics in detail, then it is described in the original article. And in more detail here .
Experimental results We experimented
with the well-known MNIST image database, which presents 10 figures in words in various variations.
What are these lines about? Adding a decoder and dynamic routing with three iterations helps. In the second column, the result is on MNIST, when 2 digits are superimposed on each other. This is an interesting ability of capsular networks to choose not one winner capsule, but two at once. The only thing will not work if two identical numbers are drawn on top of each other in a slightly different way.
IMHO on MNIST you can get any accuracy. The size of the test sample is 10,000, which means that with 99.7% accuracy there are only 30 errors. And changing these or those parameters of the algorithm, you can always reduce this error due to only randomness. Therefore, we can only talk about the contribution of one mechanism or another to the result (and then, in fact, it will be controversial and statistically unreliable). For example, the whole mechanism of dynamic routing gives an improvement of only 0.04%, i.e. into 4 images. So, there were 29 errors, it became 25. The likelihood that dynamic routing did not help is still very serious.
Of course, I think it all works and helps. Just do not pay attention to the result on MNIST. The concept itself is important, and MNIST allowed you to debug and verify its implementation. And this is an important step - now we can all touch the implementation,for example in tensorflow , and to figure out how it works, personally.
What does all this mean? A strong AI that will enslave us all?
Due to the dynamic development of the concept of convolutional networks (and even recurrent ones, but Hinton does not explain how his ideas can absorb them, although they can) a couple of years ago it began to seem to everyone in the industry that nothing more was needed. Everything is training - give data! And only a matter of time, when all the tasks will be solved. But this is not so. And I believe that it is very important that a person who is considered the father of Deep learning speaks about the severe flaw in existing approaches and is looking for new directions.
This means that from October 27, 2017, more groups of researchers will search for the next step in artificial intelligence, more people will formulate how the “weak” AI differs from “strong”, the competition will begin among previously marginal, and later trend, theories, which should replace existing models. And all this is very good! A new Wild West is opening in the field of AI.
Announcement, taking the opportunity
Not so long ago, in this article , armed with the ideas of Alexei Redozubov, which are almost orthogonal to the existing established consensus, I tried to show an alternative approach using the example of car numbers.
Last year, our research also did not stand still. For example, we wrote a demonstration on MNIST on Keras, which is now not even a shame to share with the community. And the capsule networks only convinced that the formulation of the function of mini-columns is extremely important at the current stage, and in our country it is somewhat different from what Hinton gives (of course, for the better).
Therefore, in the coming days I will publish a similar article on the trail of the MNIST training and articles on arxiv in the framework of the concept of “not at all neural networks” (we won’t think of a name for anything).