Not neural networks at all
Recently, ZlodeiBaal wrote about the achievements in convolutional neural networks (CNN) (and, by the way, he successfully set up and trained the network right there to search for the license plate area ).
And I want to talk about a fundamentally different and probably more complex model that AlexeyRozov ( AlexeyR ) is developing now , and about how we, of course, ignoring some important elements, were used to recognize car registration marks!
In the article, I will somewhat simplistically recall some aspects of this concept and show how it worked in our task.
If you spend a little time reading the latest news from neurobiology, then anyone who is familiar with computational neural networks will feel somewhat uneasy. Moreover, a paronoidal thought may appear: “maybe it’s in vain that they try to collect AI from neurons, maybe they aren’t the main ones?” It was easier to investigate neurons because of their electrical activity. For example, there are much more glial cells than neurons, and not everything is clear with their function.
And the general impression does not leave that those who are engaged in Deep learning use the ideas of scientists of the 40s-60s, not particularly rushing to sort out a huge array of studies by neurobiologists.
AlexeyR swung at the concept of the brain, consistent with slightly more modern research by neurophysiologists.
For me personally, an important test of the adequacy of an idea is its constructiveness. Is it possible to arm yourself with this idea and build something that really works?
Here I will try to convince you that the concept is constructive and very promising, including from a practical point of view.
To begin, I’ll give my interpretations of the main positions of Redozubov’s ideas that seem important to me:
Non-classical neuron and wave identifiers
Scientists have long noticed that on the body of the neuron outside the synaptic clefts there are also metabotropic detectors, which are not obvious what they are doing. Here they can be entrusted with the original mechanism of remembering a strictly defined activity surrounding this detector. In this case, a single spike of a neuron (it is called spontaneous) occurs, on which this detector is located. Such a tricky mechanism allows you to remember and spread a unique wave of information wave. In this series of articles you can find the source of this model. In general, this is very similar to the spread of rumors in a social group. It is not necessary for everyone to live nearby and get together to spread the word. It is only important that the ties with neighbors be fairly tight.
Honestly, I am infinitely far from cellular neurophysiology, but even after reading the wiki article , I can allow a different way of transmitting information than synaptic.
The second, the following from the first.
Different areas of the brain are connected by not so many axons. Of course, there is darkness there, but certainly not from every neuron to every neuron at the next level, as is customary in computational neural networks now. And if we learned to distribute unique waves-identifiers, then we will communicate between zones using such discrete identifiers. All information processed by the brain will be described by discrete identifier waves (simply by numbers), for example:
- position (say, the position of the projection of an object on the retina);
- time (subjective sensation of time by a person);
- scale;
- sound frequency;
- color
, etc.
And we will use the batch description, i.e. just a list of such discrete identifiers to describe what the zone recognized.
Third.
Here everything is turned upside down as to what everyone is used to with Hubel and Wiesel.
A small digression into history:
In 1959 Hubel and Wiesel set up an interesting experiment. They watched as neurons in the visual cortex of the first level respond to various stimuli. And they found some organization in it. Some neurons reacted to one slope of the visual stimulus, others to another. Moreover, inside one minicolumn (a vertical structure of 100-300 neurons in the neocortex), all neurons responded to the same stimuli. Then hundreds, if not thousands, of studies appeared with modern equipment, where selectivity was not found in various zones. And to the slope, and to the spatial frequency, and to the position, and the speed of movement, and to the frequency of sound. Which parameter relevant for a given part of the brain was not set, some selectivity was necessarily found.
Visual Cortex Zone V1
Depending on the frequency of the sound, A1
And it is quite natural to conclude from this that neurons change synaptic weights as they learn so that they recognize the line or boundary of a certain slope in their receptive zone. Following these ideas, Lekun built convolution networks.
But it will be much more effective, says Alexei Redozubov, if this mini-column remembers for itself not at all a specific feature, but a context. Context, for example, is the angle of rotation. The second context is position, the third is scale. And features will be generally common for some neighborhood on the crust.
Thus, self-organizing maps are needed not for the input image (in the visual cortex), but self-organizing maps of various contexts relevant to the visual cortex. Moreover, the proximity of these contexts can be estimated either due to their temporal proximity, or due to the proximity of identifier codes.
And why is this complexity with context needed? And then, that any information we deal with looks completely different depending on the context. In the absence of any a priori information about the current context in which the observed entity is located, it is necessary to consider all possible contexts. What, within the framework of the proposed model, will be the neocortex mini-columns.
Such an approach is just an interpretation of the same hundreds or thousands of experiments about self-organization in the cerebral cortex. Two results cannot be distinguished during the experiment:
1) I see a border tilted 45 degrees relative to the vertical.
2) I see a vertical border tilted 45 degrees.
It seems to be the same thing. But in the case of the second interpretation, one can find exactly the same activity in the same mini-column if we show, say, a person’s face: “I see a vertical face, tilted by 45 degrees”. And the zone will not be limited to the perception of only one type of object.
And on the other hand, the context and recognizable phenomenon will change places in another zone. So, for example, there are two ways in which visual information is analyzed: dorsal and ventral. For a “dorsal stream” context - location in space, direction of movement, etc. And for a “ventral stream”, the context may be characteristics of the observed object, even the type of object. If one of the visual processing flows is damaged, then the person will not become blind. Any problems with the perception of several objects at once, but over time, the ability to perceive several objects and interact with them in part back. Those. A good description of the object is obtained both in "dorsal stream" and in "ventral stream". But it’s better not to get sick at all and, on the one hand, consider various hypotheses according to position, orientation, and, on the other hand, hypotheses like “I see a person”, “I see a head”, “I see a chair”. Tens of millions of both hypotheses are simultaneously analyzed in the optic tract.
And now, actually, about the license plate recognition algorithm, which is based on the ideas of AlexeyR .
Recognition of car numbers
I will begin with examples of recognition of car numbers.
By the way, the number boundaries are not used here, i.e. this recognition method is generally not very similar to classical algorithms. Due to this, precious percentages are not lost due to the primary error in determining the boundaries of the number. Although, with this approach, nothing would be lost.
Architecture
In order to recognize car numbers, we used 2 zones:
1) The first zone recognized everything that looked like letters and numbers in a car number.
Input information - images.
At the same time, there were 5 parameters by which the entire zone was divided:
- position along X
- position along Y
- orientation
- scale along the X axis
- scale along Y
As a result, about 700,000 hypotheses came out. That even for a video card on a laptop did not become a big problem.
zone exit - type description:
such-and-such sign, position, orientation, scale.
visualization of the output:
It is noticeable that far from all the signs could be recognized at this level, there were a lot of false positives if the number was dirty enough.
2) The second zone of the
input information is the output of the first zone.
Engaged in promising transformations. The hypothesis was about 6 million. And the only stable pattern that was recognized here was a car number in the format of 6 characters of a larger scale on the left and 3 characters of a smaller one on the right.
All possible perspective transformations were checked. In one of the options, the signs from zone 1 “folded” in the best way into the car license plate pattern known to us. This maximum won.
But we know in what contexts there should be signs for a given perspective transformation, so we can project them back to the first zone and find the correct local maxima in the space of the first zone.
The time to complete this algorithm was about 15s on the not-fastest NVIDIA GF GT 740M.
More recognition examples
Especially revealing are the 3 and 4 columns of car numbers.
You can close half (and even more) of the sign, but still the only true hypothesis about the position and orientation of the car sign will be chosen, because we know his model. This is a fundamental property of the human brain - only by a few signs (sometimes erroneously) recognize objects and phenomena.
And in the fourth example, due to the imperfect work of the first zone, most of the numbers are “lost”. But again, due to the idea of what a license plate looks like, the hypothesis of a promising transformation was correctly selected. And already due to the reverse projection on the first zone of information about where the signs should be, the number is successfully recognized! Part of the information was lost, but restored due to what we know - “numbers and letters should be there”.
Training
Directly in the above example, the training was carried out, on the one hand, the simplest: "once shown and remembered." And on the other hand, to get a strong recognition algorithm, it was enough just to show 22 characters for the first zone and one license plate for the second zone. This is superfast training. Thousands of images were not needed.
Of course, the data was not noisy with noise, all relevant features were determined in advance, therefore, in this implementation, self-training was not fully developed.
But the concept has an unusual and powerful self-learning mechanism. More precisely, there are two things to be taught:
1) How the same "features" are transformed in different contexts. And to answer a no less complicated question: what is the context for this zone? Intuitively it seems that for sound - frequency and pace. For movements - the direction of movement and pace. For images - for which zone of 2D conversion, for some already 3D.
2) Find stable situations, i.e. a set of features that often appear in different contexts.
Over the past few months, it has been possible to do something in the direction of self-education. So, for example, when working with visual images, an object needs to appear twice in the frame in a different context (scale, orientation, position) and it is possible to obtain its portrait, which is then suitable for recognition, despite the rather complicated background and interference. All this looks like superfast self-study.
Unfortunately, a detailed description of self-learning mechanisms will remain outside the scope of this article.
And the convolutional networks of Jan Lekun?
Convolutional Networks (CNNs) also fit perfectly into this model. A convolution with kernels is performed for each position X, Y. Those. at each level in each position one of the memories (kernels) is sought, so the position is the very context. And even it is necessary to arrange nearby close positions, as this is used at the next downsample level, where a local maximum of 4 points is selected. Due to this, it turns out to train the neural network at once for all positions in the image. In other words, if we learned to recognize a cat in the upper left corner, then we definitely recognize her in the center. One has only to note that we cannot immediately recognize it, rotated by 45 degrees, or of another scale. A large training set will be required for this.
Convolutional networks due to this mechanism of the device have become a very powerful tool in skillful hands. But there are several drawbacks that are compensated by various methods, but are fundamental:
- only the position context has already been fixed, although even when working with images, orientation, scale, perspective, speed of movement when working with video and other parameters are important;
- which means you need to increase the training sample, because the convolutional network does not have a transformation mechanism, except for movement, and hope that the rest of the patterns will "emerge" after a long training;
- there is no mechanism of "remembering everything", only a change in weighting factors. Once seen, it will no longer help with further training, which means that the training sample must be properly organized and in any case increased again. In other words: memory is separate from CNN architecture.
- loss of information during downsample and subsequent upsample (in CNN-based autoencoders), i.e. if you build feedback or back projection in CNN, then there are problems with the accuracy of reproduction;
- each next level continues to operate with spatial invariance “in the plane”, and at higher levels due to a lack of constructive ideas they again come, for example, to a fully connected architecture;
- It’s quite difficult to track and debug those “models” that were formulated during the training, and they are not always correct.
It is possible to strengthen convolutional networks, for example, to “sew up” several more parameters, except for X, Y. But in general, it is necessary to develop not the simplest algorithms for self-organizing a context map, organize a “re-interpretation” of memories, generalization in different contexts, and much more, which does not quite resemble the basic ideas of CNN.
Conclusion
One example with recognition of numbers, implemented in a very truncated way, is of course not enough to say that we have in our hands a new constructive model of the device of the brain. Just a pen test. There is still a lot of interesting work. Now AlexeyRprepares examples with the classic MNIST in one zone, self-learning several zones for speech recognition. In addition, we learn to work with less structured information - people's faces. Here you can not do without powerful self-training. And it is precisely these practical examples that force literally every 2 weeks to review some of the details of the concept. You can talk as beautifully as you like about a slender concept and write great books, but it’s much more important to create really working programs.
As a result, step by step, a sufficiently developed model should be obtained, which will be much more universal than the existing CNN or, say, RNN in relation to the input information, sometimes learning from just a few examples.
We invite you to follow the development of the model in this blog and blog AlexeyR!