The limitations of image recognition algorithms



    No, it’s not about image recognition algorithms - it’s about the limitations of their use, in particular when creating AI.

    In my opinion, recognition of visual images by a person and a computer system is very different - so much that it has little in common. When a person says “I see,” he actually thinks more than he sees, which cannot be said about a computer system equipped with equipment for image recognition.

    I know that the idea is not new, but I propose once again to make sure of its validity by the example of a robot claiming to possess intelligence. The test question is: what kind of robot should the surrounding world see in order to fully become like a person?

    Of course, the robot must recognize objects. Oh yes, the algorithms cope with this - through training on the original samples, as I understand it. But this is catastrophically small!

    I.
    Firstly, each object of the surrounding world consists of many objects and, in turn, is a subset of other objects. I call this property nesting. But what if a subject simply does not have a name, so it is not in the base of the original samples used to learn the algorithm - what should the robot recognize in this case?

    The cloud that I am currently observing in the window does not have named parts, although it obviously consists of edges and a middle. However, there are no special terms for the edges and the middle of the cloud, not coined. To indicate an unnamed object, I used a verbal formulation (“cloud” - type of object, “cloud edge” - verbal formulation), which is not included in the capabilities of the image recognition algorithm.

    It turns out that an algorithm without a logical block is of little use. If the algorithm detects a part of the whole object, it will not always be able to figure out - accordingly, the robot will not be able to tell - what it is.

    II.
    Secondly, the list of objects that make up the world is not closed: it is constantly updated.

    A person has the ability to construct objects of reality, assigning names to new discovered objects, for example, species of fauna. He will call a horse with a human head and torso a centaur, but for this, he will first realize that the creature has a human head and torso, and everything else is equine, thereby recognizing the object seen as a new one. This is what the human brain does. And the algorithm, in the absence of input data, will determine such a creature either as a person or as a horse: without operating with the characteristics of types, it will not be able to establish their combination.

    In order for a robot to become like a human being, he must be able to define new types of objects for him and assign names to these types. In the descriptions of the new type, characteristics of known types should appear. And if the robot does not know how, why on earth do we need it, so beautiful?

    Let's say we send a reconnaissance robot to Mars. A robot sees something unusual, but is able to identify an object exclusively in earthly terms known to it. What will this give people listening to verbal messages coming from the robot? Sometimes it will give something, of course (if Earth objects are found on Mars), and in other cases, nothing (if the Martian objects are not similar to Earth objects).

    Image is another matter: a person himself will be able to see everything, correctly evaluate and name it. Only through not a pre-trained image recognition algorithm, but your more cunningly constructed human brain.

    III.
    Thirdly, there is some problem with the individualization of objects.

    The world around consists of specific objects. Actually, you can only see specific objects. But in some cases they need to be verbally individualized, for which either personal names are used (“Vasya Petrov”), or a simple indication of a specific object, pronounced or implied (“this table”). What I call types of objects (“people”, “tables”) are just collective names of objects that have certain common characteristics.

    Image recognition algorithms, if trained on the original samples, will be able to recognize both individualized and non-individualized objects - this is good. Face recognition in crowded places and all that. The bad thing is that such algorithms will not understand which objects should be recognized as possessing an individuality and which ones are absolutely not worth it.

    The robot, as the owner of AI, should from time to time burst out with messages like:
    - Oh, and I already saw this old woman a week ago!

    But it is not worthwhile to abuse such replicas about blades of grass, especially since there are well-founded fears about the sufficiency of computing power to perform such a task.

    It is not clear to me where the fine line is drawn between an individualized old woman and innumerable field blades of grass, which are individualized by no less than an old woman, but are not of any interest to a person from the point of view of individualization. What is the recognized image in this sense? Almost nothing - the beginning of a difficult to painful perception of the surrounding reality.

    IV.
    Fourth, the dynamics of objects, determined by their mutual spatial arrangement. This, I tell you, is something!

    I am sitting in front of the fireplace in a deep armchair and am now trying to get up.
    “What do you see, robot?”

    From our everyday point of view, the robot sees me rising from a chair. What should he answer? Probably the relevant answer would be:
    “I see you getting up from your chair.”

    To do this, the robot must know who I am, what a chair is and what it means to rise ...

    The image recognition algorithm after appropriate settings will be able to recognize me and the chair, then by comparing the frames we can determine the fact of mutual removal of me from the chair, but what does it mean to "rise" ? How does “uplift” happen in physical reality?

    If I already got up and walked away, everything is quite simple. After I moved away from the chair, all the objects in the office did not change the spatial position relative to each other, with the exception of me, who was originally in the chair, and after some time was away from the chair. It is permissible to conclude that I left the chair.

    If I am still in the process of getting up from the chair, everything is somewhat more complicated. I am still next to the chair, however, the relative spatial position of the parts of my body has changed:

    • initially the tibia and trunk were in an upright position, and the thigh was in a horizontal position (I was sitting),
    • the next moment, all parts of the body were in an upright position (I stood up).

    Observe my behavior as a person, he will instantly conclude that I am rising from a chair. For a person, this will be not so much a logical conclusion as a visual perception: he will literally see me rising from my chair, although in fact he will see a change in the relative position of parts of my body. However, in reality it will be a logical conclusion that someone must explain to the robot, or the robot must work out this logical conclusion on its own.

    Both are equally difficult:

    • to enter into the initial knowledge base information that standing up is a sequential change in the mutual spatial position of certain parts of the body is somehow not inspiring;
    • it is no less stupid to hope that the robot, as an artificial thinking creature, itself will quickly guess that the change in the mutual spatial position of certain parts of the body described above is called standing up. In humans, this process takes years, how much will it take for a robot?

    And what does the image recognition algorithms have to do with it? They will never be able to determine that I am rising from a chair.

    V.
    "Standing up" is an abstract concept, determined by a change in the characteristics of material objects, in this case, a change in their mutual spatial position. In the general case, this is true for any abstract concepts, because abstract concepts themselves do not exist in the material world, but are completely dependent on material objects. Although often we perceive them as observed personally.

    To move the jaw to the right or left, without opening the mouth - what is this action called? But no way. Undoubtedly, for the reason that such a movement is generally uncharacteristic for a person. Using the algorithms discussed, the robot will see something, but what's the point? In the base of the initial samples, the desired name will be absent, and it will be difficult to name the recorded action of the robot. And to give detailed verbal formulations to unnamed actions, as well as to other abstract concepts, image recognition algorithms are not trained.

    In fact, we have a duplicate of the first paragraph, not only with respect to objects, but to abstract concepts. However, the rest of the paragraphs, previous and next, can also be linked to abstract concepts - I just pay attention to increasing the level of complexity when working with abstractions.

    VI.
    Sixth, a causal relationship.

    Imagine that you are watching a pickup truck flying off the road and tearing down a fence. The reason that the fence is demolished is the pickup movement, and in turn the pickup movement results in the demolition of the fence.

    - I saw it with my own eyes!
    This is the answer to the question of whether you saw what happened or thought of it. And what did you actually see?

    A few items in such dynamics:

    • a pickup truck drove off the road
    • pickup came close to the fence,
    • the fence has changed shape and location.

    Based on visual perception, the robot must realize that in the usual case, the fences do not change shape and location: here this happened as a result of contact with the pickup. The subject-cause and the subject-effect must be in contact with each other, otherwise causality is absent in their relationship.

    Although here we fall into a logical trap, because other objects can contact with the subject-consequence, not only the subject-reason.

    Suppose, at the time of the pickup hit the jackdaw on the fence. A pickup truck and a jackdaw were in contact with the fence at the same time: how to determine the result of which contact the fence was demolished?

    Probably using repeatability:

    • if in each case, when a jackdaw sits on the fence, the fence is demolished, the jackdaw is to blame;
    • if in each case when a pickup crashes into the fence, the pickup is to blame.

    Thus, the conclusion that the fence was demolished by a pickup is not exactly an observation, but the result of an analysis based on the observation of objects in contact.

    On the other hand, the action can be carried out at a distance, for example, the action of a magnet on an iron object. How does the robot guess that moving a magnet closer to a nail causes the nail to rush toward the magnet? The visual picture is not like this:

    • the magnet is approaching, but not in contact with the nail,
    • at the same instant, the nail rushes to the magnet on its own initiative and comes into contact with it.

    As you can see, it’s very difficult to track cause-effect relationships, even in cases where the witness declares with iron conviction that he saw it with his own eyes. Image recognition algorithms are powerless here.

    VII.
    Seventh and last, this is the choice of visual perception goals.

    The surrounding visual picture may consist of hundreds and thousands of objects nested in each other, many of which are constantly changing their spatial position and other characteristics. Obviously, the robot does not need to perceive every blade of grass in the field, however, like every face on a city street: you only need to perceive the important, depending on the tasks performed.

    Obviously, adjusting the image recognition algorithm to the perception of some objects and ignoring others will not work, as it may not be known in advance what to pay attention to and what to ignore, especially since current goals can change along the way. A situation may arise when you first need to perceive many thousands of objects nested in each other - literally each of them - to analyze and only then issue a verdict which objects are essential for solving the current problem and which are not of interest. This is how the person perceives the world around him: he sees only the important, not paying attention to uninteresting background events. How he succeeds is a secret.

    And the robot, even equipped with the most modern and ingenious image recognition algorithms? .. If, during an attack by Martian aliens, he starts a report with weather reports and continues with a description of the new landscape spread out in front of him, he may not have time to report the attack itself.

    conclusions

    1. Simple recognition of visual images will not replace human eyes.
    2. Image recognition algorithms are an auxiliary tool with a very narrow scope.
    3. For the robot to begin not only to think, but to at least see humanly, algorithms are required not only for pattern recognition, but also for the same full-fledged and yet unattainable human thinking.

    Also popular now: