Computer vision. Lecture for Yandex Small ShAD

    The scope of computer vision is very wide: from barcode readers in supermarkets to augmented reality. In this lecture you will learn where computer vision is used and how it works, how images look in numbers, what tasks in this area are solved relatively easily, which ones are difficult, and why.

    The lecture is designed for high school students - students of the Small ShAD, but adults can learn a lot from it.

    The ability to see and recognize objects is a natural and familiar opportunity for humans. However, for the computer so far - this is an extremely difficult task. Now attempts are being made to teach a computer at least a fraction of what a person uses every day without even noticing it.

    Probably, most often an ordinary person meets with computer vision at the checkout in a supermarket. Of course, we are talking about reading barcodes. They were designed specifically in such a way as to simplify the computer reading process. But there are more complex tasks: reading license plates, analyzing medical images, flaw detection in production, face recognition, etc. The use of computer vision to create augmented reality systems is actively developing.

    The difference between human and computer vision

    The child learns to recognize objects gradually. He begins to realize how the shape of the object changes depending on its position and lighting. In the future, when recognizing objects, a person focuses on previous experience. During his life, a person accumulates a huge amount of information, the process of learning a neural network does not stop even for a second. It’s not difficult for a person to reconstruct a perspective from a flat picture and imagine how all this would look in three dimensions.

    All this is given to the computer much more complicated. And primarily because of the problem of experience accumulation. It is necessary to collect a huge number of examples, which so far is not very successful.

    In addition, a person always recognizes the environment when recognizing an object. If you pull the object out of the usual environment, it will become much more difficult to recognize it. Here, the experience accumulated over life, which the computer does not have, also plays a role.

    Boy or girl?

    Imagine that we need to learn at a glance to determine the gender of a person (dressed!) From a photograph. First you need to determine the factors that may indicate belonging to a particular object. In addition, you need to collect training set. It is desirable that it be representative. In our case, we take as a training sample all those present in the audience. And try to find distinctive factors based on them: for example, hair length, the presence of a beard, makeup and clothing (skirt or pants). Knowing what percentage of representatives of the same gender met certain factors, we can create fairly clear rules: the presence of theses or other combinations of factors with a certain probability will allow us to tell the person of which gender in the photograph.

    Machine learning

    Of course, this is a very simple and conditional example with a small number of top-level factors. In real tasks that are posed to computer vision systems, there are many more factors. Defining them manually and calculating dependencies is an impossible task for a person. Therefore, in such cases, machine learning is indispensable. For example, you can identify dozens of initial factors, as well as set positive and negative examples. And already the dependencies between these factors are selected automatically, a formula is drawn up that allows you to make decisions. Quite often, the factors themselves stand out automatically.

    Image in numbers

    The most commonly used digital image storage space is the RGB color space. In it, each of the three axes (channels) is assigned its own color: red, green and blue. For each channel, 8 bits of information are allocated, respectively, the color intensity on each axis can take values ​​in the range from 0 to 255. All colors in the digital space RGB are obtained by mixing the three primary colors.


    Unfortunately, RGB is not always well suited for analyzing information. Experiments show that the geometric proximity of colors is quite far from how a person perceives the proximity of certain colors to each other.

    But there are other color spaces. The HSV (Hue, Saturation, Value) space is very interesting in our context. It contains the Value axis, indicating the amount of light. A separate channel is allocated to it, unlike RGB, where this value needs to be calculated every time. In fact, this is a black and white version of the image that you can already work with. Hue is represented as a corner and is responsible for the main tone. Saturation (the distance from the center to the edge) determines the color saturation.


    HSV is much closer to how we imagine colors. If you show a person in the dark a red and green object, he will not be able to distinguish colors. In HSV, the same thing happens. The lower we move along the V axis, the smaller the difference between the shades becomes, as the range of saturation values ​​decreases. On the diagram, it looks like a cone, at the top of which an extremely black dot.

    Color and light

    Why is it important to have data on the amount of light? In most cases, in computer vision, color does not matter, since it does not carry any important information. Let's look at two pictures: color and black and white. Recognizing all the objects on the black and white version is not much more difficult than on the color one. In this case, the color does not carry any additional burden for us, and a great many computational problems. When we work with the color version of the image, the data volume, roughly speaking, is raised to the power of a cube.


    Color is used only in rare cases, when on the contrary it allows simplifying calculations. For example, when you need to detect a face: it is easier to first find its possible location in the picture, focusing on a range of skin tones. This eliminates the need to analyze the entire image.

    Local and global symptoms

    The signs by which we analyze the image are local and global. Looking at this picture, the majority will say that it has a red car on it:


    This answer implies that a person selected an object in the image, which means that he described a local sign of color. By and large, the picture shows a forest, a road and a little car. By area, the car occupies a smaller part. But we understand that the car in this picture is the most important object. If a person is asked to find pictures similar to this one, he will first of all select images in which there is a red machine.

    Detection and Segmentation

    In computer vision, this process is called detection and segmentation. Segmentation is the division of the image into many parts that are connected to each other visually or semantically. And detection is the detection of objects in the image. Detection must be clearly distinguished from recognition. Suppose you can detect a traffic sign in the same picture with a car. But it is impossible to recognize him, since he is turned to us with the reverse side. Also, when recognizing faces, the detector can determine the location of the face, and the “recognizer” will already tell whose face it is.


    Descriptors and Visual Words

    There are many different approaches to recognition.

    For example, this: in the image you first need to highlight interesting points or places of interest. Something different from the background: bright spots, transitions, etc. There are several algorithms to do this.

    One of the most common methods is called Difference of Gaussians ( DoG ). Blurring the image with a different radius and comparing the results, you can find the most contrasting fragments. The areas around these fragments are the most interesting.

    These areas are further described digitally. The regions are divided into small sections, it is determined in which direction the gradients are directed, and vectors are obtained.

    The image below shows what it looks like. The received data is written to descriptors.


    In order for the same descriptors to be recognized as such regardless of rotations in the plane, they are rotated so that the largest vectors are rotated in one direction. This is not always done. But if you need to find two identical objects located in different planes.

    Descriptors can be written in numerical form. The descriptor can be represented as a point in a multidimensional array. We have a two-dimensional array in the illustration. Our descriptors hit it. And we can cluster them - break them into groups.


    Further, for each cluster, we describe a region in space. When the descriptor falls into this area, it becomes important for us not what it was, but what area it fell into. And then we can compare the images, determining how many descriptors of one image were in the same clusters as the descriptors of another image. Such clusters can be called visual words.

    To find not just identical pictures, but images of similar objects, you need to take a lot of images of this object and many pictures in which it is not. Then select the descriptors from them and cluster them. Next, you need to find out in which clusters descriptors from the images that contained the object we need were in. Now we know that if the descriptors from the new image fall into the same clusters, it means that the desired object is present on it.

    Matching descriptors is not a guarantee of the identity of the objects that contain them. One way to further validate is geometric validation. In this case, the location of the descriptors relative to each other is compared.

    Recognition and classification

    For simplicity, imagine that we can break down all the images into three classes: architecture, nature and portrait. In turn, we can break nature into plants of animals and birds. And having already understood that this is a bird, we can say which one: an owl, a seagull or a crow.


    The difference between recognition and classification is rather arbitrary. If we found an owl in the picture, then it is more likely recognition. If just a bird, then this is some kind of intermediate option. And if only nature is definitely a classification. Those. the difference between recognition and classification is how deep we went through the tree. And the further computer vision advances, the lower the boundary between classification and recognition will slide.

    Also popular now: