Clustering duplicates in image search

    Every month on Yandex, over 20 million people use image search . And if one of them is looking for photos of [ Marilyn Monroe ], this does not mean that they need to find only the most famous pictures of the actress. In such a situation, results in which most of the images found will be copies of the same images are unlikely to suit users. They will have to flip through a large number of pages to see different photos of Monroe. In order to facilitate such tasks for people, we need to sort the images in the search results so that they are not repeated. And we learned to "lay them on the shelves."

    When in 2002 Yandex started searching for images, there was no technology that allowed computers to directly “see” what objects are in the image. They appear now, but so far the degree of their development is not enough for the computer to recognize Marilyn Monroe in the face or recognize the forest in the photo and immediately show it upon request. Therefore, the task of improving those methods that were invented in the first stages remains relevant.

    So, the computer still needs to understand what is shown in the picture. He does not know how to “see”, but at the same time we have technologies that look for text documents well. And it is they who will help us: the image on the Internet is almost always accompanied by some kind of text. It does not necessarily directly describe what is depicted in the picture, but is almost always associated with it in meaning and content. That is, we assume that next to the photo of Einstein with a high probability his last name will be mentioned. We call such texts as cartographic.

    Based on these data, the computer understands which documents to show the user upon request [ Marilyn Monroe]. As a result, in the search results, a person will see thumbnails of relevant images or, as you can also call them, thumbnails. They will be the same for copies of the same images. From here our name for them appeared - tumbler duplicates. Pictures that are essentially identical may be of different sizes and degrees of compression, but this does not change the content of the image. And sometimes some changes can be made to the pictures. For example, add watermarks or logos, change colors, or crop. But this will not be enough to consider this image new. Our task is to make sure that there are no duplicate thumbnails on the page with the search results in the pictures and for each group of copies one is displayed, combining them all.


    The data of the text of the pictures in principle made it clear to us that in the photographs found Marilyn Monroe. But there are few of them to determine which ones are duplicates. And at this stage, existing computer vision technologies can help us.
    When you and I say that the child is like parents, it often sounds like “dad’s nose” or “mother’s eyes”. That is, we note some facial features of the parents that are preserved in the child.

    But what if you try to teach a computer to use a similar principle? In this case, at the first stage, he must understand what points in the picture should be looked at. To do this, the image is processed with special filters that help to highlight their contours. Using them, the computer finds key points that do not change with any changes in the images themselves, and looks at what is around them. In order for the computer to “examine” these fragments, they must be converted to digital format. So the description will remain true, even if the picture is stretched, rotated or subjected to some other transformations. In fact, to some extent, we train a computer to look at an image the way a person looks at it.

    As a result, each picture receives a set of descriptions of what is at its key points. And if many of these areas of one image are similar to many areas of another image, then we can draw conclusions about the degree of similarity in general.


    But there is a problem. It is impossible to decide that the images are duplicates without knowing what is displayed on them. We can decide for ourselves that a change of only 5% of the area allows us to consider them as such, but imagine pictures of a chessboard before the start of the game and after one or two moves. These are different pictures, although no more than 5% of them formally changed. And if you add, for example, a logo to one of the chessboard images with the same positions, they will also differ by 5%, but remain duplicates. Yes, to some extent we taught the computer to see a picture, but so far the technology has not reached an absolute understanding of the subject area of ​​images, and we continue to work on solving this problem.

    All the operations described above need to be carried out with each image indexed by us, and in all in our index there are 10 billion images. And this is not the final figure. To match the growth rate of content on the Internet, using our old algorithm, it would be necessary to build up resources incredibly quickly. Naturally, it was necessary to find a more rational solution to this infrastructure problem. And we were able to do it.

    Now, for example, to add and process 10 million new images that pass through Yandex.Pictures every day, you do not need to restart the process on the billions that already exist in the database.

    In addition, the information that some images on the Internet are copies of each other helps us in ranking web documents in large search results. Identical images, like links, link documents together. Thanks to this, we can take into account how valuable this or that page is for responding to a search query. Therefore, the technology described above is important not only for Yandex.Pictures, but also, in principle, for our search.

    And if we return directly to the search by pictures, that is another incidental task, which is solved by gluing duplicates - we can find a similar image even when it is not accompanied by picture text. This is useful when it is in this form that the picture corresponds to the quality of the person.


    Also popular now: