# Algorithm for quickly finding similar images

# Introduction

Recently I came across an article posted on Habrahabr on the comparison of images "Looks like." How does the perceptual hash work ? Since I myself have been dealing with this topic for a long time (I am the author of the AntiDupl program ), I wanted to share my experience on this issue here. In the article I will give two versions of the algorithm for comparing similar images - basic and improved. All of them were tested by the author in practice within the framework of the above project. My presentation will be conducted without rigorous evidence, complex formulas, and special mathematical terminology. I hope readers forgive me for this.

# Basic Algorithm

## Image Similarity Measure

When comparing similar images, the first question arises: what is considered a measure of image similarity? Obviously, this value has the opposite value to the difference in images from each other. Therefore, you need to choose a certain metric that characterizes the difference between the images from each other. Then similar images will be considered images, the difference between which is less than a certain threshold. For images with the same dimensions, usually the standard deviation is the standard deviation of the pixels of one image from another. Although of course, nothing prevents us from choosing another metric, for example, the averaged absolute difference of image pixels from each other.

## Pictures of different sizes

The question is somewhat complicated if you need to compare images of different sizes. However, a fairly obvious solution to this problem is to bring all images to the same size. It remains to choose this size - if you select it too small, it is obvious that the differences between the images will then be leveled, and we will have many false positives. Too large a size unreasonably increases the resource consumption of the comparison algorithm. Experience shows that the most optimal is the size of the image from 16x16 to 64x64. Also, experience shows that the color characteristics of the image have a much lesser effect on the visual sensation of the similarity of images compared to brightness, so they can be discarded.

## Main steps

So, the simplest version of the algorithm for comparing similar images will include the following steps:

- Bringing all images to the same size (for definiteness, take 32x32).
- Discarding color information (converting to a gray image).
- Finding the rms difference for each pair of reduced gray images.
- Comparison of the obtained rms difference with a certain threshold.

Here the threshold determines the measure of image similarity. So if you select a threshold of 0%, then the algorithm will find only completely identical images. At a 5% threshold, the algorithm can also find visually similar images, which may vary in resolution, compression quality, the presence of small labels, crop, etc. with a small number of false positives. At a threshold above 10%, as a rule, the number of false positives can exceed the number of positive positives. Undoubtedly, this algorithm in one form or another should be implemented in any program that searches for similar images. Variations usually relate to a method for obtaining a reduced image, sizes of a reduced image and methods for quantizing it according to the color palette, selection of a metric that determines the degree of difference in images.

## Disadvantages of the basic algorithm

The disadvantages of this algorithm include its poor sensitivity when comparing low-contrast images, as well as images without large-scale features (contour drawings, text images). Weak sensitivity is explained by the fact that when the original image is reduced to a size of 32x32, as a rule, all such images become indistinguishable from each other. Also, the quadratic dependence of the execution time of the algorithm depending on the total number of images can be considered a drawback of this approach, since a comparison should be made for each pair of images. So to compare a pair of normalized images with each other, the algorithm needs to perform about 10 ^ 3 arithmetic operations, then for a thousand images this is about 10 ^ 9 operations, and for a million images - 10 ^ 15 operations.

# Fast algorithm

As already mentioned, this basic implementation of the algorithm has two problems - the low accuracy on images without large-scale features and the quadratic dependence of the execution time on the total number of images. The first problem is quite specific and characteristic for a rather narrow circle of images, therefore we will focus on solving the second problem. Obviously, this algorithm is well parallelized and vectorized. However, in the framework of this article, I would like to dwell not on the software, but on the algorithmic optimization of this algorithm.

## Aspect ratio

Visually identical images, as a rule, have close ratios of width and height. Hence the logical conclusion is to compare only images with close aspect ratios. In practice, most images in the same collection, as a rule, have the same aspect ratio (for example, 3: 4), but have different orientations (vertical or horizontal). Thus, we get two isolated classes of images that do not have to be compared with each other. The approximate acceleration of algorithms from this fact is about two times.

## Preliminary comparison

The first most obvious step: if we once reduced the image, then why not do it again? We make a 4x4 image from a 32x32 picture, where each point of the reduced image represents the average brightness of the corresponding 8x4 size of the original image. It can be mathematically rigorously proved that the rms difference for this reduced image will be no greater than the rms difference of the original image. Therefore, we can, by preliminary comparing the reduced 4x4 images, create a kind of filter that will filter out most of the candidates for a full-fledged comparison (provided that most of the images in the sample are relatively unique). Such a simple technique can achieve, as it is easy to calculate, almost 50-fold acceleration. However,

## One-dimensional image indexing

As you know, a search in an ordered data array can be performed over a time of the order of O (ln (N)), in contrast to an unordered one, which requires O (N). Thus, if we somehow manage to arrange the images, then theoretically we can reduce the quadratic complexity of the search for O (N ^ 2) to the quasilinear complexity - O (N * ln (N)). And so we will try to create an index with which we will sort the images. To do this, take our 4x4 thumbnail and reduce it two more times to 2x4. It is easy to form the first index from this picture:

i [0] = (p [0] [0] + p [0] [1] + p [1] [0] + p [1] [1]) / 4,

which represents the average brightness of our image. In fact, this image is reduced to a size of 1x1. The root-mean-square difference of the source images will be no less than the difference of their intensities. Thus, if the threshold for comparing images is t, then all possible candidates for comparison should have an average brightness in the range from i [0] - t to i [0] + t. In fact, if we sort all the images by index i [0], then we can compare with a significantly smaller number of images - 2 * t / 255. It is not difficult to calculate that for a typical threshold of 5%, we have a theoretical gain of 10 times. In practice, the gain will be less, since the statistical distribution of images by average brightness is not uniform in the range [0 ... 255], but has a bell-shaped maximum near 127 and falls off at the edges of the range.

## Multidimensional Image Indexing

In addition to the average brightness i [0], 3 additional indexes can be built from a 2x2 image:

i [1] = 128 + (p [0] [0] - p [0] [1] + p [1] [0] - p [1] [1]) / 4,

i [2] = 128 + (p [0] [0] + p [0] [1] - p [1] [0] - p [1] [1 ]) / 4,

i [3] = 128 + (p [0] [0] - p [0] [1] - p [1] [0] + p [1] [1]) / 4,

which have a simple physical meaning: index i [1] shows how much the left part of the image is brighter than the right, i [2] - how much the upper part is brighter than the bottom, and i [3] - how much the diagonals of the image differ. For all these indices, you can also sort the images, as well as for the first, and all possible candidates for comparison will lie in the ranges from i [j] - t to i [j] + t. Those. in the four-dimensional space formed by these indices, the search area will represent a 4-dimensional cube with a 2 * t side and a center with coordinates formed by the indices of the desired image. Theoretical acceleration is equal to the ratio of the volume of this 4-dimensional cube to the total volume of the index space and is equal to (2 * t / 255) ^ 4 = 1/10000. Practical acceleration is much more modest, since the effective width of the distribution of images by indices i [1] and i [2] is about 40-50%, and i [3] is only 20-30% of the maximum possible. Due to the narrow actual range, in practice the use of the last index is often not at all advisable, therefore, you can limit yourself to the first 3 indices. Nevertheless, practically achievable acceleration reaches two orders of magnitude (for a threshold difference of 5%).

## The main stages of the improved algorithm

And so, we can summarize our reasoning and give the main steps that need to be taken in the framework of the algorithm for quick image comparison.

- Reducing all images to one size (32x32) and discarding color information.
- Finding a thumbnail for a preliminary 4x4 comparison.
- Building an n-dimensional index of an image based on an image for preliminary comparison.
- Defining the boundaries of a search region in an n-dimensional index space.
- Check the aspect ratio of candidates from this field.
- Conducting a preliminary comparison of images with candidates who have passed the previous paragraph.
- Conducting a full comparison of images with candidates who have passed the previous paragraphs.

## The disadvantages of a fast algorithm

In general, the indexing efficiency strongly depends on the similarity threshold of the image - at a zero threshold, the complexity of searching for similar images approaches linear O (N), and at a large threshold, quadratic O (N ^ 2). Nevertheless, this is significant progress, since in the basic version of the algorithm, the complexity of the search is always quadratic O (N ^ 2). It is also worth noting that in practice, to implement sorting by n-dimensional index space, it is necessary to organize an n-dimensional array with lists of images lying in a certain range of indices. This introduces additional overhead and makes it unjustified to use indexing of large dimensions, as well as the inefficiency of such indexing for small image collections.

# Conclusion

In general, if we summarize all the improvements that are applied in the improved version of the algorithm, we get a gain of 3-4 orders of magnitude compared to the base implementation. Tests on large image collections show that this is true. So directly comparing images with each other takes about 30 seconds on a single-core processor for a collection of 100 thousand images. In practice, the speed of comparison becomes an insignificant factor in the general process of searching for images - after all, they still need to be found, read from the disk and form thumbnails (although of course the last two points can be performed only 1 time, since it is obvious that thumbnails weighing only 1000 it is advisable to save the byte for reuse).

I hope that readers are interested in my presentation. Perhaps even someone thought it was useful. Waiting for feedback.

# PS software implementation

A lot of time has passed since the writing of this article. I decided to make a small addition to it. As part of the Simd project, a class was written in C ++ ImageMatcher , which implements the algorithm described in this article. Have a nice use!