Computer Vision: Recognizing Clothing in a Photo Using a Mobile Application

    Not so long ago, we decided to make a project that would allow us to search for clothes in various online stores by photo (picture). The idea is simple - the user uploads an image (photo), selects the area of ​​interest (t-shirt, pants, etc.), indicates (optional) specifying parameters (gender, size, etc.), and the system looks for similar clothes in our catalogs sorting it by the degree of similarity with the original.

    The idea itself is not something new, but it has not been qualitatively implemented by anyone. For several years now there has been a project on the market www.snapfashion.co.uk, but the relevance of its search is very low, the selection is mainly by determining the color of the image. For example, he will be able to find a red dress, but a dress with a certain style or pattern is no longer there. The audience of this project, by the way, is not growing, we associate this with the fact that the search is definitely of low relevance and, in fact, is no different if you select a color on the store’s website when searching through their catalog.

    In 2013, the project www.asap54.com appeared, and here the search is a little better. The emphasis is on color and some small options that are manually specified from a special catalog (short dress, long dress, medium-long dress). This project, faced with the difficulties of visual search, slightly turned towards social networks, where fashionistas can share their “bows” in clothes, from “shazam for clothes” to “Instagram for fashionistas”.

    Despite the fact that there are projects in this area, the need to search by picture, which is very relevant today, definitely remains uncovered. And the solution to this problem by creating a mobile application, as SnapFashion and Asap54 did, is most consistent with the trends of the e-commerce market: according to various forecasts, the share of mobile sales in the USA from 11% in 2013 may grow up to 25-50% in 2017. Such a rapid growth mobile commerce also portends the growing popularity of a variety of applications that help make purchases. And most likely the stores themselves will invest in the development, promotion of such applications, as well as actively cooperate with them.

    After analyzing the competitors, we decided that we needed to try to deal with this topic ourselves and launched the Sarafan www.getsarafan.com project .
    Corporate identity was originally intended to be bright. We worked on many options:
    image

    As a result, we settled on a style with bright colors.
    image

    To start, we chose a client for iOS (for iPhone). The design is in the form of paints, it works through the Rest-service, on the main screen of the application the choice is: take a photo or select from the gallery.
    image

    This was perhaps the simplest of the entire project. Meanwhile, at the advanced backend development, things were not so rosy. And here is the story of our quest: what we did and where we came to.

    Visual search

    We have tested several approaches, but not one of them has yielded results that would allow us to make a highly relevant search. In this article, we will tell you what we tried, and what and how it worked on different data. We hope that this experience will be useful to all readers.

    So, the main problem of our search is the so-called semantic gap. Those. the difference between which images (in this case, clothing images) are considered similar by a person and a machine. For example, a person wants to find a black short-sleeved T-shirt:
    image
    A person can easily say that in the list below is the second image. However, the machine is likely to select image 3, on which the women's t-shirt, but the scene has a very similar configuration and the same color distribution.
    image

    A person expects that the result of the search will be positions with the same type (T-shirt, T-shirt, jersey ...), approximately the same style and approximately the same color distribution (color, texture or pattern). But in practice, ensuring the fulfillment of all three conditions was problematic.

    Let's start with the simplest, image with a similar color. To compare images by color, the color histogram method is most often used. The idea of ​​the color histogram method for comparing images is as follows. The whole set of colors is divided into a set of disjoint, completely covering its subsets. A histogram is formed for the image, which reflects the proportion of each subset of colors in the color gamut of the image. To compare the histograms, the concept of the distance between them is introduced. There are many different methods for generating color subsets. In our case, it would be reasonable to form them from our catalog of images. However, even for such a simple comparison, the following conditions must be met:
    - Images in the catalog should contain only one thing against an easily separable background;
    - We need to effectively distinguish between the background and the area of ​​clothing that interests us in the user's photos.
    In practice, the first condition is never satisfied. We will talk about attempts to solve this problem later. The second condition is relatively simpler, because The selection of the region of interest in the user's image occurs with his active participation. For example, there is a fairly effective background removal algorithm - GrabCut ( http://en.wikipedia.org/wiki/GrabCut ). We proceeded from the consideration that the region of interest in the image is closer to the center of the circle circled by the user than to its border and the background in this image region will be relatively uniform in color. Using GrabCut and some heuristics, it was possible to obtain an algorithm that works correctly in most cases.

    Now about the selection of the area of ​​interest to us in the catalog images. The first thing that comes to mind is to segment the image by color. For this, for example, the watershed algorithm ( http://en.wikipedia.org/wiki/Watershed_(image_processing) ) is suitable .
    However, the image of a red skirt in the catalog can have several options:
    image

    If it is relatively easy to segment the area of ​​interest in the first and second cases, then in the 3rd case we also select a jacket. For more complex cases, this method will not work, for example:
    image

    It is worth noting that the problem of image segmentation is not completely solved. Those. There are no methods that allow you to select an area of ​​interest in one fragment, as a person can do:
    image

    Instead, the image is divided into superpixels, it is worth looking in the direction of n-cuts and turbopixel algorithms.
    image

    In the future, they use some combination of them. For example, the task of finding and localizing an object comes down to finding a combination of superpixels belonging to the object, instead of searching for a bounding box.
    image

    So, the task of marking up catalog images has come down to finding a combination of superpixels that correspond to things of this type. This is already a machine learning task. The idea was as follows, to take a lot of manually marked images, to train a classifier on it and to classify different areas of a segmented image. The area with the maximum response is considered to be the area of ​​interest to us. But here we must again decide how to compare the images, because a simple color comparison is guaranteed not to work. You have to compare the form or some image of the scene. It seemed at that time that the gist descriptor ( http://people.csail.mit.edu/torralba/code/spatialenvelope/) The gist handle is a kind of histogram of the distribution of the edges in an image. The image is divided into equal parts by a grid of any size; in each cell, the distribution of edges of different orientations and different sizes is calculated and sampled. We can compare the resulting n-dimensional vectors.

    A training sample was created, manually a lot of images of different classes were marked out (about 10). But, unfortunately, even with cross-validation, it was not possible to achieve classification accuracy above 50%, changing the parameters of the algorithm. Partly the fault is that the shirt from the point of view of the distribution of the edges will not differ much from the jacket, partly the training sample was not large enough (usually gist is used to search through very large collections of images), partly because for this task it may not apply at all.

    Another way to compare images is to compare local features. His idea in the following is to highlight significant points in the images (local features), to somehow describe the neighborhood of these points and compare the number of coincidence of features of the two images. As a descriptor used SIFT. But comparing local features also gave poor results, mainly because this method is designed to compare images of the same scene taken from different angles.

    Thus, it was not possible to mark up the images from the catalog. Searching for unlabeled images using the methods described above sometimes yielded approximately similar results, but in most cases the result was nothing similar from a human point of view.

    When it became clear that we could not mark up the catalog, we tried to make a classifier for user images, i.e. automatically determine the type of thing that the user wanted to find (t-shirt, jeans, etc.). The main problem is the lack of a training sample. The catalog images are not suitable, firstly, because they were not marked up, and secondly, they are presented in a rather limited set of spatial representations and there is no guarantee that the user will provide images in a similar representation. To get a large set of spatial representations for a thing, we shot a person in this thing on video, then cut out the thing and built a training sample based on a set of frames. At the same time, the thing was contrasting and easily separated from the background.
    image
    Unfortunately, this approach was quickly rejected when it became clear how many videos needed to be removed and processed to cover all possible styles of clothing.

    Computer vision is a very vast segment, but we (so far) have not been able to achieve the desired result with a highly relevant search. We do not want to turn aside, adding additional side functions, but we will fight by creating a search tool. We will be glad to hear any advice and comments.

    Also popular now: