Group similar applications from different stores by icon

    Once I had the misfortune to turn my eyes to one attractive vacancy. Everything would be fine, but, as usual, they threw a test task. In short, it was necessary to group links to the same application in different markets. The links included applications such as Skype, Skype WiFi, Skype Qik, Viber, and two games with the same name Skyward. Among the stores were Google Play, the App Store, and the Windows Phone market. The task also had a description of the rake, they say, it was not necessary to get attached to the names of applications, the name of the developer company, etc. “But after all, identical applications are easily recognizable on different platforms by a stupid icon,” I thought, and got to find out the details. But not so simple.

    This is what the icons look like in different stores for Vibera and Skype:



    Alas, the icons differ in color set and size. The original idea of ​​hashing icons and comparing hashes, of course, will still serve, but not in this case. Initially, I made a mistake by pulling out for analysis the icons that appear in my browser, and they are quite small. A little later, digging around, I found sizes from 300 to 350 pixels, which added accuracy to the measurements. In general, the code that drags the picture is quite simple.

    For my task, I googled the OpenCV library. This is a very sophisticated tool for various image analysis. Initially, I was so carried away that I got to learn feature matching, but this is somewhat not what I needed. And I needed to highlight the contours in the images and somehow compare them.

    To build the contours, it is necessary to properly prepare the image - highlight the boundaries on it. To do this, use the Canny border detector. Maybe Canny will be right, I don’t know. It works like this:



    In the case of the Skype icon, the following results were obtained:



    It may seem that only the size remains of the differences, but this is not so. The selected borders are slightly different, and bringing the icons to the same size only adds errors.
    The only trick is to choose the minimum and maximum thresholds for the algorithm correctly. The values ​​of 100 and 200 absolutely satisfied me.

    Next we find the contours. They can be compared by calculating the coefficient of coincidence of two loops - a very useful property in my task. There is a nuance - the angle of rotation of the contour does not affect this coefficient, but in my case it will hardly do weather. For Skype from Google, the result of constructing the contours is as follows:



    There are not two circuits, there are four of them. The contour is built from the outer and inner sides of previously defined boundaries. I searched for contours with the RETR_LIST flag, that is, without a hierarchy, and then sorted from the upper left edge of the image.

    For my algorithm, I also need to calculate the total length of the contours - in OpenCV there is a separate function arcLength for this. The algorithm itself boils down to the fact that if two images coincided with more than 80% percent of the contour length, then we consider these images to be the icon of one application. The contours themselves are compared by the matchShapes function , the smaller its result - the better, in my case, the upper boundary of the contours coincidence was 0.15.

    However, there is still a second type of icons that could not be compared using this algorithm - these are the Skyward game icons:



    At the time of writing, these icons differ in color, but some time ago two stores had the first color option. Icons differed only in size, but because of this, the contours did not match at all, and nothing could be determined from them. However, the imagehash library helped me here .. For the Skyward game, the hashes were compared head-on. However, from the moment the color scheme of the icons has changed, this feature does not work.

    The "employer" did not react to my idea. It happens.

    Source code
    import numpy as np
    import cv2
    import requests
    from collections import namedtuple
    from bs4 import BeautifulSoup
    import imagehash
    from PIL import Image
    defitunes_find(content):
        icon, name = None, None
        soup = BeautifulSoup(content)
        found = soup.find(id="title")
        name = found.div.h1.get_text()
        found = soup.find('img',{'class':'artwork', 'alt': name})
        imageurl = found['src-swap-high-dpi']
        icon_r = requests.get(imageurl)
        if icon_r.status_code == 200:
            img_array = np.asarray(bytearray(icon_r.content), dtype=np.uint8)
            icon = cv2.imdecode(img_array, cv2.IMREAD_COLOR)
        return name, icon
    defgoogle_find(content):
        icon, name = None, None
        soup = BeautifulSoup(content)
        found = soup.find('div',{'class':'cover-container'})
        imageurl = found('img')[0]['src']
        icon_r = requests.get(imageurl)
        if icon_r.status_code == 200:
            img_array = np.asarray(bytearray(icon_r.content), dtype=np.uint8)
            icon = cv2.imdecode(img_array, cv2.IMREAD_COLOR)
        found = soup.find('div',{'class':'document-title'})
        ifnot found:
            found = soup.find('h1',{'class':'document-title'})
        ifnot found:
            with open('olala1.html', 'w') as f:
                f.write(content)
        name = found.get_text()
        return name, icon
    defwindows_find(content):
        icon, name = None, None
        soup = BeautifulSoup(content)
        found = soup.find('img', {'class':'appImage xlarge'})
        imageurl = found['src']
        icon_r = requests.get(imageurl)
        if icon_r.status_code == 200:
            img_array = np.asarray(bytearray(icon_r.content), dtype=np.uint8)
            icon = cv2.imdecode(img_array, cv2.IMREAD_COLOR)
        found = soup.find(id="application")
        name = found('h1')[0].get_text()
        return name, icon
    classEntry:def__init__(self, url, name, icon):
            self.url = url
            self.name = name
            self.icon = icon
            self.icon_hash = None
            self.contours = None
    items = {}
    def_go(url):
        r = requests.get(url, headers = {'User-agent': 'Mozilla/5.0'}, verify=False)
        if r.status_code == 200:
            if url.startswith('https://itunes.apple.com'):
                name, icon = itunes_find(r.content)
            elif url.startswith('https://play.google.com'):
                name, icon = google_find(r.content)
            elif url.startswith('http://www.windowsphone.com'):
                name, icon = windows_find(r.content)
            if name and icon isnotNone:
                items[url] = Entry(url, name, icon)
    url_list = [
    'https://itunes.apple.com/en/app/skype-for-iphone/id304878510?mt=8',
    'https://itunes.apple.com/en/app/skype-for-ipad/id442012681?mt=8',
    'https://play.google.com/store/apps/details?id=com.skype.raider&hl=en',
    'http://www.windowsphone.com/ru-ru/store/app/skype/c3f8e570-68b3-4d6a-bdbb-c0a3f4360a51',
    'https://play.google.com/store/apps/details?id=com.skype.android.access&hl=en',
    'https://itunes.apple.com/en/app/skype-wifi/id444529922?mt=8',
    'https://play.google.com/store/apps/details?id=com.skype.android.qik&hl=en',
    'https://itunes.apple.com/us/app/skype-qik-group-video-messaging/id893994044?mt=8',
    'https://play.google.com/store/apps/details?id=com.viber.voip&hl=en',
    'https://itunes.apple.com/en/app/viber/id382617920?mt=8',
    'https://play.google.com/store/apps/details?id=com.viber.voip&hl=en',
    'https://play.google.com/store/apps/details?id=com.ketchapp.skyward&hl=en',
    'https://itunes.apple.com/us/app/skyward/id943273841?mt=8',
    'https://play.google.com/store/apps/details?id=cz.george.mecheche&hl=en',
    ]
    tr = 100def_do():for u in url_list:
            _go(u)
        for item in items.itervalues():
            width = item.icon.shape[0]
            height = item.icon.shape[1]
            icon_c = cv2.cvtColor(item.icon, cv2.COLOR_BGR2RGB)
            pil_im = Image.fromarray(icon_c)
            item.icon_hash = imagehash.dhash(pil_im)
            edges = cv2.Canny(item.icon, tr, tr*2)
            def_s(x):
                x,y,w,h = cv2.boundingRect(x)
                return (x, y)
            contours, hierarchy = cv2.findContours(edges, cv2.RETR_LIST, 1)
            contours = sorted(contours, key = _s)
            item.contours = contours
            item.weight = sum([cv2.arcLength(cnt,True) for cnt in contours])
        matches = []
        ungrouped = []
        items_copy = items.values()
        while items_copy:
            group   = []
            item = items_copy[0]
            current = items_copy[1:]
            items_copy = []
            for other in current:
                if item.icon_hash == other.icon_hash:
                    group.append(other.url)
                else:
                    rating = 0
                    count = min(len(item.contours), len(other.contours))
                    for v in range(count):
                        result = cv2.matchShapes(item.contours[v], other.contours[v], 1, 0.0)
                        if result < 0.15:
                            l = cv2.arcLength(item.contours[v],True)
                            lo = cv2.arcLength(other.contours[v],True)
                            rating += min(l/item.weight, lo/other.weight)
                    if rating > 0.8:
                        group.append(other.url)
                    else:
                        items_copy.append(other)
            if group:
                group.append(item.url)
                matches.append(group)
            else:
                ungrouped.append(item.url)
        for v in matches:
            print'Found group: %s'%', '.join(set([items[u].name.strip() for u in v]))
            print'Urls:\n%s\n'%'\n'.join(v)
        print"Ungrouped:"for v in ungrouped:
            print'Name %s'%items[v].name
            print'Url %s'%v
    _do()
    


    Also popular now: