Instagram and GAE analytics

    Some time ago, an article was published on Habré about the search for similar Twitter accounts. Unfortunately, the author did not respond to comments, so he had to reinvent the wheel. But in order not to do exactly the same thing, it was decided to look for similar accounts on Instagram using the Google App Engine, so that everyone could use the service. This is how * came about .

    The most difficult thing, of course, turned out to be to implement the service for everyone (well, to stay within the free quotas of the Google App Engine and take into account the limitations of the Instagram API ).

    Everything is implemented as follows -
    1. The user request for analysis of the account is checked in Instagram - if the specified user is found, then the request for its analysis is added to the database. At the same time, each request has its own priority (while all requests are the same).
    2. Every 15 minutes, using cron, a task is launched that selects one request from the database from the queue and creates a new task - to get all user subscribers from the request. The task in case of an error is repeated again:
      - name: followers-get
        rate: 1/m # 1 task per minute
        bucket_size: 1
        max_concurrent_requests: 1
          task_retry_limit: 2
          min_backoff_seconds: 30

      Each task, if not all subscribers were received in one request, creates a new task:
      if users and users.get('pagination') and users.get('pagination').get('next_cursor'):
          cursor = users.get('pagination').get('next_cursor')
          url = '/task/followers-get?user_id='+user_id
          url += '&cursor=' + cursor
          taskqueue.add(queue_name='followers-get', url=url, method='GET')

    3. After receiving all the subscribers, the analysis of each begins. For this, a huge number of tasks are created to receive a list of those users to whom each subscriber is subscribed (each task can create new tasks, as is the case with the subscribers above). In order to be in accordance with the Instagram limit of 5'000 requests per hour, the task queue is configured as follows:
      - name: subscriptions-get
        rate: 5000/h

      Moreover, after completing each request, just in case, we sleep for 0.72 seconds (= 60 * 60/5000).
      Unfortunately, in the free version of Google App Engine, you can only perform 50'000 writes to the database per day. Because each task can create a new task, then the initial option - to write the result of each task to the database - had to be replaced with a new one - the result of the previous task is passed as a parameter of the new task, and only the last task writes the result to the database:
      if users and users.get('pagination') and users.get('pagination').get('next_cursor'):
          cursor = users.get('pagination').get('next_cursor')
          params = {
              'user_id': user_id,
              'f_user_id': f_user_id,
              'cursor': cursor
          if more_subscriptions:
              params['subscriptions'] = ','.join(more_subscriptions)
          taskqueue.add(queue_name='subscriptions-get', url='/task/subscriptions-get', params=params, method='POST')

      Some users (such as @instagram , for example) have millions of followers. In order not to waste precious resources on getting all of their subscribers, the task is completed after receiving 100'000 subscribers.
    4. Due to the limitation on the number of write operations to the database, it is not possible to properly track whether all tasks for a specific user have completed or not. A normal solution would be to write to the database a list of id of running tasks and upon completion of each task (or if the last attempt was made to complete the task) to exclude the task from the list. But a huge number of tasks multiplied by all users does not allow this. Therefore, the list of tasks is stored in memcache:
      memcache.set('subscriptions'+str(user_id), ','.join(str(x) for x in followers), 1209600) 

      Memcache data can be deleted at any time. To avoid the situation with a “hung” request (when all the tasks on the request were completed, but memcache was deleted and we, accordingly, don’t know about it), a task is run every few hours, which checks if there are any requests that have received the status of receiving subscribers more than than 2 weeks ago (while it’s believed that this is the time for which all tasks will be definitely completed). If such requests are found, then they are “forcibly” transferred to the next stage.
    5. At the next stage, all previously obtained data is read from the database. As it turned out, there can be quite a lot of data and there may not be enough allocated GAE RAM for them. Because the data is read in batches, an intermediate result is calculated for each portion, which is then added to the next intermediate result. In this process, I had to disable automatic caches:
      ctx = ndb.get_context()
      ctx.set_cache_policy(lambda key: False)
      ctx.set_memcache_policy(lambda key: False)

      As a result of numerous calculations at this stage, the 300 most popular users that your users are following are selected.
    6. For each of 300 users, tasks are launched to obtain data on them (names, pictures, number of subscribers, etc.). By analogy with the process described above, either the completion of all tasks is expected, or a new stage is forcibly started after some time.
    7. At the last stage, the most similar users are calculated and selected (taking into account the number of your subscribers and total subscribers). It turns out something like this , a link to the result is sent to e-mail.

    The above approaches and optimizations still allow you to remain within the framework of free quotas allocated by GAE, although obtaining the result takes quite a lot of time. Your help is needed - add your users to the queue, see how long their analysis will take.

    In the future I plan to add recognition of real people / companies to Instagram to the service, but I can’t do without machine learning - so this will be a separate task.

    * The Russian language on the site does not work yet - I can’t figure out why django translation does not work on GAE.

    Also popular now: