We write a simple system of recommendations on the example of Habr


    Today we’ll talk about recommender systems, or rather, about the simplest form of collaborative filtering. In the program guide: what is a recommender system, what is it based on, what is the mathematical apparatus and how can it be embodied in code. As a bonus, we will provide the results in the form of a simple service.

    1. What is a recommendation system
    2. Intuition
    3. Theory
    4. Implementation: code and data
    5. Habra-recommendations service
    6. Habra analytics


    What is a recommendation system


    In fact, we run into recommender systems every day, even if sometimes we don’t notice it. In the most explicit form, their work is visible in eg Amazon's online stores.



    The main objective of the system here is to offer new products based on purchased-viewed. Several goals are pursued at once, but the main one is to offer the product to the buyer, who is most likely to lead to a sale and satisfy his needs. So, informally, the recommender system offers a sorted list of products based on the customer’s background.

    Intuition


    In this article, we are talking about user-based collaborative filtering. This name may seem formidable, but quite simple ideas are behind it. “Collaborative” means based on the preferences of a certain group. For example, if Vasya, Petya and Sasha are buyers of the bookstore and their tastes are similar, then you can recommend Sasha shopping based on the shopping history of Vasya and Petya.


    (picture taken from here ).

    The picture describes a simple situation when several users watched a video, and only some of them liked it. When we decide whether to recommend a video to a user, we find that similar users disliked this video. As a result, it is not worth recommending. In other words, we filter content based ona similar group, hence the name collaborative filtering .

    Theory


    In this article we will consider only the case of a binary rating: “liked” or “no rating”. This model is applicable to favorites on Habré. If the user has saved the article to himself, then he considers it interesting or useful, and if there is no rating, then this does not mean anything, maybe he just did not see this article.

    There are several ways to filter (or rather rank) content, we will consider the so-called user-based ( user-based ) method.

    Data

    We have two entities users and articles in favorites . With each user i we associate many articles u i .

    Related users

    We determine the "similarity" of two users i and j , as



    This is the so-called Jacquard coefficient , which determines the degree of similarity of the two sets.

    The idea is simple - to determine how much the total part of the articles of two users relates to their total number.

    Recommended article

    Let some article p (from post) not be in the favorites, i.e. does not belong to the set u i , then we define the similarity of likes (“how much you will probably like it”) between the user and the article as follows:

    where n p and J p are the number of users and the users themselves who added the post p favorites.

    The idea behind the formula is simple, the contribution of one user is equal to the degree of similarity, and normalization by the number of users themselves.

    Recommendations

    Recommendations are a few posts that have maximum likes .

    Implementation: code and data


    For implementation, we need to take several steps:

    1. Collect user list
    2. Gather recommendations
    3. Count n p
    4. Write the likes function and get maximum k results


    Collect user list

    The algorithm is simple: one of the first posts for the 2013th year was selected and for each post users who left a comment gathered, and the author of the post himself. In total, a list of 25 thousand users has been compiled. The function code get_all_user_names can be found through the git in the file: recommender.py , and the collected list of users in the HabraData repository (this is the repository where I collect all kinds of interesting data from Habr) in the file user_list.txt

    Gather recommendations

    Each user, for example here , has a favorites tab, which you can parse and get data from the list. The collected data can be found in the file user_favorites.csv , and the collection code itself is in the same source code as above.

    Count n p

    For each collected post, we go through all users and count the number of times a post appears. Data in the post_counts.csv file .

    Likes function

    The main function code is given in the spoiler below. Short description: for each user, we consider its similarity with all other users, and if for the other user the similarity is not zero, then we update the similarity of the input user and the corresponding post. At the end, we normalize and sort in descending order.

    def give_recommendations (username, preferences, weights)
    def give_recommendations(username, preferences, weights):
      preference = preferences[username] 
      rank = {}
      for user_other, preference_other in preferences.iteritems():
        if username != user_other:
          similarity = jaccard_index(preference, preference_other)
          if not similarity:
            continue
          for post in preference_other:
            if post not in preference:
              rank.setdefault(post, 0)
              rank[post] += similarity
      #normalize and convert to 
      post_list = [(similarity/weights[post], post) for post, similarity in rank.items()]
      post_list.sort(reverse=True)
    



    To read: the practical book Programming Collective Intelligence and this post .

    Habra-recommendations service


    Based on the algorithm, we will make a simple recommendation service for users:



    Available at:
    www.habr-analytics.com/recommender
    (carefully, the author tested the recommendation system at 4 a.m.)

    The algorithm considered in this article is one of the simplest and most naive ( according to the assumptions of the model), so do not overestimate the results of its work. On the other hand, advanced algorithms are largely based on the same ideas and use similar techniques to model recommendations, and therefore it is useful to have at least a general understanding of user-based filtering.

    Habra Analytics


    If analytics on the example of Habr was interested, then in addition to the recommendations, the current version of the project www.habr-analytics.com supports article monitor, user analysis and a number of other options. Read more about an interesting application of the system in the article " Step Syndrome and Habr Attendance Slice ". And also in this article with a description of each of the functions separately.




    Also popular now: