madcat1991 January 14, 2015 at 15:57

Non-personalized recommendations: association method

Personal recommendations allow you to familiarize the user with objects about which he might never have known (and would not have known), but which he may like based on his interests, preferences, and behavioral properties. However, often the user does not search for a new object, but, for example, object A similar to object B (“Fast and the Furious 2” is similar to “Fast and the Furious”), or object A, which is purchased / consumed with object B (cheese with wine, beer with baby food, buckwheat with stew, etc.). To build such recommendations allow non-personalized recommendation systems (LDCs).

You can recommend similar / related objects, focusing on knowledge about objects (properties, tags, parameters) or knowledge about actions associated with objects (purchases, views, clicks). The advantage of the first method is that it allows you to accurately determine objects similar in properties (“Fast and the Furious 2” and “Fast and the Furious” - similar actors, a similar genre, similar tags, ...). However, this method will not be able to recommend related objects: cheese and wine. Another disadvantage of this method is the fact that marking all the objects available on the service requires a lot of effort.

At the same time, almost every service logs information about which user viewed / bought / clicked which object. This information is enough to build LDCs, which will allow recommending both similar and related facilities.

Under the cutter, the method of associations is described, which allows you to build non-personalized recommendations based only on data on actions on objects. There is also Python code that allows you to apply the method to a large amount of data.

Building non-personalized recommendations

To begin, consider the basic algorithm for constructing non-personalized recommendations. Suppose we have objects - movies and users. Users watch movies. Our raw data is a sparse matrix D (movies x users). If user u watched movie f, then in the corresponding cell of matrix D it costs 1.

In order to find movies similar to a given movie f, you need to know the similarity of movie f with all other movies. Similarity data is stored in the S matrix (movies x movies).

The basic algorithm for constructing non-personalized recommendations is as follows:

for a given movie f find the corresponding row R in the matrix S;
select from row R the set of films most similar to f - FR;
FR and there are non-personalized recommendations (similar / related).

Affinity method

From the described it can be seen that the recommendations and their quality depend only on the method of constructing the matrix S, and to be more precise, on the method for determining the similarity of the two films.

How to determine the similarity of movies x and y, if they watched a lot of users X and Y, respectively? The simplest solution is the Jacquard coefficient , which calculates the similarity of two objects (x and y) as:

Here, the numerator is the number of users who have watched both movie x and movie y. Denominator is the number of users who watched either movie x or movie y.

The calculated value is symmetric: x is similar to y in the same way that y is similar to x. If we want to make the coefficient asymmetric, then we can change the formula to the following:

At first glance, this method is ideal: it increases the value of similarity to films that are watched together, and normalizes the metric relative to the number of users who have watched the film.

The Harry Potter Problem or Banana Trap

Consider the above formula for the case when the object y is very popular (for example, a Harry Potter movie). Since the film is very popular and watched by many people, sim (x, y) will tend to 1 for almost all x films. This means that movie y will be similar to all movies, which is bad in most cases. It is unlikely that Harry Potter will be like the movie The Green Elephant.

The "Harry Potter Problem" is also called the "banana trap" (banana trap). Suppose a store tries to increase profits by recommending to the buyer goods that are often taken along with what the buyer is about to purchase. One of the most purchased items at the grocery store is bananas. Using the formula above, the system will recommend that all customers purchase bananas. Bananas will be bought - all is well. But these are bad recommendations, as bananas would be bought without recommendations. By recommending bananas, we reduce the profit by at least one successfully recommended product, other than bananas.

Association method

Obviously, the formula needs to be modified so that the object x makes the object y more attractive. Those. it is necessary to take into account not only that objects x and y are taken together, but also that object x cannot be taken without y.

We modify the similarity formula as follows:

Here! X is the set of users who have not watched the movie x. If y is a very popular object, then the denominator in the formula will be large. Then the similarity value will be less, and the recommendations will be more relevant. This method is called the association method.

Method Comparison

To compare the operation of the method of associations and the Jacquard coefficient, we consider the search for similar films using these two methods using the following initial data.

movies \ users	A	B	C	D	E
1. Harry Potter and the Sorcerer's Stone	1	1	1	1	1
2. The Hobbit: An Unexpected Journey	1	1			1
3. The Hobbit: The Desolation of Smaug	1	1		1
4. The Chronicles of Narnia: Prince Caspian			1
5. Dragon Heart					1

The similarity matrix, constructed using the asymmetric Jacquard coefficient, is as follows (we zero the diagonal so as not to recommend the original film):

films \ films	1	2	3	4	5
1	0	0.6	0.6	0.2	0.2
2	1	0	0.667	0	0.333
3	1	0.667	0	0	0
4	1	0	0	0	0
5	1	1	0	0	0

The similarity matrix for the association method will look as follows (in addition to the diagonal, we zero out infinity - cases where

films \ films	1	2	3
1	0	0	0
2	1,5	0	2
3	1,5	2	0
4	0.25	0	0
5	0.25	0.5	0

As can be seen from the similarity matrix, the association method allows you to take into account the overpopularity of the film "Harry Potter and the Sorcerer's Stone." When constructing recommendations by the method of associations for the film “The Hobbit: An Unexpected Journey”, the weight of “Harry Potter” (1.5) will be less than the weight of the more relevant film “The Hobbit: The Desolation of Smaug” (2).

Implementation

Below is a function to build a similarity matrix based on the association method. The function is written in Python using scipy and scikit-learn. This implementation allows you to quickly and without much expense calculate the similarity matrix for a large amount of source data.

Since within one row of the matrix the values | X | and |! X | do not change, and similar objects will be within the same line, then | X | and |! X | were omitted in calculating the association metric. The final metric formula looks like this:

def get_item_item_association_matrix(sp_matrix):
    """Простой способ построения матрицы ассоциаций
    :param sp_matrix: матрица с исходными данными о просмотрах (фильмы x пользователи)
    :return: матрица схожести фильмов (фильмы x фильмы)
    """
    watched_x_and_y = sp_matrix.dot(sp_matrix.T).tocsr()
    watched_x = csr_matrix(sp_matrix.sum(axis=1))
    magic = binarize(watched_x_and_y).multiply(watched_x.T)
    watched_not_x_and_y = magic - watched_x_and_y
    rows, cols = watched_not_x_and_y.nonzero()
    data_in_same_pos = watched_x_and_y[rows, cols].A.reshape(-1)
    return csr_matrix((data_in_same_pos / watched_not_x_and_y.data, (rows, cols)), watched_x_and_y.shape)

Conclusion

The association method is just one way to build non-personalized recommendations. As for any other method, before applying it, a number of questions need to be solved:

- determine the minimum amount of data at which the method can be applied;
- determine the threshold value of the association metric;
- determine what to do if the recommended objects have an association metric less than a threshold;
- etc.

From the above it follows that the method of associations may not be applied to all objects. Therefore, it should be combined with one or more methods of constructing non-personalized recommendations. This approach will allow you to create good recommendations regardless of the amount of data about the user or object.

PS Good recommendations!

Tags: