Recommender system: an introduction to the cold start problem
My name is Vasily, for more than three months now, I have been working as a mathematician at Surfingbird.
The first serious problem that I faced while working in the company was to solve the problem of a cold start. In this article I will describe the essence of the problem and the main directions of its solution.
The problem statement of the recommender system has already been described by Sergey Nikolenko in the article Recommender systems: problem statement .
Most recommendation systems are based on the so-called collaborative filtering methods.. Our recommendation system is no exception. All collaborative filtering algorithms rely only on information about ratings affixed by users and do not analyze the content of resources (in our case, web pages). Therefore, these algorithms work with a sufficiently large number of ratings, as a rule these are 10-20 ratings. The task of issuing relevant recommendations for new users and for new sites is called the cold start problem.
So, the problem of a cold start is divided into a cold start for users (what to show to new users?) And a cold start for sites (to whom to recommend newly added sites?). Let's start in order.
A cold start for users is possible on the basis of demographic data that users themselves indicate when registering. In this task of recommending web pages about a user, gender, date of birth, and location are known. We will take these characteristics as basic. However, more demographic data can be obtained. Using the API of social networks, we can find out the level of education, social status and other characteristics.
There are two main approaches to using demographic information about a user in recommendations:
- Expertly compiled stereotypes for various demographic categories. That is, the expert himself determines what to show each of the categories on a cold start. The obvious disadvantage of this approach is the need for an expert to work, while the user will be recommended only popular sites subjectively selected by the expert. The amount of expert work increases significantly with the increasing number of categories.
- Demographic categories are determined automatically by identifying user clusters with similar interests. Recommendations are based on the ratings given by users from the same category, i.e. the same age, gender, location, etc.
The second approach does not require the involvement of experts and provides the opportunity to create an unlimited number of clusters, so I will dwell on it in more detail.
To build demographic categories, it is natural to use clustering methods. Clustering objects x are users. Signs (or characteristics) of objects are the normalized demographic data of the user: gender, age, location, education, and others.
For clustering by demographic data, it is natural to use the k- means method, since in this case each cluster is determined by the point of its center and, as a consequence of this, is well interpreted. The distance from the object (user) to the center of the cluster can be determined, generally speaking, by an infinite number of methods. It is customary to use the Euclidean metric in the space of signs.
However, in our task it is necessary to take into account rating data, and not just demographic data.
The situation is saved by the presence of SVD algorithm topics calculated for all sites , which can also be added as features of objects during clustering. In this case, the distance between the sites will be calculated based on both the similarity of demographic data and user ratings.
After the user clusters are received and for each new user who has indicated their demographic data, we know the cluster to which he belongs, we can improve the recommendations on a cold start using group recommendations or filter bots. This is a bit more detailed.
The most natural is the method of group recommendations (individual recommendation) , the name of which speaks for itself: we select the new user recommendations that most users like in their demographic category.
There are a number of different strategies for how to aggregate the ratings of different users into a group recommendation. For example, a group rating can be calculated as follows:
GR = П r_i ^ w_i ,
wherer_i is the rating of the i- th user, w_i is the weight of the i- th user. The work is taken for all users or for a selected group according to some criterion (for example, age-gender).
Weights w_i for users with the same age, gender and location we give a larger value, the rest is smaller (manually selected).
An alternative approach is filterbots.that generate default ratings for a new user. That is, when registering, filter bots will automatically generate several ratings for the user based on their demographic data, which will be used by collaborative filtering algorithms on a cold start. The advantage of this approach is the ease of implementation and the absence of the need to change existing algorithms.
In addition, filter bots and group recommendations can be used together: then the default ratings of filter bots are group ratings.
To solve the cold start problem for new web pages, various methods are used to analyze text and other page content (pictures, video, flash, links, etc.).
The main methods of semantic text analysis that I would like to dwell on are LDA and relevance feedback .
The general outline of recommendations based on the textual content of the page is approximately as follows. First, useful content is collected across all sites (ads, menus, etc. are discarded). Words in the text undergo preliminary processing, that is, stop words are discarded and lemmatization is performed. Next, a single dictionary of words and a table of the occurrence of words in the texts of web pages (content profiles of pages) are compiled. Words are weighted by TF-IDF and for too long texts only the top N most significant words are left.
The relevance feedback algorithm compiles a tag profile (i.e., keywords) for each user based on the content profiles of the pages that this user likes.
New sites are recommended for users whose tag profiles are most correlated with the content profile of the newly added page. LDA
Algorithmacts differently. Words from a dictionary of words as a result of training a probabilistic model are grouped by a fixed number of topics (for example, 100). For each web page, a probabilistic distribution by topics is constructed (that is, a vector of 100 features, each of which characterizes the degree to which this page corresponds to the topic). To predict the likelihood of a like for each user, logistic regression is studied on the LDA features of the pages that the user watched, as a result of which each user also has a weight vector for all LDA topics. When adding a new site, LDA features are first calculated for it, and then it is recommended to users with the maximum predicted likelihood of likes, which can be easily calculated from known features of the user and the web page.
Each of the algorithms mentioned in this chain deserves a separate article with practical examples, but this is in the future ...
The first serious problem that I faced while working in the company was to solve the problem of a cold start. In this article I will describe the essence of the problem and the main directions of its solution.
The problem statement of the recommender system has already been described by Sergey Nikolenko in the article Recommender systems: problem statement .
Most recommendation systems are based on the so-called collaborative filtering methods.. Our recommendation system is no exception. All collaborative filtering algorithms rely only on information about ratings affixed by users and do not analyze the content of resources (in our case, web pages). Therefore, these algorithms work with a sufficiently large number of ratings, as a rule these are 10-20 ratings. The task of issuing relevant recommendations for new users and for new sites is called the cold start problem.
So, the problem of a cold start is divided into a cold start for users (what to show to new users?) And a cold start for sites (to whom to recommend newly added sites?). Let's start in order.
Cold start for users
A cold start for users is possible on the basis of demographic data that users themselves indicate when registering. In this task of recommending web pages about a user, gender, date of birth, and location are known. We will take these characteristics as basic. However, more demographic data can be obtained. Using the API of social networks, we can find out the level of education, social status and other characteristics.
There are two main approaches to using demographic information about a user in recommendations:
- Expertly compiled stereotypes for various demographic categories. That is, the expert himself determines what to show each of the categories on a cold start. The obvious disadvantage of this approach is the need for an expert to work, while the user will be recommended only popular sites subjectively selected by the expert. The amount of expert work increases significantly with the increasing number of categories.
- Demographic categories are determined automatically by identifying user clusters with similar interests. Recommendations are based on the ratings given by users from the same category, i.e. the same age, gender, location, etc.
The second approach does not require the involvement of experts and provides the opportunity to create an unlimited number of clusters, so I will dwell on it in more detail.
To build demographic categories, it is natural to use clustering methods. Clustering objects x are users. Signs (or characteristics) of objects are the normalized demographic data of the user: gender, age, location, education, and others.
For clustering by demographic data, it is natural to use the k- means method, since in this case each cluster is determined by the point of its center and, as a consequence of this, is well interpreted. The distance from the object (user) to the center of the cluster can be determined, generally speaking, by an infinite number of methods. It is customary to use the Euclidean metric in the space of signs.
However, in our task it is necessary to take into account rating data, and not just demographic data.
The situation is saved by the presence of SVD algorithm topics calculated for all sites , which can also be added as features of objects during clustering. In this case, the distance between the sites will be calculated based on both the similarity of demographic data and user ratings.
After the user clusters are received and for each new user who has indicated their demographic data, we know the cluster to which he belongs, we can improve the recommendations on a cold start using group recommendations or filter bots. This is a bit more detailed.
The most natural is the method of group recommendations (individual recommendation) , the name of which speaks for itself: we select the new user recommendations that most users like in their demographic category.
There are a number of different strategies for how to aggregate the ratings of different users into a group recommendation. For example, a group rating can be calculated as follows:
GR = П r_i ^ w_i ,
wherer_i is the rating of the i- th user, w_i is the weight of the i- th user. The work is taken for all users or for a selected group according to some criterion (for example, age-gender).
Weights w_i for users with the same age, gender and location we give a larger value, the rest is smaller (manually selected).
An alternative approach is filterbots.that generate default ratings for a new user. That is, when registering, filter bots will automatically generate several ratings for the user based on their demographic data, which will be used by collaborative filtering algorithms on a cold start. The advantage of this approach is the ease of implementation and the absence of the need to change existing algorithms.
In addition, filter bots and group recommendations can be used together: then the default ratings of filter bots are group ratings.
Cold start for web pages
To solve the cold start problem for new web pages, various methods are used to analyze text and other page content (pictures, video, flash, links, etc.).
The main methods of semantic text analysis that I would like to dwell on are LDA and relevance feedback .
The general outline of recommendations based on the textual content of the page is approximately as follows. First, useful content is collected across all sites (ads, menus, etc. are discarded). Words in the text undergo preliminary processing, that is, stop words are discarded and lemmatization is performed. Next, a single dictionary of words and a table of the occurrence of words in the texts of web pages (content profiles of pages) are compiled. Words are weighted by TF-IDF and for too long texts only the top N most significant words are left.
The relevance feedback algorithm compiles a tag profile (i.e., keywords) for each user based on the content profiles of the pages that this user likes.
New sites are recommended for users whose tag profiles are most correlated with the content profile of the newly added page. LDA
Algorithmacts differently. Words from a dictionary of words as a result of training a probabilistic model are grouped by a fixed number of topics (for example, 100). For each web page, a probabilistic distribution by topics is constructed (that is, a vector of 100 features, each of which characterizes the degree to which this page corresponds to the topic). To predict the likelihood of a like for each user, logistic regression is studied on the LDA features of the pages that the user watched, as a result of which each user also has a weight vector for all LDA topics. When adding a new site, LDA features are first calculated for it, and then it is recommended to users with the maximum predicted likelihood of likes, which can be easily calculated from known features of the user and the web page.
Each of the algorithms mentioned in this chain deserves a separate article with practical examples, but this is in the future ...