
Recommender systems: statement of the problem
Hello! My name is Sergey, I am a mathematician, and I determine the development of the Surfingbird recommendation system. With this article we open a cycle devoted to machine learning and recommendation systems in particular - I don’t know yet how many installations there will be in the cycle, but I will try to write them regularly. Today I will tell you what recommender systems are in general, and I will set the task a little more formally, and in the next series we will begin to talk about how to solve it and how our Tachikoma recommender system learns .

Recommender systems are models that know better than you what you want. Recently I heard a telling joke: as you know, supermarket chains usually try to predict what you want them to advertise for you (and this is an example of a recommendation system); in particular, the supermarket may try to recognize by changing preferences that the woman became pregnant and begin to use it. So, they saythat once an angry father burst into the office of a supermarket, whose daughter schoolgirls began to receive coupons for diapers and baby clothes by mail; the manager had to apologize for a long time and tell that all the recommender models are probabilistic, and mistakes are quite possible. After a couple of months, the father came again and apologized himself - it turned out that he did not know everything about his daughter ...
Collaborative filtering systems are models that try to predict how much you will like a particular product by receiving input on how you and other users have rated this and other products in the past. Collaborative filtering is the most popular type of recommendation system today. In Surfingbird, for example, your ratings are like and dislike buttons (as well as the fact that you looked at the page and did not give any rating, but more on that later). The more data we have about your preferences, the more interesting pages we can recommend to you!
Here are a few other examples of well-known recommendation systems.
So, back to our rams. Imagine that we have many users and many products (for Surfingbird these are web pages, for Netflix - movies, for Last.fm - compositions), and some users somehow rated some products. Formally speaking, the data consists of triples of the form
, where i denotes the user, a is the product, and
is the rating that user i gave the product a .
You can imagine this data as a matrix, each row of which corresponds to the user, and a column to the product. Our task is to predict the unknown elements of the matrix; to be precise, our task is to predict which of the unknown elements will be maximum in their row, that is, which products will be most liked by a particular user.
Collaborative filtering systems have several common problems that any model should solve one way or another.
In the next series, we will talk about what to do with these and other problems, as well as how to generally predict unknown ratings - stay tuned!

Recommender systems are models that know better than you what you want. Recently I heard a telling joke: as you know, supermarket chains usually try to predict what you want them to advertise for you (and this is an example of a recommendation system); in particular, the supermarket may try to recognize by changing preferences that the woman became pregnant and begin to use it. So, they saythat once an angry father burst into the office of a supermarket, whose daughter schoolgirls began to receive coupons for diapers and baby clothes by mail; the manager had to apologize for a long time and tell that all the recommender models are probabilistic, and mistakes are quite possible. After a couple of months, the father came again and apologized himself - it turned out that he did not know everything about his daughter ...
Collaborative filtering systems are models that try to predict how much you will like a particular product by receiving input on how you and other users have rated this and other products in the past. Collaborative filtering is the most popular type of recommendation system today. In Surfingbird, for example, your ratings are like and dislike buttons (as well as the fact that you looked at the page and did not give any rating, but more on that later). The more data we have about your preferences, the more interesting pages we can recommend to you!
Here are a few other examples of well-known recommendation systems.
- Amazon is one of the leaders in the region; Amazon recommends you books and other products based on what you bought, what you looked at, which ratings you put, which reviews you left ... Yes, as it usually happens, Big Brother collects everything, even if he is not able to use something.
- Netflix - in Russia they know little of this company, and it does not work for Russia, however, it was Netflix that made itself the loudest statement in the scientific community when it announced the famous Netflix Prize, having promised $ 1M for improving the quality of their prediction algorithm by 10% (we will talk about the Netflix Prize and the lessons that have been learned from it in one of the following series). Netflix's core business is movie rental; Now the company has switched to streaming video, but the first ten years of their life they sent physical DVDs by mail, which then had to be sent back to get the next one (the money was taken for a subscription). It’s difficult for a Russian person to understand how it is - to pay money to download a movie, or not even download it, but watch it online - but the model was very successful, and the data set published for Netflix Prize for several years became the main test case for collaborative filtering systems (now Netflixremoved him from open access due to possible deanonymization, and Yahoo! KDD Cup Dataset ).
- Last.fm and Pandora recommend music. They adhere to different recommendation strategies: Last.fm uses, in addition to the ratings of other users, exclusively “external” data about the music - the author, style, date, tags, etc., and Pandora is based on the “content” of the musical composition, using very an interesting idea is the Music Genome Project , in which professional musicians analyze the composition according to several hundred attributes (unfortunately, Pandora is not available in Russia now). True, no one knows how to analyze compositions automatically yet, and this is another interesting application of effort for machine learning ...
- Google , Yahoo! , Yandex - can we say that they also recommend sites to users? Formally, yes, but in reality they are different systems: search engines try to predict how relevant a given document is to a given query, and recommenders try to predict what rating a given user will give this product. Of course, in the success of search engines there is a considerable merit of models based on data from users (click logs), and, of course, search results are often personalized, but still the task is slightly different. A little closer to our task is the problem of which ads to show the user ( AdSense , Yandex.Directetc.) - here users really “vote with their feet” for advertisements, and you need to “recommend” those that are most likely to cause a positive reaction. But the matter is complicated by the economic side of the issue (advertisers pay money, an auction must be arranged between them for the right to advertise), so we will not consider these tasks either. However, leading search engines have a lot of side projects based on recommendation systems - for example, we already mentioned Yahoo! The Music .
So, back to our rams. Imagine that we have many users and many products (for Surfingbird these are web pages, for Netflix - movies, for Last.fm - compositions), and some users somehow rated some products. Formally speaking, the data consists of triples of the form


You can imagine this data as a matrix, each row of which corresponds to the user, and a column to the product. Our task is to predict the unknown elements of the matrix; to be precise, our task is to predict which of the unknown elements will be maximum in their row, that is, which products will be most liked by a particular user.
Collaborative filtering systems have several common problems that any model should solve one way or another.
- The rating matrix, as a rule, is very sparse - usually there are a lot of users and products, and the ratings are actually much less than their product, because the average user estimates very few products; the remaining elements of the matrix are unknown to us, and it is precisely them that must be predicted.
- The problem of a cold start . For users - when a new user arrives who does not yet have ratings, what to do with him? Well, well, when not at all, that's nothing - you can simply recommend the most popular products; but what if the user has already rated something, but so far so little? For products - how many ratings are needed for a new product before it can be confidently recommended? And where do these ratings come from if you don't recommend it to anyone?
In the next series, we will talk about what to do with these and other problems, as well as how to generally predict unknown ratings - stay tuned!