Rekko challenge
Today we are launching the Rekko Challenge 2019 - a machine learning competition from the Okko online cinema .
We suggest you build a recommendation system based on real data of one of the largest Russian online cinemas. We are confident that this task will be interesting for both beginners and experienced specialists. We have tried to keep the maximum scope for creativity, while not overloading you with gigabyte datasets with hundreds of pre-counted features.
More about Okko, task, data, prizes and rules - below.
Task
You have access to data about all views, ratings and additions to the “Memorized” movies and TV shows by the user for a certain period of N days (N> 60), as well as all the meta information about the content. It is necessary to predict which movies and series the user will buy or watch by subscription over the next 60 days.
In the next section, we tried to describe the minimum of what you need to know about online cinemas in order to quickly understand the data and start analyzing them. If this information is not relevant to you, you can immediately proceed to the description of the data .
About our service
If the user wants to legally watch movies on the Internet, he has three main ways.
The first way is to watch for free, constantly being interrupted by commercials ( AVOD , Advertising Video On Demand). The second is to buy a movie to your collection or rent it ( TVOD , Transactional Video On Demand). The third is to subscribe for a specific period ( SVOD , Subscription Video On Demand).
Okko works only on TVOD and SVOD models. In our service there is absolutely no advertising.
In total, the service has a little more than 10 thousand films and serials, about 6 thousand of them are available by subscription, the rest are only for purchase or rent. At the same time, almost any subscription content can be purchased. The exception is, for example, the Amedka series , they can only be viewed by subscription.
What model the film will be available for depends largely on the studio that owns the rights. They conclude a contract with online cinemas, which stipulates the time frame and the rights for which the film will be available. As a rule, the conditions are the same for all market players, but sometimes studios make concessions to some cinemas or offer better conditions for more money. So there are exclusives.
For example, the world's major new products do not immediately get into the subscription, but only 2-3 months after they appear in the service. Moreover, in the first few weeks they cannot even be rented, only the possibility of buying is available forever. But Russian films can be available by subscription immediately after the release and sometimes even simultaneously with the start of rental in offline cinemas.
When the contract expires, the film becomes unavailable - until the expiration of the expired contract or the conclusion of a new one.
Periods of lack of rights to content are clearly visible on the graphs of the number of views. Below, for example, presents such a schedule for the movie "John Wick 2". First of all, it may seem that the Hadup has rested for a couple of months, but no: the rights have ended.
The highest peak on the graph above (marked with a vertical line) coincides with the date the movie was added to the subscription: this is a very characteristic behavior for high-profile novelties. In our service 12 subscriptions:
- Eight thematic,
- Serials Amediateka,
- ABC Series,
- Russian films and series from the service "START",
- Films in 4K.
And two subscription packages: “Optimal”, which includes all thematic subscriptions, and “Optimal + Amediateka”.
The most popular, of course, are meta-packages. From thematic subscriptions, users prefer World Cinema and Our Cinema.
Few users watch movies only by subscription, the majority either buy films only, or buy more in addition to subscriptions.
Most often, users choose to buy new items of current rentals and large premieres of the past year.
The most popular source of purchases in the application is the “Recommendations” section, followed by “Search”, “News” and “Catalog”. Part of the movie users buy from "Similar" and "Memorized".
One of the main problems with which we in Okko are actively fighting is the problem of user choice of content. If you look at the graph of the probability of making a purchase from the time spent in the service (data from last year), you will see that users are ready to choose and buy the film within the first 10 minutes, then the probability of buying falls rapidly. At the same time, a sufficiently large portion of users remain who spend half an hour to an hour and cannot choose the content that is suitable for them.
10 minutes is not so much. During this time, the user is physically unable to study the catalog in detail and select the content that he likes.
Here Rekko comes in - the internal recommendation system of the Okko online cinema. Rekko is currently working in two sections of the service - “Recommendations” and “Related”.
To assess user satisfaction with the content, we analyze the fact of purchase, subscription views, viewing time, adding to the “Stored” and user rating.
The rating scale in Okko is represented by five asterisks with half divisions: it takes integer values from 0 to 10.
The user can rate the movie at any time, regardless of the fact of purchase or viewing. Rating can be changed an unlimited number of times, but can not be canceled.
You can “remember” the movie at any time, then it will appear in the “memorized” in the user profile. Similarly, it can be removed from there.
Work on Rekko started exactly a year ago and at the moment, according to A / B tests, it allowed us to increase the average number of purchases by 4%, transactional revenue by 3%, conversion to a subscription by 5%, and users began to choose films 18% faster .
Data
All data, except for viewing time and ratings, are anonymized or distorted. Time is expressed in abstract units, for which the order relation and distance are maintained.
transactions.csv
Records of all transactions and content views on them for the training period. The transaction here is considered to be buying a movie forever either for rent or initiating a viewing by subscription.
element_uid | user_uid | consumption_mode | ts | watched_time | device_type | device_manufacturer |
---|---|---|---|---|---|---|
3336 | 5177 | S | 44305181.2180206 | 4282 | 0 | 50 |
481 | 593316 | S | 44305180.606027626 | 2989 | 0 | eleven |
4128 | 262355 | S | 44305180.41444582 | 833 | 0 | 50 |
element_uid
- item identifieruser_uid
- user IDconsumption_mode
- type of consumption (P
- purchase,R
- rent,S
- view by subscription)ts
- time of the transactionwatched_time
- the number of seconds the user has viewed for this transactiondevice_type
- anonymized type of device from which the transaction was madedevice_manufacturer
- anonymized manufacturer of the device from which the transaction was made
ratings.csv
Information about the estimates of users for the training period. Information is aggregated, i.e. if the user has changed his rating, only the last value will be presented in the table.
user_uid | element_uid | rating | ts |
---|---|---|---|
571252 | 1364 | ten | 44305174.26309871 |
63140 | 3037 | ten | 44305139.28281821 |
443817 | 4363 | eight | 44305136.20584908 |
element_uid
- item identifieruser_uid
- user IDrating
- user-supplied rating (from0
to10
)ts
- time rating
bookmarks.csv
The facts of adding users of the film in the "remembered". Information is aggregated, i.e. if the user has deleted a movie from “Memorized”, there will be no record of adding it there in the table.
user_uid | element_uid | ts |
---|---|---|
301135 | 7185 | 44305161.30743926 |
301135 | 4083 | 44305160.01187332 |
301135 | 10158 | 44305157.74463292 |
element_uid
- item identifieruser_uid
- user IDts
- the time of adding the film to the "remembered"
catalog.json
Meta-information about all the recommended elements: movies, TV shows and serials.
{
"1983": {
"type": "movie",
"availability": ["purchase", "rent", "subscription"],
"duration": 140,
"feature_1": 1657223.396513469,
"feature_2": 0.7536096584,
"feature_3": 39,
"feature_4": 1.1194091265,
"feature_5": 0.0,
"attributes": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ...]
},
"2166": {
"type": "movie",
"availability": ["purchase", "rent"],
"duration": 110,
"feature_1": 36764165.87817783,
"feature_2": 0.7360206399,
"feature_3": 11,
"feature_4": 1.1386044027,
"feature_5": 0.6547073468,
"attributes": [16738, 13697, 1066, 1089, 7, 5318, 308, 54, 170, 33, ...]
},
...
}
type
- takes valuesmovie
,multipart_movie
orseries
duration
- the duration in minutes, rounded to dozens (the duration of the series for serials and serials)availability
- available content rights (may contain valuespurchase
,rent
andsubscription
)attributes
- bag of some anonymized attributesfeature_1..5
- five anonymized real and ordinal signs
Available rights are indicated at the end of the training period and the beginning of the test.
Important: json dictionary keys can only be strings, so do not forget to bring them to the number if you read the identifiers in the tables as numbers (do so to save memory).
Metrics
As a metric, we use Mean Average Precision (MAP) for 20 elements, but slightly modified. During the test period, the user could consume less than 20 films. If in this case we assume an honest MAP, the upper limit of the metric will be less than one, and the values will be small. Therefore, if the user has consumed less than 20 items, we ration them by their quantity, and not by 20.
- whether
-th predicted element in the set of elements consumed during the test period by the user
,
- the size of this set. If you suddenly forgot the quality metrics of ranking, there is a great article about them .
defaverage_precision(
dict data_true,
dict data_predicted,
const unsigned long int k
) -> float:
cdef:
unsigned long int n_items_predicted
unsigned long int n_items_true
unsigned long int n_correct_items
unsigned long int item_idx
double average_precision_sum
double precision
set items_true
list items_predicted
ifnot data_true:
raise ValueError('data_true is empty')
average_precision_sum = 0.0for key, items_true in data_true.items():
items_predicted = data_predicted.get(key, [])
n_items_true = len(items_true)
n_items_predicted = min(len(items_predicted), k)
if n_items_true == 0or n_items_predicted == 0:
continue
n_correct_items = 0
precision = 0.0for item_idx in range(n_items_predicted):
if items_predicted[item_idx] in items_true:
n_correct_items += 1
precision += <double>n_correct_items / <double>(item_idx + 1)
average_precision_sum += <double>precision / <double>min(n_items_true, k)
return average_precision_sum / <double>len(data_true)
defmetric(true_data, predicted_data, k=20):
true_data_set = {k: set(v) for k, v in true_data.items()}
return average_precision(true_data_set, predicted_data, k=k)
Prizes and rules
The prize fund is 600 thousand rubles:
- 300 thousand will receive a winner
- 200 thousand - the participant in second place
- 100 thousand - the participant in third place.
Standard rules: do not disrupt the platform, use only one account, avoid private code exchange with other participants and not be an employee of Okko and Rambler.
How to start
Starting to participate in the competition can be difficult even for experienced specialists: you need to quickly understand the new domain domain, understand and analyze the data, understand the new libraries.
We hope that in this article we were able to immerse you in the subject of online cinema and describe the data in sufficient detail. In the archive with the task, you will find a file baseline.ipynb
that contains the code for downloading data and an example of a simple solution using the K nearest neighbors algorithm.
If any points from the data description and domain domain remain unclear, we will be happy to answer your questions in the comments. You can also ask questions in the telegram channel @boosterspro - there will be a major discussion of the competition.
So, how to get started:
- Sign up at boosters.pro and join @boosterspro ;
- Download data on the competition page or here ;
- Open
baseline.ipynb
, install the necessary packages, execute the entire code and download your first solution; - Try changing the baseline to improve performance;
- Experiment!
Rekko Challenge starts today, February 18. Decisions are made until April 18, 23:59:59 Moscow time.
We are waiting for everyone and good luck!
By the way, we are looking for employees . Including the developer of recommendation systems.
Only registered users can participate in the survey. Sign in , please.