Rekko challenge

    Rekko challenge


    Today we are launching the Rekko Challenge 2019 - a machine learning competition from the Okko online cinema .


    We suggest you build a recommendation system based on real data of one of the largest Russian online cinemas. We are confident that this task will be interesting for both beginners and experienced specialists. We have tried to keep the maximum scope for creativity, while not overloading you with gigabyte datasets with hundreds of pre-counted features.


    More about Okko, task, data, prizes and rules - below.


    Task


    You have access to data about all views, ratings and additions to the “Memorized” movies and TV shows by the user for a certain period of N days (N> 60), as well as all the meta information about the content. It is necessary to predict which movies and series the user will buy or watch by subscription over the next 60 days.


    In the next section, we tried to describe the minimum of what you need to know about online cinemas in order to quickly understand the data and start analyzing them. If this information is not relevant to you, you can immediately proceed to the description of the data .


    About our service


    If the user wants to legally watch movies on the Internet, he has three main ways.


    The first way is to watch for free, constantly being interrupted by commercials ( AVOD , Advertising Video On Demand). The second is to buy a movie to your collection or rent it ( TVOD , Transactional Video On Demand). The third is to subscribe for a specific period ( SVOD , Subscription Video On Demand).


    Okko works only on TVOD and SVOD models. In our service there is absolutely no advertising.


    In total, the service has a little more than 10 thousand films and serials, about 6 thousand of them are available by subscription, the rest are only for purchase or rent. At the same time, almost any subscription content can be purchased. The exception is, for example, the Amedka series , they can only be viewed by subscription.


    The distribution of the amount of content depending on the consumption model


    What model the film will be available for depends largely on the studio that owns the rights. They conclude a contract with online cinemas, which stipulates the time frame and the rights for which the film will be available. As a rule, the conditions are the same for all market players, but sometimes studios make concessions to some cinemas or offer better conditions for more money. So there are exclusives.


    For example, the world's major new products do not immediately get into the subscription, but only 2-3 months after they appear in the service. Moreover, in the first few weeks they cannot even be rented, only the possibility of buying is available forever. But Russian films can be available by subscription immediately after the release and sometimes even simultaneously with the start of rental in offline cinemas.


    When the contract expires, the film becomes unavailable - until the expiration of the expired contract or the conclusion of a new one.


    An example of an inaccessible movie card


    Periods of lack of rights to content are clearly visible on the graphs of the number of views. Below, for example, presents such a schedule for the movie "John Wick 2". First of all, it may seem that the Hadup has rested for a couple of months, but no: the rights have ended.


    Number of views of the movie "John Wick 2"


    The highest peak on the graph above (marked with a vertical line) coincides with the date the movie was added to the subscription: this is a very characteristic behavior for high-profile novelties. In our service 12 subscriptions:


    • Eight thematic,
    • Serials Amediateka,
    • ABC Series,
    • Russian films and series from the service "START",
    • Films in 4K.

    And two subscription packages: “Optimal”, which includes all thematic subscriptions, and “Optimal + Amediateka”.


    Subscription Interface


    The most popular, of course, are meta-packages. From thematic subscriptions, users prefer World Cinema and Our Cinema.


    The dynamics of the number of views by subscription


    Few users watch movies only by subscription, the majority either buy films only, or buy more in addition to subscriptions.


    Most often, users choose to buy new items of current rentals and large premieres of the past year.


    The most popular source of purchases in the application is the “Recommendations” section, followed by “Search”, “News” and “Catalog”. Part of the movie users buy from "Similar" and "Memorized".


    Distribution of purchase sources


    One of the main problems with which we in Okko are actively fighting is the problem of user choice of content. If you look at the graph of the probability of making a purchase from the time spent in the service (data from last year), you will see that users are ready to choose and buy the film within the first 10 minutes, then the probability of buying falls rapidly. At the same time, a sufficiently large portion of users remain who spend half an hour to an hour and cannot choose the content that is suitable for them.


    Probability of purchase from time to time


    10 minutes is not so much. During this time, the user is physically unable to study the catalog in detail and select the content that he likes.


    Here Rekko comes in - the internal recommendation system of the Okko online cinema. Rekko is currently working in two sections of the service - “Recommendations” and “Related”.


    TV recommendation section


    Similar to TV


    To assess user satisfaction with the content, we analyze the fact of purchase, subscription views, viewing time, adding to the “Stored” and user rating.


    The rating scale in Okko is represented by five asterisks with half divisions: it takes integer values ​​from 0 to 10.


    Interface setting ranking


    The user can rate the movie at any time, regardless of the fact of purchase or viewing. Rating can be changed an unlimited number of times, but can not be canceled.


    You can “remember” the movie at any time, then it will appear in the “memorized” in the user profile. Similarly, it can be removed from there.


    Memorization interface


    Memorized movies in profile


    Work on Rekko started exactly a year ago and at the moment, according to A / B tests, it allowed us to increase the average number of purchases by 4%, transactional revenue by 3%, conversion to a subscription by 5%, and users began to choose films 18% faster .


    Conversion to purchase in the control group and the group with Rekko


    Data


    All data, except for viewing time and ratings, are anonymized or distorted. Time is expressed in abstract units, for which the order relation and distance are maintained.


    transactions.csv


    Records of all transactions and content views on them for the training period. The transaction here is considered to be buying a movie forever either for rent or initiating a viewing by subscription.


    element_uid user_uid consumption_mode ts watched_time device_type device_manufacturer
    3336 5177 S 44305181.2180206 4282 0 50
    481 593316 S 44305180.606027626 2989 0 eleven
    4128 262355 S 44305180.41444582 833 0 50

    • element_uid - item identifier
    • user_uid - user ID
    • consumption_mode- type of consumption ( P- purchase, R- rent, S- view by subscription)
    • ts - time of the transaction
    • watched_time - the number of seconds the user has viewed for this transaction
    • device_type - anonymized type of device from which the transaction was made
    • device_manufacturer - anonymized manufacturer of the device from which the transaction was made

    Watched_time distribution


    ratings.csv


    Information about the estimates of users for the training period. Information is aggregated, i.e. if the user has changed his rating, only the last value will be presented in the table.


    user_uid element_uid rating ts
    571252 1364 ten 44305174.26309871
    63140 3037 ten 44305139.28281821
    443817 4363 eight 44305136.20584908

    • element_uid - item identifier
    • user_uid - user ID
    • rating- user-supplied rating (from 0to 10)
    • ts - time rating

    Rating distribution


    bookmarks.csv


    The facts of adding users of the film in the "remembered". Information is aggregated, i.e. if the user has deleted a movie from “Memorized”, there will be no record of adding it there in the table.


    user_uid element_uid ts
    301135 7185 44305161.30743926
    301135 4083 44305160.01187332
    301135 10158 44305157.74463292

    • element_uid - item identifier
    • user_uid - user ID
    • ts - the time of adding the film to the "remembered"

    catalog.json


    Meta-information about all the recommended elements: movies, TV shows and serials.


    {
      "1983": {
        "type": "movie",
        "availability": ["purchase", "rent", "subscription"],
        "duration": 140,
        "feature_1": 1657223.396513469,
        "feature_2": 0.7536096584,
        "feature_3": 39,
        "feature_4": 1.1194091265,
        "feature_5": 0.0,
        "attributes": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ...]
      },
      "2166": {
        "type": "movie",
        "availability": ["purchase", "rent"],
        "duration": 110,
        "feature_1": 36764165.87817783,
        "feature_2": 0.7360206399,
        "feature_3": 11,
        "feature_4": 1.1386044027,
        "feature_5": 0.6547073468,
        "attributes": [16738, 13697, 1066, 1089, 7, 5318, 308, 54, 170, 33, ...]
      },
    ...
    }

    • type- takes values movie, multipart_movieorseries
    • duration - the duration in minutes, rounded to dozens (the duration of the series for serials and serials)
    • availability- available content rights (may contain values purchase, rentand subscription)
    • attributes - bag of some anonymized attributes
    • feature_1..5 - five anonymized real and ordinal signs

    Available rights are indicated at the end of the training period and the beginning of the test.


    Important: json dictionary keys can only be strings, so do not forget to bring them to the number if you read the identifiers in the tables as numbers (do so to save memory).


    Distribution of signs in the catalog


    Metrics


    As a metric, we use Mean Average Precision (MAP) for 20 elements, but slightly modified. During the test period, the user could consume less than 20 films. If in this case we assume an honest MAP, the upper limit of the metric will be less than one, and the values ​​will be small. Therefore, if the user has consumed less than 20 items, we ration them by their quantity, and not by 20.


    $ \ mbox {MNAP @ 20} = \ frac {1} {\ lvert U \ rvert} \ sum_ {u \ in U} \ frac {1} {\ min (n_u, 20)} \ sum_ {i = 1} ^ {20} r_u (i) p_u @ i $


    $ p_u @ k = \ frac {1} {k} \ sum_ {i = 1} ^ {k} r_u (i) $


    $ r_u (i) $ - whether $ i $-th predicted element in the set of elements consumed during the test period by the user $ u $, $ n_u $- the size of this set. If you suddenly forgot the quality metrics of ranking, there is a great article about them .


    Cython Metric Code
    def average_precision(
            dict data_true,
            dict data_predicted,
            const unsigned long int k
    ) -> float:
        cdef:
            unsigned long int n_items_predicted
            unsigned long int n_items_true
            unsigned long int n_correct_items
            unsigned long int item_idx
            double average_precision_sum
            double precision
            set items_true
            list items_predicted
        if not data_true:
            raise ValueError('data_true is empty')
        average_precision_sum = 0.0
        for key, items_true in data_true.items():
            items_predicted = data_predicted.get(key, [])
            n_items_true = len(items_true)
            n_items_predicted = min(len(items_predicted), k)
            if n_items_true == 0 or n_items_predicted == 0:
                continue
            n_correct_items = 0
            precision = 0.0
            for item_idx in range(n_items_predicted):
                if items_predicted[item_idx] in items_true:
                    n_correct_items += 1
                    precision += <double>n_correct_items / <double>(item_idx + 1)
            average_precision_sum += <double>precision / <double>min(n_items_true, k)
        return average_precision_sum / <double>len(data_true)
    def metric(true_data, predicted_data, k=20):
        true_data_set = {k: set(v) for k, v in true_data.items()}
        return average_precision(true_data_set, predicted_data, k=k)

    Prizes and rules


    The prize fund is 600 thousand rubles:


    • 300 thousand will receive a winner
    • 200 thousand - the participant in second place
    • 100 thousand - the participant in third place.

    Standard rules: do not disrupt the platform, use only one account, avoid private code exchange with other participants and not be an employee of Okko and Rambler.


    How to start


    Starting to participate in the competition can be difficult even for experienced specialists: you need to quickly understand the new domain domain, understand and analyze the data, understand the new libraries.


    We hope that in this article we were able to immerse you in the subject of online cinema and describe the data in sufficient detail. In the archive with the task, you will find a file baseline.ipynbthat contains the code for downloading data and an example of a simple solution using the K nearest neighbors algorithm.


    If any points from the data description and domain domain remain unclear, we will be happy to answer your questions in the comments. You can also ask questions in the telegram channel @boosterspro - there will be a major discussion of the competition.


    So, how to get started:


    1. Sign up at boosters.pro and join @boosterspro ;
    2. Download data on the competition page or here ;
    3. Open baseline.ipynb, install the necessary packages, execute the entire code and download your first solution;
    4. Try changing the baseline to improve performance;
    5. Experiment!

    Rekko Challenge starts today, February 18. Decisions are made until April 18, 23:59:59 Moscow time.


    We are waiting for everyone and good luck!


    By the way, we are looking for employees . Including the developer of recommendation systems.

    Only registered users can participate in the survey. Sign in , please.

    Do you use Okko?

    Will you participate?


    Also popular now: