MyTarget dynamic remarketing: non-personal product recommendations



    Dynamic remarketing (dynrem) in myTarget is a targeted advertising technology that uses information about user actions on websites and in mobile applications of advertisers. For example, in an online store, a user viewed the pages of goods or added them to the basket, and myTarget uses these events to display advertisements for precisely those goods and services to which a person has already shown interest. Today I’ll talk in more detail about the mechanism for generating non-personal, that is, item2item-recommendations, which allow us to diversify and complement the advertising output.

    The customers for dynrem myTarget are mainly online stores, which can have one or more lists of products. When building recommendations, the pair “store - list of goods” should be considered as a separate unit. But for simplicity, we will simply use the "store" next. If we talk about the dimension of the input task, then recommendations should be built for about a thousand stores, and the number of goods can vary from several thousand to millions.

    The recommendation system for dynrem must meet the following requirements:

    1. The banner contains products that maximize its CTR.
    2. Recommendations are built offline for a given time.
    3. The system architecture must be flexible, scalable, stable and work in a cold start environment.

    Note that from the requirement to build recommendations for a fixed time and the described initial conditions (we will optimistically assume that the number of stores is increasing), a requirement naturally arises for the economical use of machine resources.

    Section 2 contains the theoretical foundations for constructing recommender systems, sections 3 and 4 discuss the practical side of the issue, and section 5 summarizes the overall result.

    Basic concepts


    Consider the task of building a recommendation system for one store and list the basic mathematical approaches.

    Singular Value Decomposition (SVD)


    A popular approach to building recommender systems is the singular decomposition (SVD) approach. Rating Matrix$ R = (r_ {ui}) $ represent as a product of two matrices $ P $ and $ Q $ so that $ R \ approx PQ ^ T $then rating user rating $ u $ for goods $ i $ represented as $ \ hat {r} _ {ui} = <p_u, q_i> $ [1], where the elements of the scalar product are dimension vectors $ k $(main parameter of the model). This formula serves as the basis for other SVD models. The task of finding$ P $ and $ Q $reduces to optimizing the functional:

    (2.1)

    $ J (P, Q) = \ sum _ {(u, i)} \ mathcal {L} (r_ {ui}, \ hat {r} _ {ui}) + \ Lambda (P, Q) \ rightarrow \ min_ {P, Q}, $


    Where $ L $- error function (for example, RMSE as in Netflix competition ),$ Λ $- regularization, and the summation is over pairs for which the rating is known. We rewrite expression (2.1) in an explicit form:

    (2.2)

    $ J (P, Q) = \ sum _ {(u, i)} (r_ {ui} - <p_u, q_i>) ^ 2 + \ lambda_1 || p_u || ^ 2 + \ lambda_2 || q_i || ^ 2 \ rightarrow \ min_ {P, Q}, $


    Here $ λ1 $, $ λ2 $ - L2-regularization coefficients for user representations $ p_ {u} $ and goods $ q_ {i} $respectively. The basic model for the Netflix competition was:

    (2.3)

    $ \ hat {r} _ {ui} = \ mu + b_u + b_i + <p_u, q_i>, $


    (2.4)

    $ J (P, Q) = \ sum _ {(u, i)} (r_ {ui} - \ mu - b_u - b_i - <p_u, q_i>) ^ 2 + \ lambda_1 || p_u || ^ 2 + \ lambda_2 || q_i || ^ 2 + \ lambda_3 b_u ^ 2 + \ lambda_4 b_i ^ 2 \ rightarrow \ min_ {P, Q}, $


    Where $ µ $, $ b_ {u} $ and $ b_ {i} $- biases for rating, user and product, respectively. Model (2.3) - (2.4) can be improved by adding an implicit user preference to it. In the Netflix competition example, the explicit response is the score set by the user to the film “at our request”, and other information about the “user interaction with the product” (viewing the movie, its description, reviews on it; that is, the implicit response does not give an implicit response) direct information about the rating of the film, but at the same time indicates interest). Implicit response accounting is implemented in the SVD ++ model:

    (2.5)

    $ \ hat {r} _ {ui} = \ mu + b_u + b_i + <p_u + \ frac {1} {\ sqrt {\ sigma_u}} \ sum_ {j \ in S (u)} y_j, \, \ , q_i>, $


    Where $ S (u) $ - a set of objects with which the user implicitly interacted, $ σ_ {u} = | S (u) |, y_ {j} $ - dimension representation $ k $ for an object from $ S (u) $.

    Factorization Machines (FM)


    As can be seen in the examples with different SVD models, one model differs from the other in the set of terms included in the evaluation formula. Moreover, the expansion of the model each time represents a new task. We want such changes (for example, adding a new kind of implicit response, taking into account time parameters) to be easily implemented without changing the model implementation code. Models (2.1) - (2.5) can be represented in a convenient universal form using the following parameterization. We represent the user and the product in the form of a set of features:

    (2.6)

    $ \ overline {x} ^ U = (x_1 ^ U, x_2 ^ U, \ dots, x_l ^ U) \ in \ mathbb {R} ^ l, $



    (2.7)

    $ \ overline {x} ^ I = (x_1 ^ I, x_2 ^ I, \ dots, x_m ^ I) \ in \ mathbb {R} ^ m. $




    Fig. 1: An example of a feature matrix in the case of CF.

    For example, in the case of collaborative filtering (CF), when only data on the interaction of users and products are used, the feature vectors look like one-hot-code (Fig. 1). Introduce vector$ \ overline {x} = (\ overline {x} ^ U, \ overline {x} ^ I) $, then the recommendation task is reduced to regression problems with the target variable $ r_ {ui} $:

    1. Linear Model:
      (2.8)

      $ h_ {lin} (\ overline {x}) = w_0 + \ sum_ {j = 1} ^ {l + m} w_jx_j $

    2. Poly2:
      (2.9)

      $ h_ {poly2} (\ overline {x}) = w_0 + \ sum_ {j = 1} ^ {l + m} w_jx_j + \ sum_ {i = 1} ^ {l + m} \ sum_ {j = i + 1 } ^ {l + m} w_ {ij} x_i x_j $

    3. FM:
      (2.10)

      $ h_ {FM} (\ overline {x}) = w_0 + \ sum_ {j = 1} ^ {l + m} w_jx_j + \ sum_ {i = 1} ^ {l + m} \ sum_ {j = i + 1 } ^ {l + m} x_i x_j <\ overline {v} _i, \ overline {v} _j> $


    Where $ w_ {j} $ - model parameters, $ v_ {i} $ Are dimension vectors $ k $representing the sign $ i $ in latent space $ l $ and $ m $- the number of signs of the user and product, respectively. In addition to one-hot-codes, content-based features (Content-based, CB) can serve as signs (Fig. 2), for example, vectorized descriptions of products and user profiles.


    Fig. 2: An example of an expanded feature matrix.

    The FM model introduced in [2] is a generalization for (2.1) - (2.5), (2.8), (2.10). The essence of FM is that it takes into account the pairwise interaction of features using a scalar product$ <\ overline {v} _i, \ overline {v} _j> $, not using the parameter $ w_ {ij} $. The advantage of FM over Poly2 is a significant reduction in the number of parameters: for vectors$ v_ {i} $ we will need $ (l + m) · k $ parameters, and for $ w_ {ij} $ required $ lm m $parameters. At$ l $ and $ m $of large orders, the first approach uses significantly fewer parameters.

    Please note: if there is no specific pair in the training set$ (i, j) $, then the corresponding term with $ w_ {ij} $in Poly2 it does not affect the training of the model, and the rating score is formed only on the linear part. However, the approach (2.10) allows us to establish relationships through other features. In other words, the data on one interaction help to evaluate the parameters of attributes that are not included in this example.

    On the basis of FM, a so-called hybrid model is implemented in which CB attributes are added to CF attributes. It allows you to solve the problem of a cold start, and also takes into account user preferences and allows you to make personalized recommendations.

    Lightfm


    In the popular implementation of FM , the emphasis is on the separation between the characteristics of the user and the product. Matrices act as model parameters$ E ^ U $ and $ E ^ I $Representation of user and product features:

    (2.11)

    $ E ^ U = \ begin {pmatrix} \ overline {e} _1 ^ U \\ \ vdots \\ \ overline {e} _l ^ U \ end {pmatrix}, \, \, E ^ I = \ begin {pmatrix } \ overline {e} _1 ^ I \\ \ vdots \\ \ overline {e} _m ^ I \ end {pmatrix}, \ overline {e} _i ^ U \ in \ mathbb {R} ^ k, \ overline { e} _i ^ I \ in \ mathbb {R} ^ k $


    as well as offsets $ \ overline {b} ^ U, \ overline {b} ^ I \ in \ mathbb {R} ^ k $. Using user and product views:

    (2.12)

    $ \ overline {p} ^ U = \ overline {x} ^ U \ cdot E ^ U = \ sum_ {j = 1} ^ l x_j ^ U \ cdot \ overline {e} _j ^ U, $


    (2.13)

    $ \ overline {q} ^ I = \ overline {x} ^ I \ cdot E ^ I = \ sum_ {j = 1} ^ m x_j ^ I \ cdot \ overline {e} _j ^ I, $


    get the pair rating $ (u, i) $:

    (2.14)

    $ \ hat {r} _ {ui} = <\ overline {p} ^ U, \ overline {q} ^ I> + <\ overline {x} ^ U, \ overline {b} ^ U> + <\ overline {x} ^ I, \ overline {b} ^ I>.  $


    Loss functions


    In our case, it is necessary to rank products for a specific user so that a more relevant product has a higher rating than a less relevant one. LightFM has several loss functions:

    1. Logistic is an implementation that requires a negative that is not explicitly presented in most tasks.
    2. BPR [3] is to maximize the difference in ratings between positive and negative examples for a particular user. Negative is obtained using bootstrap sampling. The quality functional used in the algorithm is similar to ROC-AUC.
    3. WARP [4] differs from BPR in the method of sampling negative examples and the loss function, which is also ranking, but at the same time optimizes the top recommendations for the user.

    Practical implementation


    To build recommendations for a given time, a parallel implementation on Spark is used. An independent task is launched for each store, the execution of which is controlled by luigi.

    Data preprocessing


    Data preprocessing is performed by automatically scalable Spark SQL tools. The features selected in the final model are textual descriptions of goods and catalogs with standard conversions.

    What helped us when interacting with Spark:

    1. Партицирование подготовленных данных (матрицы взаимодействий пользователей и товаров, признаки для них) по магазинам. Это позволяет на этапе обучения экономить время на чтении данных из HDFS. В противном случае каждой задаче приходится считывать данные в память Spark и фильтровать их по ID магазина.
    2. Сохранение/получение данных в/из Spark мы делаем частями. Это связано с тем, что в процессе любого из этих действий данные загружаются в память JVM. Почему просто не увеличить память для JVM? Во-первых, тогда уменьшается доступная для обучения модели память, а во-вторых, не нужно хранить что-либо в JVM, она выступает в данном случае как временное хранилище.

    Обучение модели


    The model for each store is trained in its own Spark container, so that you can simultaneously run an arbitrary number of tasks for stores, limited only by cluster resources.

    LightFM lacks early stop mechanisms, therefore, we spend extra resources on additional iterations of training when there is no increase in the target metric. We chose AUC as a metric, the relationship of which with CTR is confirmed experimentally.

    Denote:

    $ S $ - all known interactions between users and products, that is, pairs $ (u, i) $,
    $ I $ - a lot of all goods $ i $,
    $ U $ - a lot of all users $ u $.
    For a specific user$ u $ also introduce $ I_ {u} = {i ∈ I: (u, i) ∈ S} $- A lot of products with which the user interacted. AUC can be calculated as follows [ref]:

    (3.1)

    $ AUC (u) = \ frac {1} {| \ mathcal {I} _u || \ mathcal {I} \ setminus \ mathcal {I} _u |} \ sum_ {i \ in \ mathcal {I} _u} \ sum_ {j \ in \ mathcal {I} \ setminus \ mathcal {I} _u} \ delta (\ hat {r} _ {ui}> \ hat {r} _ {uj}), $


    (3.2)

    $ AUC = \ frac {1} {| \ mathcal {U} |} \ sum_ {u \ in \ mathcal {U}} AUC (u). $


    In formula (3.1) we need to calculate the rating for all possible pairs $ (u, i) $ ($ u $ fixed), as well as compare ratings for items from $ \ mathcal {I} _u $ with ratings from $ \ mathcal {I} \ setminus \ mathcal {I} _u $. Given that the user interacts with the scanty part of the assortment, the complexity of the calculation is$ O (| \ mathcal {U} || \ mathcal {I} |) $. At the same time, one era of FM training costs us$ O (| \ mathcal {U} |) $.

    Therefore, we modified the calculation of AUC. First, you should break the sample into a training$ \ mathcal {S} ^ {train} \ subset \ mathcal {S} $ and validation $ \ mathcal {S} ^ {val} \ subset \ mathcal {S} $, and $ \ mathcal {S} ^ {val} = \ mathcal {S} \ setminus \ mathcal {S} ^ {train} $. Next, we use sampling to create many users for validation$ \ mathcal {U} ^ {val} \ subset \ {u \ in \ mathcal {U}: (u, i) \ in \ mathcal {S} ^ {val} \} $. For the user$ u $ of $\mathcal{U}^{val}$ elements of a positive class we will consider $\mathcal{I}_u^{+} = \{i\in \mathcal{I}: (u, i) \in \mathcal{S}^{val}\}$similar $\mathcal{I}_u$. As elements of a negative class, we take a subsample$\mathcal{I}_u^{-} \subset \mathcal{I}$ so that no elements from $\mathcal{I}_u$. The subsample size can be taken proportional to the size.$\mathcal{I}_u^{+}$, i.e $|\mathcal{I}_u^{-}| = c\cdot |\mathcal{I}_u^{+}|$. Then formulas (3.1), (3.2) for calculating AUC will change:

    (3.3)

    $AUC(u) = \frac{1}{|\mathcal{I}_u^{+}|| \mathcal{I}_u^{-}|} \sum_{i\in \mathcal{I}_u^{+} } \sum_{j \in \mathcal{I}_u^{-} } \delta(\hat{r}_{ui} > \hat{r}_{uj}),$


    (3.4)

    $AUC = \frac{1}{|\mathcal{U}^{val}|} \sum_{u \in \mathcal{U}^{val}} AUC(u).$


    As a result, we obtain a constant time for calculating AUC, since we take only a fixed part of users, and the sets $\mathcal{I}_u^{+}$ and $\mathcal{I}_u^{-}$have a small size. The learning process for the store stops after the AUC (3.4) ceases to improve.

    Search for similar objects


    As part of the item2item task, you must select for each item $n$(or products as similar as possible) to those that maximize the clickability of the banner. Our assumption: candidates for the banner must be considered from top-$k$closest in space embeddings. We tested the following methods for calculating “nearest neighbors”: Scala + Spark, ANNOY, SCANNs, HNSW.


    For Scala + Spark for a store with 500 thousand objects, it took 15 minutes to calculate an honest cosine metric and a significant amount of cluster resources, so we tested approximate methods for finding nearest neighbors. When the method of the study ranged SCANNs following parameters: bucketLimit, shouldSampleBuckets, NumHashesand setSignatureLength, but the results were poor in comparison with other methods (got very different objects in the buckets). The ANNOY and HNSW algorithms showed results comparable to an honest cosine, but worked much faster.

    200k products500k goods2.2m products
    AlgorithmANNOYHnswANNOYHnswANNOYHnsw

    index build time (sec)
    59.458.64258.0225.441190.8190.45
    total time (sec)141.2301/14527.7643.382081.57150.92


    Due to the fact that HNSW worked faster than all algorithms, we decided to stop on it.
    We also search for the nearest neighbors in the Spark container and write the result in Hive with the appropriate partitioning.

    Conclusion


    Recall: we use WARP to train the model, AUC for the early stop, and the final quality assessment is carried out using the A / B test on live traffic.

    We believe that in this place - in organizing the experiment and selecting the optimal composition of the banner - data ends and science begins. Here we learn to determine whether it makes sense to show recommendations for products for which retargeting worked; exactly how many recommendations to show; how many products viewed, etc. We will talk about this in the following articles.

    Further improvements to the algorithm - the search for universal embeddings that will allow the products of all stores to be placed in one space - are carried out within the framework of the paradigm described at the beginning of the article.

    Thanks for attention!

    Literature


    [1] Ricci F., Rokach L., Shapira B. Introduction to recommender systems handbook
    // Recommender systems handbook. - Springer, Boston, MA, 2011. - S. 147160.

    [2] Rendle S. Factorization machines // 2010 IEEE International Conference on Data Mining. - IEEE, 2010 .-- S. 995-1000.

    [3] Rendle S. et al. BPR: Bayesian personalized ranking from implicit feedback
    // Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence.
    - AUAI Press, 2009 .-- S. 452-461.

    [4] Weston J., Bengio S., Usunier N. Wsabie: Scaling up to large vocabulary image annotation // Twenty-Second International Joint Conference on Artificial Intelligence. - 2011.

    Also popular now: