How we learned to talk to millions of people

    Marketing in the X5 is a large amount of data. Pyaterochka, for example, sends more than 30 million communications every month, and this number is constantly growing. One client can simultaneously send several offers, and it is important to choose the right ones. Store promotions should be interesting to the customer and economically viable for the retailer. In this post we will tell you how we began to identify really popular offers using machine learning and eliminate the effect of spam.

    A large team is working in order to improve communication with customers of Pyaterochka, Perekrestok and Karusel, at X5 Retail Group.

    Now, to maintain this system, dozens of people and systems are required: accurate data accumulation in Oracle, data analysis and setting up campaigns in SAS, setting up rules for calculating bonuses in Comarch. Analysts make decisions every day on how to select the most relevant proposal at the moment, choosing from a huge variety of options, based on historical data on the results of actions. We are working to ensure that the communications were at the address and had nothing to do with spam.

    We thought about how to translate the process of selecting a current offer for a customer into automatic mode using machine learning in order to:

    • accounted for the accumulated information on customers and past campaigns
    • the system itself studied on new data
    • planning was calculated more than one step ahead

    So we came to the realization of self-organizing systems and the era of developing the Reinforcement Learning system in X5 began.

    A little bit about Reinforcement Learning
    * RL (Reinforcement Learning) is reinforcement learning. One of the ways of machine learning, during which the system under test (agent) is trained, interacting with a certain environment.

    Theory of learning with reinforcement operates with two concepts: action and state. Based on the state of the object, the algorithm decides on the choice of actions. As a result of the committed action, the object falls into a new state, and so on.

    Assuming that:

    • agent - client
    • action - communication with the client
    • state - state (Metrics set) of the client
    • Target function - further client behavior (for example, increase in revenue or response to a targeted campaign)

    ... then the described system should solve the set goal and the agents (clients) will choose their own actions (rewards and campaigns) for the sake of their comfort and meaningful relationship with the store.

    What does the world mind RL suggest?
    First we looked for examples of solving such problems, described in open sources.
    Found some interesting examples:

    About tools:

    About the application of RL to similar tasks in marketing:

    But they all did not fit our case or did not inspire confidence.

    Stage 1. The prototype of the solution

    Therefore, we decided to develop our own approach.

    In order to minimize risks and not to fall into a situation where we developed a system for a long time without real use, and then it did not take off as a result, we decided to start with a prototype that would not implement the RL method in its pure form, but had a clear business result.

    The basis of basic implementations of learning with reinforcement is the state-action-result matrix, updated every time new information is received from the environment.

    To reduce the status space, within the framework of the prototype, a transition was made from the client to the segment, where all the clients were divided into 29 groups based on the parameters:

    • average check
    • purchase frequency
    • basket stability
    • basket filling
    • customer loyalty (proportion of the number of weeks with purchases to the number of weeks during which a person participated in the store loyalty program)

    Thus, the task was reduced to learning the following matrix:

    At the intersection, the matrix had to be filled with the value of the goal function.

    In the first version of the algorithm, the specific response to the campaign was selected as a goal function.

    We developed the first prototype in a couple of weeks in SQL (Oracle) and Python. We had historical data on communications, so we were able to partially fill in the matrix with the estimated weight of the segment-sentence pairs. Unfortunately, it turned out that for some couples there is not enough data. It did not stop us, we were eager for combat tests.

    The marketing department of Pyaterochka entrusted us with data on two million clients for 10 weeks of experiments. At this time, these buyers were disconnected from all other communications. We identified half of the clients in the control group, and the rest of the groups tested the prototype.

    The theory of RL told us that we must not only choose the best action, but also continuously continue learning. Therefore, every time we tested a random campaign on a small percentage of our clients. Accordingly, the remaining customers received the best offer (the best campaign). Thus, we got our realization of ε - the greedy method of choosing the most optimal sentence.

    After three launches of the system, it became clear that the choice of the best response campaign does not lead to an increase in the specific RTO per campaign (this indicator is the main measure of the effectiveness of any target campaign in any organization).

    By changing the goal function (and therefore the algorithm for choosing the best campaign) directly to the incremental RTO, we learned that the most successful campaigns from this point of view are unprofitable from the point of view of ROI.

    So, by the eighth launch of the system, we changed the goal function for the third time, now to ROI.

    Conclusions from the development of the prototype

    Below are graphs of the effectiveness of the main indicators:

    • Net customer response to communication
    • Incremental RTO
    • Marginality

    It can be noted that by the last launch, the efficiency of the prototype (for incremental RTO) surpassed the average result of campaigns launched by analysts, and if we consider only the “best” segments and offers, the difference is more than two times.

    For the future, we made the following conclusions for ourselves:

    1. Speaking in advance with a KPI business may not be enough. KPI business customer is also changing. (So ​​we moved from RTO to marginality).
    2. Indirect goals (in our case, the response) is good, but sooner or later you will be asked to take into account immediate performance indicators.
    3. The best campaign segment pairs have been found that show consistently good results. These campaigns were launched on the entire base and regularly generate profit.


    1. circuit works
    2. It is necessary to take into account the cost of the client (the victory on the IITO did not ensure the growth of ROI)
    3. I would like to take into account the history of responses
    4. now it's not so scary to go to the client level

    Stage 2. Finishing the system

    Inspired by the results of the first stage, we decided to refine the system and make the following functional improvements:

    1) move from choosing an offer to a customer segment to choosing an offer individually for a client, describing it with a set of metrics:

    • Last Response Flag
    • The ratio of customer PTO in 2 weeks to PTO in 6 weeks
    • The ratio of the number of days since the last purchase to the average distance between transactions
    • The number of weeks since the last communication
    • The ratio of the amount of used bonuses per month to the amount of PTO per month
    • Reaching the goal in the previous two weeks
    • Response flags for offers with different types of rewards
    • choose not 1, but a chain of two subsequent campaigns

    2) clarify the objective function, adding to it, in addition to the response, the growth of the PTO :).

    Now, choosing an individual offer for a client, we focus on the expected value of the objective function Q1:

    • Q = 1 if the client responded to the campaign and his 2-week RTO increased by m% during the episode
    • Q = 0 if the client did NOT respond to the campaign and his 2-week RTO increased by m% during the episode
    • Q = 0 if the client responded to the campaign and his 2-week RTO grew LESS than the episode by m%
    • Q = -1 if the client did NOT respond to the campaign and his 2-week RTO grew less than m% during the episode

    Now piloting the second approach is in full swing, but we have already surpassed the previous results.

    What's next

    Fortunately, the results are not only encouraging the implementation and development team, but also business customers, so in the future, in addition to functional improvements, it is planned to create similar systems that work for real-time and online marketing.

    In addition, the attentive reader will notice that until now we have not used RL in its pure form, but only its concept. Nevertheless, even with such a simplification, we observe a positive result and are now ready to move on, complicating our algorithm. By our example, we want to inspire others to go "from simple to complex."

    Editorial Habr X5 thanks Glowbyte Consulting for help in preparing the post. The pilot was executed by a joint team of six Pyaterochka and Glowbyte specialists.

    By the way, we are looking for a manager of Product Development Big Data, a specialist on work with the data, the expert for analysis and management of loyalty programs.

    Also popular now: