How we have arranged A / B testing. Yandex lecture

    A / B testing on Yandex services is carried out constantly. “Roll out to such and such a share of the audience” and look at the reaction of people is such a standard practice that no one in the team raises the question of why this is necessary. And so that there are no problems with the testing itself, we have a special infrastructure for experiments. Details are told by developers Sergey Myts and Danil Valgushev.


    Sergey:
    - I will try to simplify the description of the task of A / B testing. There is an abstract system with users, we are making some changes to it, and we need to be able to measure the benefits in it. So far, everything is simple, but too abstract. Example. There is a web service for comparing a couple of photographs of cats. The user must select the photo he likes best. Moreover, he can choose not only the left or right shot, but also “against all”. So we picked up the pictures not very well. Our task is to reasonably improve the service, proving this with numbers.

    How do we experiment? First you need to understand what is good. We want to improve the system. It is necessary to choose what to strive for, and not necessarily in numbers, but in what we call the direction towards the ideal. You may want the user to say as little as possible that we haven’t found anything good at all. To have as few failures as possible. It may also be good when we can correctly predict the user's choice. Then let's try to make him like the left picture more often. You can also want the user to want to use our service longer. Suddenly, we will then want to hang ads, and the more he uses the service, the more he will see ads. And he is good, because he likes the service, and to us, because we like advertising. Joke.

    This concept needs to be displayed in numbers in order to measure it somehow. You can enter indicators of goodness - metrics. As metrics - the number of refusals from the comparison, the number of correctly guessed left results, some weighted average user on the service for the number of his actions, time for individual pictures, etc. Some things that we believe reflect our ideal.

    Now we need data to calculate everything. Highlight user actions. Perhaps button clicks, mouse switching. At the same time, we want to record the very fact of which pictures were displayed and how long he spent on a particular page. Let's put together everything that can help us in calculating metrics.

    We learn to remember them on the client side. If this is a web service, most likely it is JavaScript that marks some actions and saves them locally. Then we will learn how to convey them to the server and save them on each machine. And we will learn how to aggregate them and put them in storage, so that we can process them later.

    We know what we want and on what data to look for it. Let’s now learn to count. We need an implementation of calculating metrics - just some kind of process that, according to our experiments, will say that on average the user has such and such metrics. Results should be stored with easy access. Not just once to calculate, but for example, managers will appear in analytics so that they themselves can easily access this repository and see the results.

    I would like the search for results not to be very long. Therefore, the repository needs to provide a quick search and display of results - so that you can move faster. We introduce a few terms for understanding internal terminology. An experimental sample is a combination of two things: a set of flags or parameters with experimental functionality, as well as the entire subset of people who are included in these changes.

    An experiment is a collection of several samples. As a rule, one of them is a control one, and users see our service there without experimentation. All others include experimental activities. The data slice is an auxiliary analytical tool. Most often, we want to see our metrics, perhaps on some limited group of users. Sometimes we are interested in how users behave in a particular country. Sometimes it’s interesting how we change the issuance of commercial requests, because money comes from them. It is interesting to look not at the entire data stream, but at individual slices.

    We must learn to create and conduct an experiment. In the description of the experimental sample, it is necessary to somehow determine the description of the parameter that this experiment will include. Suppose an experiment compares two image selection algorithms. The first prefers cats in terms of mustache, the second - in terms of fluffy. Then in the first experimental sample there can be a flag is Mustache = true, in the second is Is fluffy = true.

    Another experimental sample will include what percentage of users and, possibly, with what restrictions — for example, in which country — we want to run our experiment. This is all about describing and modifying the experimental sample. And it would be nice for us to be able to stop the experiments and start them. We monitor health. When there is a large system, it’s good to understand when everything breaks down or when as a result of changes something goes wrong as we planned.

    If we want to conduct not one, but many experiments - it is very useful to watch what happens in each of them. For example, it may happen that the classifier of a mustache will work a little more and drain the response time. And this sometimes may not be a very desirable situation.

    You need to learn how to draw conclusions, display metrics for the experiment and the presence of significant changes. There should be some kind of interface that says that these metrics have significant changes according to statistical criteria - look and draw some conclusions. If it is argued that everything is fine with all the metrics, then we need to roll it. If it’s bad and we don’t understand why, then we need to understand. If there is no understanding, the next time you can do more work.

    It is also sometimes useful to consider features on separate important sections - for example, to make sure that the response time on mobile is not squandered. And it’s convenient when there is a tool to search for possible problems and anomalies. It’s not up to you to watch all the slices, but some tool can tell you that on this slice, most likely, there is something so bad that it hurt everyone. It seems everything is relatively simple.

    Danil:
    - Actually not. My name is Danil Valgushev. It's complicated. This is due to the fact that Yandex is a large company, and there are many interesting nuances that I want to talk about using a specific example.

    We have not only a basic search. There is a search in pictures and videos, mail, maps and many other services.

    We also have many users, many experimenters and experiments. Even within each service there are many different areas that we want to improve. In the search, we can improve the ranking algorithms, interface or create new functional features.

    How do our users interact with the experiment infrastructure? A simplified diagram looks like this. There are users, there is Yandex, where the infrastructure of experiments is built. Users ask and receive results, which are somehow modified by the experiment. There are still developers, managers and analysts at Yandex who create applications for experiments. We then conduct them and give tools for analyzing the results.

    A typical experiment consists of three steps. The manager or analyst first conducts an experiment on real users and, upon completion, analyzes the results and makes decisions.

    For example, we decided to conduct an experiment and improve the layout of search results. We must create an application and fill in all the fields. We write that rollout criteria is an improvement in basic front-end metrics. Type of application - interfaces. Next, create two samples. One is empty, A, clean production. And sample B, where there is some flag, for example, goodInterface = true. This flag then scans through our entire infrastructure to the destination, to the code that generates the interface, and the code is triggered by this flag. Also in the application, we talk about target slices, which we want to calculate in metrics, and note on which regions, browsers, platforms and at what percentage we want to start the experiment.

    Let's say we filled out an application. It turns out we can’t just roll it out into production. We must test it first. Testing has two goals. There are manual tests and automatic ones. Manual - this is when the creator of the experiment clicks on everything that interests him, all the necessary interface, so that everything works correctly. Automated tests are designed to avoid fakap when the experiment rolls out into production.

    There are two examples: checking for the fall of certain service modules or collecting assessment assessments from an experiment - in order to prevent very bad experiments in production, test them before rolling out. There is a problem that perhaps we are conducting an experiment for the first time and are not completely sure that we will not break anything. Then our experts come to our aid.

    For each service and for every aspect of the quality of service, we have experts who come and moderate it for each application. They check the clarity of the description and the correctness of the flags, give advice, see if additional tests are needed, and, in principle, accompany the experiments, helping people who do not understand them well enough.

    When the application is approved, we must get into production. This also raises the problem: users are limited, but there are many applications. A queue forms.

    One solution is a multidimensional scheme. A one-dimensional scheme is when each user gets into exactly one experiment. And multidimensional - when each user gets into more than one experiment. Naturally, intersecting experiments should not conflict with each other. Usually they relate either to different services, or to different aspects of the quality of one service.

    Let's say we got into production. How are users divided into experiments? We have some configuration that actually describes the rules. And we came to the conclusion that this configuration is conveniently described in the form of a decision graph. In the leaves of the graph are experiments, and in the nodes are decisions on the query parameters, which include, for example, user ID, query text, page address, region, user agent, time.

    The order gets into the configuration at the moment when the configuration is preparing for rollout. Configuration is collected by deleting old tickets and adding new ones. Rolling out is usually done several times a day.

    This also raises a problem. We seem to have tested all the experiments, but no one guarantees that if we roll a new configuration, nothing will break. Therefore, we always monitor the key search indicators when rolling out the configuration - in order to successfully roll back it if something happens. This usually does not happen, but we still insure.

    There are smaller breakdowns when one experiment breaks down. It is more complicated here, it is not immediately visible, it is necessary to build graphs for each experiment, for key metrics such as number of clicks, number of requests. And there is a system for automatic detection of anomalies, which notices when a schedule starts to behave badly. There is also an emergency shutdown system if something is wrong.

    How does the splitting work? How to split users so that they get mixed up, but at the same time each user gets into the same experiment?

    A simple solution is to take the hash from its identifier, and take modulo N. We get N possible descents and call them slots. We usually call this partition a dimension.

    Then you can hang experiments and algorithms on the slots. But there was a problem. Let's say we had an experiment in front of us, in which users in one sample were good and in the other a little worse. After disabling the experiment, users got used to it and began to behave differently. And when we turn on our own, we have a bias, A and B are in unequal conditions.

    Due to the fact that our algorithm is a graph, we can make such a feint with our ears: take and shuffle users again before they again fall into samples A and B. Thus, we will provide them with the same conditions.

    The multidimensional scheme also looks quite simple. There is a special node that parallelizes the graph traversal. Bypass occurs independently on each branch, and then the result is added.

    When measurements are made in different branches, they usually have Salt1 and Salt2 - salt, so that they fight independently and do not correlate with each other.

    The final issue is how to assemble the configurations. It is important to remember that each experiment still has a set of limitations: percentages, regions, browsers, platforms, etc. Here is an example - four experiments that go to different regions. How to place them, for example, on 10 slots?

    If we place it like this, then we see that each experiment was slightly detached from each slot and the last experiment did not work, because it intersects with all three.

    Simple heuristics work quite well here. When we put a new tool in the configuration, we usually try to choose slots where there are already some experiments. And when the bold experiment comes to wide limits, we need to have a place for it.

    So we conducted an experiment, it worked, we look at the metrics. Be sure to look at the usual metrics: the number of user requests, clicks - to understand how much data has been collected. Another standard metric is the percentage of unclicked pages, CTR. We have many different metrics, and acceptance does not take place on clicks and queries, but on synthetic metrics. This is a separate topic, not for our report.

    There are such statistical tests. When we conducted the experiment, we make a decision. First of all, we check the roll-out criteria by metrics, our product considerations, and be sure to consult with experts.

    After completion, the experiment goes to the dataset. We collect the whole story. First of all, it allows us to conduct various studies on the methods of the experiment, and is also needed to validate new metrics.


    Sergey:
    - This is a generalized overview of the infrastructure of tools. We talked about the first two topics: what can be done with the interaction interface and how the splitting occurs. And what problems arise with logging in the real world?

    Since there are many services and many data sources, we get a data zoo. They are of different sizes, come at different speeds. Some are ready right away, some in a day, some in a week. The big problem is that responsibility is distributed according to this source data. Each team writes its own logs, then we want to collect them. Therefore, you need to work with each separately.

    Due to the zoo data, issues of delivery and aggregation arise. This means that you need to start a complex infrastructure and well-functioning processes that will collect logs from all teams. At the same time, it would be nice to have compatible data formats in order to be able to process data throughout the company, and not go to each team with its own parser. The general library for working with logs is useful here.

    In the end, the data should be aggregated and stored in one place, where it is convenient to process it further. We have separate special teams that write libraries and are responsible for the time of data delivery for individual processes of collecting logs at a low and high level. Therefore, the team of experiments simplified the task, we have already come to the ready. We have a common library for working with logs - with the help of it, any analyst can parse all the main logs of the company, if he has the appropriate access. All data is stored in the storage under MapReduce, in the system, and processed in MapReduce calculations. We have our own YT system , you can search, there have been reports about it.

    The data delivered, you need to count. Distributed data processing, the calculation goes to hundreds of terabytes and petabytes. From the point of view of the interface, we want to get the numbers we need for any day for any experiment and data slice. So, you need to somehow prepare the data. We manage by building squeezes where the data is stacked in a special way so that they can be quickly found in the file system, simply by binary search and some other special pre-processed indexes.

    As a result, individual tools can, in seconds or minutes, if we have a very complex experiment and a lot of data, upload figures for any experiment that was conducted in the company.

    A lot of experimentation - a lot of potential problems. The services are very different, they are developed separately, their own functionality, each experimenting with them, and we collect them at a common point where everyone can break something in their own way. Therefore, monitoring is very necessary and important. The first suggestion is that the aggregation of the collected logs takes time, so it would be nice to have monitoring to break the counters. We need to at least count how many requests, clicks, or some simple actions were. This data is prepared very quickly, and you can see from it that something went completely wrong.

    On the other hand, problems can be complex and something can go wrong not just in individual numbers, but on some specific metric. For example, users can begin to spend less time in the interface or, conversely, solve some user task longer. On the one hand, this means that you need to aggregate the logs and, according to fast-moving data, metric calculations are needed faster. On the other hand, they should be more complete than some raw counters. Therefore, we have a calculation based on half-hour data, which for every half-hour period prepare logs. And for some time you can see these more complex metrics.

    Since we have a lot of experiments and metrics, they need to be monitored, and another problem arises: when there are a lot of numbers, it is difficult to look at them. Therefore, we have a tool to automatically find problems. All parameters of all metrics from the experiments are sent there, and he uses the time series analysis algorithms to determine suspicious points, which we call faults. The responsible person then leaves the notice. We have a metrics view interface where you can examine the data for your experiment for any day for any slice that you ordered in advance. Still sometimes I want to explore the experiment in more detail, in the process the idea may come up to check something, I want to somehow rotate the data. A quick analysis tool for arbitrary sets of specified slices will help with this.

    You say: calculate for me these days by such metrics and slices the preliminary data for this experiment. Then you can quickly count on any set of specified slices in the metric. For example, you can quickly look at all possible sets of browsers, multiplied by regions, multiplied by advertising requests, all sorts of metric values ​​and see that something is wrong somewhere. Perhaps find a mistake, or perhaps come up with a new experiment.

    The last important part of the infrastructure is how we make sure that everything is correct and adequate. We can build all this, but we want to experiment correctly, to draw the right conclusion. So, there must be mechanisms for how we follow this.

    There are several groups of people at different levels who monitor the accuracy of measurements. First of all, these are the developers of the ABT team, where we enter with Danil. We develop the infrastructure of experiments and analysis tools, as well as analyze the problems that arise. We have an understanding of what and how should be done, but we are more going into infrastructure support of all this.

    There are also research teams that are responsible for the development, validation and implementation of new non-trivial metrics and approaches. Danil mentioned simple metrics, but for n years now we have been using complex statistical metrics that are validated on simple ones, and they allow you to make decisions more sensitive and see changes better. Some teams specifically develop metrics. There are experts in the fields, they are responsible for the correctness of the experiment procedure.

    Service analysts are one of the penultimate lines of defense. They are responsible for the adequacy of changes through the prism of service features. Someone may experiment, but each service has one or more analysts who understand what is good and what is bad in their case and can prevent something strange. There is some expertise that is trying to keep us from logical problems.

    The topic is extensive, many things were not shown very deeply. Thanks.

    Also popular now: