# Offline A / B testing in retail

*This is a true story. The events described in the post occurred in a warm country in the 21st century. Just in case the names of the characters have been changed. Out of respect for the profession, everything is told as it really was.*

Hi, Habr. In this post we will talk about the notorious A / B testing, unfortunately, even in the 21st century it can not be avoided. Online testing has been around for a long time and flourishing , while offline has to adapt to the situation. We will talk about one such adaptation in mass offline retailing, having spiced up the history of interaction with one top consulting firm, in general under cat.

## Task

In the past, I worked on a single project in a large company that owns a chain of grocery stores, more than 500 stores. I am afraid that I should not mention the name of the company, we will call this organization the Company. The bottom line is that stores come in different sizes, can vary in size tenfold; shops can be in different cities, villages and villages; Shops can be in different parts of the city with their demographics. Here, in general, I tend to believe that if you need to test any hypothesis, then in the A / B testing paradigm, it is almost impossible to do this without causing significant damage to the business. Let's take this whole thing on the example of beer. Once a consulting office comes to the Company, well, you know these, from the very top, and they say: "do you know, dear, you have a beer here that is not the right brand in the shop windows, Also the methods of the Office cannot be questioned, since they cannot lie. Only not to us. In general, the author of these lines descends the task of the form "well, you see how they are doing the pilot, help me, if they need something." Also the methods of the Office cannot be questioned, since they cannot lie. Only not to us. In general, the author of these lines descends the task of the form "well, you see how they are doing the pilot, help me, if they need something."

Having listened to a short lecture on how their methods of generating goods on a showcase work, the desire to go into the details of the algorithm is completely gone. I decided to concentrate on measuring quality, which is much more interesting from a theoretical point of view. It also allows the Company not to invest in knowingly unprofitable projects. Having access to parallel universes, it would be possible to conduct an A / B test, where everything in Universe A proceeds in the same way as the display of goods has changed in universe B. A / B testing is a type of controlled experiment, where users are randomly divided into control and test groups. An intervention is done in the test group, a specific time is waited, the effect of such an intervention on the targets is measured, and finally the indicators of the two groups are compared. Desirable still minimize the offset between the control and test groups relative to each other. For example, to avoid such that in group A there are only cities, and in group B only villages. With sites it seems like the issue of bias is solved easily: show one version to users with an even ID, and another version with an odd one. In a situation with a chain of stores, things are not so simple, no matter how you break users or stores, it always turns out that groups A and B do not resemble each other. That group A comes to the store in the afternoon, and B in the evening. Evening out time, it turns out that A comes on weekends more often than B. Aligning all such details, it turns out that for statistically significant results you have to wait half a year, and cancel all marketing campaigns. If you beat the cities, it turns out that Moscow is present in one group and absent in another. In general, there is always a shift of one group relative to another. This is all superimposed on various marketing campaigns, local and local for stores, holidays and unforeseen circumstances in the form of parking repairs.

You remember that the Office is from the very top of world offices, and of course it has a solution to the problem of testing. Consider their methodology, with a loud marketing name - the triple difference methodology.

## Triple Difference Methodology

The essence of the triple difference methodology is simplicity. And so that the tops of the Company do not strain while listening to the presentation, this presentation will be led by a pretty-looking lady. Simplicity is achieved by relaxing the restrictions imposed by the A / B test. The only difficulty that remains on the way of the Office is the choice of the control and test group, but we will also omit this part of the process, since nothing interesting except a large set of questionable assumptions. So, as a result of a thorough analysis of the existing chain of stores, Kantora chooses two: one for the control group (green) and one for the test group (blue).

We introduce the following notation:

- $$: pilot start date;
- $$: date of the end of the pilot;
- $$: the date corresponding to the pilot start date last year;
- $$: the date corresponding to the date of the end of the pilot last year.

So we have two time periods:

- $$: pilot period (the period of the experiment);
- $$: the period corresponding to the pilot period last year.

It is proposed to compare the revenues of the test shop and the control over the pilot's periods and a year ago. To do this, you need to count three groups of differences. Denote sales per day$$ in the test store for $$, but $$- in the control. The first group sets the baseline from which the growth or decline in sales during the pilot period will be measured:

- $$: the difference in sales between the beginning of the pilot and the same date a year ago in the test store;
- $$: the difference in sales between the end of the pilot and the same date a year ago in the test store;
- $$: the difference in sales between the beginning of the pilot and the same date a year ago in the control store;
- $$: the difference in sales between the ending of the pilot and the same date a year ago in the control store.

The second group of differences sets the growth or decline in sales during the pilot period:

- $$: the difference in sales between the end of the pilot and the beginning of the pilot in the test shop (adjusted for dates a year ago);
- $$: the difference in sales between the end of the pilot and the beginning of the pilot in the control store (adjusted for dates a year ago).

Finally, the decisive difference determines which store worked better during the pilot period:

- $$

Well, the decision to implement the project cost KAMAZ gold is very simple, if $$- it means the test store has sold more beer, therefore the Office method works and gives a positive effect, therefore it needs to be implemented. Everything.

## A / B test with ML baseline

After studying the triple difference methodology and having learned that the authorities had already approved this method of measurement and started to plan the pilot, my hand hit me hard in the face. It turns out the office offers us to invest KAMAZ gold in the project, even if the methodology does not work, and the difference in sales was 1 ruble, by some accident. It was necessary to urgently develop something that would give at least some confidence in the effectiveness of a new way of putting beer on the shelf. As you remember, one of the ways to conduct an honest A / B test offline is the existence of parallel universes, then in one we can introduce the beer layout methodology, in the second leave everything as it is, wait a while and compare the results. And what if you simulate parallel universes of machine learning?

Suppose we have a time series of daily sales for each store. The gray solid line separates the periods

*before the pilot*and

*after the pilot*. The zone between the solid gray line and the intermittent gray line is the period of buyers' adaptation to the new display of goods and new brands, during this period sales data does not affect the test result and are simply ignored. The red one is real sales of any store in the period before the pilot. On the right side is a combination of test stores and checklists. The green dashed line is the sales forecast for any store using only the data available before the pilot is launched.

- The red line is the real sales of the control shop in the period after the pilot's launch.
*For stores from the control group, in the period after the pilot's start, we see only the sales forecast (green intermittent) and real sales (red intermittent).* - The blue solid is the real sales of the store from the test group in the period after the pilot's launch.
*In the test stores, we see only the sales forecast (green intermittent) and real sales (blue solid).*

The green dashed line is the machine learning baseline.

If the pilot was successful, i.e. test intervention in the form of an updated range and new calculations have a positive effect on day sales, the actual sales in test stores (blue solid) will be **on average** higher than the actual sales in control stores (red intermittent).

Let's see what it means on average. For this we have to make one assumption, we will assume that the prediction errors of the model have a normal distribution:

$$

Let's add another bold assumption, let's say sales in the category of interest today linearly depend on sales in adjacent categories today and sales in the category of interest to us yesterday and in some recent past, as well as various store metadata can be attributed to all of this to take into account the bias in demographics and other signs.

$$

It turns out very familiar model . It is worth noting that the choice of the model here is not particularly significant, it is important that the errors have a normal distribution, or other known, in order to conduct a statistical test for equality of average values. With such statements of the problem, it is always possible to test for normality at the stage of building a model, and on almost any models the distribution will be normal, according to the version of the normality test, it is checked.

So, as a predictive model, I used linear regression, although this is not a requirement, and I was guided by the simplicity of the model and its interpretability. It is worth noting that the model is predictive, but I would call it explanatory. Since we do not predict the future, we use sales from adjacent categories on the same day, which in essence is a dalic. Rather, we try to explain the sales of beer today, the sales in the store as a whole. This creates a new problem for us - it is necessary to carefully select the signs used in the model. Signs belonging to the categories of related products can be divided into three groups:

- a group of goods of interest to us (light beer, dark beer, nulevochka, kvass, maybe even yellow whale), some of these signs form the target variable, and some are excluded from the model altogether;
- groups of goods that are most likely correlated with the target group, for example, a bayan story about the fact that sales of diapers and beer have a high positive correlation coefficient;
- groups of goods, which, well, certainly do not have a significant correlation with the target groups, this is such a way of regularization before building a model, and here there will be a great temptation to add everything to the second group, just in case.

We add features from the second group as explanatory variables to the model. The idea is that we assume that changes in sales in the second group as a whole have a significant effect on the first one, and changes in sales in the first one do not have a special effect on the second one entirely (the second is much larger and more variable).

A popular question during the presentation of the method was this: what if there is a parking repair in the test / test store, the test will break? The answer is no. Parking will affect the sales of the store as a whole, and not specifically for beer, and our beer sales depend on sales in other categories and, accordingly, will fall down with everyone. For persuasiveness, it is possible to conduct a couple of simulations on the retrodatka.

It is also worth noting that we do not test the calculation by method A against the calculation by method B, but we are testing a newbehavioragainst theold. This means that the stores and the group as a whole should not cancel any planned marketing campaigns of those that were previously used. For example, if during the last 6 months, on even weeks, you reduced the price of strong beer by 2 times - continue to do it, if you stop doing it, the behavior will be different. It is worth refraining only from conducting new experiments in selected stores.

The stage of building a model is also not without pitfalls. The test and control groups can include completely different stores, and the task of our model is to align all the stores, so that for any store, the random prediction error is centered at zero (or equally shifted relative to zero). I initially expected that I would have to sort through all kinds of hyperparameters for validation until I had the desired result. But it turned out that with a sufficient set of signs, this is achieved the first time, which is interesting, and the variance of the random error also did not differ much from shop to shop. This is probably one of the weakest points of the method, since there is no guarantee that such conditions will be fulfilled.A review of the literature also did not give any results, it seems like many where baseline is applied on machine learning, but nowhere is there any theoretical guarantee. In general, after all such frauds, we get a model that studies *all of the data entirely* , and we can make daily sales forecasts *for any selected store.*. And we are not particularly worried about the accuracy, but if only the distribution of the error for all the stores was equally shifted (more pleasantly, of course, if not shifted relative to zero). And the fact that the variance can be large will only affect the size of the dataset required for the statistical significance of the test result (meaning that for given a priori statistical significance and statistical power of the test, the number of observations required for obtaining such results depends on the variance ).

Let us return to the graph above with the red, green, and blue lines, and finally introduce the concept, **on average,** higher or lower. For control stores, we can subtract daily model sales from the real daily sales (red broken line) as predicted by the model (green broken). As a result, we get a normal distribution of errors with a center at zero, so nothing has changed in them and the model will, on average, coincide with reality. For stores from the test group, we also subtract from real daily sales (blue solid line), daily sales model sales (green intermittent), and also get a normal distribution. Then if nothing has changed, then the center will be somewhere near zero; if sales have improved, it will be shifted to the right, if they have worsened, then to the left. So it looks like on simulated data.

And here we find ourselves in the conditions of the usual statistical test for the equality of the mean of two distributions, and nothing prevents us from conducting this test. For the stattest we need to know the following:

- $$ and $$: choose for yourself, or if you are lucky and educated people are sitting in marketing, then we choose with them;
- dispersion: taken from retrodatka;
- elevator: need to test not just for equality, but that the increase in sales in the test group is not less than a certain amount of Canadian conditional dollars; we do not want to implement a project worth gold in Kamaz, but so that it is profitable and does not pay for itself in a hundred years, we do not build a bridge to the Crimea.

This data will be enough to calculate the required number of days required for the pilot. Another bonus of this approach is scalability. In our case, the test issued 60 days, i.e. we need 60 daily observations for the test and 60 daily observations for the control group to obtain statistically significant test results. We can choose one store in each group and wait 2 months, or two in each group and wait 1 month, and so on. Naturally, the budget of the experiment depends on adding new stores to the test group, but it is your task how to choose such an equilibrium. I recommend to study this material in order to understand the method of calculating the required number of observations.

## Real data

Consider two images with real sales, the model is trained on several years of retrodata. Shop number one:

And shop number two:

As you can see

*in the eyes*all very good. You will easily notice the weekly patterns, as well as recently in one of the stores that something clearly happened recently, the dynamics have changed. If you look closely, you can see that the model in both stores makes a significant mistake several times. In this case, two options:

- return to the feature selection step and find a feature that will explain this behavior, probably some kind of company;
- apply the detector anomalies and throw them to consider those days in which there is an anomalous error; if the error is systematic, then naturally they will not be considered anomalies; similarly, the throwing out is easily explained to the business, since the innovation in the form of laying out beer must be systematic, and not so that on one particular day, sales jumped, and on other days they did not change.

Consider the distribution of errors for the two control stores and one test:

It looks

*fine*. For persuasiveness, you can test for normality, and make sure that everything is

*normal*. If some kind of test does not give out normal results, then either score or roll back to the feature selection point. In this case, we do not need to restart the pilot, but only rebuild the model and recalculate the numbers (so you can think in advance about including a little more pilot days in the test period than the first version of the model gives). In our case, everything was as it should be.

Then we merge all the stores of the test group into one group and all the stores of the control group into one group, so that we can do this, so we assumed above that the model error is equally shifted for any store. We get two distributions and perform stattest.

As you might have guessed, in my skepticism at the very beginning, the new unique method of displaying goods and selecting brands had no statistically significant effect on sales. That, in principle, was expected, since I saw the method of choosing new brands and the way they were calculated. I am afraid that I can’t talk about these unique techniques, but one of the photographers who went to competitors to take a picture of the beer windows received ... was rudely turned out of the room.

## Conclusion

A reasonable question may arise - why the control group at all? We need it only to take into account some global changes, since the pilot can last 1-2 months. This method can be modified to test promotions lasting a week, for example, selling sausages. If we believe that in a week there will most likely not be any global changes, then the test group against the model is tested, which is the control, and the statistical test is carried out to equal zero the mean distribution. We illustrate this as follows:

- time on the abscissa and sales on the y-axis;
- $$ shows the growth of sales in the first week under the effect of global economic growth, for example;
- $$ shows the growth of sales in the rest of the time under the effect of global economic growth.

The described pilot can last several months, but various promotions are usually short, about a week. If we can ignore the expected$$, for example, because the expected elevator is much higher, we will not need a control group, and $$will go to forecast error. If we cannot ignore$$, we will need a control group, since we will make one and the same mistake in both groups, they will be equally shifted, and the errors level each other.

And what happened to the pilot? And everything is ok, it goes and thrives, waiting for a billion in return for a given KAMAZ gold. One of the last tasks on the project was the introduction of a methodology for testing promotions, but as I left the project, the respected Office quickly returned a test for the similarity of the triple difference. Probably the same fate befell this test, but I do not know.

By the way, now it is being successfully implemented in another retail, the optimization of the assortment proposed by another consulting office, but not by this Office, is being tested. The results met the expectations of the client and the office, and the client plans to introduce a new range of product optimization based on the results of such a test.