What is wrong with A / B testing

https://www.locallyoptimistic.com/post/against-ab-tests

Transfer

We have prepared for the readers of Habra a translation of the article by Michael Kaminsky, former director of analytics at Harry's. He talks about what's wrong with A / B testing. Gleb Sologub, Skyeng analytics director, comments on the material.

The concept of A / B testing is based on the fundamentally wrong assumption that there is a single solution that is on average better for all customers. Analysts should abandon the assumption of homogeneity of their audience and begin to develop systems that allow you to use (and encourage) the results of other tests, except for binary ones.

Over the past few weeks, two very interesting articles on non-standard interpretations of A / B tests have been published. One of the articles from the Uber engineering blog is about quantile effects calculation, and the other ( from StitchFix's consistently excellent blog on Data Science ) is about using bandit contextual algorithms to achieve personalization.

Both of these articles are interesting, but it seems to me that they have too much theory about the interpretation and implementation of tests and not enough facts. I will restate my thesis for clarity:

Traditional A / B testing relies on a fundamentally wrong assumption. In most cases, option A will be better for some subgroups, and option B for others. Choosing A OR B initially loses to a carefully selected combination of A and B.

Unfortunately, applying this approach to testing, optimizing and developing software is not easy. This requires new statistical tools, new development tools and support for software solutions, as well as training of stakeholders if you want them to be involved in the process too. In this article I will give a motivating example, and then I will talk about some of the problems that will be faced when creating systems that adapt to the new reality. I will not discuss the statistical data underlying the construction of these types of systems (it is better to read the StitchFix article and this article from Google ), but I will talk about the opportunities that I see at the strategic and architectural levels.

Motivating example

In order to convince you that this is important, let's consider a small example. Although these numbers are fictitious, they perfectly represent what I have seen countless times with a real assessment of A / B tests.

Another Mattress Company (EOMC) sells mattresses online (you could see their ads on the subway). They want to test the updated order form optimized for phones. Designers are a bit worried that although the updated version is less cumbersome, it also conveys less information during the ordering process and this may adversely affect conversion from users with desktop computers.

The team runs the test and gets the following results:

Damn it, there is no difference! Intuitively, you decide to divide traffic into mobile and PC.

Wow The new version ... demonstrated exactly what the designers expected! The situation has become better for users of mobile devices and worse for PC users.

It is bad that our A / B test showed no effect. Perhaps we should send our designers to think about the new version of the order form.

But wait! What if we support an optimized mobile version for users who access the site via the phone? And an optimized desktop version - for users on desktops? What if we created a landing page that works better on weekends when people have more time to read? What if we created an ad that works better in California and not in Massachusetts?

What if the webpage doesn't have to fit all at once?

Tasks

It is difficult to say whether this idea is obvious or revolutionary. She is so obvious that it seems almost stupid. But if you look at how most companies develop, test and debug software products, it turns out that this is a fairly fundamental shift in the approach to software problems.

In many companies, there is still only one working version of the website. Testing can be done, but as soon as one of the tests wins, the losing version is discarded and the only correct version appears, the “king of the hill”.

In order to cover the whole variety of clients and users, it is necessary to develop software solutions in a fundamentally different way. We need new and more advanced tools, and we need to train stakeholders in a new way of thinking.

Today, trying to manage usage scenarios with so many variables is very difficult (if possible at all). Since managing such a large number of options is expensive, many companies do not even try to personalize customer experience. Then I will talk in more detail about the problems and outline ways to solve them.

Programming tools

In our brave new world, where we provide a variety of content to different categories of users (in proportions that may change over time), we need tools for both developing and analyzing our software.

It seems that the most obvious consequence of using such a paradigm will be a significant increase in the amount of code in a project. Instead of deleting obsolete code branches after testing, we will have to support them (perhaps forever). It's horrible!

In fact, we need to make applications more modular, so that we can constantly develop, test, deploy and maintain new code branches (for example, new versions for testing).

In order to be able to direct users to different branches of code based on their characteristics (potentially the number of branches of a user script can be huge), it is necessary to develop an architecture that supports such a fork. We need a centralized decision-making mechanism that can choose a route for a given user. It is also necessary that the components of the path be interchangeable enough to guide the user along the route, even if they were developed independently of each other and without creating a single use case.

Finally, without a single, holistic use scenario, we need tools so that product managers and designers can imagine the client's way in the garden of diverging paths. How do we introduce and evaluate new features? How do we track what steps this user went through when he used our application? How can we prevent the application from turning into a shapeless mass of spaghetti code?

Communication and learning

Taking this new look at software development will be particularly difficult for people far from the product creation process. Managers are accustomed to taking care of the route of a single user, the only sound of the brand and the same and universal experience of interaction with the client. When we begin to personalize the user experience, the opportunity to talk about the software solution only from one point of view disappears.

We need to educate the stakeholders of the value of this new approach and help them think about the user's scenario and the sound of the brand in such a context. It is necessary to develop methods for determining the most common routes. And to give managers the tools to study the product on behalf of a user from a specific subgroup so that they can gain experience of interacting with the product, personalized for different users, from different points of view.

Statistical tools

Most likely, in a world without A / B testing, we will have to get rid of the many tools that we traditionally used to optimize web applications. All our efforts to train product managers and marketers to run and interpret A / V tests will not matter.

In this new world, we will need to develop new methods of research and visualization of samples of different sizes. We will need new, more advanced methods of comparison in order not to fall into the trap of multiple comparisons .

findings

Taking into account the true diversity in our user base, we can improve interaction with a large number of users, which is very valuable. Unfortunately, as is often the case when changing the approach to the development and implementation of technologies, these advantages are expensive. We have a long way to go from the point we are in now to a stunning, more personalized future, and I’m sure that this journey will be exciting.

Author's Note:
I exclude all discussions of confidence intervals and statistical significance for the purpose of simplification. Sorry.

Commentary from Gleb Sologub, Skyeng Analytics Director

Michael summarizes the current trends of personalization and fantasizes about what the means and methods of development and analytics should be when all IT products are completely individualized for specific users.

So far, we have learned to do personalization in two ways: first, by making separate scenarios for different user segments, and secondly, by developing algorithmic solutions for displaying personalized content on the individual steps of the funnel.

So, Skyeng of course has optimized mobile versions of the site and training platform, as well as various versions of these products for users of different ages. In addition, we conducted AB-tests and realized that users from different regions have different sensitivity to prices, after which they introduced differentiation of the size of the discount depending on the region.

Examples of algorithmic personalization, in addition to those given by Michael, can be added as a long-time and widely used lists of recommended products or content, as well as the relatively recent success in generating individual posters to movies.

However, all this can be done by continuing to use the old methods of development and analytics.

In the same future that Michael describes, the AB tests, as they are, may turn out to be useless, but it will take incredible software modularity and some new analytic methods to create an endless variety of completely individual user scenarios .

We at Skyeng already have and are expanding a team of researchers and analysts who are studying these trends and are trying to apply them to improve our products.

Tags: