Using the computing power of R to test the hypothesis of equality of means
Recently, a need arose to solve a seemingly classical math problem. statistics.
A test of a certain push effect on a group of people is conducted. It is necessary to evaluate the effect. Of course, you can do this using a probabilistic approach.
But talking with business about the null hypotheses and the p-value is completely useless and counterproductive.
How can, as of February 2019, this be done as simply and quickly as possible with an “average hand” laptop on hand? Abstract note, no formulas.
It is a continuation of previous publications .
Formulation of the problem
There are two statistically identical measured groups of users (A and B). Group B is affected. Does this effect lead to a change in the average value of the measured indicator?
The most popular option is to calculate statistical criteria and draw a conclusion. I like the example of "Classic Statistical Methods: Chi-square Test" . In this case, it does not matter how this is done, with the help of specials. programs, Excel, R or something else.
However, the reliability of the findings can be very doubtful for the following reasons:
- In fact, mat. few people understand statistics from beginning to end. You should always keep in mind the conditions under which one or another method can be applied.
- As a rule, the use of tools and the interpretation of the results are based on the principle of a single calculation and the adoption of a “traffic light” decision. The fewer questions, the better for all participants in the process.
Criticism of p-value
A lot of materials, links to the most spectacular of those found:
- Nature. Scientific method: Statistical errors. P values, the 'gold standard' of statistical validity, are not as reliable as many scientists assume., Regina Nuzzo. Nature 506, 150-152
- Nature Methods. The fickle P value generates irreproducible results, Lewis G Halsey, Douglas Curran-Everett, Sarah L Vowler & Gordon B Drummond. Nature Methods volume 12, pages 179–185 (2015)
- ELSEVIER. A Dirty Dozen: Twelve P-Value Misconceptions, Steven Goodman. Seminars in Hematology Volume 45, Issue 3, July 2008, Pages 135-140
What can be done?
Now everyone has a computer at hand, so the Monte Carlo method saves the situation. From p-value calculations, we proceed to the calculation of confidence intervals for the average difference.
There are many books and materials, but in a nutshell (resamapling & fitting) is very compactly presented in the report of Jake Vanderplas - “Statistics for Hackers” - PyCon 2016 . The presentation itself .
One of the initial works on this topic, including proposals for graphical visualization, was written by the well-known mathematician popularizer of the Soviet era, Martin Gardner: Confidence intervals rather than P values: estimation rather than hypothesis testing. MJ Gardner and DG Altman, Br Med J (Clin Res Ed). 1986 Mar 15; 292 (6522): 746-750 .
How to use R for this task?
In order not to do everything with our hands on the lower level, let's look at the current state of the ecosystem. Not so long ago, a very convenient package was transferred to R dabestr
: Data Analysis using Bootstrap-Coupled Estimation .
The principles for calculating and analyzing the results used in dabestr
cheat sheets are described here: ESTIMATION STATISTICS BETA ANALYZE YOUR DATA WITH EFFECT SIZES .
---
title: "A/B тестирование средствами bootstrap"
output:
html_notebook:
self_contained: TRUE
editor_options:
chunk_output_type: inline
---
library(tidyverse)
library(magrittr)
library(tictoc)
library(glue)
library(dabestr)
Simulation
Create a lognormal distribution of the duration of operations.
my_rlnorm <- function(n, mean, sd){
# пересчитываем мат. моменты: https://en.wikipedia.org/wiki/Log-normal_distribution#Arithmetic_moments
location <- log(mean^2 / sqrt(sd^2 + mean^2))
shape <- sqrt(log(1 + (sd^2 / mean^2)))
print(paste("location:", location))
print(paste("shape:", shape))
rlnorm(n, location, shape)
}
# N пользователей категории (A = Control)
A_control <- my_rlnorm(n = 10^3, mean = 500, sd = 150) %T>%
{print(glue("mean = {mean(.)}; sd = {sd(.)}"))}
# N пользователей категории (B = Test)
B_test <- my_rlnorm(n = 10^3, mean = 525, sd = 150) %T>%
{print(glue("mean = {mean(.)}; sd = {sd(.)}"))}
We collect the data in the form necessary for the analysis by means dabestr
, and conduct the analysis.
df <- tibble(Control = A_control, Test = B_test) %>%
gather(key = "group", value = "value")
tic("bootstrapping")
two_group_unpaired <- df %>%
dabest(group, value,
# The idx below passes "Control" as the control group,
# and "Test" as the test group. The mean difference
# will be computed as mean(Test) - mean(Control).
idx = c("Control", "Test"),
paired = FALSE,
reps = 5000
)
toc()
Let's take a look at the results
two_group_unpaired
plot(two_group_unpaired)
=================================================== ====
Result as CI
DABEST (Data Analysis with Bootstrap Estimation) v0.2.0
=======================================================
Unpaired mean difference of Test (n=1000) minus Control (n=1000)
223 [95CI 209; 236]
5000 bootstrap resamples.
All confidence intervals are bias-corrected and accelerated.
and the pictures
are understandable and convenient for talking with business. All calculations were for a "cup of coffee."
Previous publication - "Data Science" special forces "in-house . "