Sampling and calculation accuracy

    A number of my colleagues are faced with the problem that in order to calculate some kind of metric, for example, conversion rate, you have to validate the entire database. Or you need to conduct a detailed study for each client, where there are millions of customers. This kind of kerry can work for quite some time, even in specially made repositories. It’s not very fun to wait 5-15-40 minutes until a simple metric is considered to find out that you need to calculate something else or add something else.


    One solution to this problem is sampling: we are not trying to calculate our metric on the entire data array, but take a subset that representatively represents the metrics we need. This sample can be 1000 times smaller than our data array, but it’s good enough to show the numbers we need.


    In this article, I decided to demonstrate how sampling sample sizes affect the final metric error.


    Problem


    The key question is: how well does the sample describe the “population”? Since we take a sample from a common array, the metrics we receive turn out to be random variables. Different samples will give us different metric results. Different, does not mean any. Probability theory tells us that the metric values ​​obtained by sampling should be grouped around the true metric value (made over the entire sample) with a certain level of error. Moreover, we often have problems where a different error level can be dispensed with. It’s one thing to figure out whether we get a conversion of 50% or 10%, and it’s another thing to get a result with an accuracy of 50.01% vs 50.02%.


    It is interesting that from the point of view of theory, the conversion coefficient observed by us over the entire sample is also a random variable, because The “theoretical” conversion rate can only be calculated on a sample of infinite size. This means that even all our observations in the database actually give a conversion estimate with their accuracy, although it seems to us that these calculated numbers are absolutely accurate. This also leads to the conclusion that even if today the conversion rate differs from yesterday, this does not mean that something has changed, but only means that the current sample (all observations in the database) is from the general population (all possible observations for this day, which occurred and did not occur) gave a slightly different result than yesterday.


    Task statement


    Let's say we have 1,000,000 records in a database of type 0/1, which tell us whether a conversion has occurred on an event. Then the conversion rate is simply the sum of 1 divided by 1 million.


    Question: if we take a sample of size N, how much and with what probability will the conversion rate differ from that calculated over the entire sample?


    Theoretical considerations


    The task is reduced to calculating the confidence interval of the conversion coefficient for a sample of a given size for a binomial distribution.


    From theory, the standard deviation for the binomial distribution is:
    S = sqrt (p * (1 - p) / N)


    Where
    p - conversion rate
    N - Sample size
    S - standard deviation


    I will not consider the direct confidence interval from the theory. There is a rather complicated and confusing matan, which ultimately relates the standard deviation and the final estimate of the confidence interval.


    Let's develop an "intuition" about the standard deviation formula:


    1. The larger the sample size, the smaller the error. In this case, the error falls in the inverse quadratic dependence, i.e. increasing the sample by 4 times increases the accuracy by only 2 times. This means that at some point increasing the sample size will not give any particular advantages, and also means that a fairly high accuracy can be obtained with a fairly small sample.


    1. There is a dependence of the error on the value of the conversion rate. The relative error (that is, the ratio of the error to the value of the conversion rate) has a "vile" tendency to be the greater, the lower the conversion rate:


    1. As we see, the error "flies up" into the sky with a low conversion rate. This means that if you sample rare events, then you need large sample sizes, otherwise you will get a conversion estimate with a very big error.

    Modeling


    We can completely move away from the theoretical solution and solve the problem "head on." Thanks to the R language, this is now very easy to do. To answer the question, what error do we get when sampling, you can just do a thousand samples and see what error we get.


    The approach is this:


    1. We take different conversion rates (from 0.01% to 50%).
    2. We take 1000 samples of 10, 100, 1000, 10000, 50,000, 100,000, 250,000, 500,000 elements in the sample
    3. We calculate the conversion rate for each group of samples (1000 coefficients)
    4. We construct a histogram for each group of samples and determine the extent to which 60%, 80% and 90% of the observed conversion rates lie.

    R code generating data:


    sample.size <- c(10, 100, 1000, 10000, 50000, 100000, 250000, 500000)
    bootstrap = 1000
    Error <- NULL
    len = 1000000
    for (prob in c(0.0001, 0.001, 0.01, 0.1, 0.5)){
      CRsub <- data.table(sample_size = 0, CR = 0)
      v1 = seq(1,len)
      v2 = rbinom(len, 1, prob)
      set = data.table(index = v1, conv = v2)
      print(paste('probability is: ', prob))
      for (j in 1:length(sample.size)){
        for(i in 1:bootstrap){
          ss <- sample.size[j]
          subset <-  set[round(runif(ss, min = 1, max = len),0),]
          CRsample <- sum(subset$conv)/dim(subset)[1]
          CRsub <- rbind(CRsub, data.table(sample_size = ss, CR  = CRsample))
        }
        print(paste('sample size is:', sample.size[j]))
        q <- quantile(CRsub[sample_size == ss, CR], probs = c(0.05,0.1, 0.2, 0.8, 0.9, 0.95))
        Error <- rbind(Error, cbind(prob,ss,t(q)))
    }

    As a result, we get the following table (there will be graphs later, but the details are better visible in the table).


    Conversion rateSample size5%10%20%80%90%95%
    0.000110000000
    0.0001100000000
    0.00011000000000.001
    0.000110,0000000.00020.00020.0003
    0.000150,0000.000040.000040.000060.000140.000160.00018
    0.0001100,0000.000050.000060.000070.000130.000140.00016
    0.00012500000.0000720.00007960.0000880.000120.0001280.000136
    0.0001500,0000.000080.0000840.0000920.0001140.0001220.000128
    0.00110000000
    0.001100000000.01
    0.00110000000.0020.0020.003
    0.00110,0000.00050.00060.00070.00130.00140.0016
    0.00150,0000.00080.0008580.000920.001160.001220.00126
    0.001100,0000.000870.000910.000950.001120.001160.0012105
    0.0012500000.000920.0009480.0009720.0010840.0011160.0011362
    0.001500,0000.0009520.00096980.0009880.0010660.0010860.0011041
    0.0110000000.1
    0.011000000.020.020.03
    0.0110000.0060.0060.0080.0130.0140.015
    0.0110,0000.00860.00890.00920.01090.01140.0118
    0.0150,0000.00930.00950.00970.01040.01060.0108
    0.01100,0000.00950.00960.00980.01030.01040.0106
    0.012500000.00970.00980.00990.01020.01030.0104
    0.01500,0000.00980.00990.00990.01020.01020.0103
    0.1100000.20.20.3
    0.11000.050.060.070.130.140.15
    0.110000.0860.08890.0930.1080.11210.117
    0.110,0000.09540.09630.09790.10280.10410.1055
    0.150,0000.0980.09860.09920.10140.10190.1024
    0.1100,0000.09870.0990.09940.10110.10140.1018
    0.12500000.09930.09950.09980.10080.10110.1013
    0.1500,0000.09960.09980.10.10070.10090.101
    0.5100.20.30.40.60.70.8
    0.51000.420.440.460.540.560.58
    0.510000.4730.4780.4860.5130.520.525
    0.510,0000.49220.49390.49590.50440.50610.5078
    0.550,0000.49620.49680.49780.50180.50280.5036
    0.5100,0000.49740.49790.49860.50140.50210.5027
    0.52500000.49840.49870.49920.50080.50130.5017
    0.5500,0000.49880.49910.49940.50060.50090.5011

    Let's see the cases with 10% conversion and with a low 0.01% conversion, because all features of working with sampling are clearly visible on them.


    At 10% conversion, the picture looks pretty simple:



    Points are the edges of the 5-95% confidence interval, i.e. making a sample we will in 90% of cases get CR on the sample within this interval. Vertical scale - sample size (logarithmic scale), horizontal - conversion rate value. The vertical bar is a “true” CR.


    We see the same thing that we saw from the theoretical model: accuracy increases as the size of the sample grows, and one converges quite quickly and the sample gets a result close to "true". In total for 1000 samples we have 8.6% - 11.7%, which will be enough for a number of tasks. And in 10 thousand already 9.5% - 10.55%.


    Things are worse with rare events and this is consistent with the theory:



    У низкого коэффициента конверсии в 0.01% принципе проблемы на статистике в 1 млн наблюдений, а с сэмплами ситуация оказывается еще хуже. Ошибка становится просто гигантской. На сэмплах до 10 000 метрика в принципе не валидна. Например, на сэмпле в 10 наблюдений мой генератор просто 1000 раз получил 0 конверсию, поэтому там только 1 точка. На 100 тысячах мы имеем разброс от 0.005% до 0.0016%, т.е мы можем ошибаться почти в половину коэффициента при таком сэмплировании.


    Также стоит отметить, что когда вы наблюдаете конверсию такого маленького масштаба на 1 млн испытаний, то у вас просто большая натуральная ошибка. Из этого следует, что выводы по динамике таких редких событий надо делать на действительно больших выборках иначе вы просто гоняетесь за призраками, за случайными флуктуациями в данных.


    Выводы:


    1. Сэмплирование рабочий метод для получения оценок
    2. Sample accuracy increases with increasing sample size and decreases with a decrease in conversion rate.
    3. The accuracy of the estimates can be modeled for your task and thus choose the optimal sampling for yourself.
    4. It’s important to remember that rare events do not sample well
    5. In general, rare events are difficult to analyze; they require large data samples without samples.

    Also popular now: