Why software tasks always take more time than you think

Original author: Erik Bernhardsson
  • Transfer
Everyone in the IT industry knows how difficult it is to evaluate a project’s deadline. It is difficult to objectively assess how long it will take to solve a difficult task. One of my favorite theories is that this is just a statistical artifact.

Suppose you evaluate a project at 1 week. Suppose there are three equally likely outcomes: either it will take 1/2 week, or 1 week, or 2 weeks. The median result is actually the same as the estimate: 1 week, but the average value (aka average, aka expected value) is 7/6 = 1.17 weeks. The score is actually calibrated (impartial) for the median (which is 1), but not for the average.

A reasonable model for the “inflation factor” (actual time divided by estimated time) would be something like a lognormal distribution . If the estimate is equal to one week, then we simulate the real result as a random variable distributed in accordance with the lognormal distribution for about one week. In such a situation, the median of the distribution is exactly one week, but the average value is much larger:



If we take the logarithm of the inflation coefficient, we get a simple normal distribution with a center of about 0. This assumes a median inflation coefficient of 1x, and, as you hope, remember, log (1) = 0. However, in various problems there may be different uncertainties around 0. We can model them by changing the parameter σ, which corresponds to the standard deviation of the normal distribution:



Just to show the real numbers: when log (actual / estimated) = 1, then the inflation coefficient exp (1) = e = 2.72. It is equally likely that the project will stretch to exp (2) = 7.4 times, and that it will end at exp (-2) = 0.14, i.e., 14% of the estimated time. Intuitively, the reason the average is so large is because tasks that run faster than anticipated cannot compensate for tasks that take much longer than anticipated. We are limited to 0, but not limited in the other direction.

Is this just a model? Still would! But soon I will get to the real data and on some empirical data I will show that in fact it is quite well consistent with reality.

Estimating Software Development Timelines


So far, so good, but let's really try to understand what this means in terms of estimating software development timelines. Suppose we look at a plan of 20 different software projects and try to evaluate how long it will take to complete them all .

This is where the mean becomes decisive. The averages add up, but there is no median. Therefore, if we want to get an idea of ​​how long it will take to complete the sum of N projects, we need to look at the average value. Suppose we have three different projects with the same σ = 1:

MedianAverage99%
Task a1.001.6510.24
Task B1.001.6510.24
Task c1.001.6510.24
SUM3.984.9518.85

Note that the averages add up and 4.95 = 1.65 * 3, but other columns do not.

Now let's add three projects with different sigma:

MedianAverage99%
Problem A (σ = 0.5)1.001.133.20
Problem B (σ = 1)1.001.6510.24
Problem C (σ = 2)1.007.39104.87
SUM4.0010.18107.99

The averages are still taking shape, but the reality is not even close to the naive 3-week estimate you might have expected. Note that a highly uncertain project with σ = 2 dominates the rest in average completion time. And for the 99th percentile, it not only dominates, but literally absorbs all the others. We can give a larger example:

MedianAverage99%
Problem A (σ = 0.5)1.001.133.20
Problem B (σ = 0.5)1.001.133.20
Problem C (σ = 0.5)1.001.133.20
Problem D (σ = 1)1.001.6510.24
Problem E (σ = 1)1.001.6510.24
Problem F (σ = 1)1.001.6510.24
Problem G (σ = 2)1.007.39104.87
SUM9.7415.71112.65

Again, the only unpleasant task is mainly dominant in the calculation of the estimate, at least for 99% of cases. Even in average time, one crazy project ultimately takes about half the time spent on all tasks, although they have similar values ​​in terms of median. For simplicity, I assumed that all tasks have the same time estimate, but different uncertainties. Mathematics is saved when the terms change.

It’s funny, but I have long had this feeling. Adding ratings rarely works when you have a lot of tasks. Instead, find out which tasks have the highest uncertainty: these tasks will usually dominate the average execution time.

The diagram shows the mean and 99th percentile as a function of uncertainty (σ):



Now math explained my sensations! I began to take this into account when planning projects. I really think that adding estimates of the deadlines for the tasks is very misleading and creates a false picture of how much time the whole project will take, because you have these crazy skewed tasks that ultimately take up all the time.

Where is the empirical evidence?


For a long time I kept it in my brain in the section “curious toy models”, sometimes thinking that this is a neat illustration of the phenomenon of the real world. But one day, wandering around the network, I stumbled upon an interesting set of data on assessing the timing of projects and the actual time to complete them. Fantasy!

Let's make a quick scatter chart of the estimated and actual time:



The median inflation rate for this data set is 1X, while the average coefficient is 1.81x. Again, this confirms the hunch that the developers rate the median well, but the average is much higher.

Let's look at the distribution of the inflation coefficient (logarithm):



As you can see, it is pretty well centered around 0, where the inflation coefficient exp (0) = 1.

Take the statistical tools


Now I’m going to dream up a little with statistics - do not hesitate to skip this part if it is not interesting to you. What can we conclude from this empirical distribution? You can expect that the logarithms of the inflation rate will be distributed according to the normal distribution, but this is not entirely true. Note that σ itself is random and varies for each project.

One convenient way of modeling σ is that they are selected from the inverse gamma distribution . If we assume (as before) that the logarithm of inflation coefficients is distributed in accordance with the normal distribution, then the "global" distribution of the logarithms of inflation coefficients ends with the Student distribution .

We apply the student distribution to the previous one: It



decently converges, in my opinion! Student distribution parameters also determine the inverse gamma distribution of σ values:



Note that values ​​of σ> 4 are very unlikely, but when they occur, they cause an average explosion of several thousand times.

Why software tasks always take more time than you think


Assuming that this dataset is representative of software development (doubtful!), We can draw a few more conclusions. We have parameters for the Student distribution, so we can calculate the average time required to complete the task without knowing σ for this task.

While the median inflation rate from this fit is 1x (as before), the 99% inflation rate is 32x, but if you go to the 99.99th percentile, it's a whopping 55 million ! One (free) interpretation is that some tasks are ultimately impossible. In fact, these extreme cases have such a huge impact on the average that the average inflation rate of any taskbecomes infinite . This is pretty bad news for anyone trying to meet deadlines!

Summary


If my model is correct (big if), then here is what we can find out:

  • People well estimate the median time to complete a task, but not average.
  • The average time is much larger than the median due to the fact that the distribution is distorted (lognormal distribution).
  • When you add grades for n tasks, things get worse.
  • Tasks of the greatest uncertainty (rather, of the largest size) can often dominate in the average time required to complete all tasks.
  • The average execution time of a task that we know nothing about is actually infinite .

Notes


  • Obviously, the findings are based on only one data set that I found on the Internet. Other data sets may give different results.
  • My model, of course, is also very subjective, like any statistical model.
  • I would be happy to apply the model to a much larger dataset to see how stable it is.
  • I suggested that all tasks are independent. In fact, they may have a correlation that will make the analysis much more annoying, but (I think) end up with similar conclusions.
  • The sum of the lognormally distributed values ​​is not another lognormally distributed value. This is the weakness of this distribution, since you can argue that most tasks are simply the sum of the subtasks. It would be nice if our distribution were sustainable .
  • I removed small tasks from the histogram (the estimated time is less than or equal to 7 hours), since they distort the analysis and there was a strange surge of exactly 7.
  • The code is on Github , as usual.

Also popular now: