# How we measure Yandex.Mail download speed

If your site is slowly loading, you risk that people will not appreciate how beautiful it is or how convenient it is. No one will like it when everything slows down. We regularly add new functionality to Yandex.Mail, sometimes we fix bugs, which means we constantly get new code and new logic. All this directly affects the speed of the interface.
Yandex.Mail is opened every day by millions of people from different parts of the globe. And it should not slow down anyone, therefore our work is not complete without various dimensions. In this post, alexeimoisseev and kurau and I decided to talk about what metrics we have and what tasks they solve. Perhaps this is useful to you.

The first time we load a page with mail, we measure using NTA . NTA is used as follows. The speed of the first download (what the front end can affect) is measured from PerformanceTiming.domLoading until the moment of full rendering (this is not onload, but the real time of the first rendering of letters). I specifically emphasize this, as many measure speed from PerformanceTiming.navigationStart . A lot of time can pass between NavigationStart and domLoading, because it includes the time of redirects, dns lookup, connections, etc. And such a metric is erroneous. For example, NOC and administrators, and not front-end developers, should be responsible for the dns lookup and connection time. Accordingly, it is very important, even in such metrics, to share the area of responsibility.

Modern browsers, including IE9, have NTA support.

But these measurements are not enough. A user downloads mail only once, and then he opens dozens of letters without reloading the page. And it’s important for us to know how quickly this happens.

Any changes to the page we have come through a single module that sets up timers for itself in various parts (preparation, requesting data from the server, templating, updating the DOM) and forwards them to the consumer modules. Timers are set via regular Date.now (). That is, at the moment of clicking on the link, we save the value Date.now () into a variable. After updating the DOM, we again remember Date.now () and calculate the difference with the previous value.

Interestingly, we did not immediately get to the separation of the update process at the stage and in the first versions we measured only the total execution time and the time of the request to the server. The stages and detailed measurements appeared after an unsuccessful release, where we slowed down a lot and could not understand why. Now the update module itself logs all its stages, and you can easily understand the reason for the slowdown: the server began to respond more slowly or JavaScript took too long.

It looks something like this:

All timings are collected and calculated upon sending. At the stages, the difference between “end” and “start” is not considered, and all calculations are performed at the end:

And similar records arrive on the server:

First boot steps:

Stages of rendering any page:

It should be noted that for honesty, the “total execution time” is not the sum of all metrics, but is calculated by a separate metric, “start” - “end”. ” This allows you to not lose the update stage. Detailed metrics allow you to quickly find the problem and ideally should be approximately equal to the total execution time. Complete equality cannot be obtained due to Promise or setTimeout.

When I hear such a phrase, I remember two jokes:

As you already understood, the “average” in the sense in which we most often understand it is nothing more than the arithmetic mean. In a more general case, it has a special name - "mathematical expectation", which in the discrete case (we will consider it later) is just the arithmetic mean. In general, in statistics, “average” refers to a whole family of measures of a central tendency, each of which characterizes the localization of the data distribution with a certain accuracy.

In our situation, we are dealing with data in which there are outliers that strongly affect the arithmetic mean. For clarity, we take the "real" data for the day and build a histogram. Let me remind you that with a sufficiently large amount of data, it becomes similar to a graph of the distribution density.

We calculate the arithmetic mean:

Horror. I note that depending on the amount of emissions this value will change. This is clearly seen if, for example, calculating the arithmetic mean for 99% of users, discarding the “large” ones:

The way to evaluate a sample is not for all the data, but only a subset is often used in case of outlier data. To do this, resort to special estimates of the central trend, based on data truncation. This group refers primarily to the median (Md).

As you know, the medianIs the average, not the average, in the sample. If we have numbers 1, 2, 2, 3, 8, 10, 20, then the median is 3, and the average is 6.5. In general, the median shows perfectly how much the average user is loading. Even if you divide these groups into “fast” and “slow”, the correct value will still be obtained.

Let's say the median is 1 s. Is this good or bad? And if we accelerate by 100 ms and do 0.9 s, then what will it be?

In the case of acceleration or deceleration, the median, of course, will change. But she cannot tell how many users have accelerated, but how many have slowed. Browsers can be accelerated, computers can be updated, the code can be optimized, but in the end you will have one little speaking figure.

To understand which group of users was affected by the changes, we can construct the following graph: we take time intervals 0 - 100 ms, 100 ms - 300 ms, 300 ms - 1000 ms, 1000 ms - infinity and consider how many percent of requests fit into each of them .

But here a problem arises. Each time we had to draw conclusions: it got a little better here and it got a little worse. Is~~it~~ possible ~~to conclude immediately? ~~Simplify the schedule even further?

When you learn to count metrics and make graphs, everyone will have a desire to build them for EVERYTHING. As a result, we get excellent 100,500 graphs, a bunch of disparate metrics, where everyone shows the boss what is more profitable for him. Poorly? Of course, bad! If you have problems it is not clear what to look at! Hundreds of graphs - and all are correct.

The standard situation: the backend builds its own schedules, the DB - the other, the frontend - the third. And where is the user? In the end, we all work for him and the schedule needs to be built from him. How to do it?

APDEX is an integration metric that immediately says good or bad. The metric works very simply. We select the time interval [0; t], such that if the page was shown in it, then the user is happy. We take another interval, (t; 4t] (four times the first), and we believe that if the page is displayed during this time, the user is generally satisfied with the speed of work, but not so happy anymore. And we use the formula:

It turns out a value from zero to one, which, apparently, best shows whether mail works well or poorly.

In the APDEX formula, unhappy or generally satisfied users influence the rating more than happy users, which means that it is worth working with them. Ideally, you should get one.

In Yandex, APDEX is used quite widely. He gained such popularity largely because his results can be processed automatically, since this is only one digit. On the contrary, in the case of a graph with multiple intervals, only a person can determine “good or bad”.

At the same time, using APDEX does not cancel the construction of other charts. The same percentiles are necessary and useful in case of analysis of problems, it will be clear from them what is happening. Thus, it is an auxiliary chart.

The right schedule is one that shows the real user experience with your site. You can endlessly improve the backend and make it arbitrarily fast, but the user, by and large, does not care. If the frontend slows down, the backend will not help, and vice versa. You should always go to the end user to find the problem.

Take, for example, an abstract user from Yekaterinburg. When, long ago, we began to introduce speed metrics, we found that the farther the user is from Moscow, the slower the mail works. Why? Everything is very simple: our DCs were then located in the capital, and the speed of light has a finite value. The signal must travel thousands of kilometers by wire. A simple calculation shows that a distance of 2,000 km will pass the light in about 7 ms. In reality, it will take even more time, because the light does not travel in a vacuum or in a straight line, there are many routers along the way, etc. Therefore, optimize do not optimize, and each TCP packet will have a delay of tens of milliseconds. Naturally, in such a situation, one should not invest in code optimization, but in creating CDNso that any user is closer to us.

Sometimes it turns out that you see even graphics, and users complain about the brakes. This always means that you either have a measurement error or you are measuring the wrong one. Metrics need to be stress tested to rule out errors in the metrics themselves. Moreover, stress testing should not be performed using the metric itself, but from the outside.

Slow backends, add loops or respond with errors. See how metrics change at each stage: from the backend to the frontend and browser. Only in this way can you be sure to measure what you really need.

For example, in stress testing, we somehow reached the point that every second request answered with an error. This allowed us to determine whether the data is re-requested in the metrics or not.

It is very important that optimization is not one-time or occasionally. A process should be organized over speed metrics. For starters, there are enough real-time charts and testing each release for speed. Thus, we will remain honest with ourselves and will understand exactly where we are slow. The established process allows you to track releases in which there have been changes in speed, which means that we can definitely fix this. Even if your team does not have time to focus on optimizing constantly and constantly, you can at least make sure that it does not get worse.

### What we are interested in

- The first boot time of the interface.
- The time to draw any block on the page (from a click to before it appears in the DOM and is ready to interact with the user).
- The number of abnormally long renderings of the page and their reasons (for example, we consider abnormally long any transition for more than two seconds).

The first time we load a page with mail, we measure using NTA . NTA is used as follows. The speed of the first download (what the front end can affect) is measured from PerformanceTiming.domLoading until the moment of full rendering (this is not onload, but the real time of the first rendering of letters). I specifically emphasize this, as many measure speed from PerformanceTiming.navigationStart . A lot of time can pass between NavigationStart and domLoading, because it includes the time of redirects, dns lookup, connections, etc. And such a metric is erroneous. For example, NOC and administrators, and not front-end developers, should be responsible for the dns lookup and connection time. Accordingly, it is very important, even in such metrics, to share the area of responsibility.

Modern browsers, including IE9, have NTA support.

But these measurements are not enough. A user downloads mail only once, and then he opens dozens of letters without reloading the page. And it’s important for us to know how quickly this happens.

Any changes to the page we have come through a single module that sets up timers for itself in various parts (preparation, requesting data from the server, templating, updating the DOM) and forwards them to the consumer modules. Timers are set via regular Date.now (). That is, at the moment of clicking on the link, we save the value Date.now () into a variable. After updating the DOM, we again remember Date.now () and calculate the difference with the previous value.

Interestingly, we did not immediately get to the separation of the update process at the stage and in the first versions we measured only the total execution time and the time of the request to the server. The stages and detailed measurements appeared after an unsuccessful release, where we slowed down a lot and could not understand why. Now the update module itself logs all its stages, and you can easily understand the reason for the slowdown: the server began to respond more slowly or JavaScript took too long.

It looks something like this:

`this.timings[‘look-ma-im-start’] = Date.now();`

`this.timings[‘look-ma-finish’] = Date.now();`

All timings are collected and calculated upon sending. At the stages, the difference between “end” and “start” is not considered, and all calculations are performed at the end:

`var totalTime = this.timings[‘look-ma-finish’] - this.timings[‘look-ma-im-start’];`

And similar records arrive on the server:

`serverResponse=50&domUpdate=60&yate=100`

### What do we measure

First boot steps:

- training,
- loading statics (HTTP request and parsing),
- execution of modules (declaration of declarations of models, types, etc.),
- initialization of base objects
- drawing
- execution of the "first render" event handlers.

Stages of rendering any page:

- preparation for a server request,
- request data from the server,
- templating
- DOM update
- event processing in view,
- execution of callback "after rendering".

It should be noted that for honesty, the “total execution time” is not the sum of all metrics, but is calculated by a separate metric, “start” - “end”. ” This allows you to not lose the update stage. Detailed metrics allow you to quickly find the problem and ideally should be approximately equal to the total execution time. Complete equality cannot be obtained due to Promise or setTimeout.

*- Ok, now we have metrics, and we can send them to the server.*

- What next?

- And let's build a schedule!

- What shall we consider?- What next?

- And let's build a schedule!

- What shall we consider?

### Let's calculate the average

When I hear such a phrase, I remember two jokes:

- On average, a person has less than two hands.
- The salary of a deputy is 100,000 rubles, the salary of a doctor is 10,000 rubles. The average salary is 55,000 rubles.

As you already understood, the “average” in the sense in which we most often understand it is nothing more than the arithmetic mean. In a more general case, it has a special name - "mathematical expectation", which in the discrete case (we will consider it later) is just the arithmetic mean. In general, in statistics, “average” refers to a whole family of measures of a central tendency, each of which characterizes the localization of the data distribution with a certain accuracy.

In our situation, we are dealing with data in which there are outliers that strongly affect the arithmetic mean. For clarity, we take the "real" data for the day and build a histogram. Let me remind you that with a sufficiently large amount of data, it becomes similar to a graph of the distribution density.

We calculate the arithmetic mean:

Horror. I note that depending on the amount of emissions this value will change. This is clearly seen if, for example, calculating the arithmetic mean for 99% of users, discarding the “large” ones:

The way to evaluate a sample is not for all the data, but only a subset is often used in case of outlier data. To do this, resort to special estimates of the central trend, based on data truncation. This group refers primarily to the median (Md).

**Median**As you know, the medianIs the average, not the average, in the sample. If we have numbers 1, 2, 2, 3, 8, 10, 20, then the median is 3, and the average is 6.5. In general, the median shows perfectly how much the average user is loading. Even if you divide these groups into “fast” and “slow”, the correct value will still be obtained.

Let's say the median is 1 s. Is this good or bad? And if we accelerate by 100 ms and do 0.9 s, then what will it be?

### OK, I accelerated rendering by 100ms

In the case of acceleration or deceleration, the median, of course, will change. But she cannot tell how many users have accelerated, but how many have slowed. Browsers can be accelerated, computers can be updated, the code can be optimized, but in the end you will have one little speaking figure.

To understand which group of users was affected by the changes, we can construct the following graph: we take time intervals 0 - 100 ms, 100 ms - 300 ms, 300 ms - 1000 ms, 1000 ms - infinity and consider how many percent of requests fit into each of them .

But here a problem arises. Each time we had to draw conclusions: it got a little better here and it got a little worse. Is

### Honey, I made another schedule

When you learn to count metrics and make graphs, everyone will have a desire to build them for EVERYTHING. As a result, we get excellent 100,500 graphs, a bunch of disparate metrics, where everyone shows the boss what is more profitable for him. Poorly? Of course, bad! If you have problems it is not clear what to look at! Hundreds of graphs - and all are correct.

The standard situation: the backend builds its own schedules, the DB - the other, the frontend - the third. And where is the user? In the end, we all work for him and the schedule needs to be built from him. How to do it?

#### APDEX

APDEX is an integration metric that immediately says good or bad. The metric works very simply. We select the time interval [0; t], such that if the page was shown in it, then the user is happy. We take another interval, (t; 4t] (four times the first), and we believe that if the page is displayed during this time, the user is generally satisfied with the speed of work, but not so happy anymore. And we use the formula:

(number of happy users + number of overall satisfied / 2) / (number of all users).

It turns out a value from zero to one, which, apparently, best shows whether mail works well or poorly.

In the APDEX formula, unhappy or generally satisfied users influence the rating more than happy users, which means that it is worth working with them. Ideally, you should get one.

In Yandex, APDEX is used quite widely. He gained such popularity largely because his results can be processed automatically, since this is only one digit. On the contrary, in the case of a graph with multiple intervals, only a person can determine “good or bad”.

At the same time, using APDEX does not cancel the construction of other charts. The same percentiles are necessary and useful in case of analysis of problems, it will be clear from them what is happening. Thus, it is an auxiliary chart.

### What is the correct schedule?

The right schedule is one that shows the real user experience with your site. You can endlessly improve the backend and make it arbitrarily fast, but the user, by and large, does not care. If the frontend slows down, the backend will not help, and vice versa. You should always go to the end user to find the problem.

Take, for example, an abstract user from Yekaterinburg. When, long ago, we began to introduce speed metrics, we found that the farther the user is from Moscow, the slower the mail works. Why? Everything is very simple: our DCs were then located in the capital, and the speed of light has a finite value. The signal must travel thousands of kilometers by wire. A simple calculation shows that a distance of 2,000 km will pass the light in about 7 ms. In reality, it will take even more time, because the light does not travel in a vacuum or in a straight line, there are many routers along the way, etc. Therefore, optimize do not optimize, and each TCP packet will have a delay of tens of milliseconds. Naturally, in such a situation, one should not invest in code optimization, but in creating CDNso that any user is closer to us.

### One more thing

Sometimes it turns out that you see even graphics, and users complain about the brakes. This always means that you either have a measurement error or you are measuring the wrong one. Metrics need to be stress tested to rule out errors in the metrics themselves. Moreover, stress testing should not be performed using the metric itself, but from the outside.

Slow backends, add loops or respond with errors. See how metrics change at each stage: from the backend to the frontend and browser. Only in this way can you be sure to measure what you really need.

For example, in stress testing, we somehow reached the point that every second request answered with an error. This allowed us to determine whether the data is re-requested in the metrics or not.

### Conclusion

It is very important that optimization is not one-time or occasionally. A process should be organized over speed metrics. For starters, there are enough real-time charts and testing each release for speed. Thus, we will remain honest with ourselves and will understand exactly where we are slow. The established process allows you to track releases in which there have been changes in speed, which means that we can definitely fix this. Even if your team does not have time to focus on optimizing constantly and constantly, you can at least make sure that it does not get worse.