# How statistics help make Yandex. Traffic better

How are Yandex.Traffic arranged? Where does the initial traffic data come from? How do they turn into a traffic jam map? Is traffic information always reliable? How to check it? And most importantly, how to make traffic data more accurate? For all this, traffic jams use statistics: science is both strong and insidious. In this lecture , Yandex.Probok analyst Leonid Mednikov tells students how to distinguish reliable from random results, and how statistics are used in various practical problems.

The principle of Yandex.Traffic is quite simple. A separate lecture is not required to describe it. The applications Yandex.Maps and Yandex.Navigator using GPS determine the location of the device on which they are running, and transfer this information to the server. And he, in turn, based on these data creates a picture of traffic jams. But how exactly does this happen? This is already a completely non-trivial process that requires the use of special algorithms and statistics. How to distinguish a slowly advancing motorist from a pedestrian or cyclist? How to verify the reliability of the transmitted information? And most importantly, how to make the traffic data the most accurate?

The very first problem that you have to deal with when compiling a traffic jam map is that GPS technology is far from ideal and does not always determine the exact location. By tracking the movement of one object, you can get quite strange data: instantaneous movements over fairly large distances, transition to the side of the road, etc. As a result, we can get the following picture: The algorithm needs to decide what it was. To refine the data, the algorithm will try to find the most probable and allowed route rules. Next, we need to determine which object we are facing? Indeed, in a traffic jam, cars can drive at the speed of a pedestrian, and GPS accuracy does not allow us to say with certainty whether an object is moving along a roadway or along a roadside. Here, several factors can be taken into account: firstly, if the number of objects moving at low speed is large enough, we can assume that there is really a traffic jam on this road and cars drive slowly. We can also take into account the history of the speed of the object over the past few hours. Suppose, if an object has not developed a speed of more than 10-15 kilometers per hour over the past four hours, then this is most likely a pedestrian.

So, we determined the speed of the machines in a certain direction for a certain period, processed them with special algorithms, averaged and, as a result, obtained an approximate general speed of the flow. All this happens in real time and on a stream consisting of hundreds of thousands of cars.

#### Algorithms

As we said above, algorithms are engaged in determining the type of an object, its speed of movement, and averaging the speed of a stream. But how to evaluate the quality of their work, to understand which way you need to move in order to improve them?

The first problem is solved by launching test cars in the city. Like ordinary users, they send data about their movements, but at the same time we know exactly what the real conditions were. After that, on the basis of metrics, you can compare the real traffic jam map in a certain area with the calculated algorithms, and determine how well the system copes with its task.

Comparing the real situation with the results of several algorithms, we can evaluate which one is better. Suppose we test a new algorithm and compare it with the old one. We run it on 67 segments and get the following results: In 54 cases, the new algorithm worked better than the old, and in 13 worse. In terms of the percentage of correct answers and errors, the new algorithm is clearly better. Now imagine that we are adding another algorithm to the comparison. But he ran on fewer segments - six. The percentage of hits and errors is even better than the first. But is it worth trusting verification on such a small number of segments? To clearly answer this question, we turn to statistics.

#### Statistics

To begin with, we’ll deal with random values. Let's say we flip a coin three times. We take the eagle for zero, and the tails for one. In total, we can get eight combinations. If we calculate the sum of the fallen values ​​for each of the combinations, we will see that the extreme values ​​fall out less often, and as we approach the center, the probability increases. Now consider an example in which we throw a coin N times. We will take the number of objects as k, and the number of combinations as C. #### Average and observed average

Now we pose another problem. Imagine that we have a coin, and we want to understand whether it is even, i.e. whether it falls on each side with the same probability. We will test this hypothesis by tossing a coin. If the coin is even, then the more times we throw a coin, the closer to the average should be the largest number of results. At the same time, on a low number of shots, the probability of making an incorrect conclusion is much higher.

Why are we even talking about accidents? Indeed, errors in algorithms for constructing a traffic jam map most often arise for some objective reasons. But not all of these reasons are known to us. If we knew all the reasons, we could create the perfect algorithm, and never improve it again. But until this happens, we are forced to take some algorithm errors as random variables. Therefore, we check how much better the algorithm began to work using approximately the same apparatus as the flat coin hypothesis. Only in this case, as a basic hypothesis, we accept that the algorithm has not changed at all. And then, using the data run through the algorithm, we determine whether there are distortions in one direction or another. And just as with the coin, it’s important that there are as many runs as possible,

If we return to our example with two algorithms, then we can still say with confidence that the first new algorithm is better than the old. But the second raises more doubts, despite the fact that the percentage of correct answers and errors is better for him. Perhaps it really works better, but to verify this, you need to conduct more checks. 