How Yandex.Taxi predicts the time of filing a car using machine learning

    Imagine that you need to call a taxi. You open the application, see that the car will arrive in seven minutes, click "Order" - and ... the car is 15 minutes away from you, if it is found at all. Agree, unpleasant?

    Under the cut, let's talk about how machine learning methods help Yandex.Taxi better predict ETA (Estimated Time of Arrival).

    To begin, we recall that the user sees in the application before ordering:

    On the map, the optimal points for landing in a taxi are marked in blue. Red pin - the point to which the user calls a taxi. The pin shows how long the car will arrive. In an ideal world. But in the real world, other people nearby also call themselves a car through the Yandex.Taxi application. And we do not know which car will go to whom, because they are distributed only after the order. If the machine is already assigned, for the forecast we will use the Yandex.Map routing and the time when moving along the optimal path. This time (perhaps with a small margin) we will show the user immediately after ordering. The question remains: how to predict ETA before ordering?

    And here comes machine learning. We make a sample with the objects and the correct answers and train the algorithm to guess the answer according to the features of the object. In our case, the objects are user sessions, the answers are the time after which the car actually arrived. Signs of the object can be numerical parameters known before ordering: the number of drivers and users of the application next to the pin, the distance to the nearest service cars and other potentially useful values.

    Why is it important

    In an ideal world, people do everything in advance and always plan their time accurately. But we live in the real world. If a person is late for work or, even worse, at the airport, it is important for him to understand whether he will have time to leave and get to his destination.
    When deciding what to order, the future passenger is guided, including waiting time. It can be very different in different applications for calling a taxi, and in different rates of one application. So that the user does not regret the choice, it is very important to show the exact ETA.

    It seems simple. Come up with more signs, train the model, for example CatBoost, predict the time before the car arrives - and you can finish on this. But experience shows that it is better not to hurry and think carefully, and then do it.

    At first, we had no doubt that it was necessary to predict the time after which the driver would actually arrive at the user. Yes, before ordering, we do not know for sure what kind of car will be assigned. But we can predict ETA using data not about a specific driver, but about drivers in the vicinity of the order. Of course, the forecast must be fair enough that the user can plan time.

    But what does “honest” mean? After all, any prediction algorithm is good or good only statistically. There are also successful, and frankly bad results, but it is necessary “on average” not to deviate much from the correct answers. Here we must understand that "on average" is different. For example, the average is at least three concepts from statistics: expectation, median and fashion. The picture from Darella Huff's great book “How to lie with the help of statistics” perfectly shows the difference:

    We want the model to be on average a little mistaken. Depending on the value “on average”, there are two options for assessing the quality of forecasts. The first option is to show the user the expected time before the arrival of a taxi. As a result, a model will be learned that minimizes the average square of the forecast error (Mean Squared Error, MSE):

    $ MSE = \ frac {1} {n} \ sum_ {i = 1} ^ {n} (y_i - \ hat {y} _i) ^ 2 \ rightarrow min $

    Here $ y_i $ - right answers, $ \ hat {y} _i $- model predictions.

    Another option is not to be mistaken with the ETA forecast mainly in one direction, up or down. In this case, we will show the user the median time distribution before the arrival of a taxi. As a result, a model will be trained that optimizes the average module of the forecast error (Mean Absolute Error, MAE):

    $ MAE = \ frac {1} {n} \ sum_ {i = 1} ^ {n} | y_i - \ hat {y} _i |  \ rightarrow min $

    But we realized that we are running a little ahead.

    Rethinking Problem Statement

    After the appointment, we know what kind of car goes to the user, which means we can estimate its travel time using Yandex.Maps. This time is shown in the pin after the order. On the one hand, we now have more information and the forecast will be more accurate, but, on the other hand, this is also an estimate with an error.

    That was the catch in the ETA pin problem. As long as the driver is not appointed, it is necessary to predict exactly the time that Yandex.Maps routing will show later, and not the actual time before the car is delivered.

    It would seem that nonsense: instead of the exact value, take as a target a different forecast? But it makes sense, and here's why. On the way to you, the designated machine may linger. The driver got into a dangerous situation on the road, in a traffic jam due to an accident or went out to buy water. Such delays are difficult to predict. They create additional noise in the target variable, due to which the already difficult task of predicting ETA in pin becomes even more difficult.

    How to get rid of noise? Predict the smoothed target variable - the time that is displayed after the machine is assigned based on the route to the user.

    This is also the logic from the business point of view: you can’t throw out travel time on the best way from ETA anyway, but additional delays can be reduced by working with drivers.

    Quality metrics, data, model and training

    We found out that for ETA in Pina, it is necessary to predict not the actual time, but the time that will be obtained after the machine is assigned along the route. Of the two quality metrics, MAE and MSE, we chose MAE. Perhaps, from the point of view of the intuitiveness of the forecast, it is more logical to evaluate the expectation (MSE), rather than the median (MAE). But MAE has a nice feature: the model is more resistant to emissions (outliers) among training examples.

    Signs are divided into groups:
    - built on the current time;
    - geo (coordinates, distance to the city center and significant objects on the map);
    - pin (how many and what cars nearby, their density is calculated differently);
    - statistics on the zone (as usual, we are mistaken, how many are predicted);
    - data on the nearest drivers (how long it takes to arrive, how close is the first closer than the second, etc.)

    On these signs, CatBoost was trained, of course . The decisive argument was that the gradient boost implemented in CatBoost over balanced trees has long established itself as a very powerful method of machine learning, and the method of encoding categorical features in CatBoost regularly justifies itself in our tasks. Another nice feature of the library is fast GPU training.

    Now a few words about which models were compared. The original ETA (up to machine learning) was calculated on the basis of the time for which the machine closest to the user can arrive. The current model (used in the application now) is what has been done with the help of machine learning and what this article is about. In addition, the production will soon roll out a new model. It uses an order of magnitude more significant for solving the problem of signs. The table below shows the measurements of the quality of these models on historical data. By the way, we still have a lot of plans - come help .

    ETA forecast quality for validation *

    Mean Absolute Error
    Error more than 1 minute
    Error more than 2 minutes
    Error more than 5 minutes

    Original ETA





    Current model

    79.276 (–3.4)

    29.33 (–2.1)

    16.98 (–6.3)

    3 (–19.2)

    New model

    78.414 (–4.5)

    28.95 (–3.4)

    16.62 (–8.2)

    2.8 (–23.2)

    * Percentage (change in parentheses compared to baseline ETA).

    Machine learning allowed to win about two seconds, or 3.4% of the average deviation of the forecast. And in the new model - almost a second, a total of 4.5% already. But by these numbers it is difficult to understand that ETA has improved significantly. To feel the benefits of machine learning, you should pay attention to the last column. Slips with a forecast of more than 5 minutes was 19.2%, and in the new model - even 23.2% less! By the way, such errors occur only in 3 and 2.8% of cases in models using machine learning.


    We refined the ETA in Pina mainly in order to provide users with a reliable forecast. But, of course, in any application of machine learning in business, it is imperative to evaluate the economic effect. And to understand whether it is comparable to the cost of building and implementing models. After the A / B test online, it turned out that we, using machine learning, received a statistically significant increase in conversion from order to trip (after all, an order can be canceled) and an increase in conversion from a user session to an order.

    In both cases, this is an effect of the order of 0.1 percentage points. This, by the way, does not contradict statistical significance: on our data volumes, even such a difference is reliably detected in 2–4 weeks. And with the importance to the business, in fact, everything is also not bad: it turned out that the cost of specifying the ETA is bounced off with a conversion increase in just a few months.

    As a result, we received a useful and demonstrative case. The specification of the ETA in the pin has become an instructive story about the careful selection of the target variable. On the product side, this is a very motivating example: we improved the application and saw that users appreciated this. We hope, the updated ETA will help our passengers to keep up with meetings, trains and airplanes more often.

    PS If you are interested in other Yandex.Taxi technologies, then we recommendpost about dynamic pricing , which my colleague recently published.

    Also popular now: