How to launch an ML prototype in one day. Report Yandex.Taxi

    Machine learning is used throughout the entire Yandex.Taxi car ordering cycle, and the number of service components working thanks to ML is constantly growing. To build them in a uniform manner, we needed a separate process. Roman Khalkachev, Head of Machine Learning and Data Analysis Service, spoke about data preprocessing, the use of models in production, the prototyping service and related tools.


    - In my opinion, some new things are much easier to perceive when they are told on some simple example. Therefore, so that the report was not dry, I decided to talk about one of the tasks that we are solving. Using her example, I will show why we act in this way.

    Let's formulate the problem. There are taxi users who need to get from point A to point B, and there are drivers who are ready for a certain amount to deliver these users from point A to point B. The user has several conditions in which he is. He calls a taxi, selects point A, point B, fare, and so on, makes a taxi landing, rides, and finally makes a landing. Today I would like to talk about getting into a car and problems that may arise.



    As a rule, these problems are connected with the fact that a person needs to choose a place where a taxi should come. And here there are a number of difficulties. These difficulties are related to four things that I have listed on the slide.



    First of all, the location may not be familiar to the user. As an example, you can imagine yourself who came to some large shopping center, in which you do not often visit. You want to leave, and you don’t really know where you can call a taxi here, where the car can call, but where it cannot, for example, because of the barrier. There are problems with the fact that in some places there are a lot of people, a lot of cars and it is difficult for you to find your car. There are places where people usually get in the car, it’s easier to get there. And you may not know, being in some new place, not necessarily in the shopping center, where exactly to land. Difficulties may be connected with the fact that the driver cannot drive to the place where you called the taxi: he is forbidden to travel, there is some big exit from the shopping center, opposite which you can not stop, etc.

    On the other hand, you may have problems as a user. The driver arrived, everything is fine, but you are uncomfortable to sit down, because everyone dug up. You ask the driver to drive somewhere else. There are other reasons.

    The most illustrative example, the quintessence of all of the above, is the airport, in which just about everything is done. Even if you fly out of Sheremetyevo very often, it’s still an unfamiliar location for you, because a lot of things often change there. There are many people, many cars, there are convenient places for landing, there are uncomfortable ones, but as a rule, none of us remember about this.



    The solution is read from the title of the slide. Let us recommend to the user some places in which, in our opinion, it is convenient to land. The thought seems obvious, but there are many nuances here.

    For starters, “convenient” is a subjective concept. It seems that before solving the problem, it is necessary to formulate some criteria for the fact that the problem will be solved correctly. We have formulated three main ones for ourselves. The first criterion is as in any task of recommendations: probably, recommendations are good if they are used. If we show such points from which the user will really leave - these are probably good points. But this, of course, is not all, because you can learn to recommend something, show it, encourage the user to use it, but you can’t get any tangible profit (we won’t get as a system, nor a user, nor a driver). Therefore, it is very important to look at other metrics. We have chosen two.

    If we tell you about a landing place that the driver can easily drive to, then the vehicle’s delivery time should be reduced. On the other hand, if it is easier for the user to find a car in this place, it is easier to land, then the waiting time for the driver by the driver should be reduced. This is some of our hypothesis, which we take for granted, and these are some metrics that we look at when we make these recommendations. But of course, these are not the only metrics to look at. You can come up with a dozen more. I think each of you can come up with a hundred of these metrics.

    Here are some more examples. This may be the proportion of cancellations before the trip. In theory, it should decrease if it is easier for the user to land. Conventionally, these are calls when a user calls the driver trying to find him, or, conversely, the driver calls the user before the trip begins. This appeal is in support, and with a dozen others.

    We have formulated the problem. We roughly understood the criterion that we can solve this problem. Let’s now think about how to solve this problem. The first thing that comes to mind: let’s recommend any such proven and understandable landing points. Here on the slide is an example of the European Shopping Center. And we know for sure that you can drive up to the exits from this shopping center, and this is some kind of guideline, thanks to which the user can find a driver. It may be any organization. There is an example with the ABC of Taste in some shopping center. In my opinion, this is Yerevan Plaza. This is also some kind of guideline for the user and driver, about which we know that you can drive there.



    These may be landmarks at the airports I spoke of. Conventionally, there are such poles in Sheremetyevo with numbers. It’s convenient to call a taxi and get into the car. A good solution, but it has a minus that it is not very scalable. We have many countries, hundreds of cities, a huge number of different shopping centers, airports, difficult interchanges, unfamiliar places for which these points are difficult to make manually, and keeping them up to date is even more difficult. It is here that comes to our aid what is loudly called "artificial intelligence." I prefer to call it data mining or machine learning.

    Machine learning needs some kind of data, and we actually have that data. Another way to solve the problem automatically is to use this data. The high-level idea is that we have data about GPS, application logs, and there is a road graph. And we can understand where users actually get into the car. Not the points where they call the car, but where they land. And based on this, do something like that.



    These are already automatically received points for the Aurora business center, where our Yandex.Taxi team is currently sitting.

    I spoke high level about our task. Now let's talk in more detail about what stages the solution to this problem consists of. It is clear that there is a stage of data preparation.



    What data do we have? Firstly, we have the GPS data of our users and the GPS data of our drivers. When they use our application, we know the approximate location of users. It is clear that GPS has a large error, in the region of 13-15 meters, but nevertheless, there is something. Secondly, we have information that is contained in the application logs about when the driver switched from the status “I am waiting for the user” to the status “I am taking the user”. It can be assumed that at about this time the driver waited for the user, the user got into the car, and they drove off. Around this place, a landing was made. And we have a road graph. A road graph is not only a set of edges, streets, but also additional meta-information: barriers, information about parking, etc. Based on this data, you can already get some kind of automatic points.

    This was the source data. And at the exit, we want two things. These are some so-called landing point candidates. How do they come about? It’s a pity that it was not possible to show the video. The following happens approximately. We have many GPS points in which we know that the driver has switched from the status “Waiting for a passenger” to the status of “Let's go.” We can, conditionally, draw them to the graph, that is, project them onto the road graph, because, as a rule, the car starts moving from some road. On this graph, perform some kind of clustering of these points. And to get a large number of candidates - these are places in which some users got into the car, and it was normal, convenient for them. Not where they called, but where they ended up sitting.

    After that, when we have a lot of candidates and we have some user online, we know his location, so he opened the application and wants to call a taxi, then we can choose the best five from a large number of candidates and show them. The best five are determined by a machine learning model that learns to rank all candidates according to the likelihood that the user right now at this time, taking into account his location and taking into account his travel history, is most convenient to leave. And approximately in this way we can automatically generate these points. Moreover, if at some point they conditionally dig up somewhere, that is, it becomes uncomfortable to call a taxi, or somewhere they put a sign prohibiting a stop, and drivers and users really stop landing in this place,



    This is approximately the block diagram of how we prepare the data. Accordingly, it is pretty standard, as in any machine learning pipeline. There is data preparation, there is a generation of candidates according to the algorithm, I told a simplified version. We store these candidates in a certain database. After that, we prepare some pool for training (training sample), in which there is, conditionally, a user, time, meta-information, a set of candidates, and it is known from which point the user eventually left. On this we train the classification model. And then, according to the predictions of probability, we rank the candidates. When the model is ready, we upload it to some cloud, where it is well stored.



    What tools do we use in preparing the data? Basically, all the data preparation we have written in Python, on the Python stack: these are standard NumPy, Pandas, Scikit-learn, etc. We have a lot of data. We have millions of trips per month. A lot of data about GPS, about tracks of drivers, application logs, so we need to process them all the same on the cluster. To do this, we use MapReduce of our intra-Yandex version, which is called YT, and there is a library written in Python to it, which allows some mappers and reducers to be launched, and to do some calculations on a large cluster.

    Finally, when the pipeline is ready, we need to automate it so that the data is up to date, and for this we use such a thing as Nirvana and Hitman. This is also intra-Yandex development. Nirvana is a cluster computing management framework. In fact, she knows how to run just about any program, be fault tolerant, be cross DC (00:14:53). And in the event that something falls, she knows how to restart it, to create launches upon the occurrence of any events. etc. The



    web interface of our MapReduce cluster looks something like this. It can be seen here that we have a lot of machines, such nodes on which calculations are performed.



    And so in the web interface a typical process of some kind of data preprocessing and model training looks. This is such a dependency graph. Dependencies are like data, when one part (one cube) is waiting for data from another cube; and logical dependence (first we prepared all the data, then started the training). This is some kind of automated system. For all this, we usually use Python.

    We formulated the problem, formulated success criteria, learned to somehow solve it offline, even made some kind of model, and it seems to work according to some offline metrics - it really predicts the points from which the user leaves and finds those points which, it would seem, should reduce the waiting time and delivery of the car.



    Let's try these models, use this data. To do this, imagine what Yandex.Taxi service is.

    A very superficial diagram looks like this. There are users, they have an application, and there are drivers, they also have an application called “Taximeter”. These applications somehow communicate with the backend, and the backend is a set of microservices that communicate with each other - Ilya talked about this . One of the microservices is our service, our team does it, it is called ML as a Service, MLaaS.



    All you need to know about him is MLaaS written in C ++, based on the so-called Fastcgi Daemon. This is an open source library, which, roughly speaking, is a framework for writing a web server that can get and post requests, everything is standard. It was once written in Yandex, laid out in open source. We use the dopped version. What can this service do? He knows how to work with models: apply them, keep them at home and sometimes update, go to this wonderful cloud, where models are regularly updated, saved, and download them.

    Each functionality, for example, these landing points - inside we call them pickup points - or, for example, the tips of points B about which Ilya talked about and constantly broke in the previous report, each such functionality, where there is some kind of machine learning, corresponds to handler, which stores the logic of receiving a request, generating machine learning factors, and applying models, and generating a response. Of course, this service is not isolated, being able to go to some additional data sources, databases, and some other microservices.



    This is how it is arranged, it has a rather simple architecture. I did not want to dwell on this slide in detail, I just wanted to say that, by convention, the architecture is very simple. The request arrives, there is some factory of models, which sometimes downloads these models from the cloud. In memory they are stored in a single copy. For each request, a rather lightweight model object is created, which extracts features, applies and generates a response.



    But what do we have for the current moment? I already told you that we have data preparation, training, various studies, experiments, and all this is written in Python Stack, and there is some production that is written in C ++, simply because we have great demands on efficiency and productivity. When you live in such an ecosystem, two problems arise.

    First of all, this is a problem of experiments. For example, a data scientist who works in our team got an idea. If you run some kind of clustering or classification algorithm with slightly different parameters, you can achieve better quality. He tried to test his hypothesis offline, built into our Python process, calculated it, and it really turns out. And now he wants an AB experiment, that is, part of the users to show the new algorithm and measure some metrics already online: does the time really drop, wait, is usage increasing. To do this, he conditionally has five versions of his algorithm, in which he believes, which offline provide good quality: implement in C ++ and conduct an AB experiment. And after this AB experiment, perhaps all five will go to waste, that is, the quality of them online will be worse than it was offline, that is, worse than in production. That is, the experimentation process takes a long time due to the fact that, conditionally, two different languages, two different technologies.

    This is for existing features. And there are new ones. Once these pickup points were also ideas that I wanted to quickly check. Do not spend on it two months of development - it is advisable to get something in three weeks. To create such a prototype is quite laborious. First, write in Python the extraction of features, simply because it is convenient - move fast, as they say. You can build any prototype in Python, there are many libraries for data analysis. You experimented on your laptop, and now you want to check on users. And to make a prototype turned out pretty hard. We have come to the conclusion that we need some additional service in order to assemble such prototypes rather quickly - conditionally, in a week or even in a day - and also to conduct AB experiments.



    We created such a service, called it PyMLaaS. What he really is? In fact, this is a complete analogue of MLaaS, about which I spoke before, but written in Python based on Flask, nginx and Gunicorn. The architecture is quite simple, the same as that of MLaaS, but there is an opportunity to quickly dig into it some prototype from your offline experiments. In addition, we arranged such proxying at the nginx level, so that, conditionally, we had the opportunity to forward part of the load from MLaaS to PyMLaaS and thereby experiment.



    That is, we moved some parameters and want to check how this affects users. We started up 5% of the load on PyMLaaS, and we’ll see what happens in the experiment. Finally, it’s convenient to create prototypes. I created a prototype of some new feature, saw it in PyMLaaS and you can immediately test it in production.

    We liked it so much that the idea came up - why not use it all the time? Because, conditionally, there are features that require a large load, 1000 RPS, large memory requirements. I want to have a fairly flexible parallelism. But for some features, for some products or services that do not have such great demands on load, performance, RPS, and so on, we are quite successfully using this service.



    To summarize. Right now we have a working scheme for creating products that use machine learning. The idea first comes up. We try to be fashionable guys and test this idea on the data. We look and prepare the data, set up some experiments, perhaps train the models, look at offline metrics, at some analytics. Then we implement it in the form of some kind of handler in PyMLaaS, run the AB experiment, start up a part of the load on this service. If a feature flies, we transfer it to MLaaS, and it lives on its own, works and brings happiness to users and drivers.

    Returning to the pickup points problem, the hypothesis turned out to be true. The benefits of such points and recommendations of convenient landing sites are quite tangible. The car's delivery time has fallen, and the waiting time has also fallen. Now about 30% of all trips are from the point we recommend. Thank you very much for your attention.

    Also popular now: