Where is the money on the road (an algorithm that allows one and a half times to reduce costs in a taxi)

There are probably very few people who have never had to use taxi services, especially since the popularity of this type of transport has been growing recently, and the price has been steadily falling. From personal experience I can say that this is a convenient way, for example, to go to the clinic with a small child.

Have you ever wondered what we pay for using a taxi?

Undoubtedly, the main part is the fee for the driver’s time and the cost of the car, but it would be rather rash to assume that this fee applies only while you are on the road. When a taxi driver has to wait a long time for the next order, who will pay for his time and a simple car? - in the end, we are with you. A taxi driver rightfully does not agree to remuneration for a day lower than its value in the labor market, and at the same time carry out an average of five orders or twenty-five. I was curious about what time downtime was not beneficial to anyone and how it could be reduced. Below I would like to share with you the results of my own research on this issue.

How does a modern taxi driver work?

He receives an order from the dispatcher or through the application for taxi drivers, takes the passenger to the place he needs, and most often he waits for the next order near the landing point.

Why is this policy bad?

It can lead to a situation where in some areas of the city there are too many taxi drivers who have to stand in line for the next order for a long time, while
in other areas there is a shortage of free cars and some of the orders are therefore lost. In the most acute form, the described problem arises as a result of daily migrations: in the morning, taxi drivers, together with the flow of the working population, “flow away” from residential areas, where at the same time there is an increased demand for their services, and accumulate in the center, where the order density is, on the contrary, lowered. and especially few who wish to return home.

Are the previous conclusions plausible reasoning or are they somehow tested in practice ?

When I thought about this for the first time, I relied solely on common sense and little experience in communicating with taxi drivers who brought me. However, at that moment when I got the first sketch of an algorithm optimizing the work of a taxi service, of course I wanted to test my hypotheses and evaluate the effectiveness of the solution. For obvious reasons, not one of the major market players, such as Uber, Gett, or "domestic" Yandex.Taxi, would like to open the information I need. Fortunately, the administration of distant Chicago has long been collecting and making publicly available data on all trips of a significant part of city taxi services. (For more details, see "Data Notes").

What was the state of affairs in Chicago for 2016?

According to my estimates (their methodology is described in more detail in the Notes on Estimating Service Time), of the time spent by all taxi drivers at the workplace, at best, only 46% of it was occupied by the execution of orders. As for the negative manifestation of the effect of daily migration, on average, the number of available cars within administrative regions is 30% more or less than the number of incoming orders at this time.

What and how can be improved?

It goes without saying that there are two obvious lines of action:

1) To organize the transfer of cars from areas where there are too many of them, to areas where there is a shortage or expectation of them.
2) Maintain in each region the optimal number of machines awaiting order.

Of course, you can set a more general task of maximizing profits, and in the course of solving it, even refuse to serve part of the "disadvantageous" routes, but such a strategy is fraught with a loss of commitment for the company as a whole to its beneficial users if a competing player serves them along the "disadvantageous" routes . Given this possibility, an attempt to serve all available orders, but at the lowest cost, seems reasonable enough.

What requirements should be presented to a good traffic redistribution algorithm?

Here, in my opinion, the following conditions are desirable:

a) The total time spent on the transfer, since it is not paid by passengers, it is advisable to make it as small as possible (In the program I wrote, for example, special minimizing algorithms on graphs are used for this, but about them later);
b) Since drivers' abundance most often directly depends on the volume of orders executed by them, it is desirable that the overhead costs of implementing the algorithm fall on each of them equally. It should also be noted that the driver who received the order to move to another area will most likely fulfill it if it takes 15 minutes, and rather ignore it if it is 50.
c)The degree of deficit in the number of free cars in each region varies during the day, therefore, the algorithm should eliminate the imbalance with sufficient efficiency and, when planning transfer flows, calculate them taking into account the situation in the future.
d) The computational complexity of the algorithm should make it possible to put it into practice.

A little trick that makes things much easier

Imagine twenty soldiers standing in front of you in a row. How to achieve that the place before the first becomes occupied, and the place of the last is vacated? One way is to ask the person standing last to go forward. Another way to do the same, but twenty times faster, is to ask all the soldiers to move just one position. This idea, applied to taxi cars, means the following: instead of transferring them from surplus area A directly to remote deficient area B, you can “move” cars only one position along the chain of areas leading from A to B. This solution, with one on the one hand, it allows eliminating the imbalance almost instantly, compared with the rate of change in the deficit on the map, on the other hand, requiring drivers to spend no more than 15 minutes each time on “shifts”.

What is the mathematical method for implementing the traffic redistribution algorithm?

(Here I will describe in detail the technical details of the algorithm I developed, a reader who is interested only in the results may skip this part of the article.)

Let's try to reduce the problem to a well-known problem on the search graphs in the production-consumption network of the supply stream of minimum cost. Since the demand for taxi cars and their distribution in the city depends on the time of day and day of the week, we will construct a separate column in the manner described below for each day of the week and time interval of one hour. The vertices of the graph will be administrative districts. Those in which the expected number I of machines freed up after the order exceeds the expected number of orders U will take the role of production centers with a capacity of S = I - U, the rest - the role of consumption centers with a capacity of S = U - I. The value of S is called the imbalancepeaks (district). We assume that there is an edge from region A to region B if (and only if) from A to B can be reached on average in time t not exceeding 15 minutes, while t is the cost of this rib (for more details see "Notes Estimation of travel time between regions "). In the classical problem of the production-consumption network, the restrictions on the allowed flow quantity have edges of the graph, however, in the case under consideration it is difficult to assume that only taxi drivers can cause traffic congestion, but twice in the same interval between orders to “shift” the driver would be bad form from our side. By the last remark, the “shear” flow from the top should not exceed I.

Is there a way to transform the graph in such a way that it is possible to apply algorithms for the standard task of finding the flow of minimum cost?

Let's choose any vertex and replace it with two new ones so that the first one includes all the edges that are in the original, and the second one that leaves all the edges that go out of the original. It remains to connect these two new vertices with a single edge with zero cost and bandwidth I and ascribe the first imbalance to S, and the second to zero. Having done such a transformation with each vertex of the original graph, you will transfer it to the form standard for the minimum cost flow problem, but this is not all.

So that the task of finding a flow that satisfies the needs of all the "centers of consumption" and realizes the power of all the "centers of production" is generally solvable, the total imbalance of all the vertices must obviously be zero. Is it possible to expect the fulfillment of the last property for taxi service graphs? It seems very plausible that in the graphs constructed for the morning hours, when the need for taxi cars is growing, the total imbalance will be negative, and for the evening ones, vice versa. Let's imagine how a taxi service can cope with this problem: it would be reasonable to introduce a work schedule for taxi drivers in which the number of cars involved would always correspond adequately to demand. To take this into account, in my algorithm I added an additional vertex “house” to the graph

By the way, if you ever need to numerically solve the problem of finding a stream of minimum cost, I strongly advise against resorting to dubious sources such as Wikipedia or various forums that advise outdated Ford-Bellman algorithms or a potentially looping network simplex method. Good algorithms in terms of runtime can be found in this article.

“Under the hood” of the programs attached to the article, my version of the Edmonds – Karp algorithm works.

What are the comparative overheads of implementing a redistribution algorithm in the city of Chicago?

On average, taxi drivers spend 275 million seconds together on a week to complete orders in Chicago, stand idle in anticipation of 315 million seconds, while the redistribution algorithm would take them only 30 million seconds.

Is it possible to consider that all the time costs for the ideal organization of transportation is the lead time of either orders or the requirements of the traffic redistribution algorithm?

Unfortunately not! Suppose, for example, in a certain area on Monday between the 9th and 10th mornings that 25 orders are expected, 15 vacated after the passengers disembarked and another 10 additionally transferred from surplus areas. It would seem that the number of free and required cars is in balance over this period, is it worth counting on being able to complete all orders? Even if the expected numbers coincide with the real ones, the order in which free cars will arrive and applications for them, generally speaking, is arbitrary. To guarantee the availability of a free car at the time of the order, the latter will always have to keep a small margin. The stock of vehicles, which with a high probability will not be depleted during the delay in their redistribution (of the order of 15 minutes),

Quite unexpectedly, the time lost in such queues, which is much more than the time spent on the redistribution algorithm, is approximately 114 million seconds per week. Is there any way to reduce it?

You can come up with something. One of the ideas is this: if taxi cars delivering passengers to a certain area arrive at an average interval per minute, then you should not refuse the next order, even when there are no free cars at the time of its arrival: at most three minutes later one would be sure to appear. True, this wait-and-see strategy has one minus - it increases the "feed time" of the car. In my algorithm, I reduced the queue size in each area by the average number of cars arriving there in 5 minutes. Such a measure should only in rare cases increase the “filing time” by five minutes, but the total downtime losses in the queues for calculations should be reduced by more than two, reducing them to 50 million seconds per week.

What is the final result?

For Chicago, in accordance with the calculations (their program you will find but in the links below), using the algorithm for redistributing free cars in combination with managing the size of their queues in each individual area would increase the share of time attributable to passenger transportation from 46% to 77%. If the calculations are correct, then their implementation in Chicago would save about $ 15 million a year, and in cities of the size of Moscow about $ 100 million - money that now simply remains on the road and does not benefit anyone.

Code projects related to this task are available here .

Data Notes

In the calculations, I used data for the whole of 2016. Two months ago there were more fields in the dataset and there was a lot of information among them about the exact coordinates and about paying for the order. Apparently, some of them were hidden in the commercial interests of taxi services, and some because of problems with their personal lives. I did not need all the fields that were available at that time, and the file with which I finally started working can be found here . One entry in it consists of:

order number cipher,
machine number cipher,
date and time of embarkation and disembarkation, rounded to the nearest marks on the clock, multiple of 15 minutes,
long miles
lead time in seconds
administrative zones of Chicago, in which the landing and, accordingly, the landing,
approximate geo-coordinates of the landing site and landing site.

This file has a size of almost eight gigabytes, with a good two-thirds of them occupying inappropriately long ciphers. To make it more convenient for me to work with data, I replaced the ciphers with their serial number during dictionary ordering, converted some numbers from text format to numeric, and saved the records as objects of the user class “date unit” class. In the process, I threw out a small number of records with completely unrealistic data. Compact packaging has reduced space for the same amount of information to 750 megabytes. The latter file can also be found by reference , and the storage format, for example, in this program .

Note on Estimating Service Time

The average lead time for the week I received by summing the lead times indicated in the record for each trip and divided by 52. As you can see, there is no magic. To find the downtime, I first formed a time-ordered list of completed orders for each machine and subtracting the time of the previous order from the start time of the next order, I received new lists of so-called “timeouts”. Two difficulties arose here. As I mentioned in the notes about the data, the entries about the time of embarkation and disembarkation are not real times, but their rounding to the nearest marks on the dial, multiple of 15 minutes. The second difficulty was not to consider breaks between work shifts as downtime. I decided the second by unloading the distribution of "timeouts" of random 30 machines, from where it was clear that waiting times of the order of 15-45 minutes are a characteristic phenomenon, and quite often there are hourly downtimes. From this, I concluded that only “timeouts” of less than one and a half hours should be included in the calculation of the downtime, and the rest should be ignored. I circumvented the first difficulty simply by summing up all the timeouts taken into account. Why can this be done? This could be an interesting task in probability theory - to show that if a pencil of length L is randomly placed on a ruler and rounded off the coordinates of its ends to the nearest integers, then the mathematical expectation of the difference of these rounded values will be exactly the length of the pencil. From this, I concluded that only “timeouts” of less than one and a half hours should be included in the calculation of the downtime, and the rest should be ignored. I circumvented the first difficulty simply by summing up all the timeouts taken into account. Why can this be done? This could be an interesting task in probability theory - to show that if a pencil of length L is randomly placed on a ruler and rounded off the coordinates of its ends to the nearest integers, then the mathematical expectation of the difference of these rounded values will be exactly the length of the pencil. From this, I concluded that only “timeouts” of less than one and a half hours should be included in the calculation of downtime, and the rest should be ignored. I circumvented the first difficulty simply by summing up all the timeouts taken into account. Why can this be done? This could be an interesting task in probability theory - to show that if a pencil of length L is randomly placed on a ruler and rounded off the coordinates of its ends to the nearest integers, then the mathematical expectation of the difference of these rounded values will be exactly the length of the pencil.

Notes on Estimating Travel Time Between Districts

The first idea of such an assessment was to average the time over all available records of trips from one region to another for a given time of day and day of the week. However, it seemed to me that this value would be slightly overestimated due to all sorts of time for baggage packing, disembarkation and boarding of passengers, as well as passage to the house porch along narrow streets where there is absolutely no need to send cars, realizing just their transfer. Therefore, I simplistically assumed that the order time is some delay constant plus a value proportional to the distance specified in the order. I cut off the constant by applying the least squares method (see in this program), and then averaged only the second term. If the proportion of the constant in the order time was more than a third, I subtracted only a third.

Tags: