
We distinguish a bus from a car by GPS-tracks

Photo by Artem Svetlov
To build a plausible cork picture, the Mail.Ru Maps project processes a large amount of information on the GPS tracks of traffic participants. Often, little is known about the source of the tracks, including for safety reasons. But to determine the true situation on the roads, I always wanted to know more. At least in order to understand how much the speed of the source machine corresponds to the speed of the rest of the stream. This article will focus on the method of extracting route vehicles (buses, trolleybuses, minibuses and trams) from the raw GPS data stream.
Why is it important
Route vehicles most often do not move at the speed of the rest of the stream. Of course, they can be indicators of the transport situation, but with some specifics:- Buses and trolleybuses, as a rule, have their own timetable with a large number of stops on the route. This means that on a free road the bus will travel deliberately slower than the stream and often stop for a short time. At rush hour, when buses run at intervals of 7-10 minutes, they can send a sufficient amount of information about the decrease in flow rate in the area of the stop.
- Thanks to the allocated lanes, the bus can go faster than the stream in traffic.
- Minibus drivers often drive contrary to any rules.
Separately, I would like to describe the trams that almost always drive along the allocated lanes passing near or in the center of the streets with automobile traffic. Therefore, the tram track is practically indistinguishable from the car track.
Initial data
I will make a reservation in advance that the purpose of the article is not to compare which of the satellite navigation systems is better. Almost all client devices now have chips that receive data from all available systems and provide generalized coordinates. To save space, hereinafter I will call the track obtained using the satellite navigation system, GPS-track.To get started, let's define what a GPS track is. A GPS track is a sequence of coordinates of a device’s position in time. Unfortunately, the only thing we know about each device sending a track is its unique identification number. These are strict privacy requirements.
All tracks have a different nature and come from different suppliers. In this article I will consider the case when the device is rigidly fixed to the vehicle and sends data at regular intervals. This simplification will allow me not to consider the situation when the device recording the track was in the hands of someone, then this someone got on the bus and drove a couple of stops on it.
The purpose of the analysis is to isolate from the general list of tracks those that travel most of the time along the same street sequences - routes.
Solution method
First of all, the initial continuous track must be divided into single trips, which we will compare with each other. As described earlier - physically on the machines there is a GPS tracker that sends its coordinates every few seconds. Most often, the tracker works when the car’s ignition is turned on, but there are devices that work around the clock. Therefore, the trip separator will take a long period of time at which the speed was always 0 or the device did not send data.
Example of dividing a track into trips
Now for each vehicle we have a set of travel tracks that it has made over a certain period of time. Among them there are both real trips, and loosely coupled tracks caused by errors in determining coordinates, movements within the closed zone of the enterprise, “reparking” and the like garbage. In order not to waste computational resources on it, I filter all tracks with a length of less than 400 meters, a number of points less than 10 and a geographic spread of less than 200 meters for a bounding box. This will avoid the asterisk tracks that are formed due to large random errors of the GPS receiver.

Characteristic asterisk tracks
The next task is to compare these tracks with each other and determine whether they go along the same route. To do this, the first thing I will bring all the GPS-tracks in a single form, linking them to our road graph. I wrote about the work of the binder in my last post. Since then, he has undergone some changes, but the basic principles have remained the same. At the exit from the binding, I get the track in the form of a chain of pairs (id of the edge of the graph, direction (forward or backward)). At this stage, you can filter out tracks that do not fit on our road graph. These can be tracks from airplanes / helicopters, containers in the seas, combine harvesters. Or simply from cars that drove to places where for one reason or another we do not have a road graph. I want to note that only those tracks that do not correspond to the road graph at all are filtered here. If the car left the parking lot, where we do not have a road graph, then drove along the roads for a long time, where it became attached to the road graph, and at the end of the road drove into the parking lot (where there is no road graph again), then such a track will be counted.
The resulting chains are much easier to compare with each other. I looked at various comparison metrics and ended up focusing on the Levenshtein metric . The alphabet in this case is the set of all possible edge-direction pairs. Thus, I got the opportunity to numerically determine the “similarity” of tracks as the number of edits of the edges of the route (adding / removing / replacing edges) in order to get another route from one route.
The next step is to group tracks on routes. This problem is solved by data clustering algorithms. Since I already have a one-dimensional metric of “similarity” of tracks, I took the simplest algorithm for hierarchical data clustering: dendrogram. A tree is built on the basis of Levenshtein’s minimum distance, and then its branches are broken, differing by more than n edges. It was imperative to calculate the optimal n equal to 16.
As a result, I get a set of clusters that store similar routes in themselves. Possessing this information, it is already possible to conclude whether the vehicle travels along a predetermined route. I had ideas to use different n depending on the number of edges in the route, but this improvement does not increase the quality of the search, and I decided to leave a fixed n.
Initially, it was thought that most vehicles have 2 routes (from final to final) in both directions. But, as practice has shown, sometimes the route can be circular, or consist of several parts.

Route vehicles do not always move along the route. There are trips to the garage, gas station, etc.

Tracks from the route. Close view
Thus, most vehicles have at least one cluster in which trips are accumulated, and several service ones: one-time or more rare routes (to the garage, to the gas station, and so on). Based on the data obtained, one more hypothesis can be checked: since we have vehicle routes and route comparison metrics, we can distinguish vehicles operating on the same route. To do this, you just need to take separate clusters of different vehicles and compare them with each other (all the more, the cluster comparison function is already in the hierarchical tree implementation).


Two different buses moving along the same routes.
Thus, I can clarify the route and group vehicles in parks.