How to find the nearest cafe, attraction, free taxi through the eyes of a programmer

From the sandbox

Services that solve any problems in the context of our location are firmly embedded in our lives. Most smartphones, if you have Internet access, can call us a taxi, calculate how long the bus will arrive, get directions based on traffic jams and various user preferences, or show friends nearby. Tasks such as finding nearby cafes or attractions have become trivial for them and can usually be solved without any access to the World Wide Web. In this article I want to consider some tools for solving such problems and compare their performance with each other.

Formulation of the problem

You must choose the tools to develop a service in which users periodically upload their location, while other users look for their "neighbors". Examples of services that solve such problems can be taxi services, social networks and games like Ingress .

The way to solve the problem

There is some theoretical introduction in the article , in more detail in Wikipedia . Further, purely practical issues will be considered.
To solve the problem, adapter classes for several selected services will be implemented. The adapter data interface is shown in the listing:

from abc import ABCMeta, abstractmethod
classAbstractStorage(object):
    __metaclass__ = ABCMeta
    @abstractmethoddefprepare_storage_for_experiment(self, test_data):pass    @abstractmethoddefexperiment_search(self, test_data):pass    @abstractmethoddefexperiment_update(self, test_data):pass    @abstractmethoddefclear_storage(self):pass

Time is measured using Profilehooks .

Accepted Simplifications

Hereinafter, all code is written in Python; you can work with all the tools described below from other common programming languages, unless otherwise indicated. The ability to speed up the system by rewriting everything in a faster programming language like c / c ++ will not be considered in the framework of this article, although it can be used in combat conditions if the feasibility of such a solution is proved.
In the given system, all requests are processed sequentially, which is equivalent to having a queue of requests before the module in question and the operation of the module in one thread; when using the system in battle, the developed module will most likely process requests in several threads.
All tests run on a laptop with the following hardware: 8 Gb RAM, Intel Core i5 2.6 GHz, SSD. The configuration of server hardware is likely to be different.
All used tools will be used with the default configuration, with the exception of the same amount of allocated RAM (where this moment lends itself to configuration by standard means). The configuration of the selected tools in this article will not be considered.

Implementation

A line (a document or another, depending on the accepted terminology) consists of id and a pair of coordinates in the internal representation of the system. An index is constructed for each of the columns if the system allows this. In all implementations, the code responsible for the search and update will be presented. A link to the full project code on github will be given at the end of the article.

Implementation 1. MongoDB v3.2.6

Link to the documentation for the geo-search

Code responsible for testing the speed of search and updates:

@timecall(immediate=True)defexperiment_search(self, test_data):deffind_point(point):
        cursor = self.collection.find(
            {
                MongoStorage.key_position:
                    {
                        '$near':
                            {
                                '$geometry':
                                    {
                                        'type': "Point",
                                        'coordinates': [point['lng'], point['lat']]
                                    },
                                '$maxDistance': 10000
                            }
                    }
            }
        )
        return cursor[0] if cursor.count() > 0elseNone@timecall(immediate=True)defexperiment_update(self, test_data):for t in test_data:
        self.collection.update_one(
            {
                MongoStorage.key_id: t["id"]
            },
            {
                '$set': {
                    MongoStorage.key_position: {
                        'type': "Point",
                        'coordinates': [t['position']['lng'], t['position']['lat']]
                    }
                }
            }
        )

Explain for the search term:

{
	"queryPlanner" : {
		"plannerVersion" : 1,
		"namespace" : "taxi_geo_experiment_test_db.taxi_driver_collection",
		"indexFilterSet" : false,
		"parsedQuery" : {
			"key_position" : {
				"$near" : {
					"$geometry" : {
						"type" : "Point",
						"coordinates" : [
							37.3816328351611,
							55.01604115262
						]
					},
					"$maxDistance" : 10000
				}
			}
		},
		"winningPlan" : {
			"stage" : "GEO_NEAR_2DSPHERE",
			"keyPattern" : {
				"key_position" : "2dsphere"
			},
			"indexName" : "key_position_2dsphere"
		},
		"rejectedPlans" : [ ]
	},
	"serverInfo" : {
		"host" : "host",
		"port" : 27017,
		"version" : "3.2.6",
		"gitVersion" : "05552b562c7a0b3143a729aaa0838e558dc49b25"
	},
	"ok" : 1
}

We see that the geo-index is used.

Implementation 2.1. PostgreSQL v9.5.2 using ST_DWithin

Documentation link (postgis)

The code responsible for testing the speed of searches and updates:

@timecall(immediate=True)defexperiment_search(self, test_data):
    cursor = self.db_connect.cursor()
    for item in test_data:
        request = "SELECT * FROM {table_name} " \
                  "WHERE ST_DWithin({column_geo},ST_MakePoint({lat},{lng}), 10000)".format(
            table_name=PostgreSQLStorage.table_name,
            column_geo=PostgreSQLStorage.column_position,
            lat=item["lat"],
            lng=item["lng"])
        cursor.execute(request)
        search_result = cursor.fetchall()
@timecall(immediate=True)defexperiment_update(self, test_data):
    cursor = self.db_connect.cursor()
    for item in test_data:
        request = "UPDATE {table_name} set {update_column_name}=ST_MakePoint({lat},{lng}) " \
                  "where {id_column_name}={id}".format(
            table_name=PostgreSQLStorage.table_name,
            update_column_name=PostgreSQLStorage.column_position,
            id_column_name=PostgreSQLStorage.column_id,
            lat=item["position"]["lat"],
            lng=item["position"]["lng"],
            id=item["id"]
        )
        cursor.execute(request)
        self.db_connect.commit()

Explain for the search term:

Index Scan using geo_index on taxi_drivers  (cost=0.14..10.51rows=1 width=36)
   Index Cond: (driver_position && '0101000020E6100000C8EA77DD72C44B404C0305B698B04240'::geography)
   Filter: (('0101000020E6100000C8EA77DD72C44B404C0305B698B04240'::geography && _st_expand(driver_position, '10000'::doubleprecision)) AND _st_dwithin(driver_position, '0101000020E6100000C8EA77DD72C44B404C0305B698B04240'::geography, '10000'::doubleprecision, true))

We also see the use of a geo-index.

Implementation 2.2. PostgreSQL v9.5.2 using ST_Distance

Documentation link (postgis)

Code responsible for testing search speed:

@timecall(immediate=True)defexperiment_search(self, test_data):
    cursor = self.db_connect.cursor()
    for item in test_data:
        request = "SELECT * FROM {table_name} " \
                  "WHERE ST_Distance({column_geo},ST_MakePoint({lat},{lng})) < 10000".format(
            table_name=PostgreSQLStorage.table_name,
            column_geo=PostgreSQLStorage.column_position,
            lat=item["lat"],
            lng=item["lng"])
        cursor.execute(request)
        search_result = cursor.fetchall()

The code responsible for testing the update speed does not differ from the implementation described in the previous paragraph.
Explain for the search term:

 Seq Scan on taxi_drivers  (cost=0.00..8241.00rows=10000 width=60)
   Filter: (_st_distance(driver_position, '0101000020E61000003B2CDE5519AA4B402B1697185FED4240'::geography, '0'::doubleprecision, true) < '10000'::doubleprecision)

In this case, the index is not used, all values in the table will be viewed, which is much slower.
Learn more about EXPLAIN in PostgreSQL .

Implementation 3. Redis v3.2.0

Geofunction documentation link

Code responsible for testing search and update speed:

@timecall(immediate=True)defexperiment_search(self, test_data):
    i = 0for item in test_data:
        command = "GEORADIUS {key} {lng} {lat} {dist_km} km WITHCOORD WITHDIST".format(
            key=RedisStorage.KEY_DRIVERS,
            lat=item["lat"],
            lng=item["lng"],
            dist_km=10
        )
        result = self._redis.execute_command(command)
@timecall(immediate=True)defexperiment_update(self, test_data):for item in test_data:
        command_rm = "ZREM {key} \"{id}\"".format(
            key=RedisStorage.KEY_DRIVERS,
            id=item["id"]
        )
        self._redis.execute_command(command_rm)
        command_add = "GEOADD {key} {lng} {lat} \"{id}\"".format(
            key=RedisStorage.KEY_DRIVERS,
            lat=item["position"]["lat"],
            lng=item["position"]["lng"],
            id=item["id"]
        )
        self._redis.execute_command(command_add)

There is no analogue to explain for redis, since there is no need for such a command, redis is not intended for complex queries in which similar functionality may be required.

It is easy to notice one more feature - in redis you cannot remove from the key (the closest analogue of the key in SQL is the table; the key can contain either a simple value, for example, a number, or a complex one, for example, a set) one of the geo objects, for this you need to use the knowledge about internal representation: the ZREM command removes a value from a set.

Conclusion

Testing was conducted on 30,000 objects and the same number of requests. If necessary, you can test on other sets of values by replacing the corresponding parameters in the code. The test results are presented in the table:

Tool	Search Time	Update Time
Mongodb	320.415	14.275
PostgreSQL (ST_DWithin)	114.878	8.908
PostgreSQL (ST_Distance)	1829.920	- (implementation and result are similar to PostgreSQL (ST_DWithin))
Redis	1098.604	5.016

All data in the table are presented in seconds, the average value for 5 tests.

Link to the project repository.

If you know another tool for effectively solving the task - write in the comments, I will consider it with interest.

Update 1: Added PostgreSQL implementation using ST_Distance. This implementation does not use an index; accordingly, the search lasts longer.

Tags: