GAE: batch put, distribution and some deceptive gestures
An incomprehensible title, the lack of a picture - in general a topic in a purely senile style.
Once, Emelya happened to find out where the visitors of his hamster came from (I exaggerate), for which Emelya turned to maxmind with a request to give IP2City base.
In general, the beginning is kind of clever ...
Task : Make a determination of the user's city by IP for the GAE application.
First problems:
- you cannot upload the database file directly (1mb / file
limitation ) - limited free resources
and yes, the problems are not the last.
Of course, it was possible to raise a little Servachka, fill the base with muscle, or even use the binary base and the Python IP, but I didn’t want to plant a zoo + another “NO” slide, which is much smaller.
I did not dare to upload 3.5 million objects into the main application, for some reason:
- the application itself needs resources
- there is no mutual dependence (services are easily shared)
Therefore, a separate application was created that in the future will be called via urlfetch and output data for a specific IP .
The limitation of 1mb was overcome by creating a RequestHandler to get a part of the file and load this part of objects into the storage.
On the client side, there is a small script that sends data using the POST method.
... begins dancing "in the style of a hard disco." As it turned out, after 40,000 inserted objects, 1 hour of processor time was consumed. Moreover, having thought that separate put for each object is wasteful, I decided to try db.put () (the so-called batch put), which inserts objects in batches.
As it turned out, db.put not only did not reduce the CPU time for each request (recall, the local script transfers 200 lines from the file of ip blocks per request and so on to the end of the file), but also increased it by about 10% . Agree, something is wrong here. I still do not understand what the problem is, but in my opinion db.put simply calls put on all objects in the list (yes, it still inserts objects if possible in one transaction, but this is not our case).
With a simple recount, it turns out something around $ 10 to load the entire database (without locations)
Waiting for you in my LJ .
Once, Emelya happened to find out where the visitors of his hamster came from (I exaggerate), for which Emelya turned to maxmind with a request to give IP2City base.
In general, the beginning is kind of clever ...
Task : Make a determination of the user's city by IP for the GAE application.
First problems:
- you cannot upload the database file directly (1mb / file
limitation ) - limited free resources
and yes, the problems are not the last.
Of course, it was possible to raise a little Servachka, fill the base with muscle, or even use the binary base and the Python IP, but I didn’t want to plant a zoo + another “NO” slide, which is much smaller.
I did not dare to upload 3.5 million objects into the main application, for some reason:
- the application itself needs resources
- there is no mutual dependence (services are easily shared)
Therefore, a separate application was created that in the future will be called via urlfetch and output data for a specific IP .
The limitation of 1mb was overcome by creating a RequestHandler to get a part of the file and load this part of objects into the storage.
On the client side, there is a small script that sends data using the POST method.
And here ...
... begins dancing "in the style of a hard disco." As it turned out, after 40,000 inserted objects, 1 hour of processor time was consumed. Moreover, having thought that separate put for each object is wasteful, I decided to try db.put () (the so-called batch put), which inserts objects in batches.
As it turned out, db.put not only did not reduce the CPU time for each request (recall, the local script transfers 200 lines from the file of ip blocks per request and so on to the end of the file), but also increased it by about 10% . Agree, something is wrong here. I still do not understand what the problem is, but in my opinion db.put simply calls put on all objects in the list (yes, it still inserts objects if possible in one transaction, but this is not our case).
With a simple recount, it turns out something around $ 10 to load the entire database (without locations)
Total, when developing under GAE:
- Think about how you will upload large amounts of data. It’s better to figure out how to collect them throughout the life of the application.
- Use a resource of 10 applications, create web services for your services and be the most service of the service. This will one way or another reduce the cost of resources and get away from "hard" limits, if necessary. This is the main way to “reach out” the GAE capabilities to your needs.
- At the moment, db.put () has no performance advantages over Model.put ()
Waiting for you in my LJ .