Work with large amounts of data and habraeffect

    One of the goals of creating bullshitbingo.ru was to see how the google application engine ( GAE ) behaves in more or less realistic conditions. I was particularly interested in the possibility of obtaining my own statistics, because what GAE and google analitics give me does not suit me for the reasons that I will give below. There was no particular reaction to the post itself , but it went to the main one and in a day the site received about 15 thousand downloads, which was quite enough. The peak load was 3-4 requests per second; as a result, the limit allocated by GAE for free resources was not exceeded.

    The following is a description of the features of working with statistics in GAE and in the second part of the graph about the received load: own and those that google forms. I tried to write in a way that was understandable to those who did not encounter GAE at all.


    Part one:
    GAE statistics , of course, show your own graphs, but there are a number of questions:
    • it becomes impossible to see them after some time, schedules are available only for the last day;
    • data presentation is fixed and not customizable;
    • there is no way to build a schedule for any of its conditions, in the case of bullshitbingo it would be interesting for me to watch different games separately.
    Google Analitics is a much more interesting thing in this regard, but there are problems with it: javascript, an hour-limited minimum time interval, etc. In short, all this did not suit me. Therefore, each time the page was loaded, I wrote down information about the request in the database. As a result, we got about 15,000 records about requests, for which, let's say, I want to build a schedule.

    Problem: GAE cannot fundamentally return more than 1000 records. This is due to the non-relational data model, and perhaps this does not bother anyone at all, but it also interferes strongly with statistics. Another consideration: building some complex queries and calculating something directly on the GAE side can be very expensive, you will have to pay for processor time and storing large amounts of data. Since statistics are, generally speaking, “dead” data that are not necessary for functioning, it is quite possible to take it from the GAE servers somewhere yourself, and already process it. Even more convenient. Therefore, it was decided to upload the statistics in the form of csv files and work with it already somehow locally.

    Data upload
    Uploading data is a separate task, because GAE does not know how to select records with an offset. Rather able, but it is actually implemented on the client side (applications, not http-client, of course). That is, when I want to get 10 records starting from the 100th, I can do this, and for this I even have the corresponding parameter to the fetch () call, but in fact all 110 records will be extracted from the database, I’ll just leave the first hundred APIs to myself . That is, it’s just impossible to get 100 records starting from the 1000th. This is even written in the documentation, but somehow somewhat foggy.

    You can get out of the situation if you use not the row number as the offset, as I used to in relational databases, but the date / time. The time in GAE is stored even with six decimal places, so the probability of making several entries with exactly the same time is extremely small. Strictly speaking, it is possible to create an artificial unique field with monotonically increasing values, but I do not see such a need.

    All statistics can be sorted by date and selected by 1000 pieces, each time remembering what date it was possible to get to, and the next time to retreat from it already. After the statistics were guaranteed to be unloaded, it can be deleted. Further, I call these fragments of 1000 entries pages. You can choose another number, but 1000 turned out to be no worse, say 100, so I stopped at 1000.

    The script for uploading statistics as a parameter takes the maximum date (namely a date, without time), before which data is required to be output. There are two reasons for this:
        1) statistics are constantly being received, but the data “for yesterday and earlier” will not change;
        2) uploading all statistics at once may simply not work out due to the large amount of data and restrictions on the execution time, and it will be possible to upload at least one day.

    Thus, the algorithm in large strokes turned out the following:
    1. select the 1000 oldest records using something like this query: SELECT * FROM request where date <$ date order by date limit 1000;
    2. generate a csv record for each line;
    3. remember $ last_date - maximum received date;
    4. execute the request, supplementing it with the condition with $ last_date: SELECT * FROM request where date <$ date and date > = $ last_date order by date limit 1000;
    5. while the result of the request is nonempty goto 2.
    After receiving the csv file, you need to make sure 10 times that the statistics have been unloaded completely, correctly and for the required period, after which it can be deleted from the GAE database.

    When comparing with $ last_date, “non-strict more” is used because theoretically the possibility of exact coincidence of time for two different requests still exists. To avoid duplicating lines at the page junction, you need to verify their unique keys (and GAE generates such a key for any objects stored in the database) and omit the line if it has already been unloaded.

    In the case of bullshitbingo, the data for the day on which I posted the post on the hub, was unloaded in more than 20 seconds, that is, on the verge of a foul. If there is a little more data, then you still have to break the downloads not into days, but already into hours.

    Delete statistics.Of course, this is again a problem. The documentation assures that it is more efficient to delete entries from the database in bulk, rather than one at a time. When you try to just delete everything that is less than the specified date, you always get a timeout. Moreover, when I checked this procedure on comparable volumes locally, it worked slowly, but. I had to rewrite the deletion procedure one record at a time. For the allotted for the execution time of the script 30 seconds, 400-600 records were deleted. I rewrote the procedure again and began to delete records of 100 pieces, it seems that the process accelerated, there was no strength to figure out what exactly happened there. Deletes and okay, tricks in 10 everything worked out.

    Part Two: "Habraeffect?"
    There have already been several articles on this subject, for example, herethe article is about GAE, but in java the picture is just given there, and I have a fully functional project.

    In general, I would not call this effect a special effect either: at peak, the load was 4 requests per second. For no more than an hour, the load was about two requests per second, after which it was steadily falling, all this is evident from the graphs. Full statistics. The peak load in 4 queries is not visible here due to averaging. Time is Moscow, the post was published at 14:40, apparently after about an hour he went to the main page. GAE saves time in GMT, I performed the conversion already at the stage of drawing the graphs, although it would be possible when loading csv files, perhaps. Separately statistics on the main page. Habra game statistics .












    Statistics the next day: by 11 a.m., the bulk of the people read up to the second page of the haralent. GAE administrative interface. Here you can also see the consumed resources, screenshots were taken 10 hours after the publication of the post. I like my drawings more. There are also statistics for two days , you can estimate the scale.






    It should be noted that I did not do any optimization at all. That is, with every request, everything that is possible (all signatures, headings, names and descriptions, all words for games) was selected from the database, unless I placed artificial empty loops. This is done firstly because of laziness, and secondly to see what happens. As a result, approximately 70% of the resources that GAE provides for free are spent. The bottleneck turned out to be the CPU; of all other resources, 1-2% was used up. In addition, several timeout errors were received when accessing the database, so most of the work with the database was later transferred to memcach.

    Some more statistics
    After the post, about 150 people tried to look at the administrative interface, 90 games were created, almost 30 of them were non-empty and 15 were completely imagined with meaningful content, their authors greetings.

    Total number of people logged in (by IP addresses): 6814
    Loaded more than one page: 3573 or 52%
    Loaded more than two pages: 2334, 34%
    More than ten: 215, 3%

    Conclusions
    Working with large amounts of data in GAE is not very convenient, but it is quite possible . For real work, you will have to write scripts that will download statistics according to the schedule, automatically check it for correctness and then initialize the cleaning of the GAE downloaded on the side. That is, all this leads to quite noticeable overhead costs and creates quite definite, but surmountable difficulties.

    Also popular now: