CouchDB: One Crash History
I want to share the story of how our project lasted for an hour and a half, and the experience of finding out the reasons.
At one point, we understand that part of the site loads with a 15-minute delay, while the other part simply does not work, giving a 504 error.
Attention! Since people like to be smart, and do not like to read, I write here. The purpose of the post is to suggest how to get out of an emergency, everything else is just lyrics, on which for some reason everyone draws attention.
I am engaged in a project that uses CouchDB as a database. There is a “Poster” section on it where you can add events, in particular, you can add a periodic event by setting the start date of the period and the end date.
After adding an event, an event document is created in the database, and a time interval is added to its separate field for each day of the period. At these intervals, a selection is made for output on the site. The selection, in fact, simply selects time intervals from all documents.
Thus, adding an event for 7 days, we get a document in which there are 7 records in the period field, and 7 records are displayed.
FailThere was no check on the server for the maximum period of the event. For some reason they didn’t provide for this, probably hoping that only users with a paid account would add events, and they should be conscious.
Dirty userA user with a paid account appears, and, for the sake of pampering, adds an event, indicating the end date of the event is 2100.
Php-fpm starts to work powerfully on the server, starting adding 365 * 100 events. He added something, but the user did not wait for a message about the successful addition, probably deciding that something was buggy or the Internet fell off, and clicked on adding the event again, changing the time of the event a little. The process started a second time. Not that php-fpm gave any serious load, but the top command on the server had more php-fpm processes than usual, which was confusing and made me think for a while in the wrong direction.
As a result, we have 2 documents in the database with time intervals of 365 * 100 in each. CouchDB begins to update the look that it does not give.
In the server logs, something like: When trying to enter the database in Futon, we see an os_process_error error. In the Status section in Futon, we see a non-disappearing inscription with a note that it is in the event database (see line 1): There was an idea that something was buggy or the database was broken, but service couchdb restart did not help, as well as replacing the database on the server to the last copy with replication on another server. After googling, a solution was found in the CouchDB mailing list archive - the database encountered an update of the form at os_process_timeout = 5000 (5 seconds). The view simply did not have time to process the document in the time allotted to it. By increasing the value in the config to 15 seconds, finally it was possible to achieve the application of the changes and the site started working normally.
[<0.738.0>] Exit from linked
Having dealt with the reason that the site simply did not load, giving a 504 error and sorting out the database, the script was finally restored and measures were taken to prevent this from happening again.
By the way, I had to delete the created 2 documents in the database with a quickly written script, because The browser simply refused to open the document in Futon, freezing tightly, obviously trying to process the array with time intervals.
The sequence of events was restored in approximately the reverse order of my narrative, which made me pretty nervous, because I had to deal with this personally for the first time (I have to give credit to my superiors, who didn’t run around, urgently needing to understand and raise the site, and quietly calmly allowed me to deal with the problem).
Based on the described, some conclusions suggest itself
- do not store a lot of data in one document, especially if it is an array that participates in the selection. Here, however, there is still a debatable question: what is better - a lot of small documents, or few large, but too large documents in our case did not show their best side;
- if, having entered the database, an os_process_error error appears, and in the logs the word timeout - try increasing os_process_timeout in the config, this will allow the database to start working again and return the results, having processed the latest changes;
- a golden rule that, alas, we did not use in a particular case - check the data entered by the user, think like a cunning user-trickster.
I hope this post will save those who, like us, will be faced with this for the first time, from agonizing googling. Our project was cut off, alas, during peak attendance, I hope this does not happen again.
PS Useful article on Habré: 16 practical tips for working with CouchDB .