GitLab PostgreSQL postmortem

    On January 31, 2017, GitLab had an accident related to the operation of the PostgreSQL DBMS, as a result of which part of the data was deleted and the project was stopped during recovery. Several months have passed, and a lot has been written on this topic, and GitLab himself presented an exhaustive obituary in which he told what happened, what measures were taken to restore and what measures will be taken to prevent such accidents. Very entertaining reading, we recommend reading it even to those who are far from Postgres.



    In the comments to our interview with Alexei Lesovsky, some members of the community jokingly complained that we mentioned the GitLab accident, but in the end did not conduct a detailed debriefing. We decided to improve and asked Alexei to write a small “debriefing”. The main purpose of this publication is to conduct a detailed analysis of the obituary, highlight key points, try to analyze them and offer recommendations on how to act in a similar situation. And, of course, consider the measures that the GitLab team plans to take to prevent such incidents in the future.

    The list of key points in chronological order that we pay attention to is as follows:

    1. LVM snapshots as a means of transferring a database from production to staging.
    2. What to do when the replica falls behind the master?
    3. We do pg_basebackup correctly, part 1.
    4. Out of max_wal_senders, but how?
    5. max_connections = 8000.
    6. “Hanging up” pg_basebackup, or doing pg_basebackup correctly, part 2.
    7. strace: which is good for the administrator, for dba it is not always necessary.
    8. rm or not rm: what to do with sensitive data?
    9. Backup methods.
    10. Different versions of Postgres on production.
    11. Broken mail in the crown.

    Let's start the debriefing. Also, at the end there will be a second list, in it we will briefly analyze the measures that have been taken or will be taken in the future to avoid the recurrence of similar incidents.

    1. LVM snapshots as a means of transferring a database from production to staging.

    The task of updating the staging environment is quite common, so well-established practices and solutions have already appeared. One such solution is to take a snapshot and transfer the snapshot to staging. The solution is quite convenient and easily customizable, but has several drawbacks. Most snapshot implementations are based on copy-on-write, therefore, when creating a snapshot at the time of its existence, we get an additional load on disk storage. Basically, this is relevant for write-intensive workloads. When latency is important to you, this can be a problem.

    A more correct way to transfer the database from production to staging are backups. There is a customized backup, and staging is updated from there. In this situation, no additional load is created by the images and backups are indirectly checked for the success of the restoration. If staging is not updated, this is a clear sign that something is wrong with the backups.

    2. What to do when the replica falls behind the master?

    Unfortunately, the article does not mention how the engineers discovered the presence of the replication lag, but, given that the replica eventually fell off irrevocably, we can conclude that they noticed too late. What recommendations can be made here? Of course, monitoring the replication lag, ideally, with an alert. But the “knee-high” method will also be used for checking the lag with a script on the crown, followed by sending a letter, SMS or something else.

    When using streaming replication, lag monitoring is one of those things that should be done immediately after configuration. There are many reasons why a replica may begin to lag. For monitoring replication in Postgres there is a pg_stat_replication view - this is a very cool thing, with some experience, it even allows you to determine the cause of the lag: network, disks, long requests, etc.

    If you find that the replica is lagging, and you do not know how soon it will fall off irrevocably, then a temporary solution is to increase the wal_keep_segments parameter. This parameter is updated without restarting the postgres, with the usual reload (see pg_reload_conf ()), but it is worth remembering that the price may be an increase in disk space use, 1 segment is 16MB. After the parameter is increased, you can begin to search and eliminate the causes of the lag.

    It can also be noted that the guys admitted that they did not use WAL archiving, which was also in vain - this would help restore the replica. But, in general, the role of WAL archives is not only in the task of securing the replica. If you have backup via pg_basebackup, using archives allows you to flexibly manage recovery (see Point in time recovery).

    3. We do pg_basebackup correctly, part 1.

    As you know from the obituary, before starting pg_basebackup, the engineer cleared the directory into which base backup was to be copied. Indeed, pg_basebackup, upon detecting that the directory is not empty, will exit immediately. However, when dealing with sensitive data, it is always better to use mv instead of rm. The time of small volume disks is long gone, and such a rule can make life much easier. This applies not only to directories, but also tables in databases, databases with the suffix _old and much more. Although this does not work with terabyte databases, it is always better to rename the object and after a couple of weeks, making sure that no one needs it, quietly delete it than do it right away.

    4. The max_wal_senders has ended, but how?

    When connecting pg_basebackup, an error was exceeded that the max_wal_senders limit was exceeded. This parameter defines the limit of concurrent connections for replication. If you remember, GitLab had only one replica, and that one fell off irretrievably. Thus, all connections should be free. However, the engineering team encountered an error, and there are no other options besides trying to run several pg_basebackup at the same time.

    In general, this is pointless: if the first pg_basebackup does not work, then why should the second one work? But, if we still encounter this error, it is worth remembering that changing max_wal_senders requires restarting the postgres (which is often unacceptable), and it was better to find out who took the connections (and these were waiting pg_basebackups) and stop them. Why raise the limit? To run more concurrent pg_basebackup?

    5. max_connections = 8000

    The next point is the connection limit set to 8000. This is a very large number for this setting. Postgres is arranged in such a way that a separate process is generated for each client connection. All this works well until there are no more actively working postgres backends than processor cores in the system. After that, with a further increase in processes, productivity begins to decline.

    In order not to keep a large number of connections to postgres, pgbouncer was invented which allows you to compress a large number of client connections into a relatively small number of connections to postgres. For example, the same 8000 client connections can be closed on pgbouncer, and it will establish connections to postgres only as necessary. Again, using pgbouncer would potentially help to avoid wasting time on the error that occurred and an additional restart of postgres.

    6. “Hanging” pg_basebackup, or do pg_basebackup correctly, part 2.

    Moving on: the connection limits have been fixed, now we need to figure out why pg_basebackup freezes up? The reason for the so-called freeze lies in the specifics of pg_basebackup. Before you start copying the database, you need to wait until the so-called. “Checkpoint” (checkpoint). Checkpoint is a procedure that can greatly strain postgres, as a result of which performance will drastically decline. Therefore, in order not to strain the system, pg_basebackup by default waits for the regular checkpoint to finish, the time and speed of which depends on the settings of the postgres. If you specify the -c fast parameter, pg_basebackup initiates an urgent checkoint without delay.

    As you understand, in the process of waiting for the checkpoint to finish, pg_basebackup does not write any messages, and this played a trick on the engineer. However, thanks to the GitLab team, postgres will now get better .

    7. strace: which is good for the administrator, for dba it is not always necessary.

    Further, the engineers tried to find out why pg_basebackup freezes, and used strace for this. From my own experience I can say that strace is not quite what you need. strace is needed at the moment when the process under investigation fails with errors or does not work correctly at some points in time. You can understand that the process is “hanging” using the top utility: if we see 0 in the% CPU field, the process is very likely to be waiting.

    And to study the causes of the so-called “freezes”, we need a stack trace of the process to understand what function it is currently on. For this, / proc // stack and, with some experience, gdb are suitable. Unfortunately, Postgres itself does not provide an opportunity to see what the process is there, waiting for the start of copying, so engineers should know about the features of pg_basebackup.

    8. rm or not rm: what to do with sensitive data?

    We came to the moment when the database directory on the wizard was deleted. Actually, this point is a repetition of the above: try to use mv, if possible. Use rm only after you make sure that no one has come to terms with the allegedly missing data.

    9. Backup methods.

    So, the accident has occurred, and you need to perform recovery from backups. Let's see what backup methods the team relied on:

    • daily pg_dump: a rather stupid method that offers recovery only at a certain point in time - the start time of pg_dump;
    • daily LVM snapshot - there are no complaints about this method, provided that the database does not experience write-intensive loads;
    • daily shot of Asure: here I have nothing to say, I did not have to use it;
    • replication from the master: replication is not considered to be a backup, since accidental deletion of data is simply replicated to the replica. You can use replication with delayed recovery (see recovery_min_apply_delay), but in this situation you need to have an understanding of how to recover if something was accidentally deleted.

    One way or another, the general recommendation is: try not to use replicas as a backup; there are tools for backup. The most preferred tool is pg_basebackup + WAL archiving. The main advantage in using this method is the ability to restore data at a specific point in time or transaction number within the existing archive. However, this solution also has a drawback - a local requirement. Each base backup is a full copy of instance, and the size of the archive depends on the number of CRUD operations. Therefore, when setting up a backup, you should consider the total size of the storage for backups, taking into account the number of days of storage.

    At the moment there are several products that implement this backup model:

    1. Barman, if you have your own servers.
    2. WAL-E if your servers are on Amazon.

    The last thing you can add is the need to monitor that 1) backups are being done successfully; 2) backups generally exist; 3) there is the result of a test restore from a backup.

    10. Different versions of Postgres on production.

    The next point to which you can find fault is the presence of different versions of the potential for production. In the case of GitLab, this just led to the inability to take backups. Therefore, in any plan for upgrading to the next major version, there should be a point to delete old postgres packages.

    11. Broken mail in kroon.

    It seems to me, and many may disagree with me, mail notifications are becoming a rather rare notification mechanism. Nevertheless, no matter what mechanism you use, you should always be sure that it really works. In my practice, there was a case when inside the monitoring that notified the SMS command, there was a function of sending a test SMS every night to the administrator on duty, to check that everything was fine with the operator and with the balance, everything was fine. In case of failure, mail spam started and the flag in the web interface lit up - and it really helped.

    In the case of GitLab, mail for the cron was simply not configured initially. If you put something important in crowns, think about notifications.

    As a result, the database was restored from the LVM image, copying took 18 hours, and this is a very long period. At the same time, the picture was on staging, and this turned out to be the most acceptable option. All this is an unsuccessful backup strategy (every 24 hours) and a coincidence. The result should be a more thoughtful approach to backup with more flexible recovery options that provide minimal data loss.

    The GitLab team posted information on the measures that should be taken to avoid similar events in the future. Let's take a quick look at them too.

    1. Update PS1 across all hosts to more clearly differentiate between hosts and environments (# 1094).

    It looks quite reasonable. Making custom command line prompt text is a fairly common and good practice.

    2. Prometheus monitoring for backups (# 1095).

    This measure also looks very reasonable, however, monitoring should be done after the entire backup system has been reviewed and rebuilt. Monitoring should include not only the backup age and size, but also the number of available backups, as well as information about which copies were successfully checked for recovery.

    3. Set PostgreSQL's max_connections to a sane value (# 1096).

    Even 800 is too much, I would recommend looking at pgbouncer and trying to use it. However, it has its own limitations, which it imposes on the application that works with it, so here you need to pre-test it.

    4. Investigate Point in time recovery & continuous archiving for PostgreSQL (# 1097).

    Yes, this must be done.

    5. Hourly LVM snapshots of the production databases (# 1098).

    A dubious idea, given that the images were used to update staging. For this purpose, backups made as part of task # 1097 are also quite suitable.

    6. Azure disk snapshots of production databases (# 1099).

    It’s hard to comment, but if you want to duplicate the main functionality for backups, then why not?

    7. Move staging to the ARM environment (# 1100).

    The task looks strange. To restore production from staging, where all the data can be substantially changed, is a dubious undertaking, in the end, from such a recovery it will only get worse.

    8. Recover production replica (s) (# 1101).

    This of course needs to be done by ASAP. Any more or less serious postgres installation is now deployed with at least one streaming replica.

    9. Automated testing of recovering PostgreSQL database backups (# 1102).

    Yes, this is a very necessary thing, it is definitely needed.

    10. Improve PostgreSQL replication documentation / runbooks (# 1103).

    Yes, the instructions need to be updated in such a way that even the night watchman of the neighboring warehouse can recover from them.

    11. Investigate pgbarman for creating PostgreSQL backups (# 1105).

    Yes, barman is definitely a good tool, but considering that GitLab uses Amazon S3, WAL-E looks preferable. However, Barman developers do not rule out the possibility, and even offer to sponsor support for Amazon S3 in Barman.

    12. Investigate using WAL-E as a means of Database Backup and Realtime Replication (# 494).

    Yes, the thing is reliable, despite the considerable number of issues in the tracker, it works stably and without complaints.

    13. Build Streaming Database Restore.

    Yes, just like that, there are backups and there is a check for the restoration of these backups through recovery on staging environments.

    14. Assign an owner for data durability.

    A responsible person is definitely needed. In addition, it is advisable for everyone to familiarize themselves with the instructions that are made in the framework of # 1103.



    Many thanks to Alexei for the detailed analysis. Be sure to leave your comments and thoughts about the accident in Gitlab and the measures taken! Well and, of course, we invite everyone to discuss this topic with Lesha in St. Petersburg on PG Day Russia in the summer ! :)

    Also popular now: