nAbdullin March 26, 2019 at 19:38

How we used deferred replication for disaster recovery with PostgreSQL

Transfer

Replication is not a backup. Or not? Here's how we used deferred replication for recovery by accidentally deleting shortcuts.

Infrastructure specialists at GitLab are responsible for running GitLab.com , the largest instance of GitLab in nature. There are 3 million users and almost 7 million projects, and this is one of the largest open source SaaS sites with a dedicated architecture. Without the PostgreSQL database system, the GitLab.com infrastructure will not go far, and we just don’t do it for fault tolerance in case of any failures when data can be lost. It is unlikely that such a catastrophe will happen, but we are well prepared and stocked up with different backup and replication mechanisms.

Replication is not a database backup tool for you ( see below ). But now we will see how to quickly recover accidentally deleted data using delayed replication: on GitLab.com, the user deleted the shortcut for the project gitlab-ceand lost contact with the merge requests and tasks.

With a delayed replica, we recovered data in just 1.5 hours. See how it was.

Point-in-time recovery with PostgreSQL

PostgreSQL has a built-in function that restores the state of the database at a specific point in time. It is called Point-in-Time Recovery (PITR) and uses the same mechanisms that maintain the relevance of the replica: starting with a reliable snapshot of the entire database cluster (basic backup), we apply a number of state changes until a certain point in time.

To use this function for a cold backup, we regularly make a basic database backup and store it in an archive (GitLab archives live in Google cloud storage ). We also monitor database state changes by archiving the write-ahead log (WAL) log . And with all this, we can perform PITR for disaster recovery: we start with the picture taken before the error and apply the changes from the WAL archive until the failure.

What is deferred replication?

Deferred replication is the application of delayed WAL changes. That is, the transaction occurred in an hour X, but it will appear in the replica with a delay dof one hour X + d.

PostgreSQL has 2 ways to configure the physical replica of the database: restore from archive and streaming replication. Restoring from the archive , in fact, works like PITR, but continuously: we constantly extract the changes from the WAL archive and apply them to the replica. And streaming replication directly retrieves the WAL stream from the upstream database host. We prefer recovery from the archive - it is easier to manage and has normal performance, which does not lag behind the working cluster.

How to set up deferred recovery from archive

Recovery options are described in the file recovery.conf. Example:

standby_mode = 'on'
restore_command = '/usr/bin/envdir /etc/wal-e.d/env /opt/wal-e/bin/wal-e wal-fetch -p 4 "%f" "%p"'
recovery_min_apply_delay = '8h'
recovery_target_timeline = 'latest'

With these parameters, we set up a lazy replica with recovery from the archive. Here wal-e is used to extract WAL ( restore_command) segments from the archive, and the changes will be applied after eight hours ( recovery_min_apply_delay). The replica will monitor changes in the timeline in the archive, for example, due to failover in the cluster ( recovery_target_timeline).

You recovery_min_apply_delaycan set up delayed streaming replication, but there are a couple of tricks that are associated with replication slots, hot spare feedback, and so on. The WAL archive allows you to avoid them.

The parameter recovery_min_apply_delayappeared only in PostgreSQL 9.3. In previous versions, for deferred replication, you need to configure a combination of recovery management functions ( pg_xlog_replay_pause(), pg_xlog_replay_resume()) or keep WAL segments in the archive for the time delay.

How does PostgreSQL do this?

Curious to see how PostgreSQL implements deferred recovery. Let's look at recoveryApplyDelay(XlogReaderState). It is called from the main loop for each entry in the WAL.

static bool
recoveryApplyDelay(XLogReaderState *record)
{
    uint8       xact_info;
    TimestampTz xtime;
    long        secs;
    int         microsecs;
    /* nothing to do if no delay configured */
    if (recovery_min_apply_delay <= 0)
        return false;
    /* no delay is applied on a database not yet consistent */
    if (!reachedConsistency)
        return false;
    /*
     * Is it a COMMIT record?
     *
     * We deliberately choose not to delay aborts since they have no effect on
     * MVCC. We already allow replay of records that don't have a timestamp,
     * so there is already opportunity for issues caused by early conflicts on
     * standbys.
     */
    if (XLogRecGetRmid(record) != RM_XACT_ID)
        return false;
    xact_info = XLogRecGetInfo(record) & XLOG_XACT_OPMASK;
    if (xact_info != XLOG_XACT_COMMIT &&
        xact_info != XLOG_XACT_COMMIT_PREPARED)
        return false;
    if (!getRecordTimestamp(record, &xtime))
        return false;
    recoveryDelayUntilTime =
        TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
    /*
     * Exit without arming the latch if it's already past time to apply this
     * record
     */
    TimestampDifference(GetCurrentTimestamp(), recoveryDelayUntilTime,
                        &secs, µsecs);
    if (secs <= 0 && microsecs <= 0)
        return false;
    while (true)
    {
        // Shortened:
        // Use WaitLatch until we reached recoveryDelayUntilTime
        // and then
        break;
    }
    return true;
}

The bottom line is that the delay is based on the physical time recorded in the transaction commit timestamp ( xtime). As you can see, the delay applies only to commits and does not touch other records - all changes are applied directly, and the commit is delayed, so that we will see the changes only after the delay is configured.

How to use lazy replica to recover data

Let's say we have a database cluster in production and a replica with an eight-hour delay. Let's see how to recover data using the example of accidentally deleting shortcuts .

When we found out about the problem, we paused recovery from the archive for the lazy replica:

SELECT pg_xlog_replay_pause();

With a pause, we had no risk that the replica would repeat the request DELETE. Useful thing if you need time to figure it out.

The bottom line is that the deferred replica must reach the moment before the request DELETE. We roughly knew the physical time of removal. We removed recovery_min_apply_delayand added recovery_target_timeto recovery.conf. So the replica reaches the right moment without delay:

recovery_target_time = '2018-10-12 09:25:00+00'

With time stamps, it is better to reduce the excess so as not to miss. True, the larger the decrease, the more data we lose. Again, if we skip the request DELETE, everything will be deleted again and you will have to start all over again (or even take a cold backup for PITR).

We restarted the deferred instance of Postgres, and the WAL segments were repeated until the specified time. You can track the progress at this stage by request:

SELECT
  -- current location in WAL
  pg_last_xlog_replay_location(),
  -- current transaction timestamp (state of the replica)
  pg_last_xact_replay_timestamp(),
  -- current physical time
  now(),
  -- the amount of time still to be applied until recovery_target_time has been reached
  '2018-10-12 09:25:00+00'::timestamptz - pg_last_xact_replay_timestamp() as delay;

If the time stamp does not change anymore, recovery is complete. You can configure the action recovery_target_actionto close, advance, or pause an instance after retry (by default, it pauses).

The database came to a state before that ill-fated request. Now you can, for example, export data. We exported the deleted data about the shortcut and all the connections with tasks and merge requests and transferred them to the working database. If the losses are large-scale, you can simply promote the replica and use it as the main one. But then all changes will be lost after the moment to which we have recovered.

Instead of timestamps, it is better to use transaction IDs. It is useful to write these IDs, for example, for DDL statements (of type DROP TABLE), with log_statements = 'ddl'. If we had a transaction ID, we would take recovery_target_xidand run everything up to the transaction before the request DELETE.

Getting back to work is very simple: remove all changes from recovery.confand restart Postgres. Soon, an eight-hour delay will appear in the replica again, and we are ready for future troubles.

Recovery Benefits

With a delayed replica, instead of a cold backup, you do not have to restore the entire image from the archive for hours. For example, we need five hours to get the entire basic 2 TB backup. And then you still have to apply the entire daily WAL to recover to the desired state (in the worst case).

A delayed replica is better than a cold backup in two ways:

No need to get the entire base backup from the archive.
There is a fixed eight-hour window of WAL segments that need to be repeated.

We also constantly check whether it is possible to make PITR from WAL, and we would quickly notice damage or other problems with the WAL archive, monitoring the lag of the delayed replica.

In this example, it took us 50 minutes to recover, that is, the speed was 110 GB of WAL data per hour (the archive was still on AWS S3 then ). In total, we solved the problem and restored the data in 1.5 hours.

Bottom line: where the delayed replica comes in handy (and where not)

Use deferred replication as a first aid tool if you accidentally lose data and notice this disaster within the configured delay.

But keep in mind: replication is not a backup.

У бэкапа и репликации разные цели. Холодный бэкап пригодится, если вы случайно сделали DELETE или DROP TABLE. Мы делаем бэкап из холодного хранилища и восстанавливаем предыдущее состояние таблицы или всей базы данных. Но при этом запрос DROP TABLE почти моментально воспроизводится во всех репликах на рабочем кластере, поэтому обычная репликация тут не спасет. Сама по себе репликация поддерживает базу данных доступной, когда сдают отдельные серверы, и распределяет нагрузку.

Даже с отложенной репликой нам иногда очень нужен холодных бэкап в безопасном месте, если вдруг произойдет сбой дата-центра, скрытое повреждение или другие события, которые сразу не заметишь. Тут от одной репликации толку нет.

Note . At GitLab.com, we now protect against data loss only at the system level and do not restore data at the user level.

Tags: