Olegbunin September 30, 2016 at 23:03

Disaster Recovery Implementation

Sergey Burladyan (Avito)

Hello everyone, my name is Sergey Burladyan, I work for Avito as a database administrator. I work with such systems:

This is our central base of 2 TB, 4 servers - 1 master, 3 standby. We also have londiste-based logical replication (this is from Skytools), an external sphinx index, various uploads to external systems - such as DWH, for example. We also have our own developments in the field of remote procedure call, the so-called xrpc. Storage for 16 bases. And another figure is that our backup takes 6 hours, and its recovery takes about 12. I would like that in case of various accidents of these systems, a simple site of our site should take no more than 10 minutes.

If you try to imagine the various connections of these systems, then they somehow look like this:

And how can you not lose all this in an accident?

What could be an accident?

I mainly consider server loss crashes, and a plus for the wizard could be such an accident as a data explosion.

Let's get started.

Suppose some administrator mistakenly updated without where. We have had such a case several times. How to protect yourself from it? We defend ourselves with the fact that we have a standby that uses WALs with a delay of 12 hours. When such an accident occurred, we took this data from standby and uploaded it back to master.

The second crash that can happen to the master is the loss of the server. We use asynchronous replication and after the loss of the server we have to do some standby promotion. And since Since we have asynchronous replication, it is necessary to perform various procedures to restore connected systems. Our master is central and is a data source, respectively, if it switches, and replication is asynchronous, then we lose some of the transactions, and it turns out that part of the system is in the unattainable future for the new master.

It’s all difficult to do with your hands, so you need to immediately do the script. What does the accident look like? In external systems, ads appear that are no longer on the wizard, sphinx issues non-existent ads when searching, sequences jumped back, logical replicas, in particular because of this, also stopped working (londiste).

But not everything is so bad, it can all be restored. We sat, thought and planned the recovery procedure. In particular, DWH we can simply unload again. And directly, because we have a simple 10 minutes, then on the monthly reports the change in these lost items is simply not visible.

How to restore xrpc? We have xrpc used for geocoding, for calling asynchronous procedures on the wizard, and for calculating user karma. Accordingly, if we zeokodil something, i.e. they turned it from the address into coordinates on the map, and then this address disappeared, it’s okay that it remains geocoded, just, for the second time we won’t geocode the same address, therefore, we don’t need to restore anything. The local procedure call is asynchronous, because it is local, it is located on one database server, even on one database, and therefore, when we switched the database, it is consistent. Also, nothing needs to be restored. Karma of the user. We decided that if the user did something bad, and then an accident occurred, and we lost these bad items, then the karma of the users can also not be restored. He did these bad things,

Sphinx site. We have two sphinx - one for the site, the other for backoffice. Sphinx, which is a site, is implemented in such a way that it completely rebuilds its entire index every 10 minutes. Accordingly, an accident occurred, recovered, and after 10 minutes the index was completely rebuilt and corresponds to the master. And for backoffice, we decided that it’s also not critical, we can overfill part of the ads that changed after recovery, and plus once a month we completely rebuild the entire backoffice sphinx, and all these emergency items will be cleaned.

How to restore sequences so that they do not jump backwards? We just selected sequences that are important to us, such as item_id, user_id, payment primary key, and after an accident we scroll them forward 100 thousand (we decided that we would have enough).

We restore logical replication using our system, this is a patch for londiste, which makes UNDO for logical replica.

Undo patch - these are three teams. The command itself and plus two Undo add / remove commands for the logical replica. And also replay in londiste, we added a flag so that it passes TICK_ID from the master to the Postgres session variable.

This is necessary directly in the implementation of Undo, because it is implemented - it's just triggers on all subscriber tables. The trigger writes to the history plate which operation occurred directly. In the target table. This passed tick_id with the master, he remembers in this entry. Accordingly, when the accident occurred, the logical replica was in the future, and it needs to be cleaned in order to restore changes that are from an unattainable future. This is done by performing inverse queries, i.e. for insert we do delete, for update we update with the previous values, well, for delete - insert.

With our hands we don’t do all this, we do it with a script. What is the feature of our script here? We have three asynchronous standby, respectively, before switching, you need to find out which one is closest to the master. Next, we select this standby, wait until it plays the remaining WALs from the archive, and select it for the future master. Next, we use Postgres 9.2. The features of this version are that in order for standby to switch to a new promotion and master, they have to be stopped. In theory, in 9.4 this can no longer be done. Accordingly, we do promote, move sequences forward, execute our Undo procedure, run standby. And then there’s also an interesting point - you need to wait until standby connects to the new master. We do this by waiting for the timeline of the new master to appear on the appropriate standby.

And now, it turns out that Postgres does not have such a SQL function, it is impossible to understand the timeline on standby. But we solve it this way, it turns out that we can connect via Postgres replication protocol to standby, and there, after the first command, standby will inform its timeline highlighted in red.

This is our recovery wizard script.

Let's go further. How do we recover directly when some external systems fall apart. For example, standby. Because we have three standby, as I said, we just take it, switch to the remaining standby if one of them falls. As a last resort, even if we lose all standby, we can switch traffic to the masters. Some traffic will be lost here, but, in principle, the site will work. There was still such a trick - at first I always created new standby from backup, then we got SSD'shny servers, and I still continued to restore standby from backup. Then it turned out that if you take from backup, recovery takes 12 hours, and if you just take pg_basebackup from some working standby, it takes much less time. If you have several standby, you can try to check it with you.

If sphinx site breaks. The site sphinx is written in such a way that it completely rebuilds the entire index, and the site sphinx is all the active site ads. Now all 30 or 35 million ads on the site are indexed by this system. Indexing comes from a separate logical replica, it is specially prepared for indexing and is made so that everything is laid out in memory there, and indexing happens very quickly, so we can do indexing every 10 minutes, completely from scratch. We have a couple of logical replicas. And if we lose a replica, we switch to its reserve. And if something happened to sphinx, then after 10 minutes it will be completely re-indexed, and everything will be fine.

How can I restore export to DWH? Suppose we exported something, an accident occurred on DWH, we lost some of the latest data. We export DWH through a separate logical replica, and the last four days are stored on this replica. We can simply re-invoke the export script with our hands and upload all this data. Plus there is another archive in six months. Or, as a last resort, because we have several standby, we can take one of them, pause and re-upload, in general, all the data from the master in DWH.

Xrpc is implemented on top of pgq (these are Skytools), and thanks to this we can do such tricky things. Pgq is, in fact, just a table in the database, events are stored in it. It looks approximately like in the picture. There is an event time and transaction id. When we restored the xrpc client, we can take and move back in this queue and play again those events that are not in the receiver.

Xdb - we have a repository of several databases. 16 bases are located on eight cars. This storage is reserved for us as follows - Postgres binary replication is simply configured from one machine to another. Those. the first machine is reserved standby in the second, the second in the third, respectively, the eighth in the first. In addition, when playing WALs, there is also a delay of four days, i.e., in fact, we have a backup of any of these nodes in four days.

Now I will tell you in detail about the replica, what it is. We have built a logical replica based on the capabilities of Postgres, it is a view on the master and a deferred trigger on the desired tables. These triggers trigger a special function that writes to a separate label. It can be considered as a materialized representation. And further this plate means londiste is replicated to a logical turnip.

Directly, it somehow looks like this, I will not dwell on this in detail.

And the logical replica server itself, why is this, in general, necessary? This is a separate server. It is characterized by the fact that everything is in memory there, i.e. shared_buffers is such a size that this whole plate and its indexes completely fit into it. This allows us to handle a large load on such logical replicas, in particular, for example, one turnip serves 7000 transactions per second, and 1000 events are queued from the master into it. Because it is a logical replica implemented by londiste and pgq, then there is a convenient thing - tracking which transactions have already been lost on this logical replica. And on the basis of this thing you can do things like Undo.

I have already said that we have two replicas, we can recover just by switching. If one replica is lost, we switch to the second. This is possible due to the fact that pgq allows several consumers to subscribe to one queue. The turnip has fallen, and then we need to restore a copy of it. If this is done simply by means of londiste, then it now takes us 4 hours for the replica site, 8 hours for the sphinx, because triggers are called there that cut data for convenient indexing to the sphinx, and this is all very long. But it turned out that there is another way to create a fallen turnip - you can make pg_dump work.

But if you just do pg_dump and run londiste on it, then it won’t work, because londiste tracks both the master and the logical replica of the current position of the lost transaction. Therefore, there still need to take additional steps. After restoring the dump on the tick_id wizard, it needs to be fixed so that it matches the tick_id on the restored turnip. If so, copy via pg_dump, then all this takes no more than 15 minutes.

The algorithm itself looks something like this.

Backup is designed to protect against crashes, but directly with the backup itself, crashes can also occur. For example, in Postgres, the WAL archiving command does not indicate what fsynk should do when the WAL is written to the archive. But this is an important thing and allows you to protect yourself from, say, an emergency restart of the archive. In addition, our backup is also reserved by the fact that it is copied to an external cloud. But in the plans: we want to make two active archive servers so that archive_command writes to both WALs. You can also say that at first we experimented with pg_receivexlog in order to receive directly on the WAL archive servers themselves, but it turned out that in 9.2 it is almost impossible to use, because it does not fsynk, it does not track which WALs it already received from the wizard which can be cleaned at checkpoint. Postgres has done it now. AND,

We do not use streaming at home. Those. what I was talking about is all based only on the WAL archive. This was done due to the fact that it is difficult to provide an archive when streaming, as if, for example, we take the archive from standby, the backup is completed, and the wizard has not yet managed to archive all these WALs needed to restore the backup. And we get a broken backup. This can be circumvented if, for example, standby, from which we take a backup, is behind us for 12 hours, like ours. Or, in Postgres 9.5, we made the archive_mode = always setting so that there would be no such problem. It will be possible to take backup from standby quietly and receive WALs directly from standby in the archive too.

It’s not enough to just backup, you still need to check if everything is backed up correctly. We do this on a test server, and for this we wrote a special backup verification script. It is based on what it checks after restoring the server and running error messages in the server log. And for each database restored on the cluster, a special checking function called check_backup is called, which performs additional checks. In particular, such a check that the date of the last transaction should differ from the date of the last announcement by no more than a minute. Those. if there are no holes, we believe that the backup is restored correctly.

On the slides you can see what specific errors we analyze in the log when checking the backup.

We used to check backups by executing a vacuum of the entire database and reading tables, but then we decided to refuse it, because we count reports on the restored backup, and if the reports were calculated correctly, there are no holes, strange values, then the backup was done correctly .

I talked about asynchronous replication, but sometimes I want to make synchronous. Our Avito consists of many services, one of these services is a payment service. And due to the fact that it is selected, we can do synchronous replication for it, because It works on a separate base. There is not such a big load and the standard latency of the network allows us to enable synchronous replication there.

What can be said at the end? Still, despite the fact that replication is synchronous, you can work and recover in this mode, if you look at your connected systems, you can figure out how to restore them. It is important to test backups as well.

Another such remark. We have a recovery script, at the end of it we need to change the DNSs, because we have a master or a slave - this is fixed in the DNS. We are now thinking about using some kind of system such as ZooKeeper in order to automatically switch DNS. Such plans.

This report is a transcript of one of the best speeches at the conference of developers of highly loaded systems HighLoad ++ . Now we are actively preparing the 2016 conference - this year HighLoad ++ will be held in Skolkovo on November 7 and 8.

The Avito team traditionally offers very strong performances, for example, this year it will be:
The experience of migration between data centers / Mikhail Tyurin
Sphinx 3.0 and RT-indexes on the main search Avito / Andrey Smirnov, Vyacheslav Kryukov;

Also, some of these materials are used by us in an online training course on the development of highly loaded systems HighLoad. Guide is a chain of specially selected letters, articles, materials, videos. Already in our textbook more than 30 unique materials. Get connected!

Tags:

Disaster Recovery Implementation

Sergey Burladyan (Avito)

Also popular now: