TM_content February 6, 2017 at 18:10

Features of the architecture of distributed storage in Dropbox

The attention of readers of Habrahabr is presented with a transcript of the video (at the end of the publication) of Vyacheslav Bakhmutov’s speech at the stage of the HighLoad ++ conference held in Skolkovo near Moscow on November 7-8 of the past year.

My name is Bakhmutov Glory, I work at Dropbox. I am a Site Reliability Engineer (SRE). I love Go and promote it. With the guys we record the golangshow podcast .

What is Dropbox?

This is cloud storage where users store their files. We have 500 million users, we have more than 200 thousand businesses, as well as a huge amount of data and traffic (more than 1.2 billion new files per day).

Conceptually, architecture is two large pieces. The first piece is a metadata server, it stores information about files, connections between files, user information, some kind of business logic, and all this is connected with the database. The second big piece is the block storage, which stores user data. Initially, in 2011, all data was stored in Amazon S3. In 2015, when we managed to download all exabytes to ourselves, we talked about writing our own cloud storage.

We called it Magic Pocket. It is written in Go, partly in Rust, and quite a lot in Python. The architecture of Magic Pocket, it is cross-zone, there are several zones. In America, these are three zones, they are combined in Pocket. There is Pocket in Europe, it does not intersect with the American, because Europeans do not want their data to be in America. Zones between themselves replicate data. There are cells inside the zone. There is a master who controls these cells. There are replications between zones. In each cell there is a Volume Manager, which monitors the servers on which this data is stored, there are quite large servers.

On each of the servers, this is all combined into a bucket, bucket is 1 GB. We operate with bucket when we throw data somewhere, when we delete, clear, defragment, because the data blocks that we save from the user are 4 MB, and it’s very difficult to operate with them. All components of Magic Pocket are well described in our technical blog, I will not talk about them.

I will talk about architecture in terms of SRE. From this point of view, the availability and security of data is very important. What is it?

Availability is a complex concept and is calculated differently for different services, but usually it is the ratio of the number of completed requests to the total number of requests, usually described by nine: 999, 9999, 99999. But in fact it is a certain function of time between disasters or some problems by the time how much you fix this problem. No matter how you fix it, in production or just roll back the version.

What is safety?

How do we calculate safety? You take some data and save it to your hard drive. Then you have to wait until they are synchronized to disk. This is ridiculous, but many use nosql solutions that simply skip this step.

How is safety calculated?

There is AFR - this is the annual disk error rate. There are various statistics on what errors occur, how often errors occur in certain hard drives. You can see these statistics :

Next, you can replicate your data to different hard drives to increase durability. You can use RAID, you can keep replicas on different servers or even in different data centers. And if you consider Markov chains as much as the probability of losing one byte, then you get something about 27 nines. Even on the Dropbox scale, where we have data exabytes, we will never lose a single byte in the near future. But all this is ephemeral, any operator error, or a logical error in your code - and there is no data.

How do we actually improve the availability and security of data?

I divide it into 4 categories, these are:

Insulation;
Protection;
The control;
Automation.

Automation is very important.

Insulation happens:

Physical;
Logical
Operational.

Physical isolation. On the scale of Dropbox or a company like Dropbox, it is very important for us to communicate with the data center, we need to know how our services are located inside the data center, how energy is supplied to them, and what kind of network availability these services have. We do not want to keep the database services in one rack that we need to constantly back up. Let's say that every backup we have is 400 Mbps, and we just don’t have enough channel. The deeper you go into this stack, the more expensive your decision is, and the more difficult it becomes. How low to go down is your decision, but, of course, you should not put all the replicas of your databases in one rack. Because the energy will turn off and you have no more databases.

You can look at all this in another dimension, from the point of view of the equipment manufacturer. It is very important to use different equipment manufacturers, different firmware, different drivers. Why? Although the equipment manufacturers say their solutions are reliable, in fact they lie and this is not so. Well at least they do not explode.

Based on all this, it is important to put critical data not only somewhere in your backups on your infrastructure, but also in the external infrastructure. For example, if you have a Google cloud, then you put important data into Amazon, and vice versa. Because if your infrastructure goes out, then the backup will be taken from nowhere.

Logical isolation.Almost everyone knows everything about her. The main problems: if one service begins to create some problems, then other services also begin to experience problems. If a bug was in the code of one service, then this bug begins to spread to other services. You are getting the wrong data. How to deal with this?

Weak connectedness! But it very rarely works. There are systems that are not loosely coupled. These are databases, ZooKeeper. If your load went to ZooKeeper, your quorum cluster fell, then it all fell. With databases about the same. If there is a large load on the master, then most likely the entire cluster will fall.

What have we done at Dropbox?

This is a high-level diagram of our architecture. We have two zones, and between them we made a very simple interface. It is practically put and get, it is for storage. It was very difficult for us, because we wanted to make everything more complicated. But this is very important, because inside the zones everything is very complicated, there are ZooKeeper and databases, quorums. And it all periodically falls, straight all at once. And so that it does not capture the remaining zones, there is this simple interface between them. When one zone falls, the second is likely to work.

Operational insulation.No matter how well you spread your code on different servers, no matter how logically isolated it is, there will always be a person who will do something wrong. For example, in Odnoklassniki there was a problem: a person rolled out all the Bash shell servers in which something did not work, and all the servers were disconnected. Such problems also happen.

They also joke that if all the programmers and system administrators went somewhere to relax, then the system would work much more stable than when they work. And indeed it is. During friezes, many companies have this practice of freezing before the New Year holidays, the system works much more stable.

Access control. Release process: all this is actively tested, then tested on staging, which your company uses, for example. Further we spread changes in one zone. As soon as we make sure that everything is fine, we lay out on the other two zones. If something is not normal, then we will replicate the data from them. This is all about storage. We constantly update our product services once a day.

Protection. How do we need to protect our data?

This is a validation of operations. This is an opportunity to recover this data. This is testing. What is operation validation?

The biggest risk to the system is the operator.

There is a story. Margaret Hamilton, who worked in the Apollo program, at some point during testing, a daughter came to her and turned some lever, and the whole system came back. This lever cleared the navigation system of this Apollo. Hamilton turned to NASA and said that there is such a problem, she offered to make a defense. If the ship flies, it will not clear the data in this ship. But NASA said that astronauts are professionals and they will never turn this lever. In the next flight, one of the astronauts accidentally turned this lever, and everything was cleared. They had a clear description of the problem and were able to restore this navigation data.

We had a similar example , we have such a tool called JSSH, it is such a distributed shell that allows you to execute commands on many servers.

In fact, in this command we need to run memcache on servers that is in reinstall state, that is, they no longer work, and we need to update them. We need to run the upgrade.sh script. Usually this is all done automatically, but sometimes you need to do it manually.

There is a problem, it was necessary to make all the quotation marks:

Since it was all without quotation marks, then for all memcachs the argument lifecycle = reinstall was set and they all rebooted. This did not affect the service very well. The operator is not to blame, anything can happen.

What have we done?

We have changed the command syntax (gsh) so that there are no more such problems. We have forbidden to perform destructive operations on live services (DB, memcache, storage). That is, we should by no means reload the database without stopping it and removing it from production, also with memcache. We are trying to automate all such operations.

We added two slashes to this operation at the end, and after that we indicate our script, such a small fix allowed us to avoid such problems in the future.

Second example.This is SQLAlchemy. This is a Python library for working with databases. And in it for update, insert, delete there is an argument called whereclause. In it you can specify what you want to delete, what you want to update. But if you transfer not whereclause, but where, then sqlalchemy will not say anything, it will simply delete everything without where, this is a very big problem. We have several services, for example, ProxySQL. This is a proxy for MySQL, which allows you to prohibit many destructive operations (DROP TABLE, ALTER, updates without where, etc.). Also in this ProxySQL, you can make throttling for those queries that we don’t know, limit their number so that a smart query does not randomly put us master.

Recovery. It is very important not only to create backups, but also to verify that these backups will be restored. Facebook recently posted an article where they talk about how they constantly make backups and constantly recover from these backups. We essentially have the same thing.
Here is an example from our Orchestrator, for a short period of time:

It can be seen that we are constantly making database clones, because we have up to 32 databases located on one server, we are constantly moving them. Therefore, we are constantly cloning. Constantly promotion go to master and slave, etc. Also a huge number of backups. We back up to ourselves, also in Amazon S3. But we also constantly have recovery. If we do not verify that every database that we backed up can be restored, then in fact we do not have this backup.

Testing

Everyone knows that unit testing and integration testing are useful. From the point of view of accessibility, testing is MTTR, the time for recovery, it is essentially 0. Because you found this bug not in production, but before production and fixed it. Availability has not bent. This is also very important.

The control

Someone always messes up: either programmers or operators will do something wrong. It's not a problem. You need to be able to find and fix it all.

We have a huge number of verifiers for storage.

There are actually 8, not 5, as here. We have verification codes more than storage codes. We have 25% of internal traffic - this is verification. At the lowest level, disk scrubbers work, which simply read blocks from the hard drive and check checksums. Why are we doing this? Because hard drives are lying, SMART is lying. It is not beneficial for manufacturers to have SMART find errors because they have to return these hard drives. Therefore, this must always be checked. And as soon as we see the problem, we try to recover this data.

We have a trash inspector. When we delete something, or move something, or do something destructive with data, we first put this data in a certain basket and then check this data if we really wanted to delete it. They are stored there for two weeks, for example. We have the capacity for this two-week time of deleted data, it is very important, so we spend money on it. We are also constantly part of the traffic that comes to storage, we store these operations in Kafka. Then we repeat these operations already on storage. We turn to storage as a blackbox to see if there really is traffic data that came to us, and those that are recorded, we can pick it up.

We constantly check that the data is replicated from one storage to another, that is, they correspond to each other. If the data is in one storage, it must be in another storage.

Verifiers - this is very important if you want to achieve high and high-quality durability.

It is also very important to know that verifiers that you have not tested, they essentially do not verify anything. You cannot say that they really do what you want.

Therefore, it is very important to carry out Disaster Recovery Testing (DRT). What it is? This is when, for example, you completely disable some internal service on which other services depend, and see if you were able to determine that something is not working for you, or you think that everything is fine. Could you react to it quickly, repair it, restore it all. This is very important because we catch problems in production. Production differs from staging in that there is a completely different traffic. You just have a different infrastructure. For example, in your staging there can be one number of services on one rack, and they are different, for example, a web service, a database, storage. And in production, it can be completely different.

We had such a problem due to the fact that our database was with web services, we could not backup it in time. This also needs to be monitored.

It is very important to prove our judgment. That is, if something fell, and we know how to restore it, we wrote this script, we must prove that if something falls, then it really will happen. Because the code is changing, the infrastructure is changing, everything can change. It is also a plus and calm for those on duty, because they do not sleep. They do not like to wake up at night, then they may have problems. This is a fact, psychologists have studied this.

If the attendants know that there will be some kind of problem, then they are ready for this. They know how to recover in the event of a real problem. This should be done regularly in production.

Automation

The most important thing. When your number of servers grows linearly or exponentially, then the number of people is not born linearly. They are born, study, but with some periodicity. And you cannot increase the number of people equivalent to the number of your servers.

Therefore, you need automation that will do the work for these people.

What is automation?

In automation, it is very important to collect metrics from your infrastructure. I practically did not say anything about metrics in my report, because metrics are the core of your service and you do not need to mention it, because this is the most important part of your service. If you do not have metrics, you do not know if your service is working or not. Therefore, it is very important to collect metrics quickly. If, for example, your metrics are collected once a minute, and your problem is in a minute, then you will not know about it. It is also very important to respond quickly. If a person reacts, for example, a minute has passed, when something happened, something is laid on the fact that the bug was in the metric, you receive an alert. We have a policy for 5 minutes. You should start to do something, respond to alert at this time. You start to do something, you begin to understand, in fact, your problem is solved on average 10-15 minutes, depending on the problem. Automation allows you to speed up, but not solve, in the sense that it provides information about this problem before you tackle this problem.

We have such a Naoru tool - paranoid automation.

Automation consists of several components.

It consists of some alerts that come. It can be a simple Python script that connects to the server and verifies that it is available. This may come from Nagios or Zabbix, no matter what you use. The main thing is that it comes quickly. Then we need to understand what to do with it, we must diagnose. For example, if the server is unavailable, we should try connecting via SSH, connecting via IPMI, see if there is no answer, it freezes or something else, you prescribe some kind of treatment.

Further, when you write automation, everything should go through the operator. We have a policy that we do any automation, about 3-6 months it is solved through the operator.

We have collected all the information about the problem, and this information falls out to the operator, such and such a server is unavailable for some reason, and indicates what needs to be done, and the operator is asked for confirmation. The operator has very important knowledge. He knows that now there is some kind of problem with the service. For example, he knows that this server cannot be restarted because something else is running on it. Therefore, it cannot be simply restarted. Therefore, every time an operator encounters a problem, he makes some improvements to this automation script, and each time he gets better.

There is a big problem, this is the laziness of the operators. They begin to automate automation. They automatically insert yes here:

Therefore, for some time we wrote not yes / no, but: "Yes, I really want this, that." In a different register, randomly checked that this is really a console. It is very important.

It is also important to have hooks. Because not everything can always be checked. Hooks can be very simple, for example, if ZooKeeper is running on the server, then we need to check that all members of the group are working (script with errors), in fact we just check something:

These hooks are located throughout flow, everywhere after alerts there are hooks , in a diagnostic plugin, etc. You can create your own hook for your service.

Next is the solution to the problem:

It can be very simple. For example, restart this server. Run puppet, chef, etc.

This is Naoru. This is a reactive remediation. Which allows us to respond to the event very quickly and repair them very quickly.

There are also open-source solutions. The most popular is StackStorm from Russian guys. They have a very good solution, quite popular. You can also do this with your own solutions like riemann or OpsGenie, etc.

There are proprietary solutions. From Facebook, this is FBAR (facebook auto-remediation). Linkedin’s Nurse. They are closed, but they constantly talk about these decisions. For example, Facebook recently made a report on the problem of transferring the entire rack through its tool.
But automation is not only reactive (here and now), but also automation that needs to be done over a long period of time. For example, we need to update 10 thousand servers, rebooting them. For example, you need to update the driver, kernel and something else to do. This can take a long period of time: month, year, etc. In a reactive system, this cannot be constantly monitored.

Therefore, we still have such a Wheelhouse system.

There is a scheme of how it works . Now I will tell you what is there.

Essentially, we have a database cluster that has one master. A and there are two slaves. We need to replace this master, for example, we want to update the kernel. To replace it, we need to tidy up the slave, depromote that master, delete it. We have a requirement in Dropbox, in our cluster there should always be two slaves for such a configuration. We have a certain state of this cluster. HostA is in production, it is master, it has not been released yet, we have two slaves, but we need three for this operation.

We have some kind of state machine that does all this.

From replace_loop (blue arrow) we see that our server is in production, and we do not have enough slaves, and we go into the state to select another slave. We come to this state, we initiate the work to create a new clone with the master'a new slave. It starts somewhere in Orchestrator, we are waiting. If the work is completed, and everything is fine, then we proceed to the next step. If it was fail, then go to the failure state. Next, we add a new slave to production to this host, also initiate this work in Orchestrator, wait and return to replace_loop.

We now have something like this:

One host is master, there are three slaves. We satisfy the condition that we have enough slaves. After that, we go to the promote state.

This is actually a depromote. Because we need to make master a slave, and some kind of slave master. Everything is the same here. Work in Orchestrator is added, the condition is checked, and so on, the remaining steps are approximately the same. After that, we delete master from production, remove traffic from it. We remove master in the installer so that the people who are involved in this server can update something on it.

This block is very small, but it is involved in more complex blocks. For example, if we need to move an entire rack to another rack, because the drivers on the switches are changing, and we just need to disconnect the entire rack to reload the switch. We have much described such state diagrams for different services, how to extinguish them, how to raise them. It all works. Even mathematically, you can prove that your system will always be in working condition. Why is this done through STM and not in procedure mode? Because if it is a long process, for example, cloning can take an hour or so, then something else may happen to you, and the state of your system will change. In the case of a state machine, you always know which state and how to respond to it.

We are now using this in other projects. These 4 methods you can use to increase the availability and security of your data. I did not talk about specific solutions, because in any large company, decisions are written by everyone. Of course, we use some kind of stable blocks, for example, for databases we use MySQL, but on top we screwed up our graph storage. For replication we used semi sync, but now we have our own replication, something like Apache BookKeeper, we also use proxdb. But in general, we all have our own solutions, so I did not mention them, because they most likely do not look like yours. But you can use these methods with any of your solutions, with open source or not. To improve your accessibility.

That's all, thanks!

Tags: