Build it yourself: how we made Amazon-style storage for small hosters

    More and more Russian Internet projects want a cloud server and look towards Amazon EC2 and its analogues. The flight of large customers to the West we perceive as a challenge for runet hosters. They would also have “their Amazon,” with preference and poetesses. To meet the demand of hosters for a distributed data warehouse for deployment at relatively small capacities, we made Parallels Cloud Server (PCS).

    In a post under the cut, I will talk about the architecture of the storage part - one of the main highlights of PCS. It allows you to organize a data storage system on ordinary hardware, comparable in speed and fault tolerance to expensive SAN storages. The second post (it is already being prepared) will be of interest to developers, it will talk about things that we learned in the process of creating and testing the system.


    The vast majority of Habr readers know about Parallels as a developer of solutions for running Windows on Mac. Another part of the audience knows about our container virtualization products - paid Parallels Virtuozzo Containers and Parallels Server Bare Metal , as well as open-source OpenVZ. The ideas that underlie containers are used to one degree or another by cloud providers of Yandex and Google scale. But in general, the strong point of Parallels is software for small and / or fast-growing service providers. Containers allow hosters to run multiple copies of operating systems on a single server. This gives 3-4 times higher density of virtual environments than hypervisors. Additionally, container virtualization allows the migration of virtual environments between physical machines, and this happens unnoticed by the host or cloud provider.

    As a rule, service providers use local drives of existing servers to store client data. There are two problems with local storage. First: data transfer is often complicated by the size of the virtual machine. The file can be very large. The second problem: the high availability of virtual machines is difficult to achieve if the server shuts down, say, as a result of a power outage.

    Both problems can be solved, but the price of the solution is quite high. There are storage systems such as SAN / NAS storage. These are such large boxes, entire racks, and in the case of a large provider - data centers for many, many petabytes. The servers connect to them using some protocol and add the data there, or take them from there. SAN storage provides redundant data storage, fault tolerance, it has its own monitoring of the status of disks, as well as a function of self-healing and correction of detected errors. All this is very cool with one exception: even the simplest SAN storage will in any way cost at least $ 100 thousand, which is quite a lot of money for our partners - small hosting companies. However, they want to possess something similar, and they constantly tell us about it. The problem drew itself: to offer them a solution to similar functionality with a significantly lower purchase and ownership price.

    PCS Architecture: First Strokes

    We started from the fact that our interpretation of distributed storage should be as simple as possible. And they proceeded from a task that was also simple: the system must ensure the functioning of virtual machines and containers.

    Virtual machines and containers make certain demands on the consistency of stored data. That is, the result of operations on the data somehow corresponds to the order in which they are performed.

    Recall how any journaling file system works. She has a journal, she has metadata and the data itself. The file system logs the data, waits for the end of this operation, after which only the data is written to the hard drive to where it should be. Then a new entry falls into the place vacated in the magazine, which after some time is sent to the place allotted to it. In the event of a power outage or in the event of a system crash, all the data in the log will be there until a secured point. They can be lost by recording already in the final place.

    There are a large number of types of consistency. The order in which all file systems expect to see their data is called immediate / strict consistency. We translate this as "strict consistency." There are two distinctive features of strict consistency. First, all readings return the values ​​just written: old data is never visible. Secondly, the data is visible in the same order in which they were recorded.

    Most cloud solutions for storing data of such properties, surprisingly, do not provide. In particular, Amazon S3 Object Storage, used for various web-based services, provides completely different guarantees - the so-called eventual consistency, or finite-time consistency. These are such systems in which the recorded data is guaranteed to be visible not immediately, but after some time. For virtual machines and file systems this is not suitable.

    Additionally, we wanted our storage system to have a number of great features:
    • She must be able to grow as needed. Storage should increase dynamically as new drives are added.
    • She should be able to allocate more space than there is on a single drive.
    • It should be able to split the amount of allocated space into several disks, because it is impossible to put 100 TB on one disk.

    In the story with several discs, there is a pitfall. As soon as we begin to distribute the data array across several disks, the likelihood of data loss increases dramatically. If we have one server with one drive, then the probability that the drive crashes is not very large. But if there are 100 disks in our storage, the probability that at least one disk burns up greatly increases. The more disks, the higher this risk. Accordingly, the data must be stored redundantly, in several copies.

    The distribution of data on many disks has its advantages. For example, it is possible to restore the original image of a broken disk in parallel with many live disks simultaneously. Typically, in a cloud storage, such an operation takes a matter of minutes, unlike traditional RAID arrays, which require several hours or even days to recover. In general, the probability of data loss is inversely proportional to the square of the time required to restore it. Accordingly, the faster we recover the data, the lower the likelihood that the data will be lost.

    It was decided: we split the entire data array into pieces of a fixed size (64-128 MB in our case), we will replicate them in a given number of copies and distribute them throughout the cluster.

    Next, we decided to simplify everything to the maximum. First of all, it was clear that we did not need a regular POSIX-compatible file system. It is only necessary to optimize the system for large objects - images of virtual machines - which occupy several tens of gigabytes. Since the images themselves are infrequently created / deleted / renamed, metadata changes are rare and these changes can not be optimized, which is very significant. For the functioning of the container or virtual machine, objects have their own file system. It is optimized for changing metadata, provides standard POSIX semantics, etc.

    Clone attack

    Often you can hear judgments that only individual cluster nodes fail at the providers. Alas, it also happens that even data centers as a whole can lose power - the entire cluster falls. From the very beginning, we decided that we should consider the possibility of such failures. Let's go back a bit and think about why in a distributed data warehouse it is difficult to ensure strict consistency of stored data? And why did the large providers (the same Amazon S3) start with storage like eventual consistency?

    The problem is simple. Suppose we have three servers that store one copy of an object. At some point, changes occur in the object. These changes manage to record two of the three servers; the third was for some reason not available. Further on, at our rack or in our data center, power is lost, and when it returns, the servers begin to load. And it may happen that the server on which the object changes were not recorded is loaded first. If you do not take any special measures, the client can get access to the old, irrelevant copy of the data.

    If we saw a large object (the image of a virtual machine in our case) into several pieces, everything becomes only more complicated. Suppose a distributed storage divides a file image into two fragments, for the recording of which six servers are needed. If not all servers manage to make changes to these fragments, then after a power failure, when the servers are loading, we will already have four combinations of fragments, and two of them never existed in nature. The file system is not designed for this at all.

    It turns out that we must somehow attribute the versions to the objects. This is implemented in several ways. The first is to use a transactional file system (for example, BTRFS) and update the version together with the data update. This is correct, but when using still traditional (rotating) hard drives - slowly: performance drops at times. The second option is to use some kind of consensus algorithm (for example, Paxos ) so that the servers that make the modification agree among themselves. This is also slow. The servers themselves, which are responsible for the data themselves, cannot track the versioning of changes in fragments of an object, because they don’t know if anyone else has changed the data. Therefore, we came to the conclusion that data versions should be updated somewhere on the side.

    The versions will be monitored by the metadata server (MDS). At the same time, it is not necessary to update the version upon successful recording, it is only necessary when one of the servers did not record for any reason and we exclude it. Therefore, in normal mode, data is recorded at the highest possible speed.

    Actually, the solution architecture is made up of all these conclusions. It consists of three components:
    • Clients who communicate with storage via regular Ethernet.
    • MDS, which stores information about where which files are located and where the current versions of the fragments are located.
    • The fragments themselves, distributed across the local hard drives of the servers.

    Obviously, MDS is the bottleneck in the entire architecture. He knows everything about the system. Therefore, it must be made highly available - so that in the event of a machine failure, the system is available. There was a temptation to “put” MDS in a database such as MySQL or in another popular database. But this is not our method, because SQL servers are quite limited in the number of queries per second, which immediately puts an end to the scalability of the system. It’s even more difficult to make a reliable cluster of SQL servers. We looked at the solution in an article about GoogleFS. In the original version, GoogleFS did not fit our tasks. It does not provide the desired strict consistency, as it is intended to add data to search engines, but not to modify this data. The solution was as follows: MDS stores the complete state about the objects and their versions in memory as a whole. It turns out not so much. Each fragment describes a total of 128 bytes. That is, a description of the state of the storage with a volume of several petabytes will fit into the memory of a modern server, which is acceptable. MDS state changes are written to the metadata log. The magazine is growing, albeit relatively slowly.

    Our task is to make sure that several MDSs can somehow agree among themselves and ensure the availability of the solution even if one of them crashes. But there is a problem: the log file cannot grow indefinitely, you will have to do something with it. To do this, a new log is created, and all changes begin to be written there, and at the same time a memory snapshot is created, and the state of the system begins to be written asynchronously to the snapshot. After the snapshot is created, you can delete the old log. In the event of a system crash, you need to “lose” the snapshot and the log to it. When the new log grows again, the procedure repeats.

    More hints

    Here are a couple of tricks that we implemented in the process of creating our distributed cloud storage.

    How to achieve high speed recording fragments?Typically, distributed repositories use three copies of data. If you write "forehead", then the client should send a request to the record with new data to three servers. It will turn out slowly, namely three times slower than the network bandwidth. If we have gigabit Ethernet, then in a second we will transfer only 30-40 MB of data per copy, which is not an outstanding result, because the write speed even on the HDD is significantly higher. To use iron and network resources more efficiently, we used chain replication. The client sends data to the first server. He, having received the first part (64 Kb), writes it to disk, and then sends it to the other servers in parallel along the chain. As a result, a large request begins to be written as early as possible, and is asynchronously transmitted to other participants. It turns out that iron is used at 80% of maximum performance, even when it comes to three copies of the data. Everything works great also because Ethernet can simultaneously receive and send data (full duplex), i.e. in reality it produces two gigabits per second, and not one. And the server, which received data from the client, at the same speed sends them along the chain further.

    SSD caching. SSDs help speed up any storage. The idea is not new, although it should be noted that there are no really good solutions in open source. SAN storages have been using such caching for a long time. The hint is based on the ability of SSDs to issue random access performance orders of magnitude higher than HDDs can produce. You start caching data on an SSD - you get a speed ten times faster. In addition, we calculate the checksums of all data, and store them on SSDs as well. Subtracting all the data periodically, we check their availability and compliance with the checksums. The latter is also sometimes called scrubbing (if you remember, a skin scrub removes loose particles from it) and increases the reliability of the system, and also allows you to detect errors earlier than the data will be needed in reality.

    There is another reason SSD caching is important for virtual machines. The fact is that for object stores, for example, Amazon S3 and its analogues, the delay in accessing the object is not very important. As a rule, access to them is via the Internet, and a delay of 10 ms is simply not noticeable there. If we are talking about a virtual machine on the server of the hosting provider, then when the OS performs a series of consecutive synchronous requests, the delay accumulates and becomes very noticeable. Moreover, all scripts and most applications are synchronous in nature, i.e. perform operation after operation, and not in parallel. As a result, seconds and even tens of seconds are already noticeable in the user interface or some responses to user actions.

    Results & Conclusions

    As a result, we got a data storage system, which by properties is suitable for hosters to deploy on their infrastructure, because it has the following properties:

    • The ability to run virtual machines and containers directly from storage, as Our system provides strong consistency semantics.
    • The ability to scale to petabyte volumes across multiple servers and drives.
    • Support for data checksums and SSD caching.

    A few words about performance. It turned out to be comparable with the performance of entrepise storage facilities. We took a cluster of 14 servers with 4x HDD from Intel and received 13 thousand random I / O operations per second with very small (4 KB) files on ordinary SATA hard drives. This is quite a lot. SSD caching accelerated the work of the same storage by almost two orders of magnitude - we approached 1 million i / o operations per second. The recovery speed of one terabyte of data was 10 minutes, and the larger the number of disks, the faster the recovery.

    A similar SAN-storage will cost from several hundred thousand dollars. Yes, a large company can afford its purchase and maintenance, but we did our own distributed storage for hosters who would like to get a solution of the same parameters for much less money and on existing equipment.


    As was said at the beginning, "to be continued." A post is being prepared on how the storage part of Parallels Cloud Server was developed. Suggestions are accepted in comments, which you would be interested to read about. I will try to take them into account.

    Also popular now: