Logrann September 20, 2018 at 11:30

How S3 DataLine storage works

Hello, Habr!

It's no secret that huge amounts of data are involved in the work of modern applications, and their flow is constantly growing. This data needs to be stored and processed, often from a large number of machines, and this is not an easy task. To solve it, there are cloud object stores. Usually they are an implementation of Software Defined Storage technology.

At the beginning of 2018, we launched (and launched!) Our own 100% S3-compatible storage based on Cloudian HyperStore. As it turned out, the network has very few Russian-language publications about Cloudian itself, and even less about the actual application of this solution.

Today, based on the experience of DataLine, I will tell you about the architecture and internal structure of Cloudian software, including the Cloudian SDS implementation based on a number of Apache Cassandra architectural solutions. Separately, we consider the most interesting in any SDS storage - the logic of ensuring fault tolerance and distribution of objects.

If you are building your S3 storage or are busy maintaining it, this article will come in handy for you.

First of all, I will explain why our choice fell on Cloudian. It's simple: there are very few worthy options in this niche. For example, a couple of years ago, when we were just thinking of building, there were only three options:

CEHP + RADOS Gateway;
Minio
Cloudian HyperStore.

For us, as a service provider, the decisive factors were: a high level of correspondence between the storage API and the original Amazon S3, the availability of built-in billing, scalability with multiregionality support and the presence of a third vendor support line. Cloudian has all this.

And yes, the most (certainly!) Most important thing is that DataLine and Cloudian have similar corporate colors. You must admit that we could not resist such beauty.

Unfortunately, Cloudian is not the most common software, and there is practically no information about it in RuNet. Today we will correct this injustice and talk with you about the features of the HyperStore architecture, examine its most important components and deal with the main architectural nuances. Let's start with the most basic, namely - what is Cloudian under the hood?

How Cloudian HyperStore Storage Works

Let's take a look at the diagram and see how the Cloudian solution works.

The main component storage scheme.

As we can see, the system consists of several main components:

Cloudian Management Control - management console ;
Admin Service - internal administration module ;
S3 Service - the module responsible for supporting the S3 protocol ;
HyperStore Service - the actual storage service ;
Apache Cassandra - a centralized repository of service data ;
Redis - for the most frequently read data .

Of greatest interest to us will be the work of the main services, S3 Service and HyperStore Service, then we will carefully consider their work. But first, it makes sense to find out how the distribution of services in the cluster is arranged and what the fault tolerance and reliability of the data storage of this solution as a whole is.

By common services in the diagram above we mean services S3, HyperStore, CMC and Apache Cassandra. At first glance, everything is beautiful and neat. But upon closer examination, it turns out that only a single node failure is effectively worked out. And the simultaneous loss of two nodes at once can be fatal for cluster availability - Redis QoS (on node 2) has only 1 slave (on node 3). The same picture with the risk of losing cluster management - Puppet Server is only on two nodes (1 and 2). However, the probability of failure of two nodes at once is very low, and you can live with it.

Nevertheless, to increase the reliability of the storage, we use 4 nodes in the DataLine instead of the minimum three. The following distribution of resources turns out:

One more nuance is immediately evident - Redis CredentialsIt is placed not on each node (as could be assumed from the official scheme above), but only on 3 of them. In this case, Redis Credentials is used for every incoming request. It turns out that due to the need to go to someone else's Redis, there is some imbalance in the performance of the fourth node.

For us, this is not yet significant. During stress testing, significant deviations in the response speed of the nodes were not noticed, but on large clusters of dozens of working nodes, it makes sense to correct this nuance.

This is how the migration scheme on 6 nodes looks like:

The diagram shows how the service migration is implemented in case of a node failure. Only the failure of one server of each role is taken into account. If both servers fall, manual intervention will be required.

Here, too, the business was not without some subtleties. Role migration uses Puppet. Therefore, if you lose it or accidentally break it, automatic failover may not work. For the same reason, you should not manually edit the manifest of the built-in Puppet. This is not entirely safe, changes can be suddenly frayed, as manifests are edited using the cluster admin panel.

From the point of view of data security, everything is much more interesting. Object metadata is stored in Apache Cassandra, and each record is replicated to 3 out of 4 nodes. Replication factor 3 is also used to store data, but you can configure a larger one. This ensures data safety even in case of simultaneous failure of 2 out of 4 nodes. And if you have time to rebalance the cluster, you can lose nothing with one remaining node. The main thing is to have enough space.

This is what happens when two nodes fail. The diagram clearly shows that even in this situation, the data remains safe.

At the same time, the availability of data and storage will depend on the strategy of ensuring consistency. For data, metadata, read and write, it is configured separately.

Valid options are at least one node, quorum, or all nodes.
This setting determines how many nodes must confirm write / read in order for the request to be considered successful. We use quorum as a reasonable compromise between the time it takes to process a request and the reliability of writing / inconsistency of reading. That is, from the three nodes involved in the operation, for error-free operation, it is enough to get a consistent answer from 2. Accordingly, in order to stay afloat in the event of failure of more than one node, you will need to switch to a single write / read strategy.

Query processing in Cloudian

Below we will consider two schemes for processing incoming requests in Cloudian HyperStore, PUT and GET. This is the main task for S3 Service and HyperStore.

Let's start with how the write request is processed:

Surely you noted that each request generates a lot of checks and data retrieval, at least 6 calls from component to component. It is from here that recording delays and high CPU time consumption appear when working with small files.

Large files are transmitted by chunks. Separate chunks are not considered as separate requests and some checks are not carried out.

The node that received the initial request further independently determines where and what to write, even if it is not written directly to it. This allows you to hide the internal organization of the cluster from the end client and use external load balancers. All this positively affects the ease of maintenance and fault tolerance of the storage.

As you can see, the reading logic is not too different from writing. In it, the same high sensitivity of performance to the size of the processed objects is observed. Therefore, due to significant savings in working with metadata, it is much easier to extract one finely chopped object than many separate objects of the same total volume.

Data storage and duplication

As you can see from the above diagrams, Cloudian supports various schemes for storing and duplicating data:

Replication - using replication it is possible to maintain a custom number of copies of each data object in the system and store each copy on different nodes. For example, using 3X replication, 3 copies of each object are created, and each copy “lies” on its own node.

Erasure Coding - With erasure coding, each object is encoded into a custom amount (known as K number) of data fragments plus a custom amount of redundancy code (M number). Each K + M fragments of an object is unique, and each fragment is stored on its own node. An object can be decoded using any K fragments. In other words, the object remains readable, even if M nodes are inaccessible.

For example, in erasure coding, according to the 4 + 2 formula (4 data fragments plus 2 redundancy code fragments), each object is split into 6 unique fragments stored on six different nodes, and this object can be restored and read if any 4 of 6 fragments are available .

The advantage of Erasure Coding compared to replication is to save space, albeit at the cost of a significant increase in processor load, worsening response speed and the need for background procedures to control the consistency of objects. In any case, metadata is stored separately from the data (in Apache Cassandra), which increases the flexibility and reliability of the solution.

Briefly about other functions of HyperStore

As I wrote at the beginning of this article, several useful tools are built into HyperStore. Among them:

Flexible billing with support for changing the price of a resource depending on the volume and tariff plan;
Built-in monitoring
The ability to limit the use of resources for users and user groups;
QoS settings and built-in procedures for balancing resource use between nodes, as well as regular procedures for rebalancing between nodes and disks on nodes or when entering new nodes in a cluster.

However, Cloudian HyperStore is still not perfect. For example, for some reason, you cannot transfer an existing account to another group or assign multiple groups to one record. It is not possible to generate interim billing reports - you will receive all reports only after closing the reporting period. Therefore, neither clients nor we can find out how much the account has grown in real time.

Cloudian HyperStore Logic

Now we will dive even deeper and look at the most interesting in any SDS storage - the logic of the distribution of objects by nodes. In the case of Cloudian storage, metadata is stored separately from the data itself. For metadata, Cassandra is used, for data, the proprietary HyperStore solution.

Unfortunately, so far there is no official translation of Cloudian documentation into Russian on the Internet, so below I will post my translation of the most interesting parts of this documentation.

The role of Apache Cassandra in HyperStore

In HyperStore, Cassandra is used to store object metadata, user account information, and service usage data. In a typical deployment on each HyperStore, Cassandra data is stored on the same drive as the OS. The system also supports Cassandra data on a dedicated drive on each node. Cassandra data is not stored on HyperStore data disks. When vNodes are assigned to the host, they are distributed only to the HyperStore storage nodes. vNodes are not allocated to the drive where Cassandra data is stored.
Inside the cluster, metadata in Cassandra is replicated in accordance with the policy (strategy) of your repository. Cassandra Data Replication uses vNodes this way:

When creating a new Cassandra object (primary key and its corresponding values), it is hashed, and the hash is used to associate the object with a specific vNode. The system checks which host this vNode is assigned to, and then the first replica of the Cassandra object is stored on the Cassandra drive on that host.
For example, suppose a host is assigned 96 vNodes distributed across several HyperStore data disks. Cassandra objects whose hash values fall within the token ranges of any of these 96 vNodes will be written to the Cassandra drive on this host.
Additional replicas of the Cassandra object (the number of replicas depends on your configuration) are associated with vNodes with the following sequence number and stored on the node to which these vNodes are assigned, provided that vNodes are skipped if necessary, so that each replica of the Cassandra object is stored on another host machine.

How HyperStore Storage Works

The placement and replication of S3 objects in a HyperStore cluster is based on a consistent caching scheme that uses integer token space in the range from 0 to 2 ¹²⁷ -1. Integer tokens are assigned to HyperStore nodes. For each S3 object, a hash is calculated as it is loaded into storage. The object is stored in the node that was assigned the lowest value of the token, greater than or equal to the hash value of the object. Replication is also implemented by storing the object on the nodes to which tokens have been assigned, which have a minimum value.

In a “classic” consistent hash-based storage, one token is assigned to one physical node. The Cloudian HyperStore system uses and extends the functionality of the “virtual node” (vNode) introduced in Cassandra in version 1.2 - a large number of tokens are assigned to each physical host (maximum 256). In fact, the storage cluster consists of a very large number of “virtual nodes” with a large number of virtual nodes (tokens) on each physical host.

The HyperStore system assigns a separate set of tokens (virtual nodes) to each disk on each physical host. Each disk on the host is responsible for its own set of replicas of objects. A disk failure only affects replicas of objects that are on it. Other drives on the host will continue to operate and carry out their data storage responsibilities.

We give an example and consider a cluster of 6 HyperStore hosts, each of which has 4 S3 storage disks. Assume that 32 tokens are assigned to each physical host and there is a simplified token space from 0 to 960, and the value of 192 tokens in this system (6 hosts of 32 tokens) is 0, 5, 10, 15, 20, and so on up to 955.

The diagram below shows one possible distribution of tokens throughout the cluster. 32 tokens of each host are distributed evenly across 4 disks (8 tokens per disk), and the tokens themselves are randomly distributed across the cluster.

Now suppose you configured HyperStore to 3X replicate S3 objects. Let's agree that the S3 object is loaded into the system, and the hash algorithm applied to its name gives us the 322 hash value (in this simplified hash space). The diagram below shows how three instances or replicas of an object will be stored in a cluster:

With its hash value of name 322, the first replica of the object is stored on 325 token, because this is the smallest token value that is greater than or equal to the hash value of the object. 325 tokens (highlighted in red on the diagram) are assigned to hyperstore2: Disk2. Accordingly, the first replica of the object is stored there.

The second replica is stored on the disk that is assigned the next token (330, highlighted in orange), that is, on hyperstore4: Disk2.
The third replica is saved to the disk, which is assigned the next token after 330 - 335 (yellow), on hyperstore3: Disk3.

Add a comment:From a practical point of view, this optimization (the distribution of tokens not only among physical nodes, but also among individual disks) is necessary not only to ensure accessibility, but also to ensure uniform distribution of data between disks. In this case, the RAID array is not used, the entire logic of data allocation on disks is controlled by HyperStore itself. On the one hand, it is convenient and controlled; if a disk is lost, everything will be rebalanced independently. On the other hand, I personally trust more good RAID controllers - after all, their logic has been optimized for many years. But these are all my personal preferences, on real jambs and problems, we never caught HyperStore, if we follow the vendor's recommendations when installing software on physical servers. But the attempt to use virtualization and virtual disks on top of the same moon on the storage system failed,

Drive device inside a cluster

Recall that each host has 32 tokens, and the tokens of each host are evenly distributed between its disks. Let's take a closer look at hyperstore2: Disk2 (in the diagram below). We see that tokens 325, 425, 370 and so on are assigned to this disk.

Since the cluster is configured for 3X replication, the following will be stored on hyperstore2: Disk2:

According to 325 disk token:

The first replicas of objects with a hash value from 320 (exclusively) to 325 (inclusive);
Second replicas of objects with a hash value from 315 (exclusively) to 320 (inclusive);
Third replicas of objects with a hash value from 310 (exclusively) to 315 (inclusive).

According to 425 disk token:

The first replicas of objects with a hash value from 420 (exclusively) to 425 (inclusive);
Second replicas of objects with a hash value from 415 (exclusively) to 420 (inclusive);
Third replicas of objects with a hash value from 410 (exclusively) to 415 (inclusive).

Etc.

As noted earlier, when placing second and third replicas, HyperStore may in some cases pass tokens so as not to store more than one copy of the object on one physical node. This eliminates the use of hyperstore2: disk2 as storage for second or third replicas of the same object.

If Disk 2 crashes on Disks 1, 3, and 4, data will continue to be stored, and objects on Disk 2 will be stored in the cluster, because have been replicated to other hosts.

Comment: as a result, the distribution of replicas and / or fragments of objects in the HyperStore cluster is based on the design of Cassandra, which was developed for the needs of the file storage. To understand where to place the object physically, a certain hash is taken on its behalf and, depending on its value, numbered "tokens" are selected for placement. Tokens are randomly distributed in advance across the cluster in order to balance the load. When choosing a token number for placement, restrictions on the placement of replicas and parts of the object on the same physical nodes are taken into account. Unfortunately, this design has a side effect: if you need to add or remove a node in the cluster, you have to re-shuffle the data, and this is a rather resource-intensive process.

Single storage in multiple data centers

Now let's see how HyperStore works in several data centers and regions. In our case, the multi-DPC mode differs from the multi-regional mode by using one or more token spaces. In the first case, the token space is uniform. In the second, each region will have an independent token space with (potentially) its own settings for the level of consistency, capacity, and storage configurations.
To understand how this works, let us again turn to the translation of documentation, section “Multi-Data Center Deployments”.

Consider deploying HyperStore in two data centers. Call them DC1 and DC2. Each data center has 3 physical nodes. As in our previous examples, each physical node has four disks, 32 tokens (vNodes) are assigned to each host, and we assume a simplified token space from 0 to 960. According to this scenario with several data centers, the token space is divided into 192 tokens - 32 tokens for each of the 6 physical hosts. Tokens are distributed by hosts absolutely randomly.

Also suppose that the replication of S3 objects in this case is configured on two replicas in each data center.

Let's look at how a hypothetical S3 object with a hash value of 942 will replicate in 2 data centers:

The first replica is stored in vNode 945 (indicated in red in the diagram below), which is located in DC2, on hyperstore5: Disk3.
The second replica is stored in vNode 950 (indicated in orange) DC2, on hyperstore6: Disk4.
The next vNode 955 is located in DC2, where the specified level of replication has already been reached, so we skip this vNode.
The third replica is located in vNode 0 (yellow) - in DC1, hyperstore2: Disk3. Please note that after the token with the highest number (955), the token with the lowest number (0) follows.
The next vNode (5) is located in DC2, where the specified level of replication has already been reached, so we skip this vNode.
The fourth and last replica is stored in vNode 10 (green) - in DC1, hyperstore3: Disk3.

Comment: another property of the above scheme is that for normal operation of storage policies affecting both data centers, the amount of space, the number of nodes and disks on a node must match in both data centers. As I said above, a multiregional scheme does not have such a limitation.

This concludes our overview of Cloudian architecture and key features. In any case, this topic is too serious and large to fit the exhaustive manual on it into an article on Habré. Therefore, if you are interested in the details that I omitted, you have any questions or suggestions for the presentation of the material in future articles, I will gladly communicate with you in the comments.
In the next article, we will consider the implementation of S3 storage in DataLine, we will talk in detail about the infrastructure and network fault tolerance technologies used, and as a bonus, I will tell you the story of its construction!

Tags: