Russian distributed storage. How things are arranged
This spring, the Radix team prepared and released the first version of the software for creating a distributed block storage system running on the Elbrus-4.4 server platforms based on the Elbrus-4C microprocessors.
The usefulness of such a symbiosis is visible to the naked eye - assembling storage systems based on domestic iron and the domestic operating system is becoming quite an attractive product of the domestic market, in particular for customers who have a focus on import substitution.
However, the potential of the developed operating system is not limited to Russian server platforms. Compatibility with standard x86-64 servers, which are widely distributed in the market, is currently being tested and tested. In addition, the product continues to be “finished” to the desired functionality, which will allow its implementation outside the Russian market.
Below we will present a small analysis of how a software solution (called RAIDIX RAIN) is arranged, allowing to combine local server media into a single fault-tolerant storage cluster with centralized management and horizontal and vertical scaling capabilities.
Distributed Storage Features
Traditional storage systems, made in the form of a single software and hardware complex, have a common problem associated with scaling: system performance rests on controllers, their number is limited, increasing capacity by adding expansion shelves with carriers does not increase performance.
With this approach, the overall performance of the storage system will fall, as with the increase in capacity, the number of controllers needed to process more access operations to the increased data volume.
RAIDIX RAIN supports horizontal block scaling, in contrast to traditional solutions, an increase in nodes (server blocks) of a system leads to a linear increase not only in capacity, but also in system performance. This is possible because each RAIDIX RAIN node includes not only media, but also computing resources for I / O and data processing.
RAIDIX RAIN assumes the implementation of all major application scenarios for distributed block storage systems: cloud storage infrastructure, high-load databases and storage of Big Data analytics. RAIDIX RAIN can also compete with traditional storage systems with sufficiently high data volumes and corresponding client financial capabilities.
Public and private clouds
The solution provides the elastic scaling required for cloud infrastructure deployment: performance, throughput, and storage volume increase with each node added to the system.
The RAIDIX RAIN cluster in all-flash configuration is an effective solution for maintaining high-load databases. The solution will be an affordable alternative to Oracle Exadata products for Oracle RAC.
Big Data Analytics
Together with additional software, it is possible to use the solution to perform big data analytics. RAIDIX RAIN provides a significantly higher level of performance and ease of maintenance compared to an HDFS cluster.
RAIDIX RAIN supports 2 deployment options: dedicated (external or convergent) and hyperconvergent (HCI, hyper-converged-infrastructure).
Dedicated deployment option
In the selected version, the RAIDIX RAIN cluster is a classic software storage. The solution is deployed on the required number of dedicated server nodes (at least 3x, the top is almost unlimited), the resources of which are fully utilized for storage tasks.
Fig. 1. Dedicated deployment option
In this case, the RAIDIX RAIN software is installed directly on bare hardware. Applications, services, computing resources that use RAIN to store information are hosted on external hosts and connected to it over a storage network (classical data center architecture).
Hyper-Converged Deployment Option
Hyper-converged version implies the joint allocation of computing power (hypervisor and production VM) and storage resources (software storage) of the data center on one set of nodes, first of all, this is relevant for virtual infrastructures. With this approach, the RAIN software is installed on each infrastructure host (node) (HCI) as a virtual machine.
Fig. 2. Hyper-convergent deployment option. The
interaction of the RAIN cluster nodes with each other and with the final consumers of storage resources (servers, applications) is carried out using iSCSI (IP, IPoIB), iSER (RoCE, RDMA) or NVMeOF protocols.
The hyper-converged deployment option provides the following benefits:
- Consolidation of computing and storage resources (no need to implement and maintain a dedicated external storage system).
- Joint horizontal block scaling of computing and storage resources.
- Ease of implementation and maintenance.
- Centralized management.
- Saving rack capacity and power consumption.
In terms of media used, RAIDIX RAIN supports 3 configurations:
- All-flash - cluster nodes are supplied only with flash-carriers (NVMe, SSD);
- HDD - cluster nodes are supplied with HDD-carriers only;
- Hybrid - two independent storage levels on HDD and SSD.
Productive fault tolerance
The core value of RAIDIX RAIN is the optimal balance of performance, fault tolerance and efficient use of storage capacity.
As part of the client IT infrastructure, RAIDIX RAIN is also attractive because at the output we have “fair” block access, which distinguishes the solution from most market analogs.
Currently, most competitive products show high performance only when using mirroring. At the same time, the storage capacity is reduced by 2 times or more: single data replication (mirroring) - 50% redundancy, double data replication (double mirroring) - 66.6% redundancy.
The use of storage optimization technologies such as EC (Erasure Coding), deduplication, and compression implemented in distributed storage systems leads to a significant degradation of storage performance, which is unacceptable for delay-sensitive applications.
Therefore, in practice, such solutions usually have to operate without the use of these technologies, or include them only for “cold” data.
RAIDIX RAIN was originally designed with a clear set of initial requirements for fault tolerance and system availability:
- A cluster must survive a failure of at least two nodes, with the number of nodes strictly more than 4. For three and four, failure of one node is guaranteed.
- A node must survive a failure of at least two disks in each node if there are at least 5 disks in the node.
- The redundancy level of drives on a typical cluster (from 16 nodes) should not exceed 30%
- Data availability must be at least 99.999%
This greatly influenced the existing product architecture.
Erasure Coding features in distributed storage
The main approach to ensuring RAIDIX RAIN failover is the use of unique Erasure Coding technologies. EC companies, known by their flagship product, are also used in distributed storage systems, which makes it possible to get performance comparable to mirrored configurations. This applies to both random and sequential loads. This ensures a given level of fault tolerance and significantly increases the usable capacity, and overhead costs are no more than 30% of the raw storage capacity.
Special mention requires high performance of RAIDIX EC on sequential operations, in particular when using large SATA drives.
In general, RAIDIX RAIN offers 3 options for robust coding:
- RAID 3 is optimal for 3 nodes;
- RAID 4 is optimal for 4 nodes;
- for subcluster storage from 5 to 20 nodes, the best approach is to use network RAID 6.
Fig. 3. Versions of error-
correcting coding All variants assume a uniform distribution of data across all nodes of the cluster with the addition of redundancy in the form of checksums (or correction codes). This allows you to draw parallels with Reed-Solomon codes used in standard RAID arrays (RAID-6) and allowing you to work out a failure of up to 2 carriers. Network RAID-6 works in the same way as a disk drive, but it distributes data across cluster nodes and allows you to work out a 2-node failure.
In RAID 6, if 1-2 media fails within a single node, they are recovered locally without using distributed checksums, minimizing the amount of data being recovered, the load on the network, and the overall system degradation.
RAIN supports the concept of fault domains or availability domains. This allows you to work out the failure not only of individual nodes, but also of entire server racks or baskets, the nodes of which are logically grouped into failure domains. This possibility is achieved by distributing data to ensure their fault tolerance not at the level of individual nodes, but at the domain level, which will make it possible to survive the failure of all nodes grouped in it (for example, the entire server rack). In this approach, the cluster is divided into independent subgroups (subclusters). The number of nodes in one subgroup is not more than 20, which ensures the requirement for fault tolerance and availability. The number of subgroups is not limited.
Fig. 4. Failure domains
Development of any failures (disks, nodes or network) is carried out automatically, without stopping the system.
In addition, all RAIDIX RAIN cluster devices are protected from power failure by connecting to uninterrupted power supplies (UPS). Devices connected to the same UPS are called a power failure group.
Characteristics and functionality
Consider the basic functional characteristics of RAIDIX RAIN.
Table 1. Basic RAIDIX RAIN Specifications
|Supported node types||Domestic server platforms based on Elbrus-4C processors |
Standard x86-64 servers (in perspective)
|Supported Media Types||SATA and SAS HDD, SATA and SAS SSD, NVMe|
|Maximum storage capacity||16 EB|
|Maximum Cluster Size||1024 knots|
|Basic functionality||Hot Expansion of Volumes |
Hot Add Nodes to Cluster,
Failover No Downtime
|Fault tolerance technology||Failure of node failures, media, network. |
Erasure Coding with cluster allocation: Network RAID 0/1/5/6.
Local Host Level Correction Codes (Local RAID 6)
As an important functional feature of RAIDIX RAIN, it is worth noting that services such as initialization, reconstruction and re-scrapping (scaling) run in the background and they can be set to the priority parameter .
Priority setting allows the user to independently regulate the load in the system, speeding up or slowing down the operation of these services. For example, a priority of 0 means that services operate only in the absence of a load from client applications.
The RAIDIX RAIN cluster expansion procedure is as simple and automated as possible; the system independently in the background process redistributes data taking into account the capacity of new nodes, the load becomes balanced and even, the overall performance and storage capacity proportionally increases. The process of horizontal scaling is "hot" without downtime, does not require stopping applications and services.
Fig. 5. Scale process diagram
RAIDIX RAIN is a software product and is not limited to a specific hardware platform — its concept implies the ability to be installed on any compatible server hardware.
Each customer, based on the characteristics of its infrastructure and applications, chooses the optimal deployment option for itself: dedicated or hyperconvergent.
Support of various types of media allows, based on the budget and the tasks to be solved, to build on the basis of a RAIDIX RAIN:
1. distributed all-flash storage with unprecedented high performance and guaranteed low latency;
2. economical hybrid systems that satisfy most of the main types of loads.
As a conclusion, we’ll show some figures obtained as a result of testing RAIDIX RAIN on a 6-node NVMe cluster configuration. Once again, we note that on such an assembly (with x86-64 servers) the product is still being finalized, and these figures are not final.
- 6 nodes with 2 NVMe HGST SN100 disks
- IB card Mellanox MT27700 Family [ConnectX-4]
- Linux Kernel 4.11.6-1.el7.elrepo.x86_64
- Local raid - raid 0
- External raid - raid 6
- Benchmark for testing FIO 3.1
UPD: the load was performed by 4KB blocks, sequential - 1M, queue depth 32. The load was started on all nodes of the cluster at the same time and the table shows the total result. Delays do not exceed 1 ms (99.9 percentile).
Table 2. Test Results
|Random read 100%||4,098,000 IOps|
|Random write 100%||517,000 IOps|
|Sequential read 100%||33.8 GB / s|
|Sequential write 100%||12 GB / s|
|Random read 70% / random write 30%||1,000,000 IOps / 530,000 IOps|
|Random read 50% / random write 50%||530,000 IOps / 530,000 IOps|
|Random read 30% / random write 70%||187,000 IOps / 438,000 IOps|