Erasure Code - More Storage Space on Nutanix



    This blog entry already appeared a couple of months ago. Unfortunately, we were severely threatened from across the ocean so that we would not talk about features that were not yet released, so the text had to be removed. And now, in NOS 4.1.3, Erasure Code is available for use in public beta status (experiment, but wait a while with production, we’ll still optimize the code), which means that you can already talk in public.

    If you have already read my story about how NDFS , the Nutanix Distributed File System, the basis of how it is made in Nutanix, is structured, then you probably noticed that the disk space consumption in NDFS is generally “generous”.

    Let me remind you that we do not use RAID, in its classical sense, when, for example, a mirror copy is kept for a disk (RAID-1), or when an additional redundancy code (RAID-5 or 6) is calculated for a group of disks. Instead, we store a block of data recorded on disks in two (or even three) places on different disks and even different nodes. This circuit is called RAIN ( Redundant Array of Independent Nodes , to the peak of RAID, which is the same, but ... Disks ). But, from the point of view of the capacity of the system’s disks, RF = 2, that is, the option when a copy is stored for each block, the space consumption is equivalent to RAID-1, that is, 50% of the raw capacity is available for the data (minus some other, variable, percentage on service structures and information, but omit this here).

    Yes, fault tolerance, reliability, fast (minutes) recovery from failures , all this is so. But still, the expense is quite large. Especially for people who are still habitually thinking about drives in terms of raw or RAID-5 capacity. And you can say as much as you like that RAID-5 is bad and unreliable , that it is slow to write, and finally, at current prices for HDDs, the fee for increased reliability and performance with gigabytes given for fault tolerance is low compared to what is given to us in return for them. Does not matter. “We have four terabyte disks in our system. Why do we even have less than two terabytes for our data? ”

    That's why Nutanix came up with an idea that is being actively implemented. Engaged in Nutanix, by the way, is “ours,” a Russian-speaking programmer.
    This is what is called the “erasure code” (we called it EC-X , Erasure Code-X). As often happens with engineers, the name is "non-self-describing", and no one knows why. In Russian, it will be, most correctly, a “redundancy code”.
    Here's how it works.

    If we have data that the toad presses to keep us on “RAID-1”, that is, in the Replication Factor (RF) = 2 mode, then we can switch the storage mode for this data from RF = 2 (or = 3) to erasure code. At the same time, a special background process begins to work for us, similar to how we deduplicate in a cluster, and, after some time, instead ofthe block-and-copy on our disks begins to be stored on the disks of the node cluster block-block-block- ... and_excess_information_for_ them , which allows you to uniquely restore the contents of the block in this chain if it is lost, for example, as a result of a disk failure of one of the nodes.
    And when this process finishes processing in the background, instead of the block and its copy in the cluster, we begin to store many blocks combined into a group, plus a separate block and a redundancy code. And in the data container where we included the erasure code instead of RF, we get the same amount of stored information, and at the same time there is more free space for the new one.
    Again, this is a bit like postponed deduplication.

    Surely you are ready to say here: “Well, you just“ invented ”a bicycleRAID-5! ”, Not exactly at the mathematical level, but remotely the principle is similar, yes.

    The “reckoning” here (nothing happens without reckoning, as we recall) is that for more disk space for data we pay a higher CPU load, if necessary, data recovery. It is clear that, instead of just copying, here we have to restore the contents of the block from the contents of other data blocks and redundant code, and this is a significantly more resource-intensive procedure.

    It is also important that, using the Erasure Code, the redundancy is enough to recover if two disks, nodes, or other cluster components fail, that is, from the point of view of fault tolerance, the RF = 3 equivalent, for which the usable volume is equal to about 33% of raw.
    What about erasure code?
    It depends on the size of the cluster. The more it is in the number of nodes - the more profitable, the greater the difference.



    For a 4-node 80TB raw cluster, approximately 40TB usable is obtained with RF = 2. When you switch the container to erasure coding, the usable space will be - 53TB.
    On 5 nodes - 100 - 50 - 75, on 6 nodes - 120 - 60 - 96, on 7 - 140 - 70 - 116. As you can see, with increasing cluster size, the "storage efficiency" for erasure code also grows, and can reach 80% of raw capacity.

    What kind of coding is used? No, this is not a Reed-Solomon Code, well known to the industry, and often used for such tasks. We had to come up with our own algorithm, which provides a higher processing and calculation speed. And of course, we use the distributed capabilities of the Nutanix cluster, the algorithm is distributed, like map-reduce, and runs on all nodes of the cluster, which ensures the reliability and performance of its operation. It is also important to note that using EC-X does not violate our Data Locality principle. If the virtual machine is located on this virtualization host (cluster node), then its data on the SSD (performance tier of our storage) will also lie locally for it, on this node, both with the RF and the EC-X storage option, which provides low latency and high disk performance.

    Why and where can this be applied?

    First of all, this allows you to lower the storage cost ($ / GB), which is especially true for cold storage and capacity nodes, especially on large clusters, if you store information on Nutanix, albeit valuable, but not too "Hot", active. And they are ready to pay for more free space with increased CPU load and longer recovery time.
    At the same time, please note that in normal mode, during normal work with data under the erasure code, the CPU load when accessing data does not significantly increase, only during recovery.
    You are also free to choose how to protect your data with redundancy. You can hold different containers for data on one cluster, some with RF = 2, others with RF = 3, and some with erasure code. For data that is hot and critical enough, you can choose one of the RFs, but for those that aren’t so hot and lying on the nodes, where the increased CPU load during recovery is not critical for us - Erasure Code.
    Again: the choice of the mode for storing data is yours, and depends on your choice and on your priorities.

    Erasure Code has appeared in the next release of Nutanix OS, which will come to your Nutanix systems with a regular update. Updating Nutanix, by the way, does not stop virtual machines and data inaccessibility, and the system is updated Over-The-Air, “like an iPhone,” but more on that in the next post.

    Also popular now: