NDFS - Nutanix Distributed File System, Nutanix “Foundation”

    image

    Nutanix is ​​a “converged platform 2.0”, if we consider “initial converged platform 1.0” to be the initial experiments on mechanically connecting storefiles and servers into a single product and some conditionally common architecture undertaken in VCE (EMC + Cisco) Vblock , NetApp FlexPod or Dell VRTX . In contrast to those listed, in Nutanix, the servers, network and storage are not just wired together in a 19 "rack together, integration is much deeper and more interesting.

    But we will start with the basics, from what lies in the foundation of Nutanix as an architecture and solution, with Nutanix Ditributed File System, NDFS:


    As the name suggests, NDFS is a distributed, clustered file system.

    One of the most important problems for the operation of a clustered distributed file system, which the developer has to solve, is the problem of working with so-called metadata.
    What is “metadata” and how do they differ from “data” itself?
    Any data that we store on the storage system always has auxiliary data associated with it. This, for example, the name of the file, the path to this file, that is, where it is located in the storage structure, its attributes, size, creation time, change time, access time, some bit attributes, such as, for example, “only reads ”, as well as access rights, from the simplest to most deployed ACLs. This data is not actually our stored data, but it is associated with it, and provide access to the data. These additional data are commonly called “metadata”. As a rule, access to data begins with access to metadata stored in the file system, and already with their help the OS can start working with data.
    The problem begins when the person who wants to work with the data is not alone. With one, everything is simple. I read the path to the file, read the necessary piece, changed it, wrote it down, and if necessary, fixed the metadata. But what to do when at the time while one process, on one cluster node, wrote its changes to the file, this file was read and another process started to make its changes from another node?

    One of the already classic mistakes of novice system administrators is to mount its LUN on iSCSI from a storage system to two or more servers (this is physically possible), format it in some NTFS, and then try to read and write to it from different servers. Those who have already passed this know what happens next. Who else has not come across this - I tell. Everything is going well, while only one server writes to such a “disk”. As soon as the second one starts recording, the file system immediately (more or less fatally) crumbles. Why? Because NTFS is not a clustered file system, and does not know anything about a situation where two different servers can be written to it on the same file system.


    The file system should monitor and “dispatch” access to data from different nodes of the cluster. The simplest way to do this is to assign one single node - the “file system head”, so that all metadata changes are made through it, and it operates with them exclusively, and it would already respond to requests for metadata changes, blocking some, and allowing others. This works, for example, Luster or pNFS . Unfortunately, this means that metadata operations become, sooner or later, a bottleneck for scaling. When all operations with metadata go through exactly one server, then one day it becomes a bottleneck. Not immediately, not always, but once and always.

    For example, those who tried to put many (hundreds) of virtual machines on one LUN in a VMFS partition (and VMFS is a classic cluster file system with locks using the SCSI Locks mechanism), they know that with a large number of metadata operations, LUN with VMFS it sometimes slows down very and unpleasantly. And “operations with metadata” is, for example, rebooting the VM, resizing the VMDK, creating, deleting and expanding the VMFS datastore on the LUN, creating templates or deploying the VM from template, all this causes the LUN to lock and briefly suspend I / O entirely for all VMs on it. Partially this problem is fought with the help of Atomic Test and Set locking, transferred to the storage system, if it supports VAAI, but this, all the same, often solves the problem only partially.

    Therefore, the question of creating a good, truly scalable cluster system is often the question of creating a well-scalable metadata workflow.
    When creating NDFS for storing file system metadata in a cluster, NoSQL Apache Cassandra database was usedoriginally developed internally by Facebook and then submitted to OpenSource. This is not the first time that it has been used in this quality, since it scales very well, but almost always, along with the problem of scaling, the problem of database consistency on a cluster scale also arises. Therefore, the task for Nutanix developers was to ensure not only distributedness (for example, Cassandra works distributed in Facеbook, on clusters of thousands of nodes), but also consistency. For Facebook, the problem is not too big that your photos in the photoset will not be updated instantly on all nodes of the cluster, but for Nutanix, the consistency of the storage metadata is critical. Therefore, so to speak, vanilla-Cassandra was substantially redesigned to provide the necessary consistency of storage in the cluster,

    At the same time, due to the distribution of Cassandra in the system, there is no "main node", the failure of which can affect the operability of the cluster as a whole. The metadata database is distributed across the cluster and Nutanix does not have any “main node”, “file system head”, the only one controlling access to metadata.

    How does work and disk access in the Nutanix cluster work?
    At the "large-block" level, this work looks like this:

    image

    Unexpectedly for many, Nutanix does not use RAID technology to ensure high availability of data on disks. We are so used to RAID on disks in enterprise systems that the claim of “no RAID” is often perplexing. “But what about the necessary reliability of disk storage?” That's right, no RAID. But, nevertheless, data redundancy is provided, and it is provided in a way that we commonly call RAIN - Redundant Array of Independent Nodes. The redundancy of the stored data is ensured by the fact that, at a logical level, each recordable data block is written not only to the disks of the cluster node to which it is intended, but also simultaneously and also to the disks of another (or two, as an option) other node .

    What gives a rejection of RAID? First of all, this is faster and more flexible recovery in the event of a disk loss, and, consequently, higher reliability, because during the recovery time in classical RAID its reliability is reduced, in addition, the performance of disk operations also decreases, since the RAID group is busy with internal processes of reading and writing blocks for recovery.
    Secondly - it is much more flexible data placement. The file system knows where and which data blocks are written, and can restore exactly what is needed at the moment, or write in the way that is optimal for data, since it completely controls the entire process of writing data to physical disks, in the case of a RAID controller logically separate from the OS operations level.

    How is access to disks organized?
    First, you remember that Nutanix is ​​a hypervisor platform. All tasks inside the Nutanix server spin on top of a baremetal hypervisor, such as ESXi, MS Hyper-V, or Linux KVM.

    Among the virtual machines, there is one - a special one, it is called CVM, or Controller VM. This is the so-called "virtual appliance", within which the entire kitchen of the formation and provision of the Nutanix file system works. Physically, CVM is a virtual machine under Centos Linux, with numerous of our services, for example, about Apache Cassandra, NoSQL-storage of metadata of the file system, I have already mentioned above, and besides this, there is a diverse “menagerie” of processes that provides everything that Nutanix knows how. In general, if we are to sharpen, then Nutanix is ​​this very CVM and it is, it is its heart and brain, and the main intellectual property.
    But inside the hypervisor, it's just one big virtual machine with many special processes in it.

    This virtual machine, as can be seen in the diagram, passes through itself the I / O traffic of the virtual machines to their virtual disks. Physical disks are located for the OS of virtual machines not only “behind the hypervisor”, but also behind CVM. And already CVMs of all cluster nodes included in the common cluster create from the physical disks, from separate SSDs and HDDs directly mapped into them, a common space. They create and give it to the hypervisor, who already sees the common repository. This CVM gives storage in the form most convenient for the given hypervisor. For example, for VMware ESXi it will be a “virtual” NFS-storey, for 2012R2 Hyper-V - a storey via SMB3 protocol, and for KVM - iSCSI.
    For each hypervisor (three of which are now supported), we now have our own CVM, which is installed in the hypervisor during the initial installation of the cluster.

    The process that provides I / O via the selected protocols between physical disks on the one hand and the VM on the hypervisor on the other is called Stargate, and the process that distributes tasks across a cluster of nodes, including all Map-Reduce tasks, such as balancing load on nodes, scrubbing (online integrity check) of disks - Curator. Prism is the management interface, including the GUI, and Zookeper (also a product from the Apache lab) stores and maintains the cluster configuration.

    image

    Since, as I already said above, Nutanix does not use RAID, and decomposes data blocks into disks on its own, this gives it great flexibility. For example, you see SSD disks in the diagram. They store the so-called hot tier, that is, those data blocks that are actively accessed, read or modified. Hot tier also contains blocks written to discs. They remain there until they are superseded by cold tier and HDD SATA due to inactivity. Moreover, since the Nutanix kitchen in CVM is engaged in “folding” the blocks into disks, it completely controls where, how, what and for how long the blocks will lie.

    Connecting the nodes of the Nutanix cluster into a single cluster, we get this structure:

    image

    Directly on top of the physical disks is a data warehouse. Data is stored in the form of extents, that is, sequentially spaced, addressed block chains, and groups of these extents. As an addressable storage, it was decided to use a well-developed ext4 file system, from which only the function of storing and addressing extents is used. The whole logic of working with metadata, as I described above, is taken to the level of Nutanix itself.
    In the diagram below, yellow - physical SSDs and HDD SATA, green - NDFS, consisting of ext4 as the extent data store and cluster metadata storage in Cassandra, and finally, data blocks of the VM OS guest filesystems are located on top of it, these will already be NTFS, ext3. XFS, or whatever you want.

    image

    In future posts, I plan to continue the “technical” part of the story about what happens with Nutanix “under the hood”, for example, more about how RAIN and fault tolerance are provided, how deduplication and compression work, how a distributed cluster is implemented and how replication works data for DR, how you can backup, and much more interesting.

    If you have just connected to our hub, then you will most likely be interested in reading other publications about Nutanix on Habrahabr:
    http://habrahabr.ru/company/nutanix/blog/240859/

    In addition, I want to delve into the topic recommend a blog to which you can also find many technical articles about how Nutanix works “under the hood”.
    http://blog.in-a-nutshell.ru

    Also popular now: