AERODISK: DSS in Russian or severe Lyubertsy-Sakhalin startup from the factory

    image

    Good afternoon, colleagues.

    In today's article we will talk about the Russian vendor Aerodisk - the developer and manufacturer of solutions in the field of data storage (hardware storage) and virtualization, the difficulties faced by the vendor in the market. And, in more detail, about the line of storage Aerodisk Engine (hardware + software).

    Immediately, I note that on October 9 (Tuesday) from 11:00 to 12:30 (Moscow time) a free
    technical webinar
    will take place , where we will show on a live system how everything works.

    I want to start an article with one of my favorite parables:

    One day a wanderer came to a city where a grandiose construction was going. The men rolled large stones under the scorching sun. “What are you doing?” Our hero asked one of the workers, who slowly dragged a cobblestone. “You don’t see - I am dragging the stones!” - he answered evil. Then the wanderer noticed another worker who carried a cart with big stones and asked: “What are you doing?” “I earn money for my family,” he received an answer. The wanderer approached the third worker, who did the same, but worked harder and faster. “What are you doing?” “I am building a temple,” he smiled.

    What is this parable about?

    The fact that if you believe in a big idea, you see a big goal, then nothing is impossible for you, everything is achievable. And every step that brings you closer to a big goal doubles your strength. We at one time saw a big goal and believed in a big idea.

    This is the first article of our company AERODISK , and in it I will talk about the solutions that we release. In addition, one of the products will describe in more detail.

    Our team is located in Moscow, and we have been developing data storage systems (DSS) since 2011. We do not reinvent the wheel, but just try to make it better and more accessible. We are not a newfangled one-stop company established under the implementation of the law of Spring and other government decrees on import substitution. Therefore, we develop systems aimed at business objectives. The idea to create your own storage was born from numerous deep communication with the storage systems of world leaders EMC, NetApp, HDS, IBM / Lenovo and others, and a clear understanding that this is not rocket science and that the results of their work are, at least, quite capable of being repeated to a team of interested engineers and developers.

    Indeed, there is a well-established myth that the development of storage systems (as well as any serious product) can only be achieved by a transnational corporation, which must have hundreds or thousands of programmers around the world. In life, this is not the case. Almost always (with rare exceptions), a new serious product is created by a small team of up to 10 people (and usually 2-3). At the same time, at the stage of creation, 80% of the total product functionality is laid. Then the product either enters the market by the efforts of this small team (or somehow still loudly declares itself), and there investors are already picking it up (funds, holdings or major producers), or it dies, which, as you understood, is not about us . In the first, successful case, only after this small team joins the ranks of a large corporation do the very same 100,500 programmers and engineers appear which in fact already induces gloss on the product and are engaged in its support. There are a lot of examples: 3par first made its storage system and started selling it quite successfully in the States, and only then was bought by HP. The same story was with Compellent and Dell. Riverbed (this is, as you understand, not about storage, but also about serious hardware) was originally written by two developers a year and a half and only then attracted investors. The point is that not the gods, in fact, burn pots, and we clearly understood it. but also about a serious technique) was originally written by two developers of the year for a half, and only then attracted investors. The point is that not the gods, in fact, burn pots, and we clearly understood it. but also about a serious technique) was originally written by two developers of the year for a half, and only then attracted investors. The point is that not the gods, in fact, burn pots, and we clearly understood it.

    Since the inception of the idea, there has been a lot of things. As it should be for a normal startup, for the first few years we (in the amount of three and a half people) were sitting in the office, which, due to lack of money, was located in the territory of one of the near-MKAD plants.

    So we went to work every morning. To get into the office, we had to go through the glamorous corridor of an abandoned factory floor.

    image

    But this did not interfere with the productive work, and in 2013 we were able to make the prototype of our first AERODISK ENGINE storage system , which, although it was curved, oblique, and worked a little damp every time , but already performed its main functions - reliable data storage.

    And this is how the first management console of our storage system looked like, which one of our customers nicely called GreenDOS.

    image

    The web interface and other buns, of course, appeared much later.

    Now, for example, it looks like this.

    image

    image

    Although, of course, we do not refuse the console, it is, it is convenient and functional and allows scripts to automate many operations.

    By the year 2014, we were able to polish the system to a stable one (well, more precisely, we thought so ourselves) and began to carefully test it with our customers and sell it. In general, the first external tests were successful (of course, not without a file) and we received the first sales. There was not much sales; in a year and a half, we sold a little more than a dozen systems. It was connected withso that in order to sell something unnecessary, you have to buy something unnecessary at the beginning, and we have no money with a lack of human and financial resources. As a result, in the middle of 2016, we realized that it would not work to pull this project with 3.5 people, and we need an investor to reach the desired level. I will not go into the details of the search for investors, this process is fun only in American films about Silicon Valley. In our reality, this process is tedious and nervous. As a result, at the end of 2016, we agreed with the investor and joyfully, full of strength and ambitious plans, went to celebrate the New Year.

    From the beginning of 2017, we began to live in a new way. They expanded the staff, moved from the factory to the human office, made a normal test lab and demo zone, launched the development of vAIR and RAILGUN and began to actively sell ENGINE. And here we are faced with the first surprise.

    Since we are a new vendor, almost one hundred percent of the time before purchase, customers tested our storage system with passion and about 80% of tests ended with an epic file. Once we even had to give the customer a 100% discount (i.e., give the system away for free), because He tested it after purchase in battle (it happened), and the result did not satisfy him (to put it mildly). Of course there were 20%, which ended in successful sales, and these customers were satisfied, but it was clear that the product is in its current state of shitpoor quality, and urgently needs to change something. We seriously reviewed the development and testing processes, and by the autumn of 2017 we were able to get a really stable release of storage systems, which immediately affected external tests and sales. The percentage of successful / unsuccessful tests has changed. By the end of the year, almost all of the tests at customers were successful, which helped us sell dozens of storage systems and with difficulty, but fulfill the sales plan set by investors.

    No less joyfully, we started the year 2018, sales went well, ENGINE received good reviews from customers. Encouraged by the success, we released the next version of ENGINE software, in which a bunch of new useful functions appeared, and which, accordingly, went into new tests with customers. And here we were in for a second surprise.

    Despite the fact that the new version of ENGINE passed all the necessary internal tests (those that allowed us to release a stable version in the fall of 2017), these tests were not enough for a more complex new version of the software. The new functionality brought with it new tasks and, accordingly, new types of external tests for customers. As a result, we again received a decrease in the effectiveness of external tests (not as scary as the first time, but significantly) and began to look for the cause. Then it dawned on us that the complication of the product conditionally 2 times, requires the complexity of its internal testing in 4. But since rolls we were already wiped, it was not difficult for us to quickly adapt our tests to the new reality. We expanded the test lab and automated many of the tests we used to do with our hands. Thus, we were able to quickly return the new version to the required stability. Later, we used the experience gained in the course of previously filled cones and past rakes when releasing our second product, vAIR, which is no longer quite DSS, but rather a hyperconvegent system (although vAIR can also pretend to be DSS, if you ask for it well). There were no surprises with him, the first tests of the customers gave the result immediately, and the sales went.

    Looking back, one can say that a long way has been done and as it turned out, the most important thing is not what you are now, but what you want to become, what you think is right and how quickly you are ready to adapt to changing external conditions. At the same time, it is important to have a clear understanding of what can and should be done much more. The main thing is to work hard and not lose heart if the result does not come immediately.

    Now about the products themselves AERODISK


    At the moment we are working on four products, two of which are already on sale, and the third and fourth are on the way.

    1. AERODISK ENGINE is a classic unified data storage system for mid-level and entry-level corporate tasks (Mid-range and Low-end). ENGINE has almost all the functionality of modern storage, including fault tolerance in Active-Active mode (ALUA), block and file access, various SSD and RAM caching options, deduplication, the ability to store data at several levels (Tiering), All-Flash mode, etc. d. It is about this product that will be discussed in this article.
    2. AERODISK vAIR is a hyperconvergent system that allows you to run server virtualization, a software network and a horizontally scalable storage system in one box. vAIR can work with both its native hypervisor (KVM) and the VMware external hypervisor. In addition, unlike our other products, vAIR supports installation on third-party hardware, fully implementing the concept of SDDC (Software-Defined-Datacenter).
    3. AERODISK RAILGUN (this is a working title, the final has not yet been selected). Initially, we had our own virtual RAID written from scratch, which we decided to call the first letter of the alphabet and, at the same time, the name of our company, i.e. RAID-A. But in the end, it all came to the fact that it became a separate product, in particular, a Hi-End-class storage system. As a separate product, it is in the beta test stage and everything goes to the fact that at the beginning of next year there will be a release of the first version. This storage system is designed for the most critical tasks and the most extreme loads. Distinctive features of RailGun are fault-tolerant symmetrical I / O (simultaneous utilization of all controllers at once), multi-controller (up to 16 in SAN-mode) and fully automated real-time scoring with any number of storage levels.
    4. Adaptive deduplication(deduplication with a block of variable length), which does not yet have even a working name, but there is already a working combat version (which is more important). Initially, as with RAID-A, this was (and still is) one of the options for ENGINE. But in the course of the play, it turned out that it works perfectly not only with ENGINE, but with any block device. As a result, we obtained a separate software that allows you to perform data deduplication at the block level, automatically selecting the block size, which significantly increases the efficiency of deduplication (we save a lot of disk space) and at the same time increases the I / O performance and reduces delays (due to that transaction records are stupidly smaller, since most of them are deduplicated). This software is available either as an add. license to any of our storage systems, or (will be available after NG) as a separate product (hardware or software),

    We now turn to the specifics. As I wrote above, this article is about ENGINE. We will write out other products in detail in our next articles.

    In this article, I will describe the hardware, storage architecture, and fault tolerance principles of SHD ENGINE. In the framework of the next article about ENGINE, a description of the extended functionality will be given, as well as detailed performance tests (with detailed methods, various load profiles, graphs and other blackjack). Immediately we will focus on hardware and architecture.

    Iron


    In fact, initially I didn’t want to paint the hardware component strongly, since in storage, the most valuable thing is intelligence, i.e. software that is installed on hardware, and we have hardware, albeit with features, but about the same as in 99% of all storage systems of all (both Western and Eastern) vendors. Sometimes even from the same factories.

    But since unlike, for example, our SDDC-product (i.e. vAIR), the ENGINE storage system is still a hardware storage system and is delivered as a full-featured PACK (i.e. hardware + software + service) and, accordingly, is not a classic SDS th (i.e. ENGINE software is extremely difficult to install on regular x86 servers, since there are a bunch of hardware features). Therefore, the description of iron in the case of ENGINE is extremely important and therefore, as you understand, the description of iron is.

    So iron. In the model range AERODISK ENGINE there are 3 modular platforms.

    • ENGINE N1 - single-controller storage for non-critical tasks and storage of archives / backups
    • ENGINE N2 - fault-tolerant storage for critical tasks and average load up to 150,000 IOPS
    • ENGINE N4 - fault tolerant storage for critical tasks and high loads up to 300,000 IOPS

    All platforms are distinguished by the following hardware feature - platforms consist of dynamically replaceable modules that can be easily replaced at any time, which greatly simplifies the operation of the storage system and the replacement of components. This applies not only to HA platforms (i.e., fault-tolerant N2 and N4), but also to the single-controller N1.

    For fault-tolerant configurations (N2 and N4), we use a chassis with two controllers (they are also “heads”, they are nodes) with a common backplane. The front wheels are mounted in the chassis (12 pieces, 24 pieces 2.5 or 3.5 inches). The disks through the backplane, in turn, simultaneously look at the two heads of storage over the SAS interface. Additional disks are connected using disk shelves (24, 60, 102 disks) in each shelf through SAS 48G (4x12 Gb), which are installed in controllers-heads. Cascading is supported.

    In SAN mode (scale-up), the number of controllers cannot exceed two (if you need more for SAN, then this will already be RAILGUN), while in NAS mode the number of controllers can be scaled to 8 (N4).

    Now pictures


    AERODISK ENGINE N1


    image
    image

    AERODISK ENGINE N2


    image
    image

    AERODISK ENGINE N4


    imageimage
    The heads themselves are mounted at the back and support the possibility of a hot one. Behind also installed two power supplies that provide fault tolerance power storage. Inside the case (N2 and N4) the controllers are connected by two interconnects.

    • PCI-bridge , which provides minimal latency and high performance, which allows you to synchronize the RAM between the storage heads.
    • 1 Gb Ethernet , which is used as a cluster heartbeat.

    Inside the head, there are motherboards, Intel Xeon processors, the model and the number of which varies depending on the specific storage model, RAM (the volume also varies from 32 GB to 2 TB per controller), as well as internal boot loaders (M2 SATA SSD), on which our software is installed.

    Protection of RAM against power loss is made using an additional battery (BBU), which provides up to 10 minutes of autonomous operation of the motherboard, RAM, processor and bootloader. During this time, in the event of a power failure, the system will automatically reset the unrecorded data from the RAM cache to the internal loader, thereby preventing data loss. After power is restored, the data from the bootloader will automatically be reset to the storage disks and, accordingly, everything will be OK with them. Depending on the model, I / O ports are installed in the storage controllers: Ethernet 1/10/25 / 40Gb and / or FC 8/16 / 32Gb. A separate bonus is the ability to install both FC and Ethernet in one box, without additional gateways and servers, which allows you to get both file and block access in one box.

    ENGINE N2 - rear view


    image

    Below are the main hardware characteristics of the ENGINE range.

    image

    Soft


    And now interesting. The software component is made, of course, based on the core of the Orthodox Linux. There are 2 distributions, one based on Astra, the second based on Debian, they do not differ in functionality, this division is needed more for proper certification, so you shouldn’t pay attention to it in the future. On top of the Linux kernel, we installed our own hand-written modules, as well as the open source solutions that we modified.

    A feature and one of the main advantages is that regardless of the hardware platform (N1, N2, N4) software is always put one. Those. even on the youngest ENGINE model, you can get all the advantages of the older models, which allows you to get Enterprise functions in the models of the initial level, as well as easily and seamlessly upgrade.

    Storage architecture


    The most important thing in storage systems is storage organization principles or storage architecture. In the case of ENGINE, in one system we combined two approaches and, accordingly, two storage architectures, which, in turn, are implemented in the form of two types of virtual RAID.

    1. Dynamic Disk Pool (DDP) - for high-performance block access, as well as random and mixed load (primarily all-flash)
    2. RAID Distributed Group (RDG) - for intelligent file access, large amounts of data, and sequential load

    The question naturally arises, why such a division? What is, RDG, that DDP is storage groups (or storage pools), each of which has its own pros and cons. Some vendors do not soar and do it easier (for themselves, of course) and produce 2 lines of storage systems that are not related to each other, for different tasks. We are trying to make our systems as universal as possible, so that within one storage system you can mix different types of tasks and workloads.

    Below is a table comparing DDP and RDG.

    image

    image

    Now turn to the insides of each of the groups.

    Dynamic Disk Pool (DDP)


    In DDP, physical disks (of one or two types) are combined into a disk pool, but they are not initially formatted, but are only marked with a label indicating that they belong to one or another pool. A pool consists of virtual blocks (chunks). When creating a pool, the administrator can specify a chunk size of 4 or 16 MB, the default is 4. Also automatically when creating a pool, disks that are identical to those in the pool but not added there are marked as hot-swap disks. Accordingly, if during the operation the disk from the pool fails, then if there is a hot spare disk, the failed disk will be removed from the pool, and instead a hot spare will be added, and then chunks from other disks that were on the failed disk will start ( partial rebuild process).

    When creating a pool, the system automatically shows how many hot spare disks are available.

    image

    One default hot swap drive can be used for any number of DDP and RDG groups, where there are similar drives. Specially assign drives for hot swap is not necessary. It is important to pay attention to their number when creating DDP.

    An important feature of DDP is that the RAID level is set at the LUN level, not the pool / group (in the RDG, on the contrary). Thus, in one pool you can get LUNs with different RAID levels, besides, the RAID level on the LUN can be changed in any way without interrupting access to the data.

    After creation, the pool can also be increased or decreased by adding or excluding physical disks (you can only add disks to the RDG). If there are already created LUNs in the pool, then after adding new disks, it is possible to redistribute LUN chunks to new disks, in order to get more performance and volume.

    In the DDP pool there are 3 multi-level data storage options that are assigned to the LUN by the storage administrator.

    • Read and write ram cache
    • SSD cache read and write
    • SSD-tier (Online-tiering)

    In all three cases, options are assigned to a specific LUN. You can assign options to all LUNs in the pool, but you should consider the competition of LUNs for a faster SSD or RAM resource. The mechanism of priorities in multi-level storage in the DDP is currently not present (while there is in RAILGUN, but this is already a separate system).

    Schematically, the organization of storage in DDP is as follows.

    image

    In our example, three LUNs are created in the disk pool, one stores data only on the HDD, the second on the HDD + SSD-tier, the third on the HDD + SSD-tier + RAM-cache.

    DDP supports RAID levels 0, 1/10, 5/50, 6/60 (see table below)

    image

    Unlike RDG, DDP only supports standard RAID levels (0, 1/10, 5/50, 6/60). In this case, LUN s chunks can be evenly distributed among all disks in the pool if RAID level 1/10 is selected. If RAID levels 5 or 6 are selected, then when creating a group, the administrator selects the maximum number of disks for a specific LUN (the limits are set due to the features of RAID 5 and 6). But it is possible to use all the disks of the pool for the five and six, for this you need to use levels 50 or 60, which, in fact, are associations of virtual devices based on the five or six.

    The main advantages of using all disks of the pool at once are:

    • Increased reliability: the more disks used for LUN, the lower the chance of losing data (see examples in failure scenarios)
    • Linear performance increase: the more disks used to store LUN data, the better the performance (especially for all-flash scripts)

    Failure scenarios

    Let's take a simple example of a LUN, which is located only on HDDs with RAID level 10. The pool is initially organized on four disks. LUN in the 10th reyd, this means that all the chunks are evenly distributed over all the disks of the pool and form a mirror of two stripes. For the sake of simplicity, let us imagine that there are no hot-swap drives ( sometimes it happens so because of reduced social responsibility in the production ) The

    data storage of such a LUN looks like a schematic.
    image

    To lose data in such a construction, it suffices to fail:

    • of any of the stripes completely (2 disks) + 1 disk of another stripes (3 disks out of 4);

    image

    • one disc in each stripe with mirror chunks (in our case, chunks 3 and 4) - 2 discs out of 4
    image

    At the same time, in case of failure of a disk with chunks 1 and 2 in one strip and chunks 3 and 4 in another strip, the data will not be lost. I / O will continue, but hot spare disks are needed to correct the situation.

    image

    If we need to increase reliability and at the same time performance, we can add disks to the pool and stretch the LUN to new disks.

    Pool scaling can be performed by any number of disks (you can add one at a time, for example). LUN will be stretched by transferring chunks to new disks in the background (after confirmation from the administrator).

    The administrator also has the choice of how to use new disks in the pool:

    • just redistribute chunks, adding performance and reliability
    • Increase LUN capacity with new disks, adding capacity, as well as performance and reliability.

    Increase the pool by another 4 disks and redistribute the chunks without increasing the volume.

    image

    Chunks are evenly distributed to all disks of the pool.

    Now, to destroy the data, the following should fail:

    • any of the stripes (4 disks) + 1 disk of another strip (5 disks out of 8) should also fail ;

    image

    • one disk in each stripe with mirror chunks (take the case with chunk 3), also 2 of 4 disks, as in the first example.

    image

    In this case, the greater the number of disks in the pool, the lower the chance that duplicate chunks will come out. Accordingly, if there are 100+ disks in the pool, the chance that 2 disks will fail in different stripes with identical chunks is extremely low. Thus, in large pools, DDP reliability is extremely high, and if you use hot spare disks, it is even higher.

    RAID Distributed Group (RDG)

    In RDG, physical disks are initially combined into virtual devices according to the parity of the RAID level, which is indicated when creating the group. Further, these virtual devices are combined into a long stripe, which is already an RDG.

    RDG has several patterns.

    • The more virtual devices in RDG, the higher its performance and reliability;
    • The higher the RAID level of a single virtual device, the higher the reliability of the group, but the effective volume decreases;
    • The more disks in a single virtual device, the higher the reliability, the larger the usable volume, but the higher the group increment step (see the example below).

    The table below shows the supported RDG levels with parity for one virtual device and the percentage of usable capacity. These are not all possible combinations; if necessary, you can create a virtual device pattern yourself.

    image

    The diagram below shows an example of a triple parity RAID-60P level group of 33 disks. The cache in RAM is not shown here, because in the case of RDG, it is read only and is always on.

    image

    A separate useful bonus is that the number of disks (and virtual devices) in the RDG is logically unlimited, which allows you to create really huge groups.

    The logic of the autochange disks in the RDG is similar to DDP. When creating a group, the system automatically shows how many hot spare disks will be available. One default hot swap drive can be used for any number of groups where there are similar drives. There is no need to specifically assign a hot spare disk; you should simply pay attention to their number when creating the RDG. Hot spare drives for RDG can simultaneously be hot spare drives for DDP and vice versa.

    image

    The process of rebuilding a disk with auto-replacement in the RDG itself is partial ie. it is not the entire disk that is being overwritten, but only those parts of the data that were damaged, which seriously speeds up the rebuilding process. Also, this process is pre-configured by politicians. At the storage level, you can preset a policy:

    • Recovery rate The policy is designed for cases when you need to quickly perform a rebuild. At the same time during rebuild I / O performance will be reduced (20-30%)
    • I / O performance. The policy is intended for cases when it is necessary not to lose performance when rebuild, and the speed of the rebuild is not so important. In this mode, during rebuild, I / O performance will not be significantly reduced (no more than 10%), while the rebuild will occur 1.5-2 times longer.

    Further optionally to the RDG, you can add the SSD cache and an additional storage level on the SSD (Online-tiering). The minimum number of disks in both cases is two. SSD cache and SSD-tier are always added to virtual devices of RAID-10 level, regardless of the level of RDG. In the diagram below, we added an SSD cache and SSD-tier to our group. Unlike DDP, in RDG, SSD levels are not added to LUNs, but to the group as a whole, i.e. All group objects (LUNs and balls) use SSD levels if they are added.

    image

    The created RDG, as well as the SSD levels, can always be expanded with additional disks for hot (but cannot be reduced, unlike DDP). When adding disks, consider the parity of the RDG and SSD levels. In our example:

    • the minimum step of adding disks for the lower level is 11, since The size of the virtual device is 11 disks. If we made this size smaller initially, then, respectively, the addition step would be smaller (for example, for RAID-10, it is obviously equal to two disks)
    • for cache and tiring levels, this step is always two disks, since SSD levels are always RAID level 10 and you should always add at least two disks.

    In the diagram below, we increased each of the storage levels by one virtual device. Now we have 4 virtual devices at the bottom level with 11 disks (44) and 2 virtual devices at each level on the SSD levels.

    image

    After adding new virtual devices, the system will automatically distribute the workloads over them, no additional actions are required by the administrator.

    After creating a group (or in the process of creation), you can attach useful things such as compression and deduplication to it. Both of these options can be enabled both at the RDG level (then they work for the whole group and block devices, and file balls), and at the level of an individual storage object.

    image

    Next, objects are created in the RDG, in which you can directly store data. These can be block devices (iSCSI / FC access) or file balls (NFS / SMB / FTP access). In the same group, you can simultaneously create both LUNs and balls.

    image

    Now consider the failure scenarios of our group.


    To lose data, any of the virtual devices must be destroyed.

    From the diagram we see that we have four virtual devices on the lower level, each with three parity disks. Accordingly, in order to kill a virtual device, four disks should fail in one of them (if three disks are lost, the device will work normally). If you look at the lower level of the RDG as a whole, then we can lose up to twelve disks (three in each virtual device). All this math does not take into account hot spare disks. Ie, if they are, then the number of disks for failure can be increased by the number of free disks from hot spare. In this example, also, as in the example with DPP, for simplicity, we imagine that there is no hot spare disk (I repeat, it’s impossible to do this in a productive environment).

    Scenario 1


    Failure of 12 disks out of 44 (3 in each virtual device)

    image

    In each virtual device, the maximum number of disks allowed for this configuration failed (we don’t touch the SSD, the logic is identical, but only for RAID10). I / O continues, but the situation is critical, you need to urgently add / change disks. In each virtual device, the maximum number of disks allowed for this configuration failed (do not touch the SSD, the logic is identical there, but only for RAID10). I / O continues, but the situation is critical, you need to urgently add / change disks.

    Scenario 2


    The failure of 13 disks out of 44 (4 disks failed in one of the devices)

    image

    There's nothing to be done, one of the virtual devices has died, so the I / O is stopped and the data should be restored from backups (there are, of course, our recovery tools , but this is a topic for a separate article). Obviously, this situation should not be allowed, so you need to use hot spare and monitor alerts in the storage system or in the monitoring system that is connected to it (SNMP is supported).

    DDP vs RDG


    If we compare the two types of groups, it is easiest to say the following.

    If you need high performance on block access with random or mixed load and dynamic RAID, then you definitely need to use DDP. If you need file access or sequential load, as well as the ability to create huge RAID groups (petabytes), then you are definitely in the RDG.

    Moreover, if during the work you need to change the type of group, then data migration between the RDG and DDP is supported, this process is, of course, non-trivial, but doable.

    fault tolerance


    Next, about high availability in ENGINE. Earlier, I wrote that an asymmetric Active / Active (ALUA - Asymmetric Logical Unit Access) scheme is used.

    ENGINE cluster software works with both block and file access. Heartbeat between the nodes is done via interconnect. The cluster automatically switches the optimal and non-optimal paths, and also automatically changes the owner of the storage groups in the following cases:

    • Controller failure (change of ownership)
    • Failure of the storage ports involved in the water-output port (change of owner)
    • Failure of the port on the host (path change is optimal — suboptimal

    For a visual example, we will depict in the diagram a 2-controller configuration, which is connected to the 2nd ports of the host, for which the OS uses a multipath. 4 storage groups are created on the storage system, for 2 of them the first controller (Engine-0) is assigned by the owner, for 2 others the engine-1 is assigned the owner. Both controllers (and 4 groups) are visible to both ports of the host.

    For DDP0 and RDG0, the owner is assigned Engine 0, the paths through this controller for this group are optimal. At the same time, there is a non-optimal path (via interconnect and Engine-1), which is activated in case of failure of the main port on the host. For DDP1 and RDG1, the opposite is true: Engine-1 is the owner, the optimal path is through it, and non-optimal through interconnect and Engine-0.

    image

    At any time, the storage administrator can change the owner of each group. The process of changing the owner takes about 5-10 seconds and occurs without interrupting I / O. The same operation is performed by the administrator to put the controller into maintenance mode, for example, when a hardware or software storage update is required.

    Port failure


    Imagine that there was a port failure on a host that was optimal for the DDP0 and RDG0 groups (via Engine-0). In this case, the storage system automatically uses a non-optimal path through Engine-1 and interconnect, which will preserve access to the data, but with an additional delay.

    image

    When the port on the host is restored, the data will automatically go to the optimal path.

    Controller failure


    A more complex scenario is a controller failure. In this case, the non-optimal paths will not save us, and in case of physical loss of the controller (or 2 I / O ports on the controller), the system will force the owner of all storage groups on the failed controller to change. Accordingly, there will be a change of ownership, which happens without interrupting the input output.

    image

    When the Engine-0 controller returns to operation again, the administrator will need to manually change the owner to Engine-0.

    Conclusion


    As you can see from the examples, at the moment AERODISK ENGINE has a fairly flexible storage architecture that is applicable for a variety of tasks and a high degree of reliability, which allows the system to be used in business-critical tasks.

    At present, about a hundred ENGINE storage systems have been installed in different regions of Russia, and this number is steadily increasing. Most of these installations went through long tests at the sites of our partners and customers, and the majority was completed successfully. Of course, these are not tens of thousands of installations, like those of transnational top vendors, but Moscow was not built right away, and still ahead.

    In the following articles about ENGINE, I will detail additional functional buns, such as deduplication, compression, snapshots and replication, and then there will be a separate article on performance tests of the RDG and DDP groups. After that, you are waiting for the analysis of our other products.

    In addition, on October 9 (Tuesday) from 11:00 to 12:30 (Moscow time) a free
    technical webinar
    will take place , where we will tell TS Solution and, most importantly, show on a
    live system how everything works.

    Also popular now: