How developers sat in St. Petersburg and ate mushrooms quietly, and then wrote an OS for data storage systems



    At the end of 2008, a Western media company, then still a small St. Petersburg company, got something like this:
    “Did you go over hardcore there and adapt SSE instructions to implement the Reed-Solomon code?”
    - Yes, but we do not ...
    - Yes, I do not care. Want an order?

    The problem was that video editing required hellish performance, and then RAID-5 arrays were used. The more disks in RAID-5, the higher the probability of failure right during installation (for 12 disks - 6%, and for 36 disks - already 17-18%). The drop of the disk during installation is unacceptable : even if the disk falls into the high-end storage system, the speed degrades sharply. The media holding is tired of screaming to beat its head against the wall every time, and so someone advised them of the gloomy Russian genius.

    Much later, when our compatriots grew up, a second interesting task arose - Silent Data Corruption . This is a type of storage error when the bit in the main data and the control bit simultaneously change on the pancake. If we are talking about video or photography - in general, no one will even notice. And if we are talking about medical data, then this becomes a diagnostic problem. So there was a special product for this market.

    Below is a story of what they were doing, a bit of math and the result is an OS for highload storage . Seriously, the first Russian OS brought to mind and released. Although for storage.

    Media holding


    The story began in 2008-2009 as a project for an American customer. It was required to develop such a data storage system that would provide high performance and at the same time cost less than a cluster of ready-made high-end storage systems. The holding had a lot of standard Amazon-iron type hardware - x86 architecture, typical disk shelves. It was assumed that "these Russians" would be able to write such control software that would combine the devices into clusters and thereby ensure reliability.

    The problem, of course, is that RAID-6 required very high processing power to work with polynomials, which was not an easy task for the x86 CPU. That is why the manufacturers of storage systems used and use their options and ultimately supply the rack as a kind of “black box”.

    But back in 2009. The main task at the beginning of the project was fast data decoding . In general, RAIDIX (which our heroes began to call much later) has always dealt with high decoding performance.

    The second task is to ensure that read / write speed is not squandered when a disk fails . The third task is a bit like the second - to deal with the problems of errors on the HDD that occur when it is unknown when and on which medium. In fact, a new field of work has opened up - the detection of latent data defects for reading and their correction.

    These problems were relevant then only for large, very large storages and really fast data. By and large, the architecture of “ordinary” storage systems suited everyone, except for those who constantly used really large amounts of data for read-write (and did not work 80% of the time with a “hot” data set, comprising 5-10% of the DBMS).

    Then there were no end-to-end data protection standards as such (more precisely, there was no sane implementation), and now they are not supported by all disks and controllers.

    First tasks


    The project has begun. Andrei Rurikovich Fedorov, a mathematician and founder of the company, began by optimizing data recovery using the typical architecture of Intel processors. Just then, the first project team found a simple but really effective approach to vectorizing multiplication by a primitive element of the Galu field. When using SSE registers, 128 field elements are multiplied by x at the same time over several XORs. And as you know, multiplication of any polynomials in finite fields can be reduced to multiplication by a primitive element due to factorization of multiplication. In general, there were a lot of ideas using the advanced features of Intel processors.

    When it became clear that ideas, in general, were successful, but writing an OS-level product to work at the lowest level was required, the department was first allocated, and then a separate company RAIDIX was founded.

    There were many ideas; for study and verification, employees of the University of St. Petersburg State University were found. Work began at the intersection of science and technology - attempts to create new algorithms. For example, a lot of work with matrix inversion (this is an algorithmically difficult task, but very important for decoding). They were picking Reed-Solomon, trying to increase the field dimension of two in the sixteenth, two in two hundred and fifty-sixth degrees, and looked for quick ways to detect Silent Data Corruption. Conducted experiments to prototypes in assembler, evaluated the algorithmic complexity of each option. For the most part, the experiments yielded a negative result, but for about 30-40 attempts one was positive in performance. For example, the same increase in the field dimension had to be removed - in theory this was wonderful, but in practice it caused decoding deterioration,

    Next was systematic work to expand RAID 7.N. They checked what would happen with an increase in the number of disks, disk partitions, and so on. Intel added a set of AES instructions for security, among which a very convenient instruction for multiplying polynomials (pclmulqdq) was found. They thought that it could be used for code - but after checking the advantages in comparison with the existing performance, they did not find it.

    The company has grown to 60 people dedicated exclusively to data storage.

    In parallel, they began to work on a fault-tolerant configuration. At first it was assumed that the failover cluster would be based on open source software. Faced with the fact that the quality of the code and its versatility were insufficient to solve specific practical problems. At this time, new problems began to appear: for example, when the interface crashed, traditionally on the controller, re-elections and master switching took place. This together took astronomical time - up to a minute or two. A new host system was required: they began to assign points for each session (the more open sessions - the more points), and new hosts did discover. In the third generation it was found that even with synchronous replication due to the peculiarities of software and hardware implementation, a session on one controller could appear earlier, than on the other - and an unwanted handover with a switch occurred. It took the fourth generation — its own cluster manager — specifically for the operation of storage, in which failures of host interfaces and backend interfaces were handled correctly, taking into account all the features of the hardware. It was very essential to complete the software at a low level, but the result was worth it - now a couple of seconds to switch the maximum, plus Active-Active has become much more correct. Added auto-configuration of drive bins. but the result was worth it - now a couple of seconds to switch the maximum, plus Active-Active has become much more correct. Added auto-configuration of drive bins. but the result was worth it - now a couple of seconds to switch the maximum, plus Active-Active has become much more correct. Added auto-configuration of drive bins.

    In the end, we made very good SATA optimization with the transition to RAID 7.3 - support for data recovery without loss of performance.

    Implementation


    The solution used is the storage vendors, as well as the owners of large storages from the USA, China and Korea. The second category is the non-stream tasks of integrators, most often media, supercomputers, healthcare. During the Olympic Games, Panorama, a sports broadcasting studio, was the final consumer; they were just making a picture from the Olympics. There are RAIDIX users in Germany, India, Egypt, Russia, the USA.

    Here's what happened:


    On one controller: regular x86 hardware + OS = fast and very, very cheap storage.


    Two controllers: the result is a redundant system (but more expensive).

    An important feature is partial volume recovery. There are three checksums for each stripe:



    Thanks to its own algorithm for calculating a RAID array, it is possible to recover only a separate area of ​​the disk containing corrupted data, reducing the recovery time of the array. It is very effective for high volume arrays.

    The second thing - the mechanism of proactive reconstruction is implemented, excluding from the process up to two (RAID 6) or up to three (RAID 7.3) disks, the read speed from which is lower than that of the rest. When it is faster to recover than to read - naturally, the first option is used.

    It works like this: from K strips, KN ​​is obtained, which is necessary to assemble a data section. If the data is integer, reading the rest of the strips stops.

    This means that in RAID 7.3 having 24 disks with 3 failures - 12 Gb / s per core (4 cores) - the speed of recovery exceeds the speed of reading backup and even accessing RAM - despite the loss of the disk, the reading is preserved.

    The next low level problem is an attempt to read a broken section. Even on enterprise systems, the delay can be up to 8 seconds - you must have seen such “hangs” of HDD-SHD. With this algorithm in mind, the non-return of data from three out of 24 disks simply means slowing down the reading by a few milliseconds.

    Moreover, the system remembers the drives with the longest response time and stops sending them requests within one second. This reduces the load on system resources. Disks with the longest response times are assigned the status “slow” and a notification is given that it is worth replacing them.


    Screenshots of the interface

    Given the advantages of RAIDIX, many customers decided to migrate to it. This was due to the lack of company size - it is difficult for a Petersburg developer to take into account all the features of database mirroring and other specific data. In the latest version, big advances were made towards smooth migration, but it still couldn’t work out smoothly and exactly like a high-end storage with mirrored Active-Active connection, a shutdown will most likely be required.

    Details


    In Russia, personally, I see the opportunity to get quite interesting options for storage for minimal money. We will collect solutions on a normal stable hardware that will be delivered ready for customers. The main advantage, of course, is the ruble per GB / s . Very cheap.

    For example, here is the configuration:
    • HP DL 380p gen8 server (Intel Xeon E5-2660v2, 24 GB memory + LSI SAS HBA 9207-8i controllers).
    • We spill RAIDIX 4.2 OS onto 2 disks, the remaining 10 - 2Tb SATA.
    • The external interface is 10 Gb / s Ethernet.
    • 20 TB of space that you can use.
    • + Licenses for 1 year (including TP and updates).
    • Sales price: $ 30,000.

    An expansion shelf connected via SAS with 12 disks of 2TB per price list - $ 20,000. The price includes OS preinstallation. 97% of the space is used for data on data disks. LUN unlimited size. Fiber Channel supported; InfiniBand (FDR, QDR, DDR); iSCSI There is SMB, NFS, FTP, AFP, there is Hot Spare, there is a UPS, 64 disks in a RAID 0/6/10 / 7.3 array (with triple parity). 8 Gb / s on RAID 6. There are QoS. The result is an optimal solution for post-production, in particular, color grading and editing, for TV broadcasting, folding data from HD cameras. With a family of nodes, you can get 150 Gb / s without a significant decrease in reliability, and even under Luster - this is a highload area.

    Here is the link spec and more details (PDF).

    Tests


    1. Single-controller configuration. SuperMicro SSG-6027R-E1R12L server 2 units. 12 4TB Sata 3.5 ”drives. External interface 8Gbps FC. 48 TB of unallocated space for $ 12,000

    2. Dual-controller configuration. SuperServer 6027TR-DTRF server, it has 2 boards (like blades). Add a shelf with 45 disks of 4 TB each. External interface 8Gbps FC. 180 TB of unallocated space for $ 30,000.

    Configuration a - RAID 7.3 on 12 disks. 36 TB of usable capacity, $ 0.33 / GB.
    Configuration b - three RAID 7.3 with 15 disks. 0.166 $ / GB

    FC 8G Performance

    Sequential read / write

    Operation type

    Block size

    4.2.0-166, RAID6, Single (ATTO8 × 2)

    IOps

    Mbps

    Avg Max Response Time

    read

    4K

    80360.8

    329.1

    55.8

    128K

    11,946.7

    1565.8

    54.3

    1M

    1553.5

    1628.9

    98.3

    write

    4K

    18910.8

    77.4

    44.8

    128K

    9552,2

    1252.0

    54.9

    1M

    1555.7

    1631.2

    100,4

    Here are the other results .

    Generally


    I am very glad that we suddenly found such a manufacturer near by, solving very specific problems. The company does not produce its components and they do not have other business services, they also do not plan to engage in system integration - so we agreed to cooperate. As a result, my department is now engaged, in particular, with solutions based on the RAIDIX OS. The first implementations in Russia, of course, will go strictly together with manufacturers.

    We tested some configurations on the demo stand, and, in general, are satisfied, although we found a couple of pitfalls (which is normal for new versions of software). If you are interested in the implementation details, write to atishchenko@croc.ru, I will tell you in more detail whether it is worth it or not.

    Also popular now: