History of the struggle for IOPS in a self-assembled SAN

    Hello!

    One of my projects uses something a bit like a private cloud. These are several servers for data storage and several - diskless, responsible for virtualization. The other day I seem to have finally put an end to the issue of squeezing the maximum performance of the disk subsystem of this solution. It was quite interesting, and even at some points - quite unexpectedly. Therefore, I want to share my story with the habrasociety, which began back in 2008, even before the advent of the "First Cloud Provider in Russia" and the campaign for sending free water meters.

    Architecture

    Virtual hard disks are exported through a separate gigabit network using the AoE protocol . In short - this is the brainchild of Coraid, which proposed to implement the transfer of ATA commands over the network directly. The protocol specification takes only a dozen pages! The main feature is the lack of TCP / IP. When transferring data, a minimal overhead is obtained, but as a fee for simplicity, impossibility of routing.

    Why such a choice? If you omit the reprinting of official sources - including and commonplace lowcost.

    Accordingly, in storages we used ordinary SATA disks with 7200 rpm. Their flaw is known to everyone - low IOPS.



    RAID10

    The very first, popular and obvious way to solve the problem of random access speed. They took mdadm into their hands, drove a couple of appropriate commands into the console, raised LVM on top (we are going to distribute block devices for virtual machines as a result) and launched several naive tests.

    root@storage:~# hdparm -tT /dev/md127
    /dev/md127:
     Timing cached reads:   9636 MB in  2.00 seconds = 4820.51 MB/sec
     Timing buffered disk reads: 1544 MB in  3.03 seconds = 509.52 MB/sec
    


    To be honest, checking IOPS was scary, there were no options for solving the problem besides switching to SCSI or writing your own crutches.

    Network and MTU

    Although the network was gigabit, the read speed from diskless servers did not reach the expected ~ 100MiB / sec. Naturally, the drivers of the network cards were to blame (hi, Debian). Using fresh drivers from the manufacturer’s website seems to have partially fixed the problem ...

    In all AoE speed optimization manuals, the first item indicates setting the maximum MTU. At that moment it was 4200. Now it seems ridiculous, but compared to the standard 1500, the linear read speed really reached ~ 120MiB / sec, cool! And even with a small load on the disk subsystem by all virtual servers, local caches corrected the situation and within each virtual machine the linear read speed was kept at least 50MiB / sec. Actually, pretty good! Over time, we changed the network cards, the switch, and raised the MTU to a maximum of 9K.

    Until MySQL Comes

    Yes, some of the projects 24/7 pulled MySQL, both for writing and reading. It looked something like this:
    Total DISK READ: 506.61 K/s | Total DISK WRITE: 0.00 B/s
      TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                                                                                       
    30312 be/4 mysql     247.41 K/s   11.78 K/s  0.00 % 11.10 % mysqld
    30308 be/4 mysql     113.89 K/s   19.64 K/s  0.00 %  7.30 % mysqld
    30306 be/4 mysql      23.56 K/s   23.56 K/s  0.00 %  5.36 % mysqld
    30420 be/4 mysql      62.83 K/s   11.78 K/s  0.00 %  5.03 % mysqld
    30322 be/4 mysql      23.56 K/s   23.56 K/s  0.00 %  2.58 % mysqld
    30445 be/4 mysql      19.64 K/s   19.64 K/s  0.00 %  1.75 % mysqld
    30183 be/4 mysql       7.85 K/s    7.85 K/s  0.00 %  1.15 % mysqld
    30417 be/4 mysql       7.85 K/s    3.93 K/s  0.00 %  0.36 % mysqld
    

    Harmless? No matter how. A huge stream of small requests, 70% io wait on the virtual server, 20% load on each of the hard disks (according to atop) on the storage and such a dull picture on the other virtual machines:
    root@mail:~# hdparm -tT /dev/xvda
    /dev/xvda:
     Timing cached reads:   10436 MB in  1.99 seconds = 5239.07 MB/sec
     Timing buffered disk reads:  46 MB in  3.07 seconds =  14.99 MB/sec
    

    And it's fast! Often the linear read speed was no more than 1-2 MiB / sec.

    I think everyone has already guessed what we ran into. Low IOPS SATA drives, even though RAID10.

    Flashcache

    How on time these guys appeared! This is salvation, this is the same! Life is getting better, we will be saved!
    Urgent purchase of Intel SSDs, inclusion of the module and flashcache utilities in the live-image of storage servers, write-back cache setting and fire in the eyes. Yeah, all the counters are zeros. Well, the features of LVM + Flashcache are easy to google, the problem was quickly resolved.

    On a virtual server with MySQL, loadavg dropped from 20 to 10. Linear reading on other virtual machines increased to stable 15-20 MiB / sec. Do not be fooled!

    After some time, I collected the following statistics:
    root@storage:~# dmsetup status cachedev
    0 2930294784 flashcache stats: 
            reads(85485411), writes(379006540)
            read hits(12699803), read hit percent(14)
            write hits(11805678) write hit percent(3)
            dirty write hits(4984319) dirty write hit percent(1)
            replacement(144261), write replacement(111410)
            write invalidates(2928039), read invalidates(8099007)
            pending enqueues(2688311), pending inval(1374832)
            metadata dirties(11227058), metadata cleans(11238715)
            metadata batch(3317915) metadata ssd writes(19147858)
            cleanings(11238715) fallow cleanings(6258765)
            no room(27) front merge(1919923) back merge(1058070)
            disk reads(72786438), disk writes(374046436) ssd reads(23938518) ssd writes(42752696)
            uncached reads(65392976), uncached writes(362807723), uncached IO requeue(13388)
            uncached sequential reads(0), uncached sequential writes(0)
            pid_adds(0), pid_dels(0), pid_drops(0) pid_expiry(0)
    

    read hit percent: 13, write hit percent: 3. A huge amount of uncached reads / writes. It turns out that flashcache worked, but not at full strength. There were a couple of dozen virtual machines, the total volume of virtual disks did not exceed a terabyte, disk activity was small. Those. such a low percentage of cache hits - not because of neighbor activity.

    Insight!

    For the hundredth time looking at this:
    root@storage:~# dmsetup table cachedev
    0 2930294784 flashcache conf:
            ssd dev (/dev/sda), disk dev (/dev/md2) cache mode(WRITE_BACK)
            capacity(57018M), associativity(512), data block size(4K) metadata block size(4096b)
            skip sequential thresh(0K)
            total blocks(14596608), cached blocks(3642185), cache percent(24)
            dirty blocks(36601), dirty percent(0)
            nr_queued(0)
    Size Hist: 512:117531108 1024:61124866 1536:83563623 2048:89738119 2560:43968876 3072:51713913 3584:83726471 4096:41667452 
    

    I decided to open my favorite Excel LibreOffice Calc:

    The diagram is built on the last line, a histogram of the distribution of queries by block sizes.
    We all know that
    hard drives usually operate in 512-byte blocks. Exactly like AoE. Linux kernel - 4096 bytes each. Data block size in flascache is also 4096.

    Having summed up the number of requests with block sizes different from 4096, it is evident that the resulting number suspiciously matches the number of uncached reads + uncached writes from the flashcache statistics. Only 4K blocks are cached! Remember that initially MTU we had 4200? If we subtract from this the size of the AoE packet header, we get the size of the data block at 3584. This means that any request to the disk subsystem will be divided into at least 2 AoE packets: 3584 bytes and 512 bytes. Which just was clearly visible in the original diagram, which I saw. Even in the diagram from the article, the predominance of 512 byte packets is noticeable. And the MTK recommended at each corner in 9K also has a similar problem: the size of the data block is 8704 bytes, these are 2 4K blocks and one per 512 bytes (which is exactly what is seen in the diagram from the article). Goofy! The solution, I think, is obvious to everyone.

    MTU 8700



    The diagram is made a few days after updating the configuration on one of the diskless nodes. After updating MTU on the others, the situation will become even better. And loadavg on a virtual server with MySQL dropped to 3!

    Conclusion

    Not being system administrators with 20 years of experience, we solved problems using the “standard” and most popular approaches known to the community at the appropriate time. But in the real world there is always room for imperfections, crutches and assumptions. On which we, in fact, ran into.
    Here is such a story.

    Also popular now: