SSD + raid0 - not so simple
Colleagues from the neighboring department ( UCDN ) came up with a rather interesting and unexpected problem: when testing raid0 on a large number of SSDs, the performance changed in such a sad way:
along the X axis - the number of disks in the array, along the Y axis - megabytes per second.
I began to study the problem. The initial diagnosis was simple - a hardware raid could not cope with a large number of SSDs and ran into its own performance ceiling.
After the hardware raid was thrown out and HBA was put in its place, and the disks were assembled in raid0 using linux-raid (it is often called 'mdadm' by the name of the command line utility), the situation improved. But it did not pass completely - the numbers increased, but were still below the calculated ones. At the same time, the key parameter was not IOPS, but multithreaded linear recording (that is, large chunks of data written to random places).
The situation was unusual for me - I never chased a pure bandwidth raids. IOPS's are our everything. And here - you need a lot of a second and more.
I started by defining baseline, that is, single disk performance. I did it, rather, to clear my conscience.
Here is a linear read graph with one SSD.
Seeing the result, I really took off. Because it very much resembled the tricks that manufacturers of cheap USB flash drives go to. They put fast memory in the FAT locations (tables) in the FAT32 (file system) and slower in the storage area. This allows you to slightly gain in performance when working with small operations with metadata, while assuming that users who copy large files are first willing to wait, and secondly, the operations themselves will occur in large blocks. More about this heartbreaking phenomenon: lwn.net/Articles/428584
I was sure that I had found the cause and root of all the problems and was already preparing a stinging message (see the captions in the picture) explaining what dull low-quality equipment of the “fertilizer” class appeared on the test, and many other words that are better not to be repeated.
Although I was confused by the kernel version at the stand - 3.2. From my previous experience, knowing the deplorable features of LSI, which change literally everything in drivers (mpt2sas) from version to version, I thought, “what if”?
A bit of background. mpt2sas is the LSI driver for HBA. He lives an incredibly stormy life, starting with the version from version v00.100.11.15 through version 01.100.0x.00 reaching right up to version 16.100.00.00 (I wonder what the number “100” means?). During this time, the driver was distinguished by a permutation of drive letter names when updating the minor version of the kernel, which differs from the order of disks annotated by the BIOS, crashes on “unexpected” LUN configurations, backplane timeouts, an unexpected number of disks, error logging in dmesg with an infinite loop speed of the kernel itself (de facto this is the equivalent of a system crash) and similar funny things.
Updated, launched the test. And this “suddenly” happened. This is what the same chart looks like at 3.14. But I almost rejected innocent SSD's.
After the disk performance stabilized, a second test was conducted: independent tests were run on all disks in parallel. The goal was simple - to check if there was a bottleneck somewhere on the bus or HBA. The disk performance turned out to be quite decent, there was no “plug” on the bus. The main task has been solved. However, the performance graph was still different. Not much, but obviously with a hint of worse than linear write speed.
Why does recording behave this way as the number of disks in an array increases? The graph (at the beginning of the article) very much resembled the graph of multithreaded application performance as the number of threads that programmers and Intel typically show when they talk about problems with mutual thread locks ...
During the test in blktopsomething strange was observed: some of the disks were loaded into the ceiling, some were almost idle. Moreover, those who show poor performance are loaded into the ceiling, and “fast” disks are idle. Moreover, disks sometimes change places - that is, a previously loaded disk 100% suddenly shows a higher speed and lower load, and vice versa, a disk that has been loaded at 50% suddenly appears to be 100% loaded and at the same time shows a lower speed. Why?
And then it dawned on me.
raid0 depends on latency of the worst drive
If we write a lot of data, then the record usually goes in large chunks. These chunks are split into smaller chunks by the raid0 driver, which writes them simultaneously to all disks from raid0. Due to this, we get an N-fold increase in productivity. (In raid0 on N drives).
But let's take a closer look at the record ...
Let's say raid uses chunks of 512k in size. There are 8 disks in the array. The application wants to write a lot of data, and we write on raid in pieces of 4MB.
Now watch your hands:
- raid0 receives a write request, divides the data into 8 pieces of 512kb each
- raid0 sends (in parallel) 8 requests to 8 devices for writing 512kb (each his own)
- raid0 is waiting for confirmation from all 8 devices to complete recording
- raid0 responds to the “written” application (that is, returns control from the write () call)
Imagine now that the discs recorded in such a time (in milliseconds):
|Disc 1||Disc 2||Disc 3||Disc 4||Disc 5||Disc 6||Disc 7||Disc 8|
Question: for how long will a 4 MB block write to this array? Answer: in 9.7 ms. Question: what will be the recycling of disk 4 at this time? Answer: about 10%. And drive number 6? 100%. Note, for example, I chose the most extreme values from the operations log, but even with a smaller discrepancy, the problem will persist. Compare the graph of reading and writing (I bring the same picture again):
See how unevenly the recording is walking in comparison with reading?
Latency SSDs have very uneven recordings. This is due to their internal structure (when a large block is recorded at a time, if necessary, moving and transferring data from place to place). The larger this block, the stronger the latency peaks (i.e. momentary performance gaps). Regular magnetic disks have completely different graphics - they resemble a flat line with almost no deviations. In the case of a linear sequential IO, this line runs high, in the case of a constant random IO - constantly low, but the key - constantly. Latency of hard drives is predictable, latency of SSDs is not. Note that all disks have this property. The most expensive latency is shifted (either very fast, or very very fast) - but the discrepancy still persists.
With such latency fluctuations, the performance of SSDs is, on average, excellent, but at some points in time, recording may take a little longer than at other times. At the tested disks, it fell at that moment to shameful values of the order of 50 Mb / s (which is lower than the linear recording of modern HDDs by a factor of two).
When requests are sent to the device in a stack and independently, this does not affect. Well, yes, one request was executed quickly, the other slowly, on average, everything is fine.
But if the write depends on all the disks in the array? In this case, any "braked" disk slows down the entire operation. As a result, the more disks in the array, the more likely it is that at least one disk will run slowly. The more disks, the greater the performance curve of their sum in raid0 begins to approach the sum of the performance of their minima (rather than average values, as we would like).
Here is a graph of actual performance depending on the number of drives. The pink line is the predictions based on average disk performance, the blue line is the actual results.
In the case of 7 discs, the differences were about 10%.
Simple mathematical simulation (with data on the latency of a real disk for a situation of multiple disks in an array) made it possible to predict that with an increase in the number of disks, degradation can reach 20-25%.
Unlike replacing the HBA or driver version, in this case it was impossible to change anything significantly, and the information was simply taken into account.
Which is better - HDD or SSD?
I must say right away: the worst expectation from an SSD is better than a constant expectation from an HDD (if it sounded too complicated: an SSD is better than an HDD).
Another thing is that an array of 20-30 HDDs is normal. 30 SSDs in raid0 will cause saliva in geeks and an attack of hepatic colic in the finance department. That is, they usually compare multiple HDDs with multiple SSDs. If we normalize the numbers according to IOPS (ohoho), that is, we get the same parrots from the SSD from the HDD, then the numbers will suddenly change, an array of a large number of HDDs will greatly overtake an array of several SSDs in write speed.
Another thing is that a large array of HDDs is already an extreme of a different kind, and there are surprises there due to the general use of the bus, HBA performance and the behavior of backplanes.
And raid1 / 5/6?
It is easy to understand that for all these arrays the problem of waiting for the “slowest” persists, and even slightly increases (that is, the problem arises with a smaller block size and lower load intensity).
Admin: I do not like LSI. If you find any complaints about the operation of disks with the participation of LSI in the system, debugging should begin by comparing the behavior of different versions of the mpt2sas driver. This is exactly the case when a version change can affect performance and stability in the most dramatic way.
Academic: When planning highly loaded systems using SSDs in raid0, it should be noted that the more SSDs are in the array, the stronger the effect of uneven latency becomes. As the number of devices in raid0 grows, the device's performance begins to strive to multiply the number of devices by the minimum disk performance (rather than the average, as one would expect).
Recommendations: in the case of this type of load, you should try to choose devices with the smallest spread in latency for recording, if possible, use devices with a larger capacity (to reduce the number of devices).
Particular attention should be paid to configurations in which some or all of the drives are connected over the network with an uneven delay, this configuration will cause much greater difficulties and degradation than local drives.