Why do I get nervous about the failures of modern SSD

Original author: Chris Siebenmann
  • Transfer


Today, one of the SSDs on one of our new Linux file servers has died. This is not the first and probably not the last SSD death we will face, but, as almost always in such cases, I felt my nerves prank - and all because of the combination of the nature of SSD failures, their similarity to “black box And solid nature.

Like most other SSD failures, this one happened suddenly; The disk went from a state that works perfectly to a state that does not react at all to anything in 50 seconds, without any warning through SMART or anything else. Here he happily handles requests for reading and writing (according to all external signs, including ZFS, which did not complain about checksums), but now there is no Crucial MX300 on the SAS port.

The first message from the Linux kernel about the refusal of IO-operations arrived at 8:31:34 PM, and the disc was officially announced to be missing at 8:32:15. However, in reality, the disk could immediately stop responding to requests - I do not quite understand the driver messages.

What worries me most about these abrupt SSD failures is how incomprehensible they are, and that I cannot explain to myself exactly what went wrong. When the hard disk is spinning, it can also suddenly die, but at least you can make an explanation of what happened before this — the motor is jammed, or there is another physical failure that led to an abrupt stop. SSDs are solid-state and mysterious, and I have no explanation for what went wrong, especially when the disk is still young and should not have come to the end of the life span of flash cells.

When HDD dies at a young age, one can imagine that it did not reveal the resulting production defects. Theoretically, this should not happen with SSD, so his early death is especially troubling. Perhaps there could be undetectable manufacturing defects in the flash cells too.

And when I don’t have an explanation for what is happening, my thoughts start following the path of anxiety - such as the fact that the disk was deceiving us about his health in SMART diagnostics, and that he actually used the last spare cells, and then they ran out, or that He had some kind of bug in the firmware, which we accidentally touched on, after which he turned into a brick.

We had such that the SSD died in this way, and then came back to life when it was taken out and stuck again - and it looked completely healthy, which did not inspire confidence at all. But it was a different type of SSD. We also got weird bugs from the Crucial MX500 series SSD.

In addition, when I have no explanation for SSD failures, each of them seems to me an unpredictable time bomb. Are they healthy or will they die tomorrow? It seems that I have to rely on statistics, that is, that not too many of them will die, and they will not do it too quickly for them to be changed. And even this hope rests on the assumption that there is no correlation of failures — that what happened to this SSD is less likely to happen to others standing next to it.

And this problem is relevant not only for our file servers - I have the same concern associated with my home computer. I mirror all the data, but what are the real chances of failure of both SSDs?

In theory, I know that SSD should be much more reliable than a rotating rusty disk. We also have a bunch of SSD, quietly working for many years. But after such mysterious sudden failures, they no longer seem so reliable. I really would like us to have some kind of warning about SSD failure, because with HD it was quite often possible (for example, I received such warnings about HD in one of the desktop desktops - even though I ignored them) .

Also popular now: