Discussions about Software Defined Storage: What's Wrong with IO?
Abstract: About the new trend - software defined strorge and the main birth trauma of block devices - the promise of endless reliability.
On the horizon, a new buzzword: Software defined $ thing . We already have an established and formed circle of everything related to software defined networks (SDN), the turn has come and storage (SDS). Apparently, we will have software defined computing or something similar next, then HP / VMWare will suddenly stir up and catch up and offer (private) “software defined enterprise”, which will mean everything that was, but even more fashionable and relevant.
However, the story is not about buzzwords. Behind every such strange name (grid, elastic, cloud) is the further development of technologies - the construction of further layers of component interaction (um ... interaction of the interaction participants, you can’t say otherwise), the main motive of which is to avoid the granularity of the computer system, so that the whole terminology the entire subject area has gone from "interprocess interaction" and has become autonomous. In a more or less decent form we (in the form of an accomplished fact) we see this in themagical world of javascriptthe work of www, when we are in no way concerned about the servers on which the tasks are spinning - all communication takes place at the level between the browser (taking into account its intimate details DOM, JS, etc.) and an abstraction called URI, which is not important - one it is a server or hundreds of different ones.
This interaction looks very tempting, so it is distributed to all other areas, as far as possible.
Before we talk about SDS, let's look at what has already taken place: SDN (software defined network).
In SDN, all network equipment (real hardware or virtual switches on virtualization hosts) is used as stupid performers, and all the intellectual work of building a real network is delegated to an application that somehow understands what is needed and makes a network topology as necessary. I omit all the names of specific technologies (openflow, big switch, floodlight, Nicra), since the main thing in SDN is the idea of forming a network configuration using software, and not the implementation details.
So, what is Software Defined Storage (SDS) then? By analogy, we can say that this is such a data storage system in which all the intellectual work on building a data storage system is delegated to the program, and the hardware and "local software" (host level) work as stupid performers.
Probably the most successful and exemplary solution here is Openstack's Swift, which creates a stable and scalable storage of blobs using stupid disks and xfs on them, from which nothing is required - only capacity and a little performance. Everything else does software.
But swift is not exactly “storage”, it is “object storage”. That is, the file storage. Without the ability to write in the middle of a file, and certainly not providing tens of thousands of IOPS for recording with microsecond delays.
And the public longs for just that. Reliable, cheap, with arbitrary and guaranteed redundancy, fault tolerance, high availability, geo replication, auto ballanced, self healing, from commodity iron (i.e. cheap again), high performance, with unlimited scalability of performance and capacity as the number of nodes increases, muti tentant, accountable (here the client could not stand the excitement and started, falling on the carpet, knock legs). All this, and even a spoon.
The SDN-SDS analogy has one small nuance that makes everything complicated. In the SDN, the only thing required from the network equipment (that is stupid and just obeying the command center) was to shift the bytes. In SDS, stupid storage devices are required not only to take bytes and convey to / from the client, but also to store them.
In this place lies the biggest, most complex and unpleasant problem. We can take and throw out the dying switch. We can do this even programmatically. No one will notice anything.
But we can’t just pick up and throw out the working “dumb” storage. Before another repository can continue to work, someone should go and copy his data to himself.
Yes, the thing is storage. If we had write-only burials for information, then their implementation would be trivial. Can't write here? Raise another node, start writing there.
But after all, from a dead node, we would also have to read what was recorded ... But the node died. Oh?
Thus, the SDS model fully coincides with the SDN in terms of the IO process. But storage is a completely new, separate problem. Which is called the CAP-theorem . And the solution is not visible there.
What is the problem? If the problem cannot be solved, then the conditions of the problem must be changed.
And here the most interesting part begins - if the tops cannot and the lower classes do not want to, this is the beginning of a revolution, right? A task change is the same model change, according to which work with block devices is going on. All the mess around SDS is, after all, about the file system on the block device, on which you can put the SQL database and work with it very, very quickly, reliably, cheaply, consistently (again, the client went into a happy tantrum .. .).
If someone provides you with a network in which 1 out of 10,000 packets will be lost, you will consider that you have an ideal network. All network applications, without exception, are ready for packet loss, and problems begin to appear when losses rise to tens of percent.
Good-kind TCP forgives almost everything - repeats, losses, jittering (a sharp change in latency), bandwidth changes, data corruption inside the packet ... If it gets really bad, then TCP starts to work slowly and listlessly. But to work! Moreover, even if the operating conditions become unbearable even for TCP (for example, 70-80% packet loss), then most network applications are ready for a disconnected network connection, and it simply reconnects again, without far-reaching consequences.
Compare this with block devices. What happens if you are sold a disk device that loses 1 out of 1,000,000 requests? An evil file system will not forgive this. What will happen if you improve the quality 100 times and you have 1 out of 100,000,000 requests broken? The file system will not forgive this. And not just not forgive, but revenge in the most terrible way. If the file system detects that 1 out of a trillion write requests failed, it will refuse to work with such a shameful block device. At best, it will go into read only mode; at worst, it will simply stop working.
And what will happen to a program whose file system has thrown such a thing? No one knows. Maybe it just ends. Or maybe it’s starting to work badly. Or freezes. If there was a page file on this block device, then some operating systems will panic. Especially if there was any important data (for example, a piece of file buffer for reading from the cat program - and the entire server with all its thousands of clients goes to blink with three LEDs on the keyboard).
What, for example, will the database do if we change as a result of an error one of only one billion blocks? (one 4k sector on a 4TB drive). Firstly, she will not notice it. Secondly, if she notices (she doesn’t like something in what she read), she will declare the database inferior, subject to apartheid, circumcision, deprivation of civil rights and declare basa non granta in the system.
In other words, endless reliability is expected from the disk stack .
The entire block stack is merciless to errors. Vendors ask tens and hundreds of millions of rubles for systems that almost never make mistakes. But even their systems make mistakes. Less commonly than commodity iron. But who benefits from this if you are not forgiven for even one mistake per quadrillion operations? (1 failed block of 4 Eb written / read, 4k blocks).
Of course, the solution to this is to increase reliability. Raids, cluster systems, mainframes ... We have already seen this somewhere. It turns out not expensive, but prohibitively expensive. If laptops were made using mainframe technologies, then they would break a thousand times less often, and would cost a million times more expensive.
Someone whispers something about RAIDs. Well, let's see what raids do. The raid takes several block devices and builds a new block device out of them. With increased reliability (and maybe performance). At the same time, he makes exactly the same requirements for the quality of devices from below - an error - and the disk is declared bad. Forever and ever. Further there is a rebuild of varying degrees of culture.
The most advanced proprietary solutions allow drives to sometimes make mistakes and reject them after exceeding a certain threshold.
But at the same time, if there is a problem, any raid error (for example, a timeout on IO) will lead to the same declaration of the entire raid as “bad”. With the same consequences for applications using data on the file system in this raid. In other words, the raid is required to make several unreliable devices ... again, infinite reliability (zero probability of failure). Theorver Negodue.
... And a kind, all-forgiving TCP looks at lost souls with compassion and love.
First, we must admit that there are no ideal things. If DNA with a billion-year evolution has failed to protect itself from errors, then hoping for a couple of years (decades) of engineering, to put it mildly, is not reasonable. Errors may be. And most importantly, what you need to learn to do with these mistakes is not to arrange tantrums due to the tiniest imperfection.
They returned a mistake to us? Trying to repeat, failed to repeat - return higher up the stack. The file system silently goes and puts the metadata in another place if it could not be written to it (and does not arrange a tantrum the size of the entire server). Having received a write (read) error to / from the log, the DBMS does not declare the database obsessed, and does not curse all applications running with this database to the seventh knee, but simply extracts a backup copy, there is no backup copy, accurately marks the data as damaged, returns an error or a note of damage. An application working with a database, after receiving this, does not do anything stupid, but works calmly with what it is, trying to minimize damage and honestly speaking about the amount of damage to the one who works with this data. And each of the levels fully checks the correctness of the data from the underlying level,
Yes, we have one bank transaction damaged on your card. Yes, we don’t know exactly how much money was written off from you. But we have interim balances, so you can continue to use the card, and we will either write off the damaged data for the old age, or restore it next week. This is instead of “Unknown error. Card operation is not possible, contact your bank card support service. ”
Eating away a small piece of data should not lead to damage to a larger piece of data. In Hebrew mythology, one case is described when an entire humanity was discarded because of a bitten apple, dispersed the whole paradise, torn off all legs of a snake and, in general, behaved like the modern file system behaves when it finds a bitten hard disk. As far as I know, this event is considered a tragic mistake. No need to do this anymore. Bit the apple - throw the apple, and no more.
Thus, the main change that SDS must bring is the change in the attitude towards block device errors. So that 1% of disk errors are considered not a very good, but tolerable indicator. And 0.01% is just a wonderful service.
Under these conditions, it will be possible to make services without waiting for infinite reliability - reasonable expectations for reasonable money.
And what does software defined storage of the future look like then? If we allow ourselves sometimes to make mistakes, then our task is not to prevent them, but to reduce their number.
For example, we can strongly parallelize operations. If 1000 nodes are responsible for storing data, then the failure of one or two of them for us only means 0.1% or 0.2% of read or write errors. We don’t have to get out with guaranteed synchronous replications. Well, yes, "a node has flown, thrown out of the service, added a new one." In principle, this is not a very good situation (because if a couple more later fly out, then we will creep up to 0.4% loss, which will reduce the quality of data storage). But we can raise a backup node. Yes, there will be data outdated for a day, and for some of the data we will lie mercilessly (return not what we wrote down). But the higher level is ready for this, right? And due to the fact that only 2-3% of the data from the node has changed, then instead of 0.1% of failures in reading (and almost 0% of failures in writing - we write to other nodes), we get 0.
0.002% is 99.998% reliability. Dream? If you are ready for this, yes.
And the resulting design turns out to be incredibly simple: a swift-like storage system for blocks spread over a heap of servers and a heap of disks. Without special requirements for mandatory data integrity - if we sometimes give outdated data, then this is just "nonsense when reading", and if we do this not too often, then everything suits everyone. We can “lose” the client’s request at any time and be sure that he will send it, if necessary. We can work not in the revolutionary-heroic regime " Storage would be made of these people: It would be safer in the world of storage ", but in a comfortable mode, when diligence and diligence in most of the time, fully compensates for rare errors.
In all the previous, there was not a word about SDS. Where is the 'software defined' here?
In the scheme described above, the executing nodes will only do what the software commands them. The software, in turn, will form a description of where and what to read and where to write. In principle, all this is already there. The previous-generation cluster file systems, CEPH, which may have evolved a bit to the network level of BTRFS, may have arrived in time for elliptics - it’s practically ready. It remains to write a normal multi-tenancy, a conversion from the logical topology of the client view to the "blunt hardware" commands (controller for SDN) - and you're done.
The main conclusion: the key problem in the development of block devices at the moment is excessively high (endless) expectations for the reliability and reliability of the operation of block devices, as well as the existing bad tradition of inflating block device errors by increasing the size of the damage domain to the task domain (and sometimes beyond its limits). The rejection of 100% reliability always and everywhere will allow, with much less effort (i.e., less cost), to provide the conditions for creating (or even applying existing) SDN solutions.
Lyrics
On the horizon, a new buzzword: Software defined $ thing . We already have an established and formed circle of everything related to software defined networks (SDN), the turn has come and storage (SDS). Apparently, we will have software defined computing or something similar next, then HP / VMWare will suddenly stir up and catch up and offer (private) “software defined enterprise”, which will mean everything that was, but even more fashionable and relevant.
However, the story is not about buzzwords. Behind every such strange name (grid, elastic, cloud) is the further development of technologies - the construction of further layers of component interaction (um ... interaction of the interaction participants, you can’t say otherwise), the main motive of which is to avoid the granularity of the computer system, so that the whole terminology the entire subject area has gone from "interprocess interaction" and has become autonomous. In a more or less decent form we (in the form of an accomplished fact) we see this in the
This interaction looks very tempting, so it is distributed to all other areas, as far as possible.
Before we talk about SDS, let's look at what has already taken place: SDN (software defined network).
In SDN, all network equipment (real hardware or virtual switches on virtualization hosts) is used as stupid performers, and all the intellectual work of building a real network is delegated to an application that somehow understands what is needed and makes a network topology as necessary. I omit all the names of specific technologies (openflow, big switch, floodlight, Nicra), since the main thing in SDN is the idea of forming a network configuration using software, and not the implementation details.
So, what is Software Defined Storage (SDS) then? By analogy, we can say that this is such a data storage system in which all the intellectual work on building a data storage system is delegated to the program, and the hardware and "local software" (host level) work as stupid performers.
Probably the most successful and exemplary solution here is Openstack's Swift, which creates a stable and scalable storage of blobs using stupid disks and xfs on them, from which nothing is required - only capacity and a little performance. Everything else does software.
But swift is not exactly “storage”, it is “object storage”. That is, the file storage. Without the ability to write in the middle of a file, and certainly not providing tens of thousands of IOPS for recording with microsecond delays.
And the public longs for just that. Reliable, cheap, with arbitrary and guaranteed redundancy, fault tolerance, high availability, geo replication, auto ballanced, self healing, from commodity iron (i.e. cheap again), high performance, with unlimited scalability of performance and capacity as the number of nodes increases, muti tentant, accountable (here the client could not stand the excitement and started, falling on the carpet, knock legs). All this, and even a spoon.
In real
The SDN-SDS analogy has one small nuance that makes everything complicated. In the SDN, the only thing required from the network equipment (that is stupid and just obeying the command center) was to shift the bytes. In SDS, stupid storage devices are required not only to take bytes and convey to / from the client, but also to store them.
In this place lies the biggest, most complex and unpleasant problem. We can take and throw out the dying switch. We can do this even programmatically. No one will notice anything.
But we can’t just pick up and throw out the working “dumb” storage. Before another repository can continue to work, someone should go and copy his data to himself.
Yes, the thing is storage. If we had write-only burials for information, then their implementation would be trivial. Can't write here? Raise another node, start writing there.
But after all, from a dead node, we would also have to read what was recorded ... But the node died. Oh?
Thus, the SDS model fully coincides with the SDN in terms of the IO process. But storage is a completely new, separate problem. Which is called the CAP-theorem . And the solution is not visible there.
What is the problem? If the problem cannot be solved, then the conditions of the problem must be changed.
And here the most interesting part begins - if the tops cannot and the lower classes do not want to, this is the beginning of a revolution, right? A task change is the same model change, according to which work with block devices is going on. All the mess around SDS is, after all, about the file system on the block device, on which you can put the SQL database and work with it very, very quickly, reliably, cheaply, consistently (again, the client went into a happy tantrum .. .).
Good TCP and the evil file system
If someone provides you with a network in which 1 out of 10,000 packets will be lost, you will consider that you have an ideal network. All network applications, without exception, are ready for packet loss, and problems begin to appear when losses rise to tens of percent.
Good-kind TCP forgives almost everything - repeats, losses, jittering (a sharp change in latency), bandwidth changes, data corruption inside the packet ... If it gets really bad, then TCP starts to work slowly and listlessly. But to work! Moreover, even if the operating conditions become unbearable even for TCP (for example, 70-80% packet loss), then most network applications are ready for a disconnected network connection, and it simply reconnects again, without far-reaching consequences.
Compare this with block devices. What happens if you are sold a disk device that loses 1 out of 1,000,000 requests? An evil file system will not forgive this. What will happen if you improve the quality 100 times and you have 1 out of 100,000,000 requests broken? The file system will not forgive this. And not just not forgive, but revenge in the most terrible way. If the file system detects that 1 out of a trillion write requests failed, it will refuse to work with such a shameful block device. At best, it will go into read only mode; at worst, it will simply stop working.
And what will happen to a program whose file system has thrown such a thing? No one knows. Maybe it just ends. Or maybe it’s starting to work badly. Or freezes. If there was a page file on this block device, then some operating systems will panic. Especially if there was any important data (for example, a piece of file buffer for reading from the cat program - and the entire server with all its thousands of clients goes to blink with three LEDs on the keyboard).
What, for example, will the database do if we change as a result of an error one of only one billion blocks? (one 4k sector on a 4TB drive). Firstly, she will not notice it. Secondly, if she notices (she doesn’t like something in what she read), she will declare the database inferior, subject to apartheid, circumcision, deprivation of civil rights and declare basa non granta in the system.
In other words, endless reliability is expected from the disk stack .
The entire block stack is merciless to errors. Vendors ask tens and hundreds of millions of rubles for systems that almost never make mistakes. But even their systems make mistakes. Less commonly than commodity iron. But who benefits from this if you are not forgiven for even one mistake per quadrillion operations? (1 failed block of 4 Eb written / read, 4k blocks).
Of course, the solution to this is to increase reliability. Raids, cluster systems, mainframes ... We have already seen this somewhere. It turns out not expensive, but prohibitively expensive. If laptops were made using mainframe technologies, then they would break a thousand times less often, and would cost a million times more expensive.
Someone whispers something about RAIDs. Well, let's see what raids do. The raid takes several block devices and builds a new block device out of them. With increased reliability (and maybe performance). At the same time, he makes exactly the same requirements for the quality of devices from below - an error - and the disk is declared bad. Forever and ever. Further there is a rebuild of varying degrees of culture.
The most advanced proprietary solutions allow drives to sometimes make mistakes and reject them after exceeding a certain threshold.
But at the same time, if there is a problem, any raid error (for example, a timeout on IO) will lead to the same declaration of the entire raid as “bad”. With the same consequences for applications using data on the file system in this raid. In other words, the raid is required to make several unreliable devices ... again, infinite reliability (zero probability of failure). Theorver Negodue.
... And a kind, all-forgiving TCP looks at lost souls with compassion and love.
What to do?
First, we must admit that there are no ideal things. If DNA with a billion-year evolution has failed to protect itself from errors, then hoping for a couple of years (decades) of engineering, to put it mildly, is not reasonable. Errors may be. And most importantly, what you need to learn to do with these mistakes is not to arrange tantrums due to the tiniest imperfection.
They returned a mistake to us? Trying to repeat, failed to repeat - return higher up the stack. The file system silently goes and puts the metadata in another place if it could not be written to it (and does not arrange a tantrum the size of the entire server). Having received a write (read) error to / from the log, the DBMS does not declare the database obsessed, and does not curse all applications running with this database to the seventh knee, but simply extracts a backup copy, there is no backup copy, accurately marks the data as damaged, returns an error or a note of damage. An application working with a database, after receiving this, does not do anything stupid, but works calmly with what it is, trying to minimize damage and honestly speaking about the amount of damage to the one who works with this data. And each of the levels fully checks the correctness of the data from the underlying level,
Yes, we have one bank transaction damaged on your card. Yes, we don’t know exactly how much money was written off from you. But we have interim balances, so you can continue to use the card, and we will either write off the damaged data for the old age, or restore it next week. This is instead of “Unknown error. Card operation is not possible, contact your bank card support service. ”
Eating away a small piece of data should not lead to damage to a larger piece of data. In Hebrew mythology, one case is described when an entire humanity was discarded because of a bitten apple, dispersed the whole paradise, torn off all legs of a snake and, in general, behaved like the modern file system behaves when it finds a bitten hard disk. As far as I know, this event is considered a tragic mistake. No need to do this anymore. Bit the apple - throw the apple, and no more.
Thus, the main change that SDS must bring is the change in the attitude towards block device errors. So that 1% of disk errors are considered not a very good, but tolerable indicator. And 0.01% is just a wonderful service.
Under these conditions, it will be possible to make services without waiting for infinite reliability - reasonable expectations for reasonable money.
Block devices of the future
And what does software defined storage of the future look like then? If we allow ourselves sometimes to make mistakes, then our task is not to prevent them, but to reduce their number.
For example, we can strongly parallelize operations. If 1000 nodes are responsible for storing data, then the failure of one or two of them for us only means 0.1% or 0.2% of read or write errors. We don’t have to get out with guaranteed synchronous replications. Well, yes, "a node has flown, thrown out of the service, added a new one." In principle, this is not a very good situation (because if a couple more later fly out, then we will creep up to 0.4% loss, which will reduce the quality of data storage). But we can raise a backup node. Yes, there will be data outdated for a day, and for some of the data we will lie mercilessly (return not what we wrote down). But the higher level is ready for this, right? And due to the fact that only 2-3% of the data from the node has changed, then instead of 0.1% of failures in reading (and almost 0% of failures in writing - we write to other nodes), we get 0.
0.002% is 99.998% reliability. Dream? If you are ready for this, yes.
And the resulting design turns out to be incredibly simple: a swift-like storage system for blocks spread over a heap of servers and a heap of disks. Without special requirements for mandatory data integrity - if we sometimes give outdated data, then this is just "nonsense when reading", and if we do this not too often, then everything suits everyone. We can “lose” the client’s request at any time and be sure that he will send it, if necessary. We can work not in the revolutionary-heroic regime " Storage would be made of these people: It would be safer in the world of storage ", but in a comfortable mode, when diligence and diligence in most of the time, fully compensates for rare errors.
Where is the SDS?
In all the previous, there was not a word about SDS. Where is the 'software defined' here?
In the scheme described above, the executing nodes will only do what the software commands them. The software, in turn, will form a description of where and what to read and where to write. In principle, all this is already there. The previous-generation cluster file systems, CEPH, which may have evolved a bit to the network level of BTRFS, may have arrived in time for elliptics - it’s practically ready. It remains to write a normal multi-tenancy, a conversion from the logical topology of the client view to the "blunt hardware" commands (controller for SDN) - and you're done.
Total
The main conclusion: the key problem in the development of block devices at the moment is excessively high (endless) expectations for the reliability and reliability of the operation of block devices, as well as the existing bad tradition of inflating block device errors by increasing the size of the damage domain to the task domain (and sometimes beyond its limits). The rejection of 100% reliability always and everywhere will allow, with much less effort (i.e., less cost), to provide the conditions for creating (or even applying existing) SDN solutions.