Defenseless data

    In the head of large IT business consumers, the focus has finally shifted from business applications to the data processed by these applications. And in the phrase “data center” now the third word is deservedly distinguished, and not the second. Along with understanding the main role of data in business, there came a panicky fear of losing them. Indeed, according to IDC statistics, in the event of a prolonged lack of access to operational data, most companies expect bankruptcy.

    There are two fundamentally different approaches to ensuring the reliability of data storage. The first is backup. Two main concepts are connected with redundancy - RPO (recovery point objective) and RTO (recovery time objective). RPO is the point in time at which the system contained data corresponding to the backup. RTO is the time taken by the backup / restore process. Naturally, with the growth of corporate data, RTO grows in proportion to the volume of data, and RPOs occur less and less. This means that the latest, most valuable data becomes the most vulnerable, and their volume increases.

    The second approach is “data is always there”, that is, data protection directly in the storage system, at the time it gets there. And this means real time RPO and zero-aiming RTO. This approach is being steadily promoted by the giants of storage systems (in particular, EMC). The most popular way to provide protection according to the proposed concept is RAID (redundant array of independent disk; by the way, earlier instead of the word “independent” appeared “inexpensive”, which is hardly applicable for modern fiber channel disks). The principle of operation is to combine several disks into a group and store data and redundant information in it. I think it makes little sense to talk about RAID levels, because now we are interested in the most popular level - 5.
    ((cut))
    In a RAID5 group, data is saved “spread out” across all disks, and correction codes — information required to recover data — are also “spread out”. Its redundancy for RAID5 is the optimal 25% of the volume of useful data. RAID5 is built in such a way that a group can withstand a single drive failure at a time.

    It would seem that with such a storage technology, the data is really always there. Let's see how “always” is. The subtle point here is that a group can withstand only one drive at a time. Even if you instantly replace this disk, the group needs some time to restore data and correction codes (rebuild) to this disk. The data, of course, is available at the same time, but if another disk fails during the rebuild procedure, the group will be destroyed. The more disks in a group and the larger the volume of each disk, the more frequent one of them will fail, and the more time it takes for rebuild. Up to the point that a RAID5 group of a large number of inexpensive volume disks can completely collapse several (3-4 times) once a year!

    The solution to this problem is the introduction of double correction, RAID6 or RAID5 DP. Such a group can withstand the failure of two disks at one time (as we found out above, the “moment” for large groups is quite a long time of the rebuild procedure). Failure of two discs in a row is not an frequent event. Theoretically, for groups of less than 20 TB, the RAID6 level provides 2 orders of magnitude better data protection (time before data loss) for disks with average parameters than RAID5.

    Practice makes one doubt the theory of probability: the failure of the second disk at the time of rebuild is veryprobable. This is especially true for systems under serious workload. Two factors affect this. Firstly, the rebuild procedure on a productive system seriously loads disks, the number of read / write operations increases significantly on an already heavily loaded system. Secondly, at the current level of microelectronics, disks come off the assembly line like clones; accordingly, such an important parameter as MTBF is almost the same. Thus, one of the disks, which has reached the maximum operating time, leads to an increased load on the entire group, faster than under normal conditions, the exhaustion of the resource of the remaining disks and, as a result, the increased probability of failure of another disk. A kind of blackout.

    Storage manufacturers are struggling with this as they can. For example, when ordering a storage system, IBM supplies disks of various manufacturers and different lots in order to introduce heterogeneity in MTBF disks and reduce the likelihood of two disks failing in a group at the same time. However, the concept of data is always there does not save it. And along with data protection, in-place backups continue to be used. Which, incidentally, also does not provide 100% protection of data from hardware failure ...

    Keep this in mind: your business is vulnerable, as is your data. Absolute data protection is not possible, but using a combined approach to data protection, reliable devices and complete redundancy of storage systems, the likelihood of losing corporate data can be minimized.

    Also popular now: