Transformation of backup storage technologies: software products and data deduplication devices

    The market for storage-centric disk backups is measured in billions of dollars. There are quite a few well-known companies in this market that release products that have already become well-known throughout the world: EMC DataDomain, Symantec NetBackup, HP StoreOnce, IBM ProtectTier, ExaGrid and others. How did this market begin, and in what technological direction is it developing now, how to compare different software products and deduplication devices with each other?

    The first deduplication storage systems appeared in the early 2000s. They were created to solve the problem of backing up exponentially growing data. The growth of data in productive systems of companies led to the fact that the duration of backup to tapes increased so much that full backups were no longer “placed” in the backup window, and the use of disk storage systems existing at that time was difficult from due to their insufficient capacity. As a result, backups could “break off” either due to lack of time (for the case of tapes), or due to lack of space (for the case of disks). The problem of disk space could be solved by purchasing high-capacity storage systems, but in this case there was a problem of high storage cost.

    The backup software products were originally designed with the assumption that the backup storage is a tape drive, and the backup algorithm is the father-son-grandson algorithm:
    • “Father” (full backup once a week),
    • “Son” (incremental copies six days a week),
    • "Grandson" (old full backup, usually sent to off-site storage).

    This approach generated the generation of quite a lot of backup data and was relatively inexpensive for companies when using tapes, but when using disks, the cost of this approach increased significantly.

    In those days, only a small number of backup software products provided built-in deduplication of backup data. Storage systems with built-in deduplication appeared specifically to solve this problem - reducing the cost of storing data on disks (in the future, down to the tape level). A key factor in the success of these new devices was the fact that storage deduplication worked transparently and did not require any modifications to existing backup software.

    However, over the past time, almost all backup software products have acquired built-in deduplication, and the cost of disks (the original problem of disk storage systems) has significantly decreased. Moreover, now many backup products are able to deduplicate on the side of the original data, that is, the backup data is deduplicated even before they are transferred to the backup repository for storage. This allows you to reduce the load on the channel, increase the speed of operation and reduce the backup window. For this reason, the functionality of many disk storage systems now includes integration functions with such software products.

    Currently, storage systems that are positioned as backup repositories are under additional competitive pressure from storage systems designed to operate as primary servers of a productive network (Primary Storage), since deduplication functionality is often included in them for free.

    A logical question arises: why then do you need specialized Backup Target storage systems and how to use them correctly? If summarizing information from various manufacturers of such storage systems, they use the following three strategies:
    1. It is claimed that (under certain conditions) deduplication on Backup Target storage systems has advantages over deduplication built into backup products;
    2. They position their storage systems not only as a storage location for the backup repository, but also as a possible storage location for the organization’s electronic documents archive;
    3. They include backup software in the supply package of their storage systems, or simply integrate their storage systems with backup software products (including from other manufacturers).


    Consider the first point (which deduplication is better?)

    When comparing, manufacturers' arguments come down to a comparative analysis of deduplication coefficients, duration of backup windows, total equivalent storage capacity and replication efficiency. However, in fact, this analysis strongly depends on “environmental factors” (that is, on the experimental conditions, and if the actual conditions of the client differ from the experimental ones, then the result of measuring the coefficients will be different).

    For example, take the deduplication coefficient. Here you need to correctly determine what and how we measure. Some manufacturers indicate that their products have a deduplication factor of 30 to 1. It sounds, of course, impressive. However, at the same time, other manufacturers indicate that their product gives a deduplication coefficient of an order of magnitude less, for example, “3 to 1”. Does this mean that the products of the first manufacturers are better than the second? No, since the calculations evaluated different data sets, which resulted in such different deduplication coefficients. That is, the “deduplication coefficient”, indicated as a constant, is rather a marketing term, since it shows deduplication of different data from different manufacturers, and you cannot compare products on its basis, unless you yourself can try to put into practice different products on a specially prepared identical test data set. However, at the moment there is no industry (or at least de facto) standard for estimating the deduplication coefficient. Let's say in the anti-virus industry there is a so-called EICAR standard reference test virus, which should be detected by any antivirus. Here, too, a test reference data set could be created, on which the deduplication coefficient of different software products and storage systems is calculated, but in reality there is no such reference. which should be determined by any antivirus. Here, too, a test reference data set could be created, on which the deduplication coefficient of different software products and storage systems is calculated, but in reality there is no such reference. which should be determined by any antivirus. Here, too, a test reference data set could be created, on which the deduplication coefficient of different software products and storage systems is calculated, but in reality there is no such reference.

    The difference in the comparison of deduplication coefficients can also be observed due to the existence of a difference in the algorithms of the backup process itself when using different products. Suppose a backup software product is used, and a scheme is applied with a full copy once a week and incremental copies on other days. The product deduplicates and compresses backups. Now compare this with the use of Backup Target storage, which, say, each time receives a full copy of the disk volume for storage, and which performs deduplication before saving data to disk. In the second case, the deduplication coefficient will be much larger, and the actual saving of disk space of the backup repository, on the contrary, will be much smaller.

    At the same time, it is precisely the disk space of the backup repository that was saved over a certain period of time (and not the deduplication coefficient) that is ultimately the most correct criterion for comparing deduplication tools. However, in advance (before purchase) it is usually impossible to find out, alas.

    The " equivalent storage capacity " (or the storage size that is required to save data without deduplication) is another, but also purely marketing criterion, since it is based on the same deduplication coefficient and is calculated through it (manufacturers simply multiply the actual usable capacity storage for deduplication coefficient). As a result, using one disputed coefficient, another disputed coefficient is obtained.

    Sometimes the coefficient is used "equivalent backup performance". The idea of ​​this coefficient implies that the user uses a special software client that performs primary deduplication on the source data side (to minimize network traffic), and then sends data to Backup Target storage, where this data is deduplicated globally (to minimize occupied disk space) ). These clients are usually installed on the database server, application server, and backup server. The equivalent backup performance is measured in terabytes per hour, defined as the amount of data actually stored on storage for an hour and multiplied by ... deduplication coefficient. Obviously, in this case, comparison of different storage systems by this coefficient, if indicated in the materials for the product, will be incorrect.

    Only the transfer rate of the original data can be considered an objective metric.

    Strategy # 2 (Backup Target SHD as an electronic archive)

    Repositioning Backup Target storage as storage, which can be used not only to store the backup repository, but also to store electronic archives of an organization, is a good idea.However, the requirements for storage in these two cases differ significantly. Archives, unlike backups, by their very nature rarely contain duplicate information. Archives should provide the ability to quickly search for individual items, while accessing backups is relatively rare. These differences in requirements indicate that storage systems must still have a different architecture to perform these tasks. Manufacturers take steps in this direction, for example, they change the architecture of the file system of their storage systems, however, in doing so, they are essentially moving towards a universal file system and a universal storage system (and competition with universal storage systems has already been mentioned above).

    Strategy No. 3 (integration of storage systems with backup software products)

    As for the idea of ​​integrating backup software products with storage, it looks very reasonable if the integration is carried out not just in marketing materials, but includes integration at the technological level. For example, storage systems make hardware snapshots of their disks as efficiently as possible (getting the minimum possible RPO in practice , since any software implementation from a third-party vendor will most likely be slower). At the same time, backup software products perform well other important backup functions: building a repository and organizing long-term backup storage, performing backup testing procedures and quick data recovery in the event of a failure (minimizing RTO) Such technological “symbiosis” between manufacturers of backup software products and hardware storage systems allows obtaining the most effective solutions for the user.

    In conclusion

    • Over 10 years, the technological evolution of the market for products and devices with deduplication has taken place - they have become profitable to complement each other in terms of functionality. There has been a shift from deduplication on the backup repository to deduplication on the source data side, or to a combination of approaches.
    • There is no need to compare the effectiveness of deduplication by “deduplication coefficients” and metrics derived from them, since they strongly depend on the source data, on the nature of their daily changes, on the network bandwidth and other “environment” factors.
    • At the moment, when creating the architecture of the backup infrastructure, it is optimal not to look separately at the " hardware storage with deduplication " and separately at the " software backup products ", but at their integrated complementary bundles of software + storage

    Sitelinks


    Also popular now: