How to securely store large amounts of data within a moderate budget

    Good afternoon, Habrahabr! Today we’ll talk about how storage requirements are changing due to the growth of data volumes and why traditional systems that we trust can no longer cope with capacity expansion and ensure reliable storage. This is my first post after a long break, so just in case I introduce myself - I'm Oleg Mikhalsky, Acronis product director.

    If you follow the trends in the industry, you probably already come across such a concept as software defined anything. This concept involves the transfer to the software level of the key functions of the IT infrastructure, ensuring its scalability, manageability, reliability and interaction with other parts. Gartner names Software Defined Anything among the 10 key trends of 2014, andIDC has already published a special review of the Software Defined Storage segment and predicts that by 2015 only $ 1.8 billion will be bought for this type of commercial solution. It is about storage systems of this new type that will be discussed later.


    To get started, let's look at the statistics of growth in data volumes and draw some conclusions. A few years ago, the volume of data created worldwide exceeded 1 zettabyte - this is about a billion completely filled hard drives with a capacity of 1 TB, and already exceeds all the storage space available today. According to the forecast of EMC - the world leader in the storage market, over the current decade, data volumes will increase by another 50 times, creating a shortage of storage space of
    more than 60%.



    Figure: Deficit of storage space for created information increases
    Source: IDC The Digital Universe Decade - Are You Ready? (2010)

    How much and why?


    What are the reasons for the avalanche growth of information volumes:
    • creating new information is now much cheaper than before: the cost of storage and processing has decreased by 6 times since 2005
    • IT budgets at the same time increased one and a half times
    • by 2020, the number of devices that create information will increase by 8 times: from smartphones and cameras with higher resolution to all kinds of sensors and smart personal devices
    • additional information is created in the form of a derivative of the already created - primarily backups, as well as logs, archives of digital audio, video


    In turn, the lack of storage space is explained by the fact that hardware storage systems have evolved for a long time on the principle of faster, higher, stronger - that is, from tape to larger disks, faster disks, flash drives, systems from several shelves with different drives type and speed. And storage optimization was sharpened for the needs of companies with large budgets - fast storage for virtualization, super-fast storage for real-time data processing, smart storage with optimization for specific business applications. At the same time, about backups, archives and logs that do not directly create business value and simply take up space, customers seem to have forgotten, and storage manufacturers did not think (name the hardware storage vendor, which is sold specifically as “the cheapest and most reliable storage for backup of your data ”).

    You're doing it WRONG


    From practice, for example, I know of cases where backups and logs are stored in hundreds of terabytes on the shelves of branded vendors designed for online storage of business application data, or vice versa - on a self-made JBOD of a few petabytes in size, half of which is the full second copy “for reliability. " As a result, a paradox: the cost of storing data (at the level of 10-15 cents per gigabyte per month) exceeds several times the cost of storing in an Amazon cloud, the capabilities of iron to process this data are not used, and the reliability necessary for backup and long-term storage is vice versa not provided. (about reliability we will analyze a little lower). In the case of JBOD, the costs of supporting and expanding it also increase. But as noted above, companies have this problem for a long time was not in the foreground.

    Development in the right direction


    Not surprisingly, the first to notice the problem were developers and engineers who are directly connected with large data arrays - such as those on Google, Facebook, as well as in scientific experiments such as the famous hadron collider. And they began to solve it by the software available to them, and then share their best practices in publications and at conferences. Perhaps this is partly why the Storage segment in Software Defined Anything quickly found itself filled with a large number of open-source projects, as well as start-ups that began to offer highly specialized solutions for a specific type of problem, but again bypassing backups and long-term archives.

    Reliability of storage is included in the title of the article, and now we’ll further analyze why storing large amounts of data on ordinary storage systems becomes not only difficult as the data grows, but also dangerous - which is especially important for backups or logs (which, incidentally, includes video surveillance archives) , which may come in handy rarely, but on an extremely important occasion - for example, to conduct an investigation. The fact is that in traditional storage systems, the more data becomes, the higher the storage costs and the risks of data loss due to a hardware failure.

    Calculations and entertaining statistics


    It was found that on average, hard drives fail with a probability of 5-8% per year ( Google data ). For storage with a capacity of petabytes, this means the failure of several disks per month, and with a storage size of 10 petabytes, disks can fail every day.



    Fig. How hard drives fail. (Goolge data)

    Example: Using RAID 5 taking into account the probability of a read error of 10 -15 per bit means a possible loss of real data with every 26th recovery or every few months. For example, if the system has 10 thousand disks and the average time between errors is 600 thousand hours for one disk, then disk recovery will need to be done every few days. (based on data from an Oracle article )

    It should be noted that RAID-based systems recover failed drives with limitations. And the recovery time depends on the size of the disk. The larger the drive, the longer it will recover, increasing the likelihood of a repeated failure leading to data loss. Thus, with the growth of the size of disks and the amount of storage space, reliability decreases. In addition, there are errors that are not detected at the RAID level. For those who want more details - an excellent overview of RAID problems is published on Habré here .

    Add to that, according to NetApp research, on average, one out of 90 disks has latent damage associated with checksums, block write errors, or incorrect parity bits that are not detected in traditional storage systems. As another study shows , traditional file systems are not able to detect such errors. The probability of even the most common of these types of errors is low. But as the data array grows, the likelihood of loss also increases. SHD ceases to provide reliable storage.

    Reliability hardware that can handle a limited amount of data is not enough to store hundreds of terabytes and petabytes reliably.

    Software Defined Storage


    Based on these prerequisites and the accumulated experience of working with growing volumes of data, the concept of Software Defined Storage began to develop. The first developments that appeared in this field did not prioritize any one problem, such as reliability. Guided by the needs of their own projects, Google developers, for example, simultaneously tried to solve several problems: ensuring scalability, accessibility, performance and, including, reliability when storing large amounts of data, using inexpensive typical (commodity) components, such as, for example, desktop hard drives and non-brand chassis, which more often than expensive brands fail.

    For this reason, the Googler file system (GFS) can be considered in some ways the progenitor of the class of solutions, which will be discussed below. Other development teams, such as the open source projects Gluster (later part of RedHat) and CEPH (now supported by Inktank) focused primarily on achieving high performance when accessing data. This list will not be complete without the HDFS (Hadoop filesystem), which appeared on the basis of Google's development and is focused on high-performance data processing. The list goes on, but a thorough review of existing technologies is beyond the scope of this article. I only note that the problem of optimizing long-term storage in its pure form was not put in priority, but was solved, as it were, in the process of optimizing the cost of the solution as a whole.



    It is clear that creating a commercial solution based on open source is a complex and risky experiment and only a large company or system integrator can do it, who has enough expertise and resources to work with opensource code that is difficult to install, integrate and support, and have sufficient commercial motivation for this. But as mentioned above, for commercial vendors, the main motivation is directed to such high-budget areas as storage systems with high speed for virtualization or parallel data processing.

    Ready-made solutions


    The closest to solving the problem of inexpensive and reliable storage were startups who focused on providing cloud backup, but many of them had already lost their distance, while others were absorbed by large companies and stopped investing in technology development. The vendors such as BackBlaze and Carbonite, who bet on deploying cloud storage in their own data centers on the basis of standard components and were able to gain a foothold in the market for their cloud services, were best at moving forward. But they, in view of the extremely high competition in their main market, do not actively promote storage technology as an independent solution of the Software Defined Storage class. Firstly, in order not to create competitors, and secondly, not to spray their resources into completely different lines of business.

    As a result, storage administrators, who are responsible for storing backups, logs, archives of video surveillance systems, television programs, voice recordings, have a choice: on the one hand, there are convenient but expensive solutions that can easily cover current needs if there is a sufficient budget in storage of 100-150TB of data. And it will be reliable and safe - as they say in the industry, no one has been fired for buying iron from a cool vendor. But as soon as the storage capacity exceeds the threshold of 150-200 TB of data, problems with further expandability appear - in order to combine all the hardware into a single file system, freely redistribute the space, upgrade hard drives to larger drives, additional migration costs are expensive accessories and specialized software for “virtualization of storage”. As a result, in terms of cost of ownership, such a system over time becomes far from optimal for “cold data”. Another alternative is to assemble the storage system yourself on the basis of Linux and JBOD. Perhaps it will suit a specialized company such as a hoster or telecom provider, where there are experienced and qualified specialists who will take responsibility for the performance and reliability of their own solutions. An ordinary company of medium or small size, whose main business is not connected with data storage, most likely does not have a budget for expensive hardware and qualified specialists. For such companies, Acronis own development can be an interesting alternative - Another alternative is to assemble the storage system yourself on the basis of Linux and JBOD. Perhaps it will suit a specialized company such as a hoster or telecom provider, where there are experienced and qualified specialists who will take responsibility for the performance and reliability of their own solutions. An ordinary company of medium or small size, whose main business is not connected with data storage, most likely does not have a budget for expensive hardware and qualified specialists. For such companies, Acronis own development can be an interesting alternative - Another alternative is to assemble the storage system yourself on the basis of Linux and JBOD. Perhaps it will suit a specialized company such as a hoster or telecom provider, where there are experienced and qualified specialists who will take responsibility for the performance and reliability of their own solutions. An ordinary company of medium or small size, whose main business is not connected with data storage, most likely does not have a budget for expensive hardware and qualified specialists. For such companies, Acronis own development can be an interesting alternative - An ordinary company of medium or small size, whose main business is not connected with data storage, most likely does not have a budget for expensive hardware and qualified specialists. For such companies, Acronis own development can be an interesting alternative - An ordinary company of medium or small size, whose main business is not connected with data storage, most likely does not have a budget for expensive hardware and qualified specialists. For such companies, Acronis own development can be an interesting alternative -Acronis Storage is a software solution that allows you to quickly deploy highly reliable and easily expandable storage systems on inexpensive standard chassis and disks that can be arbitrarily combined with each other, changed one by one on a “hot system”, increasing the space with arbitrary blocks from several terabytes to several tens or hundreds of terabytes, using essentially only the skills of assembling a PC and an intuitive non-specialist web interface for configuring and monitoring the entire storage system and its individual nodes and disks. This development was the result of Acronis’s internal cloud backup backup startup, which has now expanded to several petabytes in three data centers.

    To summarize


    A review of approaches to storing large amounts of data would not be complete without mentioning solutions that are based on software, but are delivered to the market in the form of hardware-software systems (appliances). In some cases, this makes it possible to quickly deploy a solution and may be suitable for a not very large company with limited resources. But the use of a predefined hardware configuration limits the ability to fine-tune the system and, naturally, sets a higher threshold than for pure software for the solution price, which already includes the hardware. And, of course, this approach inherits many specific hardware storage systems in terms of upgrading a single server (scale-up by replacing disks with more capacious and faster ones, replacing the network with a faster one).

    In conclusion, let us once again turn to the data of storage industry analysts and fix several conclusions. According to a Forrester Forrsights Hardware Survey study conducted at the end of 2012, 20% of companies already had backup volumes up to 100TB per year, and the complexity of expanding storage for backups became a problem for 42% of respondents. The company’s company is different, but these data give rise to specialists to think about the long-term planning of storage capacity that may be needed in their organization over the next several years. Under the assumption that all companies are approximately similar in terms of backup storage, almost half of them will have the problem of optimizing storage systems for backups in the coming years, and possibly such other cold data. The data on traditional RAID-based storage systems suggests

    Also popular now: