Speed ​​or volume? Automated storage management with heterogeneous characteristics

    Enterprise-class solid-state drives (SSDs) have emerged as new data storage technologies and are already in full use in the design of high-performance systems. SSDs are significantly faster than hard drives (HDDs), but they are also more expensive. Here and now, their role is participation in hybrid storage environments (SSDs and hard drives together), and their cost prevents the full transition to their use for most companies. We deal with a tiered storage environment where solid-state drives are the fastest and most expensive devices, and high-capacity hard drives, on the contrary, are the slowest and cheapest. The term “hybrid” and the phrase “tiered storage system” can be considered synonymous.
    The key to successfully using SSDs in a hybrid storage environment is software that automatically manages the placement of data and ensures that the most frequently used data is hosted on fast SSDs.
    The product is the Teradata the Virtual the Storage (the TVS) Teradata Labs designed specifically for the management of such a hybrid storage environment. In this article, we will look at TVS in terms of its development objectives and talk about how it works.
    This article is based on my lecture at the Teradata Forum 2012 conference , which was held in Moscow at the end of November.

    At the heart of the effectiveness of hybrid storage systems is the idea that different data is used by consumers at different frequencies. For example, for daily reporting and operational analysis, data are needed for the last day-week, while access to data that is older than three years is not required every day, and only for one or two indicators. The data characterization through the access frequency can easily be represented in the form of temperature, as it is understood in school physics, but instead of molecules, data blocks "move" here. So, the data that is dead weight is called cold, and the most requested data is called hot.
    Under control based on data temperature, we understand the ability to manage the priority of allocating system resources based on business rules, making fuller use of the capabilities of different data storage systems, taking into account the ever-growing volumes of information.

    The amount of data also matters. If you look at how much data in the system is hot and how much is cold, you get a characteristic that is unique to a particular system at a particular point in time. It turns out that the frequency of access to the data and the amount of these data do not mutually correlate, and the combination of their averaged (over a certain period) values ​​together is a characteristic of the system that we are analyzing.
    Regarding the amount of data, it should also be taken into account that over the past ten years, the processor power and the power of the I / O subsystem have been growing differently. Comparing them directly is not the purpose of this article, but the ratio of the number of processor seconds per I / O unit changes annually, and not in favor of the HDD.

    “HDD-only” system configurations have evolved (and are developing) as follows:
    • The amount of data per node is growing;
    • The growth granularity is quite large (73 GB - 136 GB - 300 GB ...);
    • With the number of drives, the space occupied by the racks in the data center is growing;
    • CPU performance per terabyte of data drops.
    The constant increase in CPU performance has led to an increase in the number of disks that are needed so that the load on processors and disks is uniform. More disks - more data. There is a third way - few disks and a lot of data. But if there are a lot of users (read - parallel requests) in the system, then a small number of disks becomes a bottleneck. After all, the speed of access to data on the HDD has not changed much over the past decade, and the speed of random access is a key characteristic for a disk subsystem under multisession load conditions.
    Such growth is quite suitable for those companies and systems that have low CPU requirements per terabyte. The rest remains to accept the excess capacity of the hard drives (now you can’t find an HDD less than 300 GB, which for Teradata means a minimum amount of 14 TB per node).
    The advent of SSDs on the market has currently changed the situation. SSD differences are known, it is flash memory with an interface identical to the HDD in physical and electrical parameters. In addition, the enterprise class has its own characteristics:
    • Single-level cell (SLC) technology is used, when only one bit of information is stored in one cell;
    • Factory test chip selection;
    • Error correction algorithms;
    • Integrated controllers with performance optimization and reliability control software.

    Teradata spent many years and effort preparing for changes in storage devices, and developed a virtualization layer that sits between the Teradata File System logical file system and real devices. The result was called Teradata Virtual Storage.
    As key features in the design were selected:
    • Automation. Teradata Labs did not plan to create new tasks for the DBA.
    • Distribution of I / O between devices within a hybrid storage system.
    • Performance. We always think about it, no matter what is meant by this term.

    Automatic operation, as a fundamental requirement, was to eliminate the need for manual intervention in the process of placing and migrating the hottest data to the fastest devices, and vice versa. The DBA has a lot of other work to keep an eye on whether the user load profile matches how the data is located between the disks.
    The optimal load distribution is influenced by many factors: the total amount of data in the system, the capacity of each medium, the temperature of the data being accessed, the current level of activity in the system, the rate of change of access patterns. All these indicators change dynamically, and it makes no sense to assume that I / O will come from the fastest media (unless all the data in the system fits on them).

    The philosophy of TVS was to achieve a balanced distribution of I / O operations: the storage location should correspond to the temperature of the data. The goal was defined as follows: at least 80% of the I / O should go through the fastest devices, and the rest from the slower layers. Knowing in advance that slow layers will also be addressed, you need to carefully consider the ratio of SSD and HDD in the system so that the hard drives can withstand these remaining - potential - 20%.
    Holding hot data on fast SSDs, most of the HDD load is transferred to it. As a result, we get a better and denser distributed response time from both those and others. As soon as the activity level on the HDD increases, the queue length on the disk increases, the access speed slows down or jumps. This is especially noticeable when a short tactical request, which needs a single I / O operation, is queued after large analytical requests. By moving hot data to SSDs, we reduce disk queues and provide good response times.

    At the same time, disks and disk arrays differ in their performance characteristics. Teradata Labs did a great job profiling different arrays of different vendors, differentgenerations and different types (SSD / HDD). The resulting database is used by TVS during the configuration process to determine the level of speed at which a particular device or location within it works.
    In particular, inside the HDD its slow and fast zones are determined. On hard drives, the lower Logical Block Addresses (LBAs) of each physical drive correspond to its external tracks and the fastest response time. A partitioned disk for each partition will have its own estimate of the speed of work, and therefore its position on the scale of the speed of the devices.
    The data in the Teradata DBMS is distributed (ideally evenly) between various virtual processors (called AMPs). Each AMP stores its part of the user object (table) data, and only it has access to it. Individual records at the Teradata File Storage file system level are grouped into data blocks, which, in turn, are grouped into cylinders (a group of data blocks continuously sequential on disk). The data block contains sorted records of one table. Blocks have different sizes, maximum - 127.5 KB, i.e. 255 sectors of 512 bytes each. In a mature system, data blocks have an average size of 96 KB. From the point of view of the file system, a table can be thought of as a collection of data blocks that lie on all AMPs.

    Teradata Virtual Storage identifies three levels of storage device speed: fast, medium, and slow. When ranking, all sorts of factors are taken into account: device type, rotation speed (for HDD), physical location (internal or external tracks of the HDD). From the space of fast devices, TVS is separated by a part called soft reserve for critical DBMS objects:
    • Spool - Temporary tables containing intermediate and final results when executing queries.
    • WAL (Write after Logging) - Contains transaction log entries, as well as WriteAheadLog log entries, which is used to protect against loss of changes to data blocks processed in memory.
    • DEPOT - A small area of ​​data that is used to protect against failure of the disk array of those data blocks that are updated "in place" when new data is written directly over existing data.
    The speed with these objects is critical for the overall system performance, so we always store them on high-speed media.
    The boundaries between soft reserve, fast, medium and slow areas of storage devices are redefined for each platform and release of the DBMS.

    During the configuration process, devices and sections of devices of different speeds are evenly distributed between AMPs so that everyone gets equivalent performance. During the operation of the system, TVS controls the distribution of data between these devices, that is, an increase in performance is achieved through the redistribution of data.
    Managing a hybrid storage environment manually will be a burdensome and time-consuming task requiring constant monitoring and administrator action. In the case of TVS, we are dealing with a fully automated service.
    As already mentioned, Teradata Virtual Storage distributes data automatically depending on its temperature. The frequency of access to each data block is monitored by the TVS and stored as the data temperature aggregated at the cylinder level (that is, all blocks of the same cylinder will always have the same temperature). The temperature here is a relative value inside the system.
    TVS periodically decreases the temperature settings for all cylinders in the system. At the beginning, when recording, the cylinder temperature is always set as the system average, but over time, old data is accessed less often than new data, and this principle is expressed in the aging process. When activated, the temperature of all cylinders in the system decreases uniformly.

    But this is only part of the operation of the module, which bears the name Migrator. As you might guess, the Migrator uses the value of the temperature of the cylinder to determine whether it should be moved to the storage device at a different speed. At each AMP, the Migrator maintains an ordered queue of “improperly” placed cylinders. Those that contain hot data, but lie on a slow medium, are placed at the beginning of the queue. Cold data that is on fast media is the other way around. There will be no immediate benefit from transferring them, and only if the space on the fast medium runs out, the Migrator begins to rid it of cold data.
    For example, at first at some point in time, blocks of different temperatures are located on arbitrary media. Gradually, with the work of the Migrator, the picture changes: hot blocks migrate to fast carriers, cold ones to slow, warm ones to medium-sized carriers.
    TVS processes run on each node. The migrator selects from the queue and performs an average migration of one cylinder at each AMP every 5 minutes, and no more than two parallel migrations per node. This continues until the queue ends. At this rate, approximately 10% of all those employed in the cylinder system can migrate in a week. It seems that the load on the system, which in this mode is about 2% of CPU and I / O, is more than an acceptable payment for the achieved high-quality distribution of data to devices.
    While the cylinder is migrating, writing to it is not possible, and the converse is also true.
    TVS also supports another migration mode called optimize. In this mode, the Migrator uses all available system resources for the fastest possible data migration, and the same 10% of the data can migrate in about 8 hours. Naturally, we are not talking about any application on an industrial system, but sometimes it is useful.
    When the filling of the system with data exceeds 95%, the Migrator will stop its work.
    In addition to the two modes of operation described above, there is another mode - asynchronous migration, which starts even less often than the regular migration mode - when the free space in soft reserve, on a medium that is fast or medium in speed, drops to 10% of its capacity . The purpose of this mode is to free up space on a quick medium for fresh hot data pies . In this mode, the Migrator will work until the 10% threshold is reached or until the “incorrectly placed” cylinders remain, that is, hot data in this mode does not leave fast carriers, and warm - medium ones.
    One of the interesting features of TVS control is the so-called initial data temperature. By default, it is set in the system parameters, but using the query band session property (this is a text string) and the TVSTemperature keyword, you can change these settings for the data loaded by this session.

    For an example, let's look at the configuration options for one of the major systems supplied by Teradata - Active EDW 6690. Its features regarding disks are as follows:
    • 15 - 20 pieces of 400 GB SSD;
    • volume of hot data: 1.6 - 2.2 TB;
    • 60 - 160 pieces 300 or 600 GB HDD;
    • volume of warm and cold data: 4.9 - 26 TB;
    • up to 174 drives per node.

    Only 2.5 ”form factor disks are used, SSD and HDD are located in the same disk array. Moreover, disks in the system do not have to be installed all at once, you can add them later.
    When the new system is just starting to live, it is impossible to predict the distribution of data by temperature. But for existing systems, Teradata offers an assessment of the actual temperature of the data. In the process of such an assessment, the optimal amount of data for each node and the recommended share of SSD in it are determined.
    For assessment, a data collection tool is used (the utility is called IOCNT). The utility is installed on several nodes of the customer’s existing system and collects information about the ratio of hot and cold data. To reduce the measurement error, the utility connects to several AMPs and aggregates information about the actual activity of accessing data for selected time periods.

    As already mentioned, the data temperature distribution for each system is different. But we can distinguish a couple of stable cases when the average temperature of the data in the system is high or low, and offer options for balanced configurations.
    If the average temperature of the data in the system is high, the capacity of fast media should be comparable to the capacity of slow. Part of the HDD in the system is replaced with an SSD, and the configuration of such a system contains less terabytes per node, possibly fewer disks, but the available processor power per terabyte of data is large.
    Otherwise, when a large proportion of the data in the system is rarely used, we supplement high-capacity HDDs with a small number of SSDs to speed up I / O for basic operations and critical objects. At the same time, the ratio of processor power per terabyte of data almost corresponds to the base.

    In conclusion, let me give you an example screenshot of the TVS Monitor portlet from the Teradata Viewpoint Administration Tool. The portlet allows you to track statistics of data temperature and the degree of congestion of devices of different speeds. The administrator can see how data is distributed now, assess the current need for data migration, and also study how distribution has changed in the past.

    Also popular now: