Unobvious features of storage

image

Colleagues, there was a desire to share practical experience in operating virtualization environments and related systems.

This article will discuss the features of the storage systems. In principle, this information applies to physical servers that use storage systems, but mostly we will talk about virtualization, and VMWare in particular.

Why VMware? It's simple, I am a specialist in this virtualization environment, I have the status of VCP5. For a long time he worked with large customers and had access to their systems.

So, let's start with the origins. I will try not to bother readers with abstruse words and complicated calculations. Not everyone is a specialist in the field, and many may come in handy.

What is virtualization (in the field of servers)? This is a kind of software and hardware system that allows you to logically separate computing resources from hardware. In the classic form, only one operating system that manages this server can work on one server. All computing resources are given to this operating system, and it exclusively owns them. In the case of virtualization, we add a layer of software that allows us to emulate some of the computing resources of the server in the form of an isolated container, and there can be many such containers (virtual machines). Each container can have its own operating system, which, perhaps, will not suspect that its hardware server is actually a virtual container.

How does the hypervisor work? Conventionally, all requests from operating systems in containers (guest operating systems) are accepted by the hypervisor and processed in turn. This applies to working with processor power, as well as working with RAM, network cards, as well as data storage systems. The latter will be discussed later.
The disk resources that are provided by the hypervisor to the operating systems in containers typically use the disk resources of the hypervisor itself. These can be either disk systems of the local physical server or disk resources connected from external storage systems. The connection protocol here is secondary and will not be considered.

All disk systems, in fact, are characterized by 3 characteristics:
1. The width of the data channel
2. The maximum number of I / O operations
3. The average delay value at the maximum allowable load

1. The channel width is usually determined by the storage system connection interface and the performance of the subsystem itself. In practice, the average load in width is extremely small and rarely exceeds 50 ... 100 megabytes per second, even for a group of 20-30 virtual servers. Of course, there are specialized tasks, but we are now talking about the average temperature in a hospital. Practice points to just such numbers. Naturally, there are peak loads. At such moments, the bandwidth may not be enough, therefore, when sizing (planning) your infrastructure, you need to focus on the maximum possible loads.

2. I / O operations can be divided into single-threaded and multi-threaded. Given the fact that modern operating systems and applications have completely learned how to work multithreaded, we will consider the entire load as multithreaded. Further I / O operations can be divided into sequential and random. With random access, everything is clear, but with sequential access? Given the load from a large number of virtual machines, and even multi-threaded from each machine, we will end up with almost completely random access to data. Of course, variants of specific cases with sequential access and a small number of threads are possible, but again, we are considering the average temperature. Finally, I / O operations can be divided into read and write. The classic model tells us about 70% of read operations and 30% of write operations. Maybe, this is the case for applications inside virtual machines. And after all, many when testing and sizing take these statistics as the basis for testing storage systems. And they make a huge mistake. People confuse access statistics for applications and access statistics for a disk subsystem. This is not the same thing. In practice, the following division is observed for a disk system: about 30% of read operations and 70% of write operations.
image
Why is there such a difference ?! It is due to the work of caches of different levels. The cache can be in the application itself, in the operating system of the virtual machine, in the hypervisor, in the controller, in the disk array, and finally in the disk itself. As a result, some of the read operations fall into the cache at some level and do not reach physical disks. And write operations always reach. This must be clearly remembered and understood when sizing storage systems.

3. Delays, or latency of the storage system, is the time during which the guest operating system receives the requested data from its disk. Schematically and simplified, the request from the application is as follows: Application-Operating System-Virtual Machine-Hypervisor-Storage System-Hypervisor-Virtual Machine-Operating System-Application. In fact, there are 2-3 times more intermediate chain links, but let's omit the deep technical details.
What is most interesting in this chain? First of all, the answer of the storage system itself and the work of the hypervisor with the virtual machine. With the storage system everything seems to be clear. If we have SSD disks, then time is spent reading data from the desired cell, and in fact that's it. Latency is minimal, on the order of 1 ms. If we have SAS disks 10k or 15k, then the data access time will consist of many factors: the depth of the current queue, the position of the head relative to the next track, the angular position of the disk plate relative to the head, etc. The head is positioned, waiting for the disk to turn when the necessary data is under it, performs a read or write operation and flies to a new position. The smart controller stores a queue of access to data on disks, adjusts the reading sequence depending on the flight path of the head, swaps positions in the queue, trying to optimize performance. If the drives are in a RAID array, then the access logic becomes even more complex. For example, a mirror has 2 copies of data on 2 halves, so why not read different data from different locations simultaneously with two halves of the mirror? Controllers behave similarly with other types of RAID. As a result, for high-speed SAS disks, the standard latency is 3-4 ms. For slower brothers NL-SAS and SATA, this indicator worsens to 9 ms.

Now consider the chain link of a hypervisor-virtual machine. In virtual machines, hard disk controllers are also virtual, usually SCSI devices. And the guest operating system communicates with its disk using the iSCSI commands too. When a disk is accessed, the virtual machine is blocked by the hypervisor and does not work. At this time, the hypervisor intercepts the SCSI commands of the virtual controller, and then resumes the virtual machine. Now the hypervisor itself accesses the data file of the virtual machine (the disk file of the virtual machine) and performs the necessary operations with it. After that, the hypervisor again stops the virtual machine, again generates SCSI commands for the virtual controller and, on behalf of the virtual machine’s disk, gives an answer to a recent request from the guest operating system. These I / O counterfeiting operations require about 150-700 clock cycles of the central processor of the physical server, that is, they take about 0.16 microseconds. On the one hand, not so much, but on the other hand? What if the machine has active I / O? Let's say 50,000 IOPS. And what if she communicates just as intensely with the network? We add here a rather probable loss of data from the processor cache, which could change during the wait for the fake requests by the hypervisor. Or something good, the execution core has changed. As a result, we have a significant decrease in the overall performance of the virtual machine, which is rather unpleasant. In practice, I got a drop in performance up to 40% of the nominal value due to the influence of hyperactive I / O of the virtual machine over the network and disk system. that is, they take about 0.16 microseconds. On the one hand, not so much, but on the other hand? What if the machine has active I / O? Let's say 50,000 IOPS. And what if she communicates just as intensely with the network? We add here a rather probable loss of data from the processor cache, which could change during the wait for the fake requests by the hypervisor. Or something good, the execution core has changed. As a result, we have a significant decrease in the overall performance of the virtual machine, which is rather unpleasant. In practice, I got a drop in performance up to 40% of the nominal value due to the influence of hyperactive I / O of the virtual machine over the network and disk system. that is, they take about 0.16 microseconds. On the one hand, not so much, but on the other hand? What if the machine has active I / O? Let's say 50,000 IOPS. And what if she communicates just as intensely with the network? We add here a rather probable loss of data from the processor cache, which could change during the wait for the fake requests by the hypervisor. Or something good, the execution core has changed. As a result, we have a significant decrease in the overall performance of the virtual machine, which is rather unpleasant. In practice, I got a drop in performance up to 40% of the nominal value due to the influence of hyperactive I / O of the virtual machine over the network and disk system. And what if she communicates just as intensely with the network? We add here a rather probable loss of data from the processor cache, which could change during the wait for the fake requests by the hypervisor. Or something good, the execution core has changed. As a result, we have a significant decrease in the overall performance of the virtual machine, which is rather unpleasant. In practice, I got a drop in performance up to 40% of the nominal value due to the influence of hyperactive I / O of the virtual machine over the network and disk system. And what if she communicates just as intensely with the network? We add here a rather probable loss of data from the processor cache, which could change during the wait for the fake requests by the hypervisor. Or something good, the execution core has changed. As a result, we have a significant decrease in the overall performance of the virtual machine, which is rather unpleasant. In practice, I got a drop in performance up to 40% of the nominal value due to the influence of hyperactive I / O of the virtual machine over the network and disk system.
The slow operation of the disk system has a tremendous impact on the operation of applications inside guest machines. Unfortunately, many experts underestimate this influence, and when sizing the hardware they try to save on the disk subsystem: they try to put inexpensive, large and slow disks, save on controllers, fault tolerance. Loss of computing power leads to trouble and downtime. Losses at the disk system level can lead to the loss of everything, including business. Remember this.

But back to the read and write operations, as well as the features of working with them at different RAID levels. If we take stripe as 100% performance, then the following types of arrays have indicators:

Operation Array type Efficiency
Read
RAID0 100%
RAID10 100%
RAID5 ≈90%
RAID6 ≈90%
Write
RAID0 100%
RAID10 50%
RAID5 25%
RAID6 16%

As we can see, RAID5 and RAID6 have enormous write performance losses. Here we must not forget that I / O operations load the disk system together, and they cannot be read separately. Example: in RAID0 mode, a hypothetical system has a performance of 10,000 IOPS. We collect RAID6, load the classic load of the virtualization environment, 30% read and 70% write. We get 630 IOPS reads with simultaneous 1550 IOPS writes. Not a lot, right? Of course, the presence of write-back write cache in storage systems and controllers slightly increases performance, but everything has limits. Read IOPS must be correct.

A few words about reliability that have already been repeatedly said. When large and slow disks fail, the array rebuild process begins (did we take care of the hot swap?). On 4TB disks, the RAID 5 and 6 rebuild process takes about a week! And if the load on the array is large, then even more. Also, the rebuild process is associated with a sharp increase in the load on the array disks, which increases the likelihood of another disk failure. In the case of RAID 5, this will result in permanent data loss. In the case of RAID 6, we face high risks of losing a third drive. For comparison, in RAID10, when rebuilding an array, in fact, data is simply copied from one half of the mirror (one disk) to the other. This is a much simpler process and takes relatively little time. For a 4 TB drive with an average load, the rebuild time will be about 10-15 hours.

I would like to add that smart storage systems such as the Dell Compellent SC800, which have a very flexible approach to data storage based on various custom Tier levels, also happen in nature. For example, this system can write new data only on SCL SSD disks in RAID10 mode, after which, in the background, it will redistribute data blocks to other types of disks and RAID levels depending on the statistics of access to these blocks. These systems are quite expensive and aimed at a specific consumer.

Summing up, it remains to determine the following points:
- The storage system for virtualization should be fast and reliable, with minimal delays
- When designing a virtualization environment for a storage system, it is necessary to lay about 40% of the total hardware budget
- When sizing the storage system, in general, it is worth focusing on RAID10
- Backups will save the world!

Also popular now: