ESergey September 22, 2017 at 18:01

Storage Utilization Analysis

How to understand that storage is bad? How to determine that the stock of productivity is exhausted What parameters testify to this? This article will focus on the analysis of storage utilization, as well as identifying and predicting performance problems. The material may not be interesting to experienced storage administrators, since only general points will be considered, without going deeper into the logic of the tricky performance optimization mechanisms.

First, let's define the terminology. There are several terms and abbreviations that are close in meaning: storage, disk array, SAN, Storage Array or just Storage. I’ll try to clarify.

SAN - Storage Area Network, or storage area network, is a set of equipment that transfers traffic between a server and a storage system.

SHD- a data storage system or disk array, equipment on which data is stored with online access. There are also archive repositories, but we will not consider them here. The abbreviation SHD can also be used as an abbreviation for the network of data storage , but among Russian- speaking specialists, the term SHD is assigned to the data storage system.

Storage can provide two ways to access data:

Block access, the server operating system works with storage as a SCSI hard drive (simplified).
File access, the server’s operating system works with the storage system as with file storage using the NFS, SMB protocol, etc.

Typically, storage systems providing block access have higher performance requirements than systems providing file access, this is due to the specifics of the tasks being solved. Next, we will discuss storage systems with block access using the Fiber Channel protocol.

Three main metrics are used to evaluate storage performance.

Service Time, often referred to as latency or responce time, is measured in milliseconds and means:
- when reading: the time from the receipt of the data storage task to read the information block until the requested information is sent.
- when recording: the time from the moment of receipt of the recorded block of information until confirmation of its successful recording.
IO / s - the number of input / output operations per second.
MB / s - the number of megabytes transferred per second.

The parameters IO / s and MB / s are closely related to each other by the size of the data block, i.e. one megabyte of information can be written in 4k blocks and get 256 I / O operations, or 64k blocks and get 16 IO.

Consider the most typical manifestations of storage performance problems in terms of Service Time, IO / s and MB / s.

Increased Service Time

For each storage system, there is an extreme Service Time value that corresponds to maximum performance, in other words, a slight increase in load will lead to a significant increase in Service Time, thereby causing degradation of demanding applications.

For example, below are graphs of Service Time versus IOPS for two storage configurations.

ST for All flash storage, 2 Node, 24x1.9 TB SSD, RAID 5, Random 32k, 50/50 Read / Write.

ST for classic storage, 2 Node, 24x1.8 TB HDD, RAID 5, Random 32k, 50/50 Read / Write.

In general cases, for All Flash storage, Service time is considered less than 1ms, and for classic storage up to 20ms. The threshold of acceptable Service time depends on the number of controllers, disk speed, and of course the model of the storage itself, and may differ from the values given.
You also need to consider the level of latency of the disk subsystem that maintains the normal working capacity of the application, and always have the necessary margin.

MB / s bar

Most often indicates the exhaustion of the bandwidth of the channel or FC adapter.

Competing values for MB / s or IO / s

The sum (orange graph) of two or more parameters on a time span has a constant and does not exceed it at any moment. This situation can be observed in the case of competition for the bandwidth of the channel or port of storage.

IO decrease with increasing ST

If the percentage distribution of block sizes has not changed, but at the same time ST starts to increase, and IO falls, this may indicate hardware problems with storage, degradation of one of the controllers, or high CPU utilization.

CPU utilization

The utilization of the CPU of storage controllers in general should not exceed 70%, if it is constantly above 70%, then this indicates a lack of storage capacity for storage.
It should be noted that storage can be divided into two large groups:

Using ASIC, in such storage systems the data transfer inside the array is processed by a separate high-performance chip, while the CPU remains with service tasks, such as creating and deleting disks and snapshots, collecting statistics, etc.
Without the use of ASIC, in such storage all the tasks are performed by the CPU.

CPU utilization must be interpreted differently for storage with and without ASIC, but in any case, it should not be higher than 70% in the absence of running service tasks.

Slow IO reading growth

This problem can occur if the storage system uses tiering of data placement between carriers of different speeds (for example, SSD and NL SATA).

For example: a certain database works with a high load one day a week, and the rest of the time is idle, in which case the data that has not been accessed for a long time will go to the media at a low speed, and the read speed will gradually increase upon transition (the so-called data warming) to fast carriers.

What kind of load does not indicate a problem?

Sawtooth IO graph

MB jumps IO

bouncing values

All listed load examples do not indicate any problems on the storage side. The load is created by the host connected to the storage system and depends on the logic of processes using disk space.

How to define trashholds for Service Time, IO / s and MB / s?

These parameters can be calculated theoretically, adding up the performance of disks and counting the penalties of the selected RAID level, you can also use sizers if there are any, but the calculation will be very approximate, since the real load profile will not be taken into account. To determine the exact threshold values, indicating, for example, 90% of the storage load, it is necessary to carry out load testing using special software, forming a load profile close to real and measure the maximum values of IO / s and MB / s. But what about Service Time? Here the dependence is nonlinear. To determine the Service Time corresponding to a 90% load, you just need to generate 90% of the maximum IO value achieved. The results can be extrapolated to storage systems close in configuration.

Instead of a conclusion

Analysis and interpretation of storage performance parameters in most cases is not a trivial task; you need to understand the architecture and operation principle of a particular storage system, have a SAN port scheme and know the nuances of the FC adapters used. I did not consider the impact of replication and the use of convergent solutions, since the use of these technologies significantly complicates the description of processes that affect performance and narrows the list of general recommendations. The article did not understand the parameters of using the controller cache, loading disks, and utilizing the internal switching ports of storage systems, since the interpretation of this data is highly dependent on the specific storage model and technologies used.

Tags: