"20 thousand IOPS per node - good performance, taking into account delays of 5 ms." For OLTP - no

KDPV

The reason for writing this article was a very decent review. How we tested VMware vSAN ... of CROC. The review is worthy, but there is a phrase in it, with which I have been struggling for more than a dozen years. Storage administrators, virtualizers and integrators repeat: "Delays of 5 ms are a great indicator." Even the figure of 5 ms ten years does not change. I heard it live from highly respected admins already at least a dozen times. From less respected - dozens, and how many times I read on the Internet ... No, no, no. For OLTP loads of 5 ms, especially as they are usually measured - this is epic fail. I had to explain the reasons for this many times already, this time I decided to collect my thoughts in a reusable form.

At once I will make a reservation that in the above-mentioned article there are no such errors, rather the phrase worked as a trigger.

Typical start

Everything described in this article is true for common DBMS used for typical business OLTP. Most of all I have experience with MS SQL Server, but at least for PostgeSQL, Oracle and Sybase many points and conclusions will also remain true.

The performance of the DBMS is usually all unhappy. If there is a DBMS in a large system - and suddenly, almost always there is - then this DBMS is a bottleneck. Well, or it will immediately become a bottleneck if you start optimizing everything else. And so, the customer comes and says in a human voice: "Help! Save! We paid $ NNNNNNNNN for the server and the storage system, but the speed does not grow! Already the administrator set up and advised the vendor, but it still does not move." If system developers fit the definition of Lavrov (we can do without an exact quotation), and operation and maintenance specialists "fight incidents by rebooting the server", then the problem is often simple and simple: there are no indexes, no query queries, fatal configuration errors (for which the documentation in bold it is written "it is impossible to do so !!!"), excessive blocking, deadlocks and other simple and understandable nonsense. There are many such cases, most, but not all. If the system is in difficulty or load has passed some invisible limit, then it will either die from these problems or go to the next level.

SQL Server Diagnostic Tips

IMHO, the best tool now is the SQL Server First Responder Kit , promoted by Brent Ozar . This tool is developing very actively. There is still a decent set of Glenn Berry , he also did not abandon his project. Both sets are beautiful in their own way, reading comments and requests for the first time opens up a lot of new things. I myself always begin to look around with a sys.dm_os_waitsatsquick glance at the Error log and find out if there is at least some working backup system.

At this level, the server is no longer under the director's desk, the drives are no longer inside the server, but in the storage system, developers know about indexes, and administrators already know how to use PowerShell, and IT managers start saying buzzwords like SLA and RPO / RTO. Here at this level an interesting situation arises:

DBMS is a bottleneck.
The server, it seems, by all indicators should be sufficient.
Programmatically, the DBMS can be further improved, but it is difficult (either to switch to more expensive licenses, or to switch to the “red zone of the Shylyilev curve” for optimization)
The disk system is bought expensive and, it seems even is somehow configured.

But no. The crocodile is not caught, the coconut does not grow, and the system performance is the same or lower than on the old server. I look in sys.dm_os_waitsatsand see WRITELOG, PAGEIOLATCH_SHand PAGEIOLATCH_EXin the top, the average waiting time is 5+ ms. Well, typically, cho: "Hey, admins and DBA, here you have a disk-bottleneck system" and here begins the old song about 5 ms:

We have 5 ms on SLA
Yes, we have a shelf of 20,000 IOPS drags
We are a vendor said that all database files can be on one partition
We have virtualization and hyperconvergence and we cannot allocate separate disks for database
According to our data server recycling 5%
Everything is configured on the recommendations
Your database does not need high performance, it does not do more than 300 IOPS (and we have the same shelf for 20,000 IOPS)

By the way, all of the above is not only about "their" servers, but also about cloud services and virtualization. There is a handful of its specifics, but the typical clinical picture is about the same: moderately optimized database, intelligent development and maintenance staff, there is a reserve for the processor and memory, the "exhaust" from further investments is almost zero.

So here. This whole song about "5 ms" nonsense and nonsense. If you say so yourself, read this article. And if you say this - prepare the arguments. Earlier, when I heard these words, I was angry, but I'm not angry anymore. I, like that pot of petunia from Hitchhiker's Guide to the Galaxy, have only one thought: “Well, here again ...”.

Who is guilty?

Why is the database so slow? Well, it would seem that a typical server with 20-64 cores at a frequency of 2-3 GHz is capable of performing 50-150 billion simple operations, and the maximum (synthetic) database tests show only 10,000-50000 transactions per second on such machines. Hey! This is from a million to a dozen possible millions of transactions per transaction. This is not just a lot, it is to get a lot of confusion.
Such an overhead is the ACID transaction requirements.

A tomicity - either the entire transaction is completed or the whole is not completed.
C onsistancy - at the entrance and at the exit from the transaction the system is in a holistic state
I solation - transactions do not see intermediate states of each other
D urability - if the transaction was successfully completed (commited), then regardless of the circumstances, the changes made should remain in the system.

By the way, letter-to-letter, these requirements are not met almost anywhere and never, and in distributed systems simply never (CAP-theorem interferes). For our situation, the "D" requirement is probably more expensive than the others; this requirement is provided by the key mechanism of all common OLTP DBMSs: WAL, write-ahead log (PostgeSQL), also known as transaction log (SQL Server), also known as REDO log (Oracle). Here it is - a stone around the neck of performance, and it is also the foundation of the transaction durability.

What is WAL

Let's forget for a moment about modern SSDs, about cool storage systems. Suppose we have a server, in it one or several disks.
Any transaction, even the insertion of a single record, is at least potentially, but in fact almost always and real, a non-atomic action. We almost always need to change not only the page where the record lies, but also the index pages, possibly the service pages. At the same time in the same transaction the same page can change many times. Plus, we can execute other transactions in parallel. Moreover, neighboring transactions constantly “pull” the same pages. If we wait for each page to be written to disk before continuing, which is essentially what Durability requires, then we will have to write many times more and wait for the mandatory end of each recording to non-volatile media. No caches, no swapping of operations in the queue, otherwise there will be no integrity! Moreover, we somehow need to mark which data is already on fixed transactions, and which - not yet (and what was the data before). To understand - a typical single hard disk (HDD) in this mode will give 50-100 IOPS and this is a constant for 20 years. One even a small transaction will require 5-10 write operations. Oh yes, to know what to write down - you need to read. Even very-very high-write OLTP systems read 3 times more than they write. Thus, our transaction costs 20-40 IO, which means 0.2-0.8 seconds per disk. what to write - you need to read. Even very-very high-write OLTP systems read 3 times more than they write. Thus, our transaction costs 20-40 IO, which means 0.2-0.8 seconds per disk. what to write - you need to read. Even very-very high-write OLTP systems read 3 times more than they write. Thus, our transaction costs 20-40 IO, which means 0.2-0.8 seconds per disk.
2 transactions per second. Isn't it enough? Let's try scatter on the discs? Oh, and we still have to wait until the previous one is recorded and there is no parallelism in the end. How to be? And let's get a log file in which we will consistently record all write operations to the database and transaction marks! Pros:

Information about the operation can be much more compact than recording the entire page (typical page size is 8 KiB, information recorded in the journal is often 0.5-1 KiB).
Instead of an entry stating whether a transaction is recorded directly or not into the page, all you need to do is to mark the beginning and commit the transaction in the log.
Pages can be written not after each transaction - even several times smaller. The process of reading / writing data is completely "untied" from the log.
The main thing. If we put our magazine on a separate disk and write records sequentially, then due to the fact that it is not necessary to constantly reposition the disk heads, even a consumer HDD in this mode squeezes out up to 1000 IOPS, considering that small transactions "cost" 2-4 log entries, then you can squeeze 200-400 TPS
In case of failure, the state of the data file can be restored by such a log, and if you cancel a transaction on it, you can roll back the changes.

This log is called write-ahead log / transaction log / REDO log.

Hooray! Super! There were 2 transactions per second, it was 300 - improved 150 times. And at what cost? As it turns out, the price is significant:

In all common DBMS logging is strictly consistent. One stream is responsible for logging. Do you have 100 processors? Cool. And in the journal will still write one thread. The queue depth is exactly one.
Still - no OS caches, no permutations of operations. The requirements of durability remained. Write-through operations: until the disk responded, "I wrote it down, I accurately recorded exactly on the surface, not in the cache." The DBMS does not continue.
If you put the log file on a data disk, almost all the benefits of sequential writing will disappear. Moreover, for good, if there are several databases on the server, then there are several disks for logs.
Rollback of transactions (at least in MS SQL Server) - read the log and restore the state of it. This is as much or even more write operations as there were write operations in the transaction. Rollback - expensive!

This explanation is very simplified, "on the fingers." That's enough for our topic. WAL is a key, fundamental mechanism for ensuring transactionality, it is mandatory write-through, single-threaded access only for sequential writing, in terms of storage, queue depth 1.

If you are interested in this topic

Совсем вводная статья Как устроены базы данных
Для SQL Server есть отличный цикл статей Как перестать называть журнал транзакций SQL Server лог-файлом и прекратить борьбу за его размер.
Интересно взглянуть немного с другой стороны, вот, например, расшифровка отличного доклада про in-memory СУБД Tarantool
В каждой СУБД есть обширные разделы, поясняющие работу WAL.

The write-ahead logging topic in the database should be known to everyone, at least in the minimum amount, who in one way or another administers the DBMS, or the DBMS infrastructure, or develops the databases.

WAL and storage

Storage vendors "from birth" are faced with a DBMS. It is for databases that business buys these insanely expensive complexes: from street price vaults for Dell-EMC, HP, Hitachi, NetApp, when creating a budget, eyes are filled with tears from most top managers, if they, of course, do not receive a percentage of this price. But there is an engineering and marketing conflict. I will explain it using the example of Dell-EMC, but only because I remember where their documentation is.

So:

Single-threaded log, out of turn
The write-through log, that is, latency is "eternal" compared to CPU performance
OLTP load is a lot of relatively small transactions.
Most of the other DBMS loads will somehow parallel.

Amdal's law mercilessly tells us that a single-continuous low-capacity load will make adding processors useless, and the performance will be determined by the magazine. Moreover, at this moment we will not give a damn about the performance of the storage system in IOPS, and only latency will become important.
But do not discard other disk operations - read and write to data files and intempdb. Reading is also a “waiting” operation. While the data page is not read from disk to memory, the processor cannot process it. But for these operations, large queues are possible and permutation of operations in this queue: the DBMS often knows which pages need to be loaded into memory, which pages should be flushed to disk and queued for reading at once a lot. Since in this scenario it is important when the last operation from the stack ends, in this load we are on the contrary more important than IOPS than the latency of a separate operation. To understand the scale: read operations in a typical OLTP system 85% -95%. Yes, yes, write operations are much less.

The development engineers of the storage vendor work closely with the database vendors, and are well aware of all the technical nuances of the DBMS operation with the disk subsystem. Proper planning, partitioning, and allocation of disk resources for a DBMS is a complex and important competence of the storage system administrator . In the same Dell-EMC, even the basic white-paper H14621 and H12341 according to the partitioning recommendations for SQL Server are one hundred pages. Hey! This is not a detailed dock, this is the most common white-paper! There is a bunch of specific ones ( h15142 , h16389 ... there is their darkness). Not lagging behind and "subcontractors" from VMware - Architecting Microsoft SQL Server on VMware vSphere. Please note that these documents are not only and not so much for DBA, as for infrastructure administrators and storage systems.
I also note that in all these documents separate LUNs are cut for data, for journals and for the database tempdb. Yes, somewhere in fresh documents it is neatly said that for All-Flash solutions there is no sense in separating magazines into physically separate media, but LUNs still suggest cutting them separately. If the data and logs fall into one LUN, then from the point of view of the OS it will be one IO queue. And here there will be a problem. In transactions with the journal latency will immediately be an order of magnitude more. And due to the fact that there are unmovable logging operations in the queue, IOPS on the data files andtempdb. This is not a "discovery of the century", it is the alphabetical truth of working with a database. It is not outdated and not canceled with the advent of All-Flash. Yes, delays in operations with SSD are faster by an order of magnitude than in operations with HDD, but still a couple of orders of magnitude slower than memory operations. IO is still a bottleneck DBMS.
And the technical documents correctly emphasize that the number of IOPS is not important in the transaction logs, but it is important that the latency is minimal (in modern writing that less than 1 ms).

And marketers need to sell. Hyper Convergence! Virtualization! Flexibility to deploy! Deduplication! Easy to set up! Lots and lots of IOPS! Beautiful presentations, confident voice, strict suits. Well, how else to sell a solution with a 6-7-digit price tag in dollars? It’s somehow forgotten that either latency or throughput can be achieved from the storage system, but not both at once, that some license for the load balancer costs just like another shelf, that if the intensive recording lasts more than an hour, then the controllers memory it’s not enough and the productivity will go down to “as if there is no cache”, that training the customer’s employees costs another 100,000 rubles for the first course, and similar tricks ...

5 ms

Either having heard enough, having read marketers, or from laziness, or because of some cockroaches, but for some reason, the administrators of the storage system often do something like this. We take a large shelf, combine it all into something flat, cut it into thin provisioned LUNs and distribute it across the LUN to the server. Or two, because "the system partition is well deduplicated." And when, I see that with the disk subsystem by the SQL ad-ad-ad, then the same song begins, that "5 ms is an excellent indicator", that "100000 IOPS", "Your load on the storage system is less than 5%"

NO .

For OLTP systems on a partition with WAL / 5 ms transaction logs, this is an invalid figure. On the "near-commodity" piece of iron for a price of 1000 (in words: a thousand) times cheaper than the normal indicator, it will now be 0.1-0.3 ms. And tomorrow - 0.01 ms. The speed, as in HDD 2008 at the price of the whole apartment entrance in Moscow is not needed. No “serviceability” is worth it.
Vendor writes that transaction logs are not demanding on IOPS and can they be put on the HDD? Yes it is. But for this it is necessary that none of these discs~~infection~~The task besides writing logs to the DBMS did not touch. And so that the storage system responded to the server that the data was recorded immediately as the data went into non-volatile memory (this is much earlier than they will be written)
Thin disks for real OLTP databases are evil.
For WAL, it is absolutely uninteresting how many IOPS can be squeezed there at a queue depth of 10 or 20. There is no depth.
For WAL, there is absolutely no indication that the IO queue in the OS is "only about 1". She will not be anymore.
No, DBA and DB developers are not "Kryvoruki woodpeckers who can’t properly configure the WAL record to be paralleled" (the real opinion of the admin)
The amateurs ’logic to consider recycling“ since your system is crookedly configured into one partition, it doesn’t do 10,000 IOPS, which means it needs to be transferred from the high-end array to mid-range ”- this is the wrong logic.
If a 40-core server has a processor load of 2.5 percent, this does not mean that it has nothing to do, but, most likely, it means that there is some kind of task that blocks all others.

When any data download on a developer's laptop is 5 minutes, and on a 40 nuclear server with 1 Tb of RAM and a storage system for half a million dollars the same task takes an hour, then even the most patient customers will have questions about the reasonableness of costs.

Average delay per WAL section	there will never be more transactions per second than:
5 ms	200
1 ms	1000
0.5 ms	2000
0.1 ms	10,000
0.05 ms	20,000

What to do

Tips for administrators and DBA

For OLTP, stop reading "recycling" and IOPS. Separately, I’ll note that you don’t look at IOPS with a large queue depth at all: even on data sections, large queues usually have a short burst or something that does not affect the actual OLTP performance.

Sharing disk space on a LUN is not a DBA whim. The database has several different load profiles of the disk subsystem. At a minimum, the following can be distinguished:

Work with data files. Usually this is reading and writing random blocks of 8/64 KiB. Readings 80-95%. Queues occur: during maintenance periods, during bulk loading periods, on inefficient or bulk requests, and at checkpoint. Performance is affected by readability. It is important that the 8/64 KiB alignment of the "through" blocks pass through the entire storage system.
Working with tempdbis the same as working with data files, but readings are usually 40-75% and responsiveness to writing can be important. In modern MS SQL systems, this database can be loaded several times stronger than the database with data. In a non-cluster configuration of a DBMS, this section should be excluded from any storage systems replications. Its contents after restarting the service will still be destroyed.
Work with archived data / DWH. Readings close to 100%. The size of one reading block is usually 64 KiB. Requests read a lot and in a row, so the queue can jump up to 1000 and more.
Work with transaction logs. Read only for maintenance (backup, replication, etc.), application performance is affected only by the record. Record in blocks 0,5-64 KiB. No queue, in one thread. Delay is critical for applications.
Backup and restore. From a database point of view, this is reading in large blocks (often 1 MiB). It is important that this load can rest against the channels / buses (both FC and Ethernet) and the performance of the storage processors in some cases. Backing up a single server can affect the performance of other servers on the same SAN / storage.
Work with application files: these are logs, default trace, binary files, etc. This load is rarely significant and is important only at system startup.

There are other types of load, but they are slightly exotic (for example, there may be a repository of files stored in the database in the form of a FileStream directory). All these types of loads have different, often conflicting disk requirements. If they are all dumped on one partition, then you not only degrade performance, but it is very important that you lose the opportunity to understand why the system slows down, and also you are unable to improve only the part that needs improvement without global improvements / upgrades to the storage system. Therefore, the main recommendation:

Read the recommendations of the manufacturer of storage systems and DBMS, try to understand "why they advise so much" and design the system taking into account different types of load. Spread fundamentally different types of load on different sections.

Well, to the heap

Read not marketing gibberish, but technical documentation. For Dell / EMC and SQL Server, start with the links in the article.
Technical documentation check measurements. These measurements are compared with some kind of "commodity" example (for example, on NUCs with SSD, as an option, yes ). Work on the hypothesis-test-analysis cycle, honestly testing your hypotheses.
Storage administrators and virtualization farms need to plan their deployment along with the DBA if there are some heavily loaded databases (even 200 transactions per second).
If you use geo-distributed synchronous clustering (MetroСlaster and its relatives), see if it introduces delays that are unacceptable. These clustering can easily give +0.5 ms, and when it was 0.2, and with the cluster it became 0.7, the performance will drop up to 3 times.
It is necessary to design understanding that with time the situation will change and that the system will need to be changed. If your section is tempdbnot loaded now, then perhaps the DB developers in the next version will turn on RCSI and in 12 hours your carriage will turn into a pumpkin.
Latency is almost always more important than throughput. And it continues to be true in cloud technologies, in “hyperconvergent systems”, and virtual farms. If you improve throughput by increasing latency, then you are most likely mistaken. This can only be done very reasonably.

MS SQL Server

If we talk about MS SQL, then in fact there are a number of ways to reduce the load on the bottleneck transaction log, it can help someone:

Often I see a recommendation not to make large transactions. It is right. But transactions should not be too small. 1000 consecutive transactions inserting one line at a time can be 5-30 times longer than a single transaction with 1000 INSERT. And yes, let me remind you that if you did not open the transaction explicitly, then the default behavior will be “each command is a separate transaction”.
The tempdbtransaction log is not real. He has caching. Therefore, if you need intermediate calculations, then do not do them in permanent tables.
If you need to insert a lot of records, you should use BULK INSERT or another option for minimally logged operations. Only operations such as inserting records and rebuilding an index can be minimally logged, and only in the recovery models "Simple" and "Bulk logged". And, by the way, in all other cases there is no difference in performance between the Simple / Bulk logged and Full models. Mass loading is still the most useful - The Data Loading Performance Guide , but this article is gradually becoming obsolete . For a snack (true about ETL, not OLTP), another article from nearly a decade ago We Loaded 1TB in 30 Minutes with SSIS, and So Can You
Newer versions of SQL Server have Delayed Transaction Durability - a mode in which you can get performance at the cost of reliability.
Newer versions of SQL Server have In-Memory OLTP . The technology with a bunch of restrictions, but with proper spot use can be useful.
Look for unnecessary synchronous mirrors, unnecessary synchronous alwaysOn replicas.

***

That's all. There is no magic. 20000 IOPS with 5 ms latency and with queue 4-16 says nothing about storage performance for OLTP tasks. For OLTP, you need to properly mark the storage system, select the performance metric correctly and be able to measure it.

PS: last note about SSD.

На горизонте есть очень интересный новый участник борьбы за производительность систем хранения для БД. Это Intel Optane. У нынешних SSD "красивые" цифры производительности начинаются от глубины очереди 4, что на самом деле не очень жизненно. Плюс эта производительность на запись обеспечивается некоторым объёмом оперативной памяти внутри SSD, и, если запись идёт достаточно интенсивно, то потом наступает существенная деградация производительности. А еще у SSD ограничен ресурс. Да, у современных серверных и топовых "гражданских" он достаточно большой, но совсем не бесконечный. И тут на сцене появляется Intel Optane: судя по тестам несерверных моделей (раз, два ) задержки на случайных операциях на очереди глубины 1 выходят на уровень около 20 микросекунд. Это по сути без кеширования, без деградации. SSD при таком же профиле начинают тупить до 100-300 мкс. Ресурс уже сейчас выходит за типичный ресурс SSD.
Ну цена, да. Но потенциально эта железка открывает новые горизонты производительности OLTP "традиционной", не in-memory архитектуры без отказа от ACID. А с другой стороны latency 20 мкс заставляет задуматься о судьбе "обычных" СХД. На low-latency требованиях им будет очень тяжело конкурировать с Optane (снова привет встроенным системам хранения?).
Это всё очень круто и я надеюсь на успех (и подешевление) Optane.

Acknowledgments

eugeneb0 и apatyukov за добрые советы и вычитку.

Tags: