TSDB Analysis in Prometheus 2
- Transfer

The time series database (TSDB) in Prometheus 2 is a great example of an engineering solution that offers major improvements over the v2 storage in Prometheus 1 in terms of data collection speed and query execution, resource efficiency. We implemented Prometheus 2 in Percona Monitoring and Management (PMM), and I had the opportunity to understand the performance of Prometheus 2 TSDB. In this article I will talk about the results of these observations.
Prometheus average workload
For those who are used to dealing with basic databases, the Prometheus regular workload is pretty curious. The speed of data accumulation tends to a stable value: usually the services that you monitor send about the same number of metrics, and the infrastructure changes relatively slowly.
Information requests may come from different sources. Some of them, such as alerts, also strive for a stable and predictable value. Others, such as user queries, can cause spikes, although this is not typical for most of the load.
Load test
During testing, I focused on the ability to accumulate data. I deployed Prometheus 2.3.2 compiled with Go 1.10.1 (as part of PMM 1.14) on the Linode service using this script: StackScript . For the most realistic load generation, using this StackScript, I launched several MySQL nodes with a real load (Sysbench TPC-C Test), each of which emulated 10 Linux / MySQL nodes.
All of the following tests were performed on a Linode server with eight virtual cores and 32 GB of memory, on which 20 load simulations of monitoring two hundred MySQL instances were launched. Or, in Prometheus terms, 800 targets, 440 scrapes per second, 380 thousand samples per second and 1.7 million active time series.
Design
The usual approach of traditional databases, including the one used by Prometheus 1.x, is memory limit . If it is not enough to withstand the load, you will encounter large delays and some requests will not be fulfilled.
Memory usage in Prometheus 2 is configured through a key
storage.tsdb.min-block-durationthat determines how long the records will be stored in memory before flushing to disk (by default, this is 2 hours). The amount of memory needed will depend on the number of time series, labels, and the intensity of data collection (scrapes) in total with the net input stream. In terms of disk space, Prometheus aims to use 3 bytes per record (sample). On the other hand, memory requirements are much higher.Despite the fact that it is possible to configure the block size, it is not recommended to manually configure it, so you are confronted with the need to give Prometheus as much memory as it asks for your load.
If there is not enough memory to support the incoming metrics stream, Prometheus will drop out of memory or get an OOM killer.
Adding swap to delay the crash when Prometheus runs out of memory does not really help, because using this function causes explosive memory consumption. I think the thing is Go, its garbage collector and how it works with swap.
Another interesting approach is setting the head block to be reset to a disk at a certain time, instead of counting it from the start of the process.

As you can see from the graph, disk flushes occur every two hours. If you change the min-block-duration parameter to one hour, then these discharges will occur every hour, starting in half an hour.
If you want to use this and other graphics in your Prometheus installation, you can use this dashboard . It was developed for PMM, but, with minor modifications, is suitable for any installation of Prometheus.
We have an active block called head block, which is stored in memory; blocks with older data are available through
mmap(). This removes the need to configure the cache separately, but also means that you need to leave enough space for the operating system cache if you want to query data older than the head block.This also means that the consumption of Prometheus virtual memory will look quite high, which is not worth worrying about.

Another interesting design point is the use of WAL (write ahead log). As can be seen from the storage documentation, Prometheus uses WAL to avoid loss during falls. The specific mechanisms for ensuring data survivability, unfortunately, are not well documented. Prometheus version 2.3.2 flushes the WAL to disk every 10 seconds, and this parameter is not user configurable.
Seals (Compactions)
Prometheus TSDB is designed in the image of an LSM repository (Log Structured merge - a log-structured tree with merging): the head block is periodically flushed to disk, while the compression mechanism combines several blocks together to avoid scanning too many blocks during requests. Here you can see the number of blocks that I observed on the test system after a day of work.

If you want to know more about the repository, you can examine the meta.json file, which contains information about the blocks available and how they appeared.
{
       "ulid": "01CPZDPD1D9R019JS87TPV5MPE",
       "minTime": 1536472800000,
       "maxTime": 1536494400000,
       "stats": {
               "numSamples": 8292128378,
               "numSeries": 1673622,
               "numChunks": 69528220
       },
       "compaction": {
               "level": 2,
               "sources": [
                       "01CPYRY9MS465Y5ETM3SXFBV7X",
                       "01CPYZT0WRJ1JB1P0DP80VY5KJ",
                       "01CPZ6NR4Q3PDP3E57HEH760XS"
               ],
               "parents": [
                       {
                               "ulid": "01CPYRY9MS465Y5ETM3SXFBV7X",
                               "minTime": 1536472800000,
                               "maxTime": 1536480000000
                       },
                       {
                               "ulid": "01CPYZT0WRJ1JB1P0DP80VY5KJ",
                               "minTime": 1536480000000,
                               "maxTime": 1536487200000
                       },
                       {
                               "ulid": "01CPZ6NR4Q3PDP3E57HEH760XS",
                               "minTime": 1536487200000,
                               "maxTime": 1536494400000
                       }
               ]
       },
       "version": 1
}Seals in Prometheus are tied to the time a head block was flushed to disk. At this point, several such operations may be performed.

Apparently, seals are unlimited in any way and can cause large disk I / O jumps at runtime.

CPU load spikes

Of course, this has a rather negative effect on the speed of the system and is also a serious challenge for LSM-storages: how to make seals to support high query speeds and not cause too much overhead?
The use of memory in the process of compaction also looks quite interesting.

We can see how, after compaction, most of the memory changes state from Cached to Free: it means that potentially valuable information was removed from there. Curious if used here
fadvice() or some other minimization technique, or is it caused by the fact that the cache was freed from blocks destroyed during compaction?Crash Recovery
Disaster recovery takes time, and it is justified. For an incoming stream of one million records per second, I had to wait about 25 minutes while the recovery was performed taking into account the SSD-drive.
level=info ts=2018-09-13T13:38:14.09650965Z caller=main.go:222 msg="Starting Prometheus" version="(version=2.3.2, branch=v2.3.2, revision=71af5e29e815795e9dd14742ee7725682fa14b7b)"
level=info ts=2018-09-13T13:38:14.096599879Z caller=main.go:223 build_context="(go=go1.10.1, user=Jenkins, date=20180725-08:58:13OURCE)"
level=info ts=2018-09-13T13:38:14.096624109Z caller=main.go:224 host_details="(Linux 4.15.0-32-generic #35-Ubuntu SMP Fri Aug 10 17:58:07 UTC 2018 x86_64 1bee9e9b78cf (none))"
level=info ts=2018-09-13T13:38:14.096641396Z caller=main.go:225 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2018-09-13T13:38:14.097715256Z caller=web.go:415 component=web msg="Start listening for connections" address=:9090
level=info ts=2018-09-13T13:38:14.097400393Z caller=main.go:533 msg="Starting TSDB ..."
level=info ts=2018-09-13T13:38:14.098718401Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536530400000 maxt=1536537600000 ulid=01CQ0FW3ME8Q5W2AN5F9CB7R0R
level=info ts=2018-09-13T13:38:14.100315658Z caller=web.go:467 component=web msg="router prefix" prefix=/prometheus
level=info ts=2018-09-13T13:38:14.101793727Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536732000000 maxt=1536753600000 ulid=01CQ78486TNX5QZTBF049PQHSM
level=info ts=2018-09-13T13:38:14.102267346Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536537600000 maxt=1536732000000 ulid=01CQ78DE7HSQK0C0F5AZ46YGF0
level=info ts=2018-09-13T13:38:14.102660295Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536775200000 maxt=1536782400000 ulid=01CQ7SAT4RM21Y0PT5GNSS146Q
level=info ts=2018-09-13T13:38:14.103075885Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536753600000 maxt=1536775200000 ulid=01CQ7SV8WJ3C2W5S3RTAHC2GHB
level=error ts=2018-09-13T14:05:18.208469169Z caller=wal.go:275 component=tsdb msg="WAL corruption detected; truncating" err="unexpected CRC32 checksum d0465484, want 0" file=/opt/prometheus/data/.prom2-data/wal/007357 pos=15504363
level=info ts=2018-09-13T14:05:19.471459777Z caller=main.go:543 msg="TSDB started"
level=info ts=2018-09-13T14:05:19.471604598Z caller=main.go:603 msg="Loading configuration file" filename=/etc/prometheus.yml
level=info ts=2018-09-13T14:05:19.499156711Z caller=main.go:629 msg="Completed loading of configuration file" filename=/etc/prometheus.yml
level=info ts=2018-09-13T14:05:19.499228186Z caller=main.go:502 msg="Server is ready to receive web requests."The main problem of the recovery process is high memory consumption. Despite the fact that in a normal situation the server can work stably with the same amount of memory, when it crashes, it may not rise due to OOM. The only solution I found was to disable data collection, raise the server, allow it to recover and reboot with the collection already enabled.
Warm up
Another behavior that should be remembered during the warm-up is the ratio of low productivity and high resource consumption right after the start. During some, but not all starts, I observed a serious load on the CPU and memory.


The memory lapses indicate that Prometheus cannot configure all charges from the start, and some information is lost.
I did not find out the exact reasons for the high load on the processor and memory. I suspect that this is due to the creation of new time series in the head block with a high frequency.
CPU load spikes
In addition to the seals, which create a rather high I / O load, I noticed serious jumps in the load on the processor every two minutes. Bursts last longer with a high incoming stream and it looks like they are caused by the Go garbage collector, at least some kernels are fully loaded.


These leaps are not so insignificant. It appears that when they occur, the internal entry point and Prometheus metrics become inaccessible, which causes data gaps at the same time intervals.

You can also notice that the Prometheus exporter shuts up for one second.

We can see correlations with garbage collection (GC).

Conclusion
TSDB in Prometheus 2 is fast, able to handle millions of time series and at the same time with thousands of records made per second using fairly modest hardware. The utilization of the CPU and disk I / O is also impressive. My example showed up to 200,000 metrics per second per core used.
To plan the extension, you need to remember about sufficient memory volumes, and this should be real memory. The amount of memory used, which I observed, was about 5 GB per 100,000 entries per second of the incoming stream, which combined with the operating system cache about 8 GB of occupied memory.
Of course, there is still a lot of work to tame the bursts of CPU and disk I / O, and this is not surprising given how young the TSDB Prometheus 2 is compared to InnoDB, TokuDB, RocksDB, WiredTiger, but they all had similar problems at the beginning of the life cycle.