Is storage speed suitable for etcd? Ask fio
A short story about fio and etcd
The performance of the etcd cluster is largely dependent on the performance of its storage. etcd exports some metrics to Prometheus to provide the necessary information about storage performance. For example, the metric wal_fsync_duration_seconds. The documentation for etcd says : for the storage to be considered fast enough, the 99th percentile of this metric should be less than 10 ms. If you plan to run the etcd cluster on Linux machines and want to evaluate if your storage is fast enough (for example, SSDs), you can use fio , a popular tool for testing I / O operations. Run the following command, where test-data is the directory below the storage mount point:
fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=2300 --name=mytest
You just need to look at the results and verify that the 99th percentile of fdatasync duration is less than 10 ms. If so, you have a fairly fast storage. Here is an example of the results:
sync (usec): min=534, max=15766, avg=1273.08, stdev=1084.70 sync percentiles (usec): | 1.00th=[ 553], 5.00th=[ 578], 10.00th=[ 594], 20.00th=[ 627], | 30.00th=[ 709], 40.00th=[ 750], 50.00th=[ 783], 60.00th=[ 1549], | 70.00th=[ 1729], 80.00th=[ 1991], 90.00th=[ 2180], 95.00th=[ 2278], | 99.00th=[ 2376], 99.50th=[ 9634], 99.90th=, 99.95th=, | 99.99th=
- We have configured the --size and --bs options for our specific scenario. To get a useful result from fio, enter your values. Where to get them? Read how we learned how to configure fio .
- During testing, the entire I / O load comes from fio. In a real scenario, it is likely that other write requests will come to the repository, except for those related to wal_fsync_duration_seconds. Extra load will increase the value of wal_fsync_duration_seconds. So if the 99th percentile almost reached 10 ms, your storage will not have enough speed.
- Take fio version no lower than 3.5 (previous ones do not show percentiles of fdatasync duration).
- Above is only a snippet of the results from fio.
A long story about fio and etcd
What is WAL in etcd
Databases typically use write-ahead logs ; etcd uses it too. Here we will not discuss in detail the write-ahead log (WAL) log. We only need to know that each member of the etcd cluster maintains it in persistent storage. etcd writes each key-value pair operation (for example, updating) to the WAL before applying them to the repository. If between the snapshots one of the storage members crashes and restarts, it can locally recover transactions from the last snapshot using the WAL content.
When a client adds a key to a store of key-value pairs or updates the value of an existing key, etcd records this operation in the WAL, which is a regular file in persistent storage. Before proceeding, etcd MUST be completely sure that writing to the WAL really happened. On Linux, a single write system call is not enough for this , since actually writing to the physical storage may be delayed. For example, Linux may store a WAL record in a cache in kernel memory for some time (for example, a page cache). And in order for the data to be accurately written to the persistent storage, you need the fdatasync system call after writing, and etcd just uses it (as you can see from strace , where 8 is the WAL file descriptor):
21:23:09.894875 lseek(8, 0, SEEK_CUR) = 12808 <0.000012> 21:23:09.894911 write(8, ".\0\0\0\0\0\0\202\10\2\20\361\223\255\266\6\32$\10\0\20\10\30\26\"\34\"\r\n\3fo"..., 2296) = 2296 <0.000130> 21:23:09.895041 fdatasync(8) = 0 <0.008314>
Unfortunately, writing to persistent storage does not go instantly. If the fdatasync call is slow, etcd system performance drops. The documentation for etcd says that the repository is considered fast enough if in the 99th percentile of fdatasync calls it takes less than 10 ms to write to the WAL file. There are other useful metrics for storage, but in this post we are talking only about this metric.
Assess storage with fio
If you need to evaluate whether your repository is suitable for etcd, use fio, a very popular I / O load testing tool. It should be remembered that disk operations can be very different: synchronous and asynchronous, many classes of system calls, etc. As a result, fio is very difficult to use. It has many parameters, and different combinations of their values produce completely dissimilar I / O workloads. To get adequate numbers for etcd, you should make sure that the test recording load from fio is as close as possible to the real load from etcd when writing WAL files.
Consequently, fio should at least create a load in the form of a series of sequential write operations to the file, each record will consist of a write system call followed by a fdatasync system call. For sequential write operations, fio needs the --rw = write option. In order for fio to use the write system call rather than pwrite when recording , it is worth specifying the --ioengine = sync parameter. Finally, for fdatasync to be called after each entry, you need to add the --fdatasync = 1 parameter. The other two options in this example (--size and --bs) are scenario specific. In the next section, we’ll show you how to configure them.
Why fio and how we learned how to configure it
In this post we describe the real case. We had a Kubernetes v1.13 cluster , which we monitored using Prometheus. etcd v3.2.24 was hosted on an SSD. Etcd metrics showed too high latencies for fdatasync, even when the cluster was not doing anything. The metrics were strange, and we really didn't know what they meant. The cluster consisted of virtual machines, you had to understand what the problem was: in physical SSDs or in the virtualization layer. In addition, we often made changes to the hardware and software configuration, and we needed a way to evaluate their results. We could run etcd in each configuration and look at the Prometheus metrics, but this is too troublesome. We were looking for a fairly simple way to evaluate a specific configuration. We wanted to check if we understood Prometheus metrics from etcd correctly.
But for this it was necessary to solve two problems. First, what does the I / O load that etcd creates when writing to the WAL look like? What system calls are used? What is the size of the records? Secondly, if we answer these questions, how do I reproduce a similar workload with fio? Do not forget that fio is a very flexible tool with many options. We solved both problems with one approach - using the lsof and strace commands . lsof displays all the file descriptors used by the process and its associated files. And with strace, you can study an already running process or start a process and study it. strace displays all system calls from the process being studied (and its child processes). The latter is very important, since etcd just takes a similar approach.
First of all, we used strace to learn the etcd server for Kubernetes when there was no load on the cluster. We saw that almost all WAL records were about the same size: 2200-2400 bytes. Therefore, in the command at the beginning of the post, we specified the parameter --bs = 2300 (bs means the size in bytes for each fio entry). Note that the size of the etcd entry depends on the version of etcd, the delivery, parameter values, etc., and affects the duration of fdatasync. If you have a similar scenario, examine your etcd processes with strace to find out the exact numbers.
Then, to get a good idea of the actions in the etcd file system, we started it with strace and with the -ffttT options. So we tried to study the child processes and write the output of each of them in a separate file, and also get detailed reports on the start and duration of each system call. We used lsof to confirm our analysis of strace output and see which file descriptor was used for what purpose. So with strace we got the results shown above. Statistics on synchronization time confirmed that the exponent wal_fsync_duration_seconds from etcd corresponds to fdatasync calls with WAL file descriptors.
We studied the fio documentation and selected the options for our script so that fio would generate a load similar to etcd. We also checked the system calls and their duration by running fio from strace, similar to etcd.
We carefully selected the value of the --size parameter, which represents the entire I / O load from fio. In our case, this is the total number of bytes written to the storage. It turned out to be directly proportional to the number of write (and fdatasync) system calls. For a specific bs value, the number of calls to fdatasync = size / bs. Since we were interested in percentile, we should have had enough samples for reliability, and we calculated that 10 ^ 4 would be enough for us (we get 22 mebibytes). If --size is smaller, outliers can occur (for example, several fdatasync calls work out longer than usual and affect the 99th percentile).
Try it yourself
We showed how to use fio and find out if the storage has enough speed for high performance etcd. Now you can try it in practice on your own, using, for example, virtual machines with SSD storage in the IBM Cloud .