olemskoi January 3, 2017 at 00:26

Slow Cooker: stress testing network services

Transfer

Linkerd , our service mesh for cloud applications, has a duty to cope with large volumes of network traffic over the long term. Before releasing the next release, compliance with this requirement must be carefully checked. In this article, we describe the load testing strategies and the tools we used, as well as look at a few issues that were discovered. This will introduce slow_cooker , an open source stress testing tool written on Go that was created to run lengthy stress tests and identify lifecycle issue identification.

linkerd acts as a transparent proxy. It adds to the requests for specific services the use of connection pools, failover, retries, load balancing taking into account delays, and much more. In order to be a viable and industrial- friendly system, linkerd must be able to cope with a very large number of requests over long periods of time in a changing environment. Fortunately, linkerd is built on the basis of netty and Finagle . Among all network programs, their code is one of the most widely tested and tested during industrial operation. But code is one thing, and real performance is another.

In order to evaluate the behavior of the system in an industrial environment, linkerd must be subjected to the most rigorous and comprehensive load testing. Moreover, since linkerd is part of the underlying infrastructure, its instances rarely stop or restart, and each of them can miss billions of requests through the changing behavior of services and their clients. This means that we also need to test the issues lifecycle (lifecycle issues). For high bandwidth network servers, such as linkerdLifecycle issues include memory and socket leaks, bad GC pauses, and network and disk subsystem congestion. Such things happen infrequently, but if you do not learn how to work them out properly, the consequences can be disastrous.

Who is testing software for testing?

At the initial stage of linkerd development , we used the popular ApacheBench and hey load testing tools . (Of course, they only work with HTTP, and linkerd proxies various protocols, including Thrift, gRPC, and Mux — but. But we had to start somewhere.)

Unfortunately, we quickly realized that, despite the undoubted usefulness of these tools for quickly obtaining performance data, they are not very good at identifying the life cycle problems that we wanted to learn to identify. These tools give a general result on the completion of the test, and with this approach, problems can not be noticed. Moreover, they rely on average values and standard deviations, which is not the best, in our opinion, way of assessing system performance.

To identify life cycle issues, we needed better metrics and the ability to see how linkerd behaves during lengthy tests, running hours and days, not minutes.

To get a tender code, we cook slowly

Since we could not find a suitable tool, I had to make my own: slow_cooker . slow_cooker is a stress testing program designed specifically to run lengthy stress tests and identify life cycle problems. We use slow_cooker extensively to find performance issues and test changes to our products. In slow_cooker there are incremental reports on the progress of the testing process, change detection (change detection) and all the necessary metrics.

So that other people can use slow_cooker and participate in the development, today we open its source code. See slow_cooker source on GitHub and the recently released 1.0 release .

Let's talk about the features that slow_cooker provides .

(For simplicity, we will test it directly on the web services themselves. In practice, of course, we use slow_cooker primarily to find problems with linkerd , and not with the services it serves.)

Network latency walkthrough

Since slow_cooker is primarily aimed at identifying life cycle problems that occur over long periods of time, it incorporates the idea of step-by-step reports. Too much can be skipped, if we analyze the average values for a very large number of input data, especially when it comes to such sp e variables phenomena as the work of garbage collector, or saturation of the network. With the help of step-by-step reports, we can see changes in throughput and delays directly on a running system.

The example shows the output of slow_cooker , obtained during load testing linkerd . In our test case, linkerd balances the load between three nginx servers , each of which distributes static content. Delays are given in milliseconds, and we derive min , p50 , p95 , p99 , p999 and max delays recorded at ten-second intervals.

$ ./slow_cooker_linux_amd64 -url http://target:4140 -qps 50 -concurrency 10 http://perf-target-2:8080
# sending 500 req/s with concurrency=10 to http://perf-target-2:8080 ...
#                      good/b/f t     good%   min [p50 p95 p99  p999]  max change
2016-10-12T20:34:20Z   4990/0/0 5000  99% 10s   0 [  1   3   4    9 ]    9
2016-10-12T20:34:30Z   5020/0/0 5000 100% 10s   0 [  1   3   6   11 ]   11
2016-10-12T20:34:40Z   5020/0/0 5000 100% 10s   0 [  1   3   7   10 ]   10
2016-10-12T20:34:50Z   5020/0/0 5000 100% 10s   0 [  1   3   5    8 ]    8
2016-10-12T20:35:00Z   5020/0/0 5000 100% 10s   0 [  1   3   5    9 ]    9
2016-10-12T20:35:11Z   5020/0/0 5000 100% 10s   0 [  1   3   5   11 ]   11
2016-10-12T20:35:21Z   5020/0/0 5000 100% 10s   0 [  1   3   5    9 ]    9
2016-10-12T20:36:11Z   5020/0/0 5000 100% 10s   0 [  1   3   5    9 ]    9
2016-10-12T20:36:21Z   5020/0/0 5000 100% 10s   0 [  1   3   6    9 ]    9
2016-10-12T20:35:31Z   5019/0/0 5000 100% 10s   0 [  1   3   5    9 ]    9
2016-10-12T20:35:41Z   5020/0/0 5000 100% 10s   0 [  1   3   6   10 ]   10
2016-10-12T20:35:51Z   5020/0/0 5000 100% 10s   0 [  1   3   5    9 ]    9
2016-10-12T20:36:01Z   5020/0/0 5000 100% 10s   0 [  1   3   5   10 ]   10

In this report good%, the throughput is shown in the column : how close we are to the required number of requests per second (RPS, requests per second).

This report looks good: the system is fast and the response time is stable. At the same time, we should be able to clearly see where and when the trouble started. The output of slow_cooker was set up in such a way as to facilitate a visual search for problems and outbreaks using vertical alignment, as well as an indicator of the change. Let's see an example where we got a very slow server:

$ ./slow_cooker_linux_amd64 -totalRequests 100000 -qps 5 -concurrency 100 http://perf-target-1:8080
# sending 500 req/s with concurrency=10 to http://perf-target-2:8080 ...
#                      good/b/f t     good%   min [p50 p95 p99  p999]  max change
2016-11-14T20:58:13Z   4900/0/0 5000  98% 10s   0 [  1   2   6    8 ]    8 +
2016-11-14T20:58:23Z   5026/0/0 5000 100% 10s   0 [  1   2   3    4 ]    4
2016-11-14T20:58:33Z   5017/0/0 5000 100% 10s   0 [  1   2   3    4 ]    4
2016-11-14T20:58:43Z   1709/0/0 5000  34% 10s   0 [  1 6987 6987 6987 ] 6985 +++
2016-11-14T20:58:53Z   5020/0/0 5000 100% 10s   0 [  1   2   2    3 ]    3 --
2016-11-14T20:59:03Z   5018/0/0 5000 100% 10s   0 [  1   2   2    3 ]    3 --
2016-11-14T20:59:13Z   5010/0/0 5000 100% 10s   0 [  1   2   2    3 ]    3 --
2016-11-14T20:59:23Z   4985/0/0 5000  99% 10s   0 [  1   2   2    3 ]    3 --
2016-11-14T20:59:33Z   5015/0/0 5000 100% 10s   0 [  1   2   3    4 ]    4 --
2016-11-14T20:59:43Z   5000/0/0 5000 100% 10s   0 [  1   2   3    5 ]    5
2016-11-14T20:59:53Z   5000/0/0 5000 100% 10s   0 [  1   2   2    3 ]    3
FROM    TO #REQUESTS
   0     2 49159
   2     8 4433
   8    32 8
  32    64 0
  64   128 0
 128   256 0
 256   512 0
 512  1024 0
1024  4096 0
4096 16384 100

As you can see, the system works quickly and responsively, with the exception of one hit in 2016-11-14T20: 58: 43Z, during which the throughput dropped to 34%, and then returned to normal. As the owner of this service, you will probably want to look at the logs or performance indicators to find out the cause of the incident.

Life Cycle Problem Example: GC Pause

To demonstrate the advantages of step-by-step reports over regular reports that display only summary data, let's simulate the situation in which the garbage collector runs on the server. In this example, we will directly test a single nginx process that distributes static content. To simulate the delays caused by the garbage collector, we will pause and resume nginx in a loop at five-second intervals (using kill -STOP $PIDand kill -CONT $pid).

For comparison, let's start with a report from ApacheBench :

$ ab -n 100000 -c 10 http://perf-target-1:8080/
This is ApacheBench, Version 2.3 <$Revision: 1604373 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking perf-target-1 (be patient)
Completed 10000 requests
Completed 20000 requests
Completed 30000 requests
Completed 40000 requests
Completed 50000 requests
Completed 60000 requests
Completed 70000 requests
Completed 80000 requests
Completed 90000 requests
Completed 100000 requests
Finished 100000 requests
Server Software:        nginx/1.9.12
Server Hostname:        perf-target-1
Server Port:            8080
Document Path:          /
Document Length:        612 bytes
Concurrency Level:      10
Time taken for tests:   15.776 seconds
Complete requests:      100000
Failed requests:        0
Total transferred:      84500000 bytes
HTML transferred:       61200000 bytes
Requests per second:    6338.89 [#/sec] (mean)
Time per request:       1.578 [ms] (mean)
Time per request:       0.158 [ms] (mean, across all concurrent requests)
Transfer rate:          5230.83 [Kbytes/sec] received
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.2      0       3
Processing:     0    1  64.3      0    5003
Waiting:        0    1  64.3      0    5003
Total:          0    2  64.3      1    5003
Percentage of the requests served within a certain time (ms)
  50%      1
  66%      1
  75%      1
  80%      1
  90%      1
  95%      1
  98%      1
  99%      2
 100%   5003 (longest request)

Here we see a delay of 1.5 ms, while there are several emission used for lshimi delays. Such a report can be easily mistakenly considered normal even though the tested service did not respond to requests exactly half the time spent on checking. If the target SLA is 1 second, then the service exceeded it for more than half of the run, but you may not notice this from the report!

With the step-by-step reports of slow_cooker, we see that there is a persistent bandwidth issue. It is also much more obvious here that P99.9 has consistently high values throughout the test:

$ ./slow_cooker_linux_amd64 -totalRequests 20000 -qps 50 -concurrency 10 http://perf-target-2:8080
# sending 500 req/s with concurrency=10 to http://perf-target-2:8080 ...
#                      good/b/f t    good%    min [p50 p95 p99  p999]  max change
2016-12-07T19:05:37Z   2510/0/0 5000  50% 10s   0 [  0   0   2 4995 ] 4994 +
2016-12-07T19:05:47Z   2520/0/0 5000  50% 10s   0 [  0   0   1 4999 ] 4997 +
2016-12-07T19:05:57Z   2519/0/0 5000  50% 10s   0 [  0   0   1 5003 ] 5000 +
2016-12-07T19:06:07Z   2521/0/0 5000  50% 10s   0 [  0   0   1 4983 ] 4983 +
2016-12-07T19:06:17Z   2520/0/0 5000  50% 10s   0 [  0   0   1 4987 ] 4986
2016-12-07T19:06:27Z   2520/0/0 5000  50% 10s   0 [  0   0   1 4991 ] 4988
2016-12-07T19:06:37Z   2520/0/0 5000  50% 10s   0 [  0   0   1 4995 ] 4992
2016-12-07T19:06:47Z   2520/0/0 5000  50% 10s   0 [  0   0   2 4995 ] 4994
FROM    TO #REQUESTS
   0     2 19996
   2     8 74
   8    32 0
  32    64 0
  64   128 0
 128   256 0
 256   512 0
 512  1024 0
1024  4096 0
4096 16384 80

Percentile Delay Reports

As you can see from the ApacheBench example, some load testing tools only display the average value and standard deviation. However, these metrics are usually inappropriate when estimating delays that do not obey the law of normal distribution and often have very long tails.

In slow_cooker we do not use the mean and standard deviation, but instead display the minimum, maximum and several high-order percentiles (P50, P95, P99 and P99.9). This approach is increasingly used in modern software, where one request can generate dozens or even hundreds of calls to other systems and services. In such situations, metrics like the 95th and 99th percentiles provide the prevailing delay value.

Conclusion

Although in our time writing a stress testing tool is not too difficult (especially when using modern programming languages that have built-in parallelism support and are network-oriented, such as, for example, Go), the implementation of the measurement system and report structure can significantly affect on the usefulness of such a program.

Currently, we are widely using slow_cooker to test linkerd and other services (e.g. nginx). linkerd is tested in 24x7 mode under conditions of interaction with various services, and slow_cooker helped us not only to prevent code deployment with serious errors, but also to find performance problems in already working releases . The use of slow_cooker in Buoyant has become so ubiquitous that we began to call load testing programs “slowcooking”.

You can start working with slow_cooker by visiting the release page on Github . Download the tool and start testing your favorite server to see if it has performance issues. Linkerd slow_cooker helped us a lot when testing , and we hope you find it equally useful.