Selecting and Configuring Garbage Collector for Highload System in Hotspot JVM

Introduction
When working in the field of RTB (Real Time Bidding), one of the key characteristics is the time spent displaying ads to a user who visits the site. It consists of several stages, one of which is an auction for an ad space conducted by SSP (Supply Side Platform) between several DSP (Demand Side Platform) systems. In this case, the critical value is the time for which the DSP will have time to answer with its inventory and the cash rate for this impression. Typically, the upper limit of this time is approximately 100 milliseconds. Given the fact that for the optimal performance of advertising campaigns, tens of thousands of requests per second are required, fulfilling this requirement can be a very non-trivial task.
Our Ad Server, which is responsible for the main operation of GetIntent DSP, is developed in Java and runs on the standard Hotspot JVM, which has well-known garbage collection (GC) mechanisms. Therefore, the most optimal option lies in the analysis of how exactly the work with memory occurs, and as a result, the selection of the most suitable garbage collection algorithm and its optimal setting. This will be discussed in this article.
In aggregate, our expected result is the maximum balance between the number of servers (the smaller the better) and the total duration and frequency of GC pauses during which we can lose potential impressions.
How we tested
For testing, 2 workstations were used. On the first, the JVM started with:
-Xmx4500m
On the second:
-Xmx12g
JVM version: Oracle 1.8.0_66-b17 The
garbage collectors CMS (Concurrent Mark Sweep) and G1 (Garbage First) were compared. The
testing was carried out for 16 hours at a load that was fully consistent with the combat load.
CMS (Concurrent Mark Sweep)
CMS can significantly reduce delays associated with garbage collection. However, when using it, one inevitably encounters two main problems, which create the need for additional configuration:
- Memory fragmentation
- High allocation rate
You can positively influence the first parameter by controlling the promotion rate of the indicator. To do this, you need to determine how many objects fall into Tenured, and which "dies young" in the Eden area.
Testing was carried out with the following parameters:
-XX:+UseConcMarkSweepGC
-XX:NewRatio=1, 3, 5
for logging were used:
-XX:+PrintGCDetails -XX:+PrintGC -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+PrintGCDateStamps -XX:+PrintCMSInitiationStatistics -XX:PrintCMSStatistics=1
G1 (Garbage First)
The G1 GC looks like a tempting choice for an RTB bidder, as its main goal is to withstand stable and predictable Stop The World (STW) pauses. It also determines the simplicity and clarity of its settings. In fact, you need to operate with only one parameter - the maximum allowable duration of the STW pause: -XX: MaxGCPauseMillis
In our case, to eliminate random long pauses, you can sacrifice a small fraction of throughput.
Regarding the G1 GC, from the moment of its appearance as an experimentable garbage collector, some prejudices have formed, the main one of which is that MaxGCPauseMillis is not maintained. There is also a recommendation voiced by Oracle to use it on large enough heap sizes (> = 6 Gb).
How relevant is all this we will find out after our testing. We’ll also take a little time for such an exclusive G1 GC function as String Deduplication.
Testing was carried out with the following parameters:
-XX:+UseG1GC
-XX:MaxGCPauseMillis=100, 60, 40
Additionally, tests were conducted with the parameter:
-XX:MaxTenuringThreshold=8
for logging were used:
-XX:+PrintGCDetails -XX:+PrintGC -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+PrintGCDateStamps -XX:+PrintAdaptiveSizePolicy -XX:+PrintReferenceGC
Max Heap Size 4.5Gb
Summary table of the distribution of Stop The World pauses:

The clear winner in this configuration is CMS with the flag
-XX:NewRatio=5
As you can see, despite the fact that the ms / sec pause indicator for this configuration is slightly worse than the others, it still shows itself as the most stable - ~ 12 ms average pause and almost 98% fits into the norm - an excellent result for us. With these indicators, one Full GC for 16 hours, you can close your eyes.
Latency distribution chart for the best G1 and CMS indicators:

CMS Results Analysis
We experimented with sets of parameters in which the size of Eden (-XX: NewRatio) was 1/2, 1/4, and 1/6 of the total memory size. The average promotion rate for these configurations was distributed accordingly: 1.7, 2.75 and 2.79 mb / sec, which is logical - the smaller the Eden size, the more garbage manages to leak directly into Old Gen. As you can see, from a certain moment, the size of the Eden region begins to slightly affect this indicator. In our case, we can sacrifice a higher promotion rate (as a result, more frequent OldGen builds and a greater likelihood of fragmentation) for the minimum possible average delay during application operation.
Analysis of the results of G1
It can be seen that G1 is cramped in such a small heap. Mixed pauses are very frequent,
-XX: MaxGCPauseMillis has a small effect on the final result, and the configuration with the desired pause of 40ms could not do without Full GC.
However, there is another point that confused us. By default, G1 selects 15 ages for the Survivor area. We decided to see if we really need so much:

Obviously a strange sign. Starting around age 8, size always remains at approximately the same level; this suggests that these are long-lived objects that are likely to fall into the Tenured area in any case, and before that, with each minor assembly, we simply transfer from empty to empty, whereas we could immediately put all this into OldGen. A good solution is to set MaxTenuringThreshold = 8.
However, in the case of heap 4.5Gb, we did not notice a big difference in the results, so for brevity we omit them. Let's see if something changes on the big heap.
Max Heap Size 12Gb
Summary table of distribution of Stop-the-World pauses:

The composition of representatives of G1 has changed a little, because the parameter MaxTenuringThreshold = 8 (in the table mtt = 8) in this configuration began to produce noticeable results.
On a large heap, G1 spread its wings and stepped forward both in the overall distribution of pauses and in a very short maximum pause. Moreover, the average time spent on the GC was less than 7ms every second, i.e. less than 0.7%
Latency distribution chart for the best G1 and CMS indicators:

CMS Results Analysis
It is believed that the main problem with CMS is the issue of scalability. Our testing confirms this. Almost all indicators are worse than when using a small heap. Of the pluses, it can be noted that due to the larger memory size, the effect of fragmentation is noticeably lower - not a single Full GC for the entire experiment.
Analysis of the results of G1
The result clearly shows that G1 is indeed much more stable on large amounts of memory; the conditions specified in the settings are quite clearly fulfilled. Here is the undisputed winner with 40 ms latency. The average pause grew by only 3 ms, when as the memory size grew by almost 2.4 times! What can we say about the ms / sec indicator - twice as good.
G1 String Deduplication
Since our bidder works with the text OpenRTB protocol, writes a lot of string logs, stores string caches, etc., it is quite logical to expect a big effect from this new function. In theory, the number of garbage collections should be reduced while the average collection time will increase. We added this flag for configuration with MaxGCPauseMillis = 100ms and Xmx = 4500m:

Although the average pause is within the specified limits, the number of pauses in excess of 1000ms exceeded the permissible limits. This can be seen in the graph:

Attempts to set a shorter pause duration led to a very strong increase in CPU consumption. It was decided to refuse to use this parameter.
Summary
We conducted a detailed analysis of CMS and G1 garbage collectors, the main purpose of which was to understand how much we can reduce the influence of GC on latency - the most critical indicator for our system.
Quite the expected result - there are no definite conclusions. For a VM with a memory size of 5Gb, came out the winner of the CMS with the configuration -XX: NewRatio = 5; Despite the large maximum pause, during the life of the application it showed a more stable result, better percentile and average delay. However, on a VM with a heap size of 12Gb, G1 outstripped the CMS by a large margin, which justifies the recommendations of Oracle; ms / sec delay is better by 1.94 times, max pause by 13.3 times!
Thanks to this study, we could no longer work blindly, guided only by individual recommendations and diverse opinions; on the contrary, we were able to find the perfect balance for our heterogeneous system in terms of configuration, getting maximum stability and, as a result, profit from what we have today.
Authors of the article - absorbb and dmart28