bedya November 26, 2014 at 16:29

High CPU or how Garbage Collector can kill performance

From the sandbox

Good afternoon.

About 2 weeks ago, our monitoring tool (NewRelic) began to detect a large number of site crashes lasting no more than 1 minute, but with a very high frequency. In addition, it was visually noticeable that the overall performance of the web application ( Umbraco 6.1.6, .net 4.0 ) fell.

Red stripes in the picture - this is our fall.

Yes, I’ll make a reservation. Before we all noticed this, a new module for the blog was installed and, accordingly, the company’s blog was migrated from Worldpress to Umbraco.

As a result, we have the following input data: the application began to store more data (much more) + a third-party module = High CPU was installed.

Let's hit the road

Before starting the study, it was decided to check Googe Analytics to make sure that the number of users did not change (the result - everything was as before) + it was decided to carry out load testing - to determine the throughput.

Here we were completely disappointed, our application was dying in 30 simultaneous sessions. The site through the browser did not open at all. And it was production.

Step 1 - collect performance dumps under load using Debug Diagnostics tools

1. Install it on the production server.
2. Run, create a new rule with the “Performance” type.

3. We indicate that dumps should be collected using Performance Counters.

4. In our case, select% Processor Time, the threshold is 80%, and the duration is 15 seconds.

This means that dumps will be collected if the CPU is more than 80% loaded for 15 seconds.

5. We study the results.

What you need to pay attention to is highlighted in red rectangles.

Namely:

At the time of dumping, the Garbage Collector was running (at first I did not give any attention to this);
Very large heap size;
All 4 threads are owned by Garbage Collector and eat 100% CPU.

Here I would like to draw attention to the fact that the problem itself lies not in the GC, but in the fact that the incorrect allocation of memory makes it work in this way.

Bit of theory

In GC, the most time-consuming is garbage collection of the Gen 2 generation (which causes, respectively, the assembly of Gen 1 and Gen 0). Each generation also has its own threshold, exceeding which garbage collection will automatically start. And this means, the more often the threshold is exceeded, the more often garbage collection will start.

A small example:

Assume the Gen 2 generation threshold: 300 MB
In one second, GC can clear: 100 MB (Gen 2)
Each new user per second results in allocation: 10 MB (in Gen 2)

If we have 10 users, then 10 * 10 = 100 MB, therefore there are no problems.
If we have 40 users, then 400 MB is allocated every second, which causes the assembly of the sum (the threshold is exceeded), and so on.
That is, the more users, the more memory is allocated (increasing), the more often garbage collection is called with a large interval of time for assembly.

In .net 4.0, when garbage collection starts, all GC threads are given the highest priority. This means that all server resources will be directed to garbage collection and, in addition, all other threads (processing incoming requests) will be temporarily suspended until the garbage collection is completed. This is the reason why the server did not respond to requests in any way, even with a partial load.

Therefore, we can conclude: the reason is the incorrect allocation of a large amount of memory for short periods of time. To solve the problem, you need to find where we have the so-called memory leaks in our code.

Step 2 - search for objects. which occupy the most memory (Memory Profiling)

For this, I used dotMemory as a memory profiler.

We start dotMemory under load and try to take a snapshot of the memory when its volume starts to grow rapidly. (The green zone in the image below is Gen 2.)

Next, we proceed to analyze the image. HttpContext, DynanicNode, Node

occupy the largest memory capacity .

HttpContext is excluded, since it stores references to DynanicNode and Node objects.

Next, we will group by generations, since we only need objects of the Gen 2

generation . In the Gen 2 generation, we again group by dominators.

This allows you to 100% find the necessary objects that occupy the largest amount of memory. After that, you need to work with specific instances of the object to determine what kind of object it is (id, properties, etc.)

At this point, it became clear what data is the source of the problem, it remains only to find the place where it is created and fix it.

Step 3 - Fix the Problem

In my particular case, the problem lay with the control that generated the main site navigation. This control was not in the cache, that is, it worked out with every page request. A specific 'memory leek' was associated with a call to the native Umbraco DynamicNode.isAncestor () method . As it turned out, in order to determine the parent, the method lifted the entire site tree into memory. This was confirmed by the fact that the problem began to appear only with the growth of data, and specifically with the import of the blog.

Therefore, the fix itself of the problem was to replace the isAncestor method with our own implementation + applying OutputCache to our control.

conclusions

- High CPU - this is not only recursion or heavy load, but also GC;
- Creation of objects should be deliberate and consistent with the architecture of the application;
- Output cache - anytime, anywhere;
- Everything that is not visible during normal testing will manifest itself under stress!

And note:

At the time of writing, NewRelic did not help me catch the source of High CPU, but performance counter % Time in GC easily pointed to the source of the problem.

If on the graph the CPU peaks increase in accordance with the graph peaks % Time in GC and the % Time in GC value is higher than the 20% line, => High CPU due to the garbage collector.

Thanks for attention. I hope it was interesting.

Tags: