CPU Load: when to start worrying?

From the sandbox

This note is a translation of an article from Scout's blog. The article provides a simple and clear explanation of such a thing as load average. This article is aimed at novice Linux administrators, but it may be useful for more experienced administrators. Interested, welcome to cat.

You are probably already familiar with the concept load average. Load averageAre the three numbers displayed when executing the topand commands uptime. They look something like this:

load average: 0,35, 0,32, 0,41

Most intuitively understand that these three numbers indicate the average processor load on progressively increasing time intervals (one, five and fifteen minutes) and the lower their value, the better. Larger numbers indicate too much load on the server. But what values are considered limiting? Which values are “bad” and which are “good”? When should you just worry about middle load tasks, and when should you drop other things and solve the problem as quickly as possible?
To get started, let's figure out what it means load average. Consider the simplest case: suppose we have one server with a single-core processor.

Traffic flow analogy

A single-core processor is like a single lane road. Imagine that you control the movement of cars on a bridge. Sometimes, your bridge is so loaded that cars have to wait in line to drive on it. You want to let people know how long they have to wait to get to the other side of the river. A good way to do this would be to show how many cars are waiting in line at a particular point in time . If there are no cars in the queue, the approaching drivers will know that they will be able to immediately cross the bridge. Otherwise, they will understand that they will have to wait in line.
So, Bridge Manager, which naming system will you use? How about this:

0.00 means that there are no cars on the bridge. In fact, values from 0.00 to 1.00 mean no queue. The approaching car can use the bridge without waiting;
1.00 means that there are just as many cars on the bridge as it can accommodate. Everything is still going well, but in case of an increase in the flow of cars, problems are possible;
Values greater than 1.00 indicate that there is a queue at the entrance. How big For example, a value of 2.00 indicates that there are as many cars in a queue as moving along a bridge. 3.00 means that the bridge is fully occupied and is waiting in line for twice as many cars as it can accommodate. And so on.

load average = 1.00

load average = 0.50

load average = 1.70
Here is the base CPU load. "Machines" are processed using intervals of processor time ("cross the bridge"), or queued. On Unix, this is called the length of the execution queue : the number of all processes currently running, plus the number of processes waiting in the queue.
As a bridge manager, you would like machine processes to never wait in line. Thus, it is preferable that the processor load is always below 1.00. Traffic surges are periodically possible when the load will exceed 1.00, but if it constantly exceeds this value - this is an occasion to start worrying.

So you say 1.00 is the ideal load average?

Not really. The problem with the value of 1.00 is that you have no stock left. In practice, many system administrators draw a line at 0.70:

Practical rule “Supervision required”: 0.70. If the average load value constantly exceeds 0.70, you should find out the reason for this system behavior in order to avoid future problems;
The practical rule, “Repair it immediately!”: 1.00. If the average system load exceeds 1.00, you must urgently find the cause and eliminate it. Otherwise, you run the risk of waking up in the middle of the night and it certainly won't be fun;
The practical rule is “Right now 3 nights !!! SHOZANAH ?? !! ": 5.00. If the average processor load exceeds 5.00, you have serious problems. The server can freeze or run very slowly. Most likely, this will happen at the worst possible moment. For example, in the middle of the night or when you are giving a talk at a conference.

What about multiprocessor systems? My server shows loading 3.00 and everything is OK!

Do you have a four-processor system? Everything is fine if it load averageis 3.00.
In multiprocessor systems, the load is calculated relative to the number of processor cores available. 100% load is indicated by the number 1.00 for a single-core machine, the number 2.00 for a dual-core, 4.00 for a quad-core, etc.
Returning to our bridge analogy, 1.00 means "one fully loaded lane." If there is only one lane on the bridge, 1.00 means that the bridge is 100% loaded, but if there are two lanes, it is only 50% loaded.
Same thing with processors. 1.00 means 100% load of a single core processor. 2.00 - 100% dual core boot, etc.

Multi-core vs. multiprocessing

Which is better: one processor with two cores or two separate processors? In terms of performance, both of these solutions are roughly equal. Yes, approximately. There are many nuances associated with the size of the cache, process switching between processors, etc. Despite this, the only important characteristic for changing the system load is the total number of cores, regardless of how many physical processors they are on.
Which brings us to two more practical rules:

"Number of cores = maximum load." On a multi-core system, the load should not exceed the number of available cores;
"Nuclei - they are also the nuclei in Africa." How the cores are distributed across the processors doesn't matter. Two quad-core = four dual-core = eight single-core processors. Only the total number of cores matters.

Bring it all together

Let's look at the load averages with the command uptime:

~$ uptime
 09:14:44 up  1:20,  5 users,  load average: 0,35, 0,32, 0,41

Here are the indicators for a system with a quad-core processor and we see that there is a large load margin. I will not even think about it until load averageit exceeds 3.70.

What average should I control? For one, five or 15 minutes?

For the values that we spoke about earlier (1.00 - fix it immediately, etc.), you should consider time intervals of five and 15 minutes. If the load on your system exceeds 1.00 on an interval of one minute, everything is in order. If the load exceeds 1.00 on a five- or 15-minute interval, you should start taking action (of course, you should also take into account the number of cores in your system).

The number of cores is important for correctly understanding the load average. How do I recognize him?

The command cat /proc/cpuinfodisplays information about all the processors in your system. To find out the number of cores, “feed” its output to the utility grep:

~$ cat /proc/cpuinfo | grep 'cpu cores'
cpu cores	: 4
cpu cores	: 4
cpu cores	: 4
cpu cores	: 4

Translator Notes

The above is a translation of the article itself. You can also get a lot of interesting information from the comments on it. So, one of the commentators says that it is not important for each system to have a margin in productivity and not to allow a load value above 0.70 - sometimes we need the server to work “to the fullest” and in such cases load average = 1.00what the doctor prescribed.

PS

Habrauser dukelion added a valuable remark in the comments that in some scenarios, in order to achieve maximum iron efficiency, it is worth keeping a value load averageslightly above 1.00 to the detriment of the efficiency of each individual process.

PPS

Habrauser enemo in the comments added a remark that a high rate load averagecan be caused by a large number of processes that are currently performing read / write operations. That is, load average > 1.00on a single-core machine it does not always mean that your system does not have a reserve for loading the processor. A more careful study of the reasons for this indicator is required. By the way, this is a good topic for a new post on Habré :-)

PPPS

Habrauser esvaf in the comments wonders how to interpret the values load averagewhen using a processor with HyperThreading technology. I have not found a definite answer at the moment. In this article it states that a processor which has two virtual cores at a single physical be 10-30% more productive than a simple single-core. If such an assumption is taken for truth, I think that in the interpretation load averageit is worth taking into account only the number of physical cores.

Tags: