You are measuring the CPU load incorrectly

Original author: Brendan Gregg
  • Transfer
That metric, which we call “processor utilization”, is in fact understood by many people not quite correctly. What is processor loading? Is this how busy our processor is? No, it is not. Yes, yes, I'm talking about the very classic CPU load that all performance analysis utilities show, from the Windows task manager to the top command on Linux.

Here's what “processor is 90% loaded now” can mean? Perhaps you think it looks something like this:



But actually it looks like this:



“Idle” means that the processor is able to execute some instructions, but does not do this, because it expects something — for example, input-output of data from RAM. The percentage of real and “idle” work in the figure above is what I see day-to-day in the work of real applications on real servers. There is a significant chance that your program spends its time in about the same way, but you don’t know about it.

What does this mean for you? Understanding how much time the processor actually performs some operations, and which only expects data, sometimes makes it possible to change your code, reducing the exchange of data with RAM. This is especially true in the current realities of cloud platforms, where automatic scaling policies are sometimes directly tied to CPU load, which means that every extra clock cycle of “idle” work costs us real money.

What is processor loading actually?


That metric, which we call "processor load" actually means something like "non-idle time": that is, this is the amount of time the processor spent in all threads except for the special "Idle" stream. The kernel of your operating system (whatever it is) measures this amount of time when context switches between threads of execution. If the command execution thread switched to a non-idle thread that worked for 100 milliseconds, then the OS kernel considers this time as the time spent by the CPU on performing real work in this thread.

This metric first appeared in this form simultaneously with the advent of time-sharing operating systems. The programmer’s manual for the computer in the lunar module of the Apollo spacecraft (the time-sharing front-end system at that time) called its idle stream the special name “DUMMY JOB” and engineers compared the number of instructions executed by this thread to the number of commands executed by work flows - this gave them an understanding of the processor load.

So what's wrong with this approach?

Today, processors have become much faster than RAM, and waiting for data has begun to occupy the lion's share of the time that we used to call the “CPU time”. When you see a high percentage of CPU usage in the output of the top command, you can decide that the processor is the bottleneck (a piece of iron on the motherboard under the heatsink and cooler), although in reality it will be a completely different device - RAM banks.

The situation even worsens over time. For a long time, processor manufacturers managed to increase the speed of their cores faster than memory manufacturers increased the speed of access to it and reduced latency. Somewhere in the 2005th year, processors with a frequency of 3 Hz appeared on the market and manufacturers concentrated on increasing the number of cores, hyper trading, multi-socket configurations - and all this put even greater demands on the speed of data exchange! Processor manufacturers tried to somehow solve the problem by increasing the size of processor caches, faster buses, etc. This, of course, helped a little, but did not radically turn the tide. We are already waiting for memory most of the time “processor loading” and the situation is only getting worse.

How to understand what the processor is actually doing


Using hardware performance counters. On Linux, they can be read with perf and other similar tools. Here, for example, measuring the performance of the entire system within 10 seconds:

# perf stat -a -- sleep 10
 Performance counter stats for 'system wide':
     641398.723351      task-clock (msec)         #   64.116 CPUs utilized            (100.00%)
           379,651      context-switches          #    0.592 K/sec                    (100.00%)
            51,546      cpu-migrations            #    0.080 K/sec                    (100.00%)
        13,423,039      page-faults               #    0.021 M/sec                  
 1,433,972,173,374      cycles                    #    2.236 GHz                      (75.02%)
         stalled-cycles-frontend  
         stalled-cycles-backend   
 1,118,336,816,068      instructions              #    0.78  insns per cycle          (75.01%)
   249,644,142,804      branches                  #  389.218 M/sec                    (75.01%)
     7,791,449,769      branch-misses             #    3.12% of all branches          (75.01%)
      10.003794539 seconds time elapsed

The key here is the metric " number of instructions per clock cycle " (insns per cycle: IPC) , which indicates how many instructions per processor on average, performed his every stroke. Simplified: the higher the number, the better. In the example above, this number is 0.78, which, at first glance, does not seem to be such a bad result (did 78% of the time do useful work?). But no, on this processor, the maximum possible IPC value could be 4.0 (this is due to the way modern processors receive and execute instructions). That is, our IPC value (equal to 0.78) is only 19.5% of the maximum possible speed for executing instructions. And in Intel processors starting with Skylake, the maximum IPC value is already 5.0.

In the clouds


When you work in a virtual environment, you may not have access to real performance counters (this depends on the hypervisor used and its settings). Here's an article on how this works in Amazon EC2 .

Data Interpretation and Response


If you have IPC <1.0 , then I congratulate you, your application is idle waiting for data from RAM. Your strategy for optimizing performance in this case will not be a decrease in the number of instructions in the code, but a decrease in the number of RAM accesses, more active use of caches, especially on NUMA systems. From a hardware point of view (if you can influence it) it would be wise to choose processors with larger cache sizes, faster memory and a bus.

If you have IPC> 1.0, then your application suffers not so much from waiting for data, but from an excessive number of instructions being executed. Look for more efficient algorithms, do no unnecessary work, cache the results of repeated operations. Using Flame Graphs plotting and analysis tools can be a great way to sort things out. From a hardware point of view, you can use faster processors and increase the number of cores.

As you can see, I drew a line for the IPC value of 1.0. Where did I get this number? I calculated it for my platform, and if you do not trust my assessment, you can calculate it for your own. To do this, write two applications: one should load the processor 100% with the flow of instructions (without actively accessing large blocks of RAM), and the second should actively manipulate data in RAM on the contrary, avoiding heavy calculations. Measure the IPC for each of them and take the average. This will be an approximate turning point for your architecture.

What performance monitoring tools should actually show


I believe that every performance monitoring tool should show the IPC value next to the processor load. This is done, for example, in the tiptop tool under Linux:

tiptop -                  [root]
Tasks:  96 total,   3 displayed                               screen  0: default
  PID [ %CPU] %SYS    P   Mcycle   Minstr   IPC  %MISS  %BMIS  %BUS COMMAND
 3897   35.3  28.5    4   274.06   178.23  0.65   0.06   0.00   0.0 java
 1319+   5.5   2.6    6    87.32   125.55  1.44   0.34   0.26   0.0 nm-applet
  900    0.9   0.0    6    25.91    55.55  2.14   0.12   0.21   0.0 dbus-daemo

Other reasons for the incorrect interpretation of the term "processor load"


The processor can perform its work more slowly, not only because of the loss of time waiting for data from RAM. Other factors may include:

  • CPU temperature drops
  • Variable processor frequency with Turboboost technology
  • CPU core frequency variation
  • The problem of averaged calculations: 80% of the average load on the measurement period per minute may not be a disaster, but they can also hide jumps in themselves up to 100%
  • Spin locks: the processor is loaded with instructions and has a high IPC, but in reality the application is in spin locks and does not perform real work

conclusions


CPU utilization has become a significantly misunderstood metric today: it includes the latency of data from RAM, which can take even longer than executing real commands. You can determine the actual processor load using additional metrics, such as the number of instructions per clock (IPC). Values ​​smaller than 1.0 indicate that you are limited by the speed of data exchange with memory, and large ones indicate a large processor load of the instruction stream. Performance measuring tools should be improved to display IPC (or something similar) directly next to the processor load, which will give the user a full understanding of the situation. Having all this data, developers can take some measures to optimize their code precisely in those aspects where it will be most useful.

Also popular now: