Uncalled function slows down the program 5 times

Transfer

Slow Windows, Part 3: Completing Processes

The author is optimizing Chrome's performance at Google - approx. per.

In the summer of 2017, I struggled with the Windows performance issue. The completion of the processes was slow, serialized and blocked the system input queue, which led to multiple hangs of the mouse cursor when building Chrome. The main reason was that at the end of the process, Windows spent a lot of time searching for GDI objects, while holding the system-global user32 critical section. I talked about this in the article "24-core processor, but I can not move the cursor . "

Microsoft fixed the bug, and I went back to my business, but then it turned out that the bug was back. There were complaints about the slow performance of LLVM tests, with frequent input hangs.

But in fact, the bug did not return. The reason was to change our code.

2017 problem

Each Windows process contains several standard GDI object handles. For processes that do nothing with graphics, these handles usually have a NULL value. When the process is complete, Windows calls some functions for these descriptors, even if they are NULL. It didn't matter — the functions worked quickly — until the release of Windows 10 Anniversary Edition, in which some security changes made these functions slow . During operation, they held the same lock that was used for input events. With the simultaneous completion of a large number of processes, each makes several calls to a slow function that holds this critical lock, which ultimately leads to blocking user input and the cursor to hang.

The Microsoft patch was not to call these functions on processes without GDI objects. I don’t know the details, but I think the Microsoft fix was something like this: So just skip cleaning GDI if the process is not a GUI / GDI process. Since compilers and other processes, which are quickly created and completed by us, did not use GDI objects, this patch was enough to fix the UI hanging.

+ if (IsGUIProcess())

+ NtGdiCloseProcess();

– NtGdiCloseProcess();

Problem 2018

It turned out that some standard GDI objects are in fact very easily distinguished by processes. If your process loads gdi32.dll, you will automatically receive GDI objects (DC, surfaces, regions, brushes, fonts, etc.), whether you need them or not (note that these standard GDI objects are not displayed in the Task Manager among the GDI objects for the process).

But this should not be a problem. I mean, why should the compiler load the gdi32.dll? Well, it turned out that if you load user32.dll, shell32.dll, ole32.dll or many other DLLs, then you will automatically receive in addition gdi32.dll (with the aforementioned standard GDI objects). And it is very easy to accidentally load one of these libraries.

LLVM tests at startup of each process called CommandLineToArgvW(shell32.dll), and sometimes called SHGetKnownFolderPath (also shell32.dll). These calls were enough to pull out gdi32.dll and generate these scary standard GDI objects. Since the LLVM test suite generates a lot of processes, it ultimately serializes when the processes end, causing huge delays and freezes of input, much worse than they were in 2017.

But this time we knew about the main problem with blocking, so we immediately knew what to do.

First of all, we got rid of the call to CommandLineToArgvW , manually sending the command line. After that, the LLVM test suite rarely called any functions from any problem DLL. But we knew in advance that this would not affect the performance. The reason was that even the remaining conditional call was enough to always pull out shell32.dll, which in turn pulled out gdi32.dll, creating standard GDI objects.

The second fix was deferred loading of shell32.dll . Delayed loading means that the library is loaded on demand — when the function is called — instead of loading when the process starts. This meant that shell32.dll and gdi32.dll would rarely load, but not always.

After this, the LLVM test suite began to run five times.faster in one minute instead of five. And no more mouse hangs on the developers' machines, so that employees could work normally during the execution of tests. This is a crazy acceleration for such a modest change, and the author of the patches was so grateful for my investigation that he pushed me to the corporate bonus .

Sometimes the smallest changes have the biggest consequences. You just need to know where to type "zero" .

Execution path not accepted

It is worth repeating that we paid attention to the code that was not executed - and this became a key change. If you have a command line tool that does not access gdi32.dll, then adding code with a conditional function call will slow down the completion of processes many times if gdi32.dll is loaded. In the example below, CommandLineToArgvW is never called, but even a simple presence in the code (without a call delay) negatively affects the performance:

int main(int argc, char* argv[]) {
  if (argc < 0) {
    CommandLineToArgvW(nullptr, nullptr); // shell32.dll, pulls in gdi32.dll
  }
}

So yes, deleting a function call, even if the code is never executed, may be sufficient to significantly improve performance in some cases.

Pathology reproduction

When I investigated the initial error, I wrote a program ( ProcessCreateTests ) that created 1000 processes and then killed them all in parallel. This reproduced the hang, and when Microsoft fixed the error, I used a test program to test the patch: see the video . After the reincarnation of the bug, I changed my program by adding the option -user32, which for each of the thousands of test processes loads user32.dll. As expected, the completion time of all test processes increases dramatically with this option, and it is easy to detect mouse cursor hang-ups. The process creation time also increases with the -user32 option, but there are no cursor hangs during process creation. You can use this program and see how terrible the problem can be. Here are some typical results of my quad-core / eight-thread notebook after a week of uptime. The option -user32 increases the time for everything, but UserCrit locks on completion of processes are especially dramatic :

> ProcessCreatetests.exe

Process creation took 2.448 s (2.448 ms per process).

Lock blocked for 0.008 s total, maximum was 0.001 s.


Process destruction took 0.801 s (0.801 ms per process).

Lock blocked for 0.004 s total, maximum was 0.001 s.


> ProcessCreatetests.exe -user32

Testing with 1000 descendant processes with user32.dll loaded.

Process creation took 3.154 s (3.154 ms per process).

Lock blocked for 0.032 s total, maximum was 0.007 s.


Process destruction took 2.240 s (2.240 ms per process).

Lock blocked for 1.991 s total, maximum was 0.864 s.

Digging deeper, just for fun

I thought about some of the ETW methods that can be used to study the problem in more detail, and have already started writing them. But he came across such inexplicable behavior, which he decided to devote to a separate article. Suffice it to say that in this case, Windows behaves even more strangely.

Other articles of the cycle:

Slow Windows, Part 0: Arbitrary Slowdown VirtualAlloc
Slow Windows, Part 1: File Access
Slow Windows, Part 2: Creating Processes
Slow Windows, Part 3: This

Literature

The first report on UI hangs: "24-core processor, but I can not move the cursor"
The following article, which leads to the understanding of the problem: "What * does * Windows, while holding this lock"
An article about another UI blocking due to the interaction between Gmail, ASLR v8 workers, CFG memory allocation policies and slow WMI scanning: "24-core CPU, but I can't type an email"
Downloading with the gdi32.dll compiler seems strange, but it is even stranger that the compiler loads mshtml.dll, which VC ++ used to do in some cases
Sometimes weeks of research lead to small but critical changes, as discussed in the article “Know where to type zero”
Video demonstrating the use of ProcessCreateTests and ETW to verify bug fixes.
The first change for LLVM by manual command line parsing
Second fix for LLVM using shell32.dll delay boot

Tags: