How I found a bug in Intel Skylake processors

Original author: Xavier Leroy
  • Transfer
The instructors of the Introduction to Programming courses know that students find any reasons for their program errors. Sort procedure rejected half of the data? “It could be a virus on Windows!” Did the binary search never work? “The Java compiler behaves strangely today!” Experienced programmers know very well that a bug is usually in their own code, sometimes in third-party libraries, very rarely in system libraries, very rarely in the compiler and never in the processor. I thought so too until recently. Until I encountered a bug in Intel Skylake processors when I was debugging OCaml's mysterious crashes.

First manifestation


In late April 2016, shortly after the release of OCaml 4.03.0, one Very Serious Industrial User OCaml (OSIP) contacted me privately with bad news: one of our applications, written in OCaml and compiled in OCaml 4.03.0, crashed randomly . Not every time I started, but sometimes segfault crashed in different places in the code. Moreover, crashes were observed only on their newest computers that ran on Intel Skylake processors (Skylake is the code name for the last generation of Intel processors at that time. Now the latest generation is Kaby Lake).

Over the past 25 years, many OCaml bugs have been reported to me, but this message has been a particular concern. Why only Skylake processors? In the end, I could not even reproduce the crashes in the OSIP binaries on computers in my Inria company, because they all worked on older Intel processors. Why are crashes not reproducing? The single-threaded OSIP application does network and disk I / O operations, so its execution must be strictly determined, and any bug that segfault caused must manifest itself each time it is run in the same place in the code.

My first assumption was that OSIP had hardware buggy: a bad memory chip? overheat? In my experience, due to such malfunctions, the computer can boot normally and work in the GUI, but it crashes under load. So, I advised OSIP to run a memory check, lower the processor clock speed and disable Hyper-Threading. The assumption about HT appeared in connection with a recent bug report in Skylake with vector arithmetic AVX, which appeared only when HT was turned on ( see description ).

OSIPu did not like my advice. He objected (logically) that they ran other CPU / memory-intensive tasks / tests, but only programs written in OCaml crashed. Obviously, they decided that their hardware is in order, and the bug is in my program. Well, great. I still persuaded them to run a memory test, which did not reveal errors, but they ignored my request to turn off HT. (Very bad, because it would save us a lot of time).

At the same time, OSIP conducted an impressive investigation using different versions of OCaml, different C compilers that are used to compile the OCaml runtime support system, and different operating systems. The verdict was as follows. OCaml 4.03 is buggy, including early beta, but not 4.02.3. Of the compilers, GCC is buggy, but not Clang. Of the operating systems - Linux and Windows, but not MacOS. Since ClOS is used on MacOS and the port runs from the Windows version on GCC, OCaml 4.03 and GCC were clearly called the reason.

Of course, OSIP argued logically: they say, in the OCaml 4.03 runtime support system there was a fragment of bad C code - with undefined behavioras we say in business - because of which GCC generated bad machine code, since C compilers are allowed to work with undefined behavior. This is not the first time that GCC handles unspecified behavior as incorrectly as possible. For example, see this security vulnerability or this broken benchmark .

Such an explanation seemed quite plausible, but it did not explain the random nature of the failures. GCC generates fancy code due to undefined behavior, but it is still deterministic code. The only reason I could come up with an accident could be Address Space Layout Randomization(ASLR) is an OS function for randomizing an address space that changes the absolute addresses in memory each time it starts. The OCaml runtime support system in some places uses absolute addresses, including for indexing memory pages into a hash table. But crashes remained random even after disabling ASLR, in particular, while the GDB debugger was running.

It was May 2016, and it was my turn to get your hands dirty when OSIP sent a subtle hint - it gave shell access to its famous Skylake machine. First of all, I built the debug version of OCaml 4.03 (to which I later planned to add more debugging tools) and reassembled the OSIP application with this version of OCaml. Unfortunately, this debug version did not fail. Instead, I started working with the OSIP executable, first interactively manually under GDB (but it drove me crazy, because sometimes I had to wait for a crash for an hour), and then with a small OCaml script that ran the program 1000 times and saved memory dumps on every failure.

Debugging the OCaml runtime support system is not the fun part, but posthumous debugging from memory dumps is generally terrible. Analysis of 30 memory dumps showed segfault errors in seven different places, two places in OCaml GC, and five more in the application. The most popular place with 50% failures was the function mark_slicein the OCaml garbage collector. In all cases, OCaml had a bunch of corrupted ones: in a well-formed data structure, there was a bad pointer, that is, a pointer that did not point to the first field of the Caml block, but to the header or to the middle of the Caml block, or even to an invalid memory address (already freed) . All 15 crashes mark_slicewere caused by a two-word pointer in front of a block of size 4.

All these symptoms were consistent with familiar errors, such as the one the compilermark_sliceforgot to register the memory object in the garbage collector. However, such errors would lead to reproducible crashes that depend only on memory allocation and garbage collector actions. I completely did not understand what type of OCaml memory management error could cause random crashes!

For lack of better ideas, I again listened to the inner voice, which whispered: “hardware bug!”. I had an unclear feeling that crashes more often happen if the machine is under more stress, as if it were just overheating. To test this theory, I changed my OCaml script to run N copies of the OSIP program in parallel. For some runs, I also turned off the OCaml memory compactor, which caused more memory consumption and more garbage collector activity. The results were not what I expected, but still amazing:

NBoot systemWith default settingsWith seal off
13 + epsilon0 glitches0 glitches
24 + epsilon1 failure3 failures
46 + epsilon12 crashes19 crashes
810 + epsilon17 crashes23 glitches
1618 + epsilon16 crashes



This shows the number of crashes per 1000 runs of the test program. See the jump between$ N = 2 $ and $ N = 4 $? And a plateau between higher values$ N $? To explain these numbers, you need to talk more about the Skylake test machine. It has 4 physical cores and 8 logical cores, since HT is enabled. Two cores were occupied in the background by two long-term tests (not mine), but otherwise the machine was free. Consequently, the system load was$ 2 + N + epsilon $where $ N $Is the number of tests running in parallel.

When no more than four processes work simultaneously, the OS scheduler equally distributes them between the four cores of the machine and stubbornly tries not to direct the two processes to two logical cores of the same physical core, because this will lead to insufficient use of other physical cores. This happens in the case of$ N = 1 $as well as most of the time in the case of $ N = 2 $. If the number of active processes exceeds 4, then the OS starts using HT, assigning processes to two logical cores on the same physical core. This is the case$ N = 4 $. Only if all 8 logical cores on a machine are occupied, does the OS implement the traditional separation of time between processes. In our experiment, these are cases$ N = 8 $ and $ N = 16 $.

Now it became clear that crashes only start when Hyper-Threading is turned on, more precisely, when the OCaml program was working next to another thread (logical core) on the same physical core of the processor.

I sent the results of the experiments to OSIP, begging him to accept my theory that multithreading is to blame. This time, he listened and turned off HT on his machine. After that, the failures completely disappeared: two days of continuous testing did not reveal any problems at all.

Is the problem resolved? Yes! A happy ending? Not really. Neither I nor OSIP tried to report the problem to Intel or to someone else, because OSIP was satisfied that OCaml could be compiled with Clang, and also because it did not want unpleasant publicity in the spirit of OSIP falling randomly! ". I was completely tired of this problem, and I did not know how to report such things (Intel does not have a public bug tracker like ordinary people have), and I also suspected that this was a bug of specific OSIP machines (for example, a lot of failed chips that accidentally ended up in the wrong basket at the factory).

Second manifestation


The year 2016 was calm, no one else reported that the sky (sky, more precisely, Skylake - pun intended) was falling due to OCaml 4.03, so I happily forgot about this little episode with OSIP (and continued to compose terrible puns).

Then, on January 6, 2017, Angerran Decornn and Joris Giovannanjeli from Ahrefs (another OCaml Very Serious Industrial User, Caml Consortium member in addition) reported mysterious random crashes with OCaml 4.03.0: this is PR # 7452 in the Caml bug tracker.

In their example of repeated failure, the ocamlopt.opt compiler itself sometimes crashed or produced meaningless results when it compiled a large source file. This is not too surprising, because ocamlopt.opt itself is an OCaml program compiled by ocamlopt.byte, but it was easier to discuss and reproduce the problem.

Publicly open comments on the PR # 7452 bug pretty well show what happened next, and Ahrefs employees described their hunt for the bug in detail in this article . So I will highlight only the key points of this story.

  • 12 hours after the opening of the ticket, when there were already 19 comments in the discussion, Angerran Decorn said that "all the machines that were able to reproduce the bug work on Intel Skylake family processors."
  • The next day I mentioned random crashes in OSIP and suggested disabling multithreading (Hyper-Threading).
  • A day later, Joris Giovannanzheli confirmed that the bug does not play when Hyper-Threading is disabled.
  • At the same time, Joris discovered that a failure occurs only if the OCaml runtime support system is assembled with a parameter gcc -O2, but not gcc -O1. Looking back, this explains the absence of crashes with the debug version of the OCaml environment and with OCaml 4.02, since they are both built by default with the parameter gcc -O1.
  • I go on stage and post the following comment:
    Is it crazy to assume that setting up gcc -O2on the OCaml 4.03 environment produces a specific sequence of instructions that causes a hardware failure (some stepping) in Skylake processors with Hyper-Threading? Perhaps this is crazy. On the other hand, there is already one documented hardware issue with Hyper-Threading and Skylake (link)
  • Mark Shinwell contacted colleagues at Intel and managed to push the report through the user support department.

Then nothing happened for 5 months, until ...

Opening


On May 26, 2017, the user "ygrek" posted a link to the following change log from the Debian microcode package: Errat's SKL150 was documented by Intel in April 2017 and described on page 65 in the 6th Generation Intel Processor Specification Sheet . A similar errat is referred to as SKW144, SKX150, SKZ7 for the Skylake architecture types and KBL095, KBW095 for the newer Kaby Lake architecture. The words “complete nightmare” are not mentioned in Intel's documentation, but describe roughly the situation.

* New upstream microcode datafile 20170511 [...]
* Likely fix nightmare-level Skylake erratum SKL150. Fortunately,
either this erratum is very-low-hitting, or gcc/clang/icc/msvc
won't usually issue the affected opcode pattern and it ends up
being rare.
SKL150 - Short loops using both the AH/BH/CH/DH registers and
the corresponding wide register *may* result in unpredictable
system behavior. Requires both logical processors of the same
core (i.e. sibling hyperthreads) to be active to trigger, as
well as a "complex set of micro-architectural conditions"




Despite the rather vague description (“a complex set of microarchitectural conditions”, and don’t say it!), This errata strikes directly at the target: enabled Hyper-Threading? There is one! Appears pseudo-randomly? There is! Has nothing to do with floating point or vector instructions? There is! In addition, the microcode update is ready, which fixes this error, it is nicely packaged in Debian and ready to be downloaded to our test machines. A few hours later, Joris Giovannanjeli confirmed that the crash disappeared after updating the microcode. I ran even more tests on my new-featured workstation with a Skylake processor (thanks to Inria's supply department) and came to the same conclusion, because the test, which crashed faster than 10 minutes on the old microcode, worked for 2.5 days without problems on the new microcode .

There is another reason to believe that the SKL150 is the culprit in our problems. The fact is that the problematic code described in this errat just generates GCC when compiling the OCaml runtime support system. For example, in a file byterun/major_gc.cfor a function, you sweep_sliceget the following C code:

hd = Hd_hp (hp);
/*...*/
Hd_hp (hp) = Whitehd_hd (hd);

After macro expansion, it looks like this:

hd = *hp;
/*...*/
*hp = hd & ~0x300;

Clang compiles this code in a trivial way, using only full-width registers:

movq    (%rbx), %rax
[...]
andq    $-769, %rax             # imm = 0xFFFFFFFFFFFFFCFF
movq    %rax, (%rbx)

However, GCC prefers to use the 8-bit register %ahto work with bits 8 to 15 of the full register %rax, leaving the remaining bits unchanged:

movq    (%rdi), %rax
[...]
andb    $252, %ah
movq    %rax, (%rdi)

These two codes are functionally equivalent. One possible reason for choosing GCC may be that its code is more compact: an 8-bit constant $252fits in one byte of code, while a 32-bit constant expanded to 64 bits $-769requires 4 bytes. In any case, the generated GCC code uses both %rax, and %ahand, depending on the level of optimization and an unfortunate set of circumstances, such a code may end in a cycle small enough to cause the SKL150 bug.

So, in the end, this is still a hardware bug. Told ya!

Epilogue


Intel has released microcode updates for the Skylake and Kaby Lake processors that fix or work around the problem. Debian has published detailed instructions to check if your processor is subject to a bug and how to obtain and apply microcode updates.

The publication of the bug and the release of the microcode turned out to be very timely, because mysterious crashes began to occur in several projects on OCaml. For example, Lwt , Coq, and Coccinelle .

A number of technical sites wrote about the hardware bug, for example, Ars Technica , HotHardware , Tom's Hardware and Hacker's News [and GeekTimes - approx. trans.].

Also popular now: