How a Specter bug that could break an industry was kept secret for seven months
- Transfer

When researcher Michael Schwartz of Graz University of Technology first contacted Intel, he thought he would be upset. He found a problem in their chips, working together with colleagues - he was helped by Daniel Grass, Moritz Lip and Stefan Mangard. The vulnerability was deep and easily exploited. His team finished writing an exploit on December 3rd, Sunday afternoon. Assessing the possible consequences of their find, they immediately wrote to Intel.
Schwartz received the answer only nine days later. But when they called him from the company, Schwartz was surprised: the company already knew about the problems with the CPU, and desperately tried to figure out how to fix them. Moreover, the company did everything possible to ensure that no one else knew about it. They thanked Schwartz for his contribution, but said that the information he discovered was top secret and gave him a date after which this secret could be revealed.
The problem that Schwartz discovered — and, as he later found out, many others — was potentially disastrous. Vulnerability at the chip circuit level, which could slow down the operation of any processor in the world, in the absence of an ideal fix except for processing the entire chip. It has hit almost every major technology company in the world, from Amazon server farms to chip makers like Intel and ARM. But Schwartz was faced with yet another problem: how to keep such a serious vulnerability a secret long enough to be fixed?
Disclosure is an old problem in the security world. When a researcher finds a bug, it is usually customary to give manufacturers several months of handicap to fix the problem before it becomes available to the general public and bad people will have a chance to use it. But the more companies and products are affected by the problems found, the more difficult this dance becomes. The more programs you need to quietly develop and promote, the more people need to report a problem and ask to keep it a secret. In the case of Meltdown and Specter, this coordination of keeping secrets failed and the secret appeared before anyone had time to prepare for it.
Premature disclosure has consequences. After the information is released, confusion arises - for example, are AMD chips susceptible to Specter attacks (are they susceptible) or is Meltdown specific to Intel only (AMD chips also suffered). Antivirus systems were caught off guard and inadvertently blocked many critical patches. The development of other patches had to be suspendedafter computers stopped working because of them. One of the best tools for fixing the vulnerability, Retpoline, was developed by the Google incident response team, and initially they planned to release it along with information about the bug. But although the Retpoline development team claims that it was not taken by surprise, the code of this tool was not made publicly available until the day after the first announcement of the vulnerability, in particular due to an accidental breach of secrecy.
What is most worrying is that many critical groups that respond to vulnerabilities were generally unaware of what was happening. Most Influential Warningabout the vulnerability came from the CERT unit Carnegie Melon, working with the Department of Homeland Security on the disclosure of vulnerabilities. But according to chief vulnerability analyst Will Dorman, CERT did not know about this problem until the Meltdown and Specter sites were launched, which led to increased chaos. In the initial report, CPU replacement was indicated as the only solution. Technically, this advice was correct in the event of an error in the processor scheme, but it only increased the panic among IT managers who realized how they pick out and replace the CPU on all reporting devices. A few days later, Dorman and colleagues decided that their advice was not applicable in practice, and replaced the recommendation with a simple patch installation.
“I would like to know in advance,” Dorman says. “If we had learned about this earlier, we could have released a more accurate document, and people would have received much more information right away, and not like now when we check the patches and update the document all last week.”
But perhaps these problems could not be avoided? Even Dorman is not so sure. “This is the largest multiple vulnerability we have ever dealt with,” he told me. “With vulnerabilities of this magnitude, it is impossible to get out of the water so that everyone is satisfied.”
The first step in revealing the vulnerabilities of Meltdown and Specter was taken six months ago, before the opening of Schwartz, in a letter dated June 1posted by Jan Horne, a contributor to Google Project Zero. A letter sent to Intel, AMD and ARM signaled a new vulnerability, called Specter, and demonstrated an exploit of Intel and AMD processors, and unpleasant consequences for ARM. Horn approached this with caution and gave manufacturers only the necessary minimum information. He specifically appealed to three chip manufacturers, and urged each company to figure out how to make it public and contact other companies that could be affected. At the same time, Horne warned them not to spread the word too far or too fast.
“Keep in mind that we haven’t reported this to other Google departments yet,” wrote Horn. “When reporting this to third parties, try not to disseminate information unnecessarily.”
It turned out to be quite difficult to establish who exactly is vulnerable. It all started with chip manufacturers, but it soon became clear that it would be necessary to patch operating systems, which required the involvement of another circle of researchers. This should affect browsers, as well as massive cloud services managed by Google, Microsoft and Amazon, which could be considered the most attractive targets for the new bug. As a result, dozens of companies from all over the world will have to release this or that patch.
The official policy of Project Zero was to provide 90 days before the publication of news, but the more companies joined the circle of favorites, the stronger Project Zero yielded to its requirements, and extended this period more than twice. Months passed, companies began to release their own patches, trying to hide what they were fixing. The Google Incident Response Team received information in July, a month after the first warning from Project Zero. Microsoft Insiders released a quiet early patch in November. During this period, Intel Director Brian Krzhanich made more controversial actions, ordering an automatic sale of shares in Octoberon November 29th. On December 14, Amazon Web Server customers received a warning that a wave of computer reboots could affect performance on January 5. Another patch from Microsoft was compiled and released on New Year's Eve, which says that the company’s team probably worked on it all night. In each case, the causes of the changes were blurry, and users knew little about what was being fixed.
However, you cannot rewrite the basics of the Internet infrastructure so that someone does not notice it. The thickest hints came from the Linux world. This OS, which runs the majority of cloud servers on the Internet, is required to play a large role in any correction of Specter and Meltdown errors. But, since the source code of this system is open, any changes will have to be made public. Each update was laid out on an open Git repository, and all official discussions took place on a public mailing list. When patches for the OS kernel began to come out one after another for the mysterious “page table isolation” function, people who closely watched this realized that something was wrong.
The biggest hint was the event of December 8, when Linus Torvalds adopted a new patchthat changed how the Linux kernel works with x86 processors. “This fix, in addition to fixing KASLR leaks) also strengthens the code for x86,” Torvalds explained. And the latest kernel release came out just a day before. Usually the patch should have waited for inclusion in the next release, but for some reason this patch was too important. Why is it usually the moody Torvalds that suddenly turned on a freelance update, especially if it seems to slow down the kernel?
Even stranger looked a suddenly appeared letter a month ago , in which it was proposed to update the old kernels with a new patch in hindsight. Summing up the rumorsOn December 20th, Linux veteran Jonathan Corbet wrote that the problem with the page table "has all the signs of a security patch released under deadline pressure."
And yet they did not know everything. Page Table Isolation, "page table isolation" is a way to separate the kernel space from the user space, so the problem was clearly in some kind of leak from the kernel. But it remained unclear what exactly worked incorrectly in the kernel or how far the effect of this bug was spread.
The following news came from the chip makers themselves. In a new patch, Linux described all x86 processors as vulnerable, including AMD processors. Since the patch was underestimated, AMD was not happy to include this patch. The day after Christmas [Catholic, December 25 / approx. trans.] AMD engineer Tom Lendaki sent a letter to the Linux kernel mailing list explaining why AMD didn’t need a patch for chips.
“AMD’s microarchitecture does not allow operating with such memory links, including speculative ones, which gain access to privileged data while working in a less privileged mode, in cases where such access can lead to a page fault error,” wrote Lendaki.
This whole story is full of technical terms, but for all the people trying to figure out the essence of the error, it sounded like a fire alarm. An AMD engineer knew exactly about the vulnerability, and said that the core problem stemmed from something that processors have been doing for almost 20 years. If the problem was speculative references, this problem concerned everyone - and to fix it, something much more than a simple kernel fix should be required.
"That was the impetus," says Chris Williams, editor of The Register. - Until that moment, no one mentioned speculative references to memory. Only after the appearance of this letter did we realize that something was wrong. "
When it became clear that the problem was related to speculative links, the researchers were able to finish the picture to the end. For years, security researchers have been looking for methods to crack the kernel through speculative execution of programs; Schwartz’s team from Graz published a paper on this subject in June. Anders Fogh published his attempts at similar attacks in July, although they were unsuccessful. Just two days after the letter from AMD, a researcher under the nickname brainsmoke presented work on this topic at the Chaos Computer Congress in Leipzig. All these works did not lead to the discovery of a bug suitable for exploitation, but thanks to them it became clear how it should look - and it looked extremely bad.
Fogh said that from the very beginning it was clear that any working bug would turn into a disaster. “When you start learning something like that, you already know that your success will lead to very bad consequences,” he told me. After the release of Meltdown and Specter and the eruption of chaos, Fog decided not to publish further research on this topic.
The following week, rumors of a bug began to leak through Twitter, mailing lists, and forums. A regular speed meter flying through the PostgreSQL mailing list found a 17% slowdown - a terrible number for people waiting for the patch. Other researchers wrote informal posts describing everything that they knew, and emphasized that these were just rumors. “This article basically provides insights, until the embargo is lifted,” -wrote one of the authors . "And on this day, fireworks and dramatic events should be expected."
By the New Year, rumors became impossible to ignore. Williams decided it was time to write something. On January 2, The Register published an article on what they called "a flaw in the Intel processor circuitry." It described what was happening on the Linux mailing list, an ominous letter from AMD, and early research. “From what AMD programmer Tom Lendaki described, it follows that Intel CPUs commit speculative code execution without security checks,” the article said. - This will allow the user code of the ring-3-level level to read data from the kernel level ring-0-level. And that’s not good. "
The decision to publish this article was controversial. The industry assumed an embargo on the dissemination of information, giving companies time to issue patches. Early distribution of news shortened this time, and gave criminals a chance to exploit vulnerabilities before patches appeared. But Williams claims that by the time the article was released, the secret had already been revealed. “I thought we had to warn people that when these patches come out, they definitely need to be installed,” says Williams. “If you are smart enough to use such a bug, you would have guessed it without us.”
And in any case, the embargo would last only one more day. The official release was scheduled for January 9, along with Microsoft's Thursday patches and right in the midst of the Consumer Electronics Show, which could muffle the bad news. But a combination of wild rumors and accessible researchers made it impossible to contain news. Reporters bombarded the researchers with letters, and everyone in this field tried their best to remain silent, since the likelihood that the secret would last another week was constantly decreasing.
The tipping point came thanks to brainsmoke. He was one of a small number of core researchers who were not subject to an embargo, so he took the rumors as a guide to action and decided to find this bug. The next morning after an article in The Register, he found it, and tweeted a screenshot of his terminal as evidence. “No page fault is needed,” he wrote in a subsequent tweet. “The main issue, apparently, is to drag and drop everything into and out of the cache.”
When the researchers saw this tweet, it all happened. The Graz team firmly decided not to disclose the cards to Google or Intel, but after spreading evidence of the possibility of using the bug from Google, it was reported that the embargo would be lifted on the same day, January 3, at 2 p.m. PST. At the appointed hour, the full version of the study appeared on two specially prepared sites, along with pre-prepared logos of each of the vulnerabilities. Messages flowed in from ZDNet, Wired, and The New York Times, often describing information collected just hours before. After more than seven months of planning, the secret finally came out.
It's hard to say how much the early exit cost. Patches are still being developed, and speed meters are still counting the resulting losses. Would it go more smoothly if there was another extra week to prepare? Or would she simply delay the inevitable?
You can find a lot of formal documents describing how the announcement of such vulnerabilities should take place, for example, with the international standards organization , the US Department of Commerce , CERT - although they can find little clear advice on such a serious problem. Experts have been tormented for years with similar questions, and the most experienced of them have already despaired of finding the perfect answer.
Katie Mussuris helped write instructions for such events at Microsoft, along with ISO standards and countless other instructions. When I asked her to rate the public’s reaction this week, she described her more gently than I expected.
“Perhaps it was better to do nothing,” she said. - ISO standards can tell you what to think about, but they will not tell you what to do at the peak of a similar situation. This is similar to how you read instructions and complete a couple of training alarms. It’s good when there is a plan, but when your house is on fire, you do not act as it is written in the plan. ”
The more the technology is centralized and acquires internal connections, the more difficult it becomes to avoid such fire alarms. With the proliferation of protocols like OpenSSL, the risk of massive bugs like Heartbleed, an online version of cereal disease, is increasing. This week has demonstrated a similar effect on iron. Speculative execution became the industry standard before we had time to ensure its safety. And since most web services run on the same chips and the same cloud services, this risk increases many times over. And when the vulnerability finally manifested itself, as a result, the task of properly covering it became almost impossible.
Such confusion is difficult to avoid with any failure of key technologies. “In the 90s, we had a motto - one vulnerability, one producer, and there were most of these vulnerabilities. And now, almost anywhere, there is an element of coordination among several stakeholders, says Moussouris. “This is what real coverage of the problems associated with the work of several stakeholders looks like.”