C is not a low-level language.

Transfer

Your computer is not a fast version of the PDP-11.

Hi, Habr!

My name is Anton Dovgal, I'm a C (and not only) developer at Badoo.

I came across an article by David Chiznell, a researcher at Cambridge University, in which he challenged the generally accepted proposition that C is a low-level language, and his arguments seemed interesting enough to me.

In light of the newly discovered vulnerabilities Meltdown and Specterit is worth spending time figuring out the reasons for their appearance. Both of these vulnerabilities exploited speculative execution of instructions by processors and allowed an attacker to get results through third-party channels. Vulnerability-causing features of processors, along with some others, were added so that C programmers continue to believe that they have been programming in a low-level language, although this has not been the case for decades.

Processor manufacturers are not alone in this. C / C ++ compiler developers have also contributed.

What is a low level language?

An American scientist in the field of computer technology and the first Turing Award Priest Alan Perlis gave the following definition:

"A programming language is low-level if programs written on it require attention to the insignificant."

Although this definition refers to C, it does not give an understanding of what people want to see in a low-level language. Different properties make people consider the language low-level. Imagine a scale of programming languages with an assembler at one end and an interface to an Enterprise computer at the other. Low-level languages are “closer to the gland”, while high-level languages are closer to how people think.

To be “closer to the gland,” the language must provide abstractions that correspond to abstractions of the target platform. It is easy to prove that C was a low-level language in the PDP-11. The sequential execution of programs, the flat address space, even the pre- and post-increment operators perfectly fit the PDP-11 addressing modes.

PDP-11 fast emulators

The key reason for the appearance of Specter and Meltdown vulnerabilities is that the processor makers did not just make fast processors, but made fast processors with PDP-11 interface. This is important because it allows C programmers to continue to believe that their language is close to the hardware.

Code C mainly provides a sequential abstract automaton (up to C11 - fully sequential, if you exclude non-standard extensions). Creating a new thread is a call to the library function, an operation that is rather expensive. Therefore, processors, wanting to continue to execute C code, rely on instruction-level parallelism (ILP) command-level parallelism. They analyze neighboring operations and perform independent operations in parallel. This greatly complicates the processors and leads to an increase in energy consumption, but allows programmers to write mostly sequential code. In contrast, graphics processors (GPUs) achieve high performance in a different way: they require writing parallel programs.

High command-level concurrency is the direct cause of Specter and Meltdown. A modern Intel processor executes up to 180 instructions at a time (unlike the sequential abstract machine C, which expects the previous instruction to execute before the next one starts). A typical C code heuristic indicates that there is one branch for an average of every seven instructions. If you want to keep the instruction pipeline full, then you need to guess the next 25 branches. This, in turn, adds complexity - the processor incorrectly guessed the branch will first calculate, and then throw the calculation results, which negatively affects the power consumption. These discarded data have visible indirect results, which was used in Specter and Meltdown attacks.

Renaming registers consumes a lot of energy and chip space in modern processors. It cannot be turned off or its energy consumption reduced, which makes it inconvenient in the era of “ dark silicon ”, when transistors are low, but the transistors involved are a valuable resource. This device is absent in the GPU, where concurrency is achieved by using threads instead of attempting parallel execution of initially sequential code. If the instructions do not have dependencies that need to be rebuilt, then there is no need to rename the registers either.

Consider another fundamental part of the C design: flat memory. It has not existed for a couple of decades. A modern processor often has three levels of caching between registers and main memory, so as to reduce the time to access the latter.

The cache is hidden from the programmer and therefore inaccessible from C. Effective use of the cache is one way to speed up the execution of code on a modern processor, but it is completely hidden from an abstract machine and programmers have to rely on knowledge of the implementation details of the cache (for example, that two aligned 64-bit values can be in one cache line) for writing efficient code.

C Optimization

One of the common characteristics attributed to low-level languages is speed. In particular, they should be easily translated into fast code without a complicated compiler. The argument that a smart enough compiler can make a language fast is often ignored by supporters of C when they talk about other languages.

Unfortunately, with the help of simple translation it is impossible to get a fast code from C.
Processors architects make heroic efforts to create chips that can execute C code quickly. But the performance levels that programmers expect to see are achieved only with the help of incredibly complex optimizations performed by the compiler.
The Clang compiler (including the relevant parts of LLVM) is about 2 million lines of code. For the analysis and transformation of the code, which is necessary to accelerate C, we need about 200,000 lines of code (excluding comments and empty lines).

For example, to process large amounts of data in C, you need to write a loop that processes each element sequentially. For optimal execution of this loop on a modern processor, the compiler must determine that the iterations of the loop are independent of each other. The restrict keyword can help in this case - it ensures that writing to one pointer will not interfere with reading from another pointer. This information in C is much more limited than in a language like Fortran, which is the main reason why C failed to force it out of high-performance computing.

After the compiler has determined that the iterations are independent of each other, the next step is an attempt to vectorize the result, because the throughput of modern processors is four to eight times higher for vectorized code than for scalar code. A low-level language for such processors would have its own vector types of arbitrary length. In the intermediate representation of LLVM there are just such types, because it is always easier to break large operations with vectors into several small ones than to construct larger vector operations.

At this stage, optimizers have to contend with the rules of operation of memory C. C ensures that structures with the same prefix can be used interchangeably, and provides access to the offset of the fields of structures in the language. This means that the compiler cannot change the order of fields in the structure or add alignment to improve vectorization (for example, by transforming the structure from arrays into an array of structures or vice versa). This is usually not a problem in low-level languages, where it is possible to control the arrangement of fields in the structure, but it makes the task of accelerating C more difficult.

C also requires alignment at the end of the structure, since it guarantees the lack of alignment in arrays. Alignment is a rather complicated part of the C specification, which interacts poorly with other parts of the language. For example, you should be able to compare two structures using the type-less comparison method (that is, the memcmp () function), so the copy of the structure must also be aligned. In some cases, copying alignment takes considerable time.

Consider two basic optimizations that the C compiler produces: SROA (scalar replacement of aggregates, scalar replacement of aggregates) and loop opening .

SROA attempts to replace fixed-size structures and arrays with separate variables. This allows the compiler to handle access to them independently of each other and ignore the operation if it is obvious that its result is not used. In some cases, the indirect effect of this optimization is to remove alignment.

The second optimization, opening the loop, converts the loop with the condition into a condition with different cycles in both branches. This changes the order of execution as opposed to the statement that the programmer knows what will be performed in a low level language. And it also creates serious problems with how undefined variables and undefined behavior are handled in C.

In C, an uninitialized variable has an undefined value, which may be different with each call. This is important because it allows for lazy recycling of memory pages. For example, in FreeBSD, the malloc () implementation tells the system that the pages are no longer involved, and the system uses the first entry in the page as proof that this is not the case. Appeal to the newly allocated memory can get the old value, then the operating system can reuse the memory page, and then replace it with the page filled with zeros at the next entry to another page location. The second call to the same place on the page will receive a zero value.

If the condition uses an undefined value, the result is also undefined — anything can happen. Imagine a loop open optimization where the loop runs zero times. In the original, the entire cycle is a dead code. In the open version, there is now a condition with a variable that may not be initialized.
As a result, the dead code can be converted to undefined behavior. This is just one of many optimizations that, with a more thorough study of C semantics, turn out to be unreliable.

As a result, you can make the code in C work quickly, but only by spending thousands of person-years to create a smart enough compiler. But this is possible only if some rules of the language are violated. Compiler creators allow C programmers to imagine that they write code that is “close to hardware,” but they have to generate machine code that behaves differently so that programmers continue to believe in what they write in fast language.

Understanding c

One of the basic attributes of a low-level language is that programmers can easily understand how an abstract language machine is transferred to a physical machine. This was definitely the case on the PDP-11, where C expressions were translated into one or two instructions. Similarly, the compiler put variables into the slots of the stack and converted simple types into understandable ones for PDP-11.

Since then, C implementations have become much more difficult to maintain the illusion that C is easily transferred to the hardware platform and works fast. In 2015, the results of a survey among C programmers, authors of compilers, and members of the standardization committee showed that there are problems with understanding C. For example, this language allows implementations to add alignment to structures (but not to arrays) to ensure that all fields are properly aligned for the target platform. If you fill this structure with zeros and then specify the value of some fields, will the alignment bits have zeros? According to the survey, 36% were confident that they would, and 29% of respondents did not know the answer. Depending on the compiler and the level of optimization, this may be true (or not).

This is a rather trivial example, but many programmers either give the wrong answer, or cannot answer at all.

If you add pointers, the semantics of C becomes even more confusing. BCPL modelwas pretty simple: all meanings are words. Each word is either a data or an address in memory. Memory is a flat array of cells indexed by address.

Model C allows implementation for different platforms, including segmented architectures, where the pointer can consist of segment ID and offset, as well as virtual machines with a garbage collector. Specification C restricts the allowed operations with pointers to avoid problems with such systems. The response to the Defect Report 260 contains a reference to the origin of the pointer:

“Implementations can follow the origin of a set of bits and handle those containing an undefined value differently than those that contain a specific one. They can handle pointers differently depending on their origin, even if they are the same in terms of their bit value. "

Unfortunately, the word "origin" is not in the C11 specification, so the compilers themselves decide what it means. GCC and Clang, for example, differ in whether the pointer retains its origin and is converted back and forth. Compilers may decide that two pointers to malloc () results always give a negative result when comparing, even if they point to the same address.

These misunderstandings are not purely academic. For example, vulnerabilities have already been observed that resulted from overflowing a signed integer (undefined behavior in C) or dereferencing a pointer before checking for NULL , while the compiler was told that the pointer could not be NULL.

In the presence of such problems, it is difficult to expect from a programmer a complete understanding of how a C program translates to the appropriate architecture.

Representing a processor not for C

The proposed fixes for protection from Specter and Meltdown cause a serious deterioration in performance, undermining all the achievements of micro-architecture over the past decade. Perhaps it's time to stop thinking about how to make C code faster, and instead think about new programming models on processors that are designed for speed.

There are many examples of architectures that have not been focused on traditional C code and from which to draw inspiration. For example, multi-threaded processors like the Sun / Oracle UltraSPARC Tx do not require so much cache to keep their executive devices occupied. Research processors have expanded this concept to a very large number of hardware-planned threads. The key idea is that with a sufficient number of threads, the processor can suspend those threads that are waiting for data and fill the execution units with instructions from other threads. The problem is that C programs usually have very few threads.

ARM's SVE (Scalar Vector Extensions, scalar vector extensions) is another similar work from Berkeley, which offers a look at the improved interface between the program and the hardware. Conventional vectorization blocks implement operations with vectors of a fixed size and expect the compiler to adapt the algorithm to the specified size. In contrast, the SVE interface offers the programmer to independently describe the level of parallelism and expects the hardware to independently adapt it to the executive devices available. Using it in C is difficult, because the auto vectorizer must compute parallelism based on loops in the code.

Caches are large, but this is not the only reason for their complexity. Cache Coherence Support Protocol- one of the most difficult components of a modern processor. Most of the complexity arises from the need to maintain a language in which data can be simultaneously shared and editable. As an inverse example, an Erlang-style abstract machine, where each object is either local or immutable, can be cited. The cache coherence protocol for such a system would have only two cases: mutable data and shared data. The cache of the program stream that was transferred to another processor must be explicitly disabled, but this is a relatively rare operation.

Immutable objects can simplify caches even more, as well as make some operations cheaper. In the Max Lab project from Sun Labs, it was noted that the objects in the cache and the newly created objects are almost always the same. If objects die before they are excluded from the cache, then you can not write them to the main memory and thus save energy consumption. The Maxwell project proposed a garbage collector that worked in the cache and allowed to quickly recycle memory. With immutable heap objects and a mutable stack, the garbage collector becomes a very simple state machine, which is easily implemented in the hardware and makes it possible to effectively use a relatively small cache.

A processor that is designed exclusively for speed, and not for a compromise between speed and C support, probably should support a large number of threads, have large blocks of vectorization and a simpler memory model. It will be difficult to execute C code on such a processor, therefore, given the amount of old C code in the world, it is unlikely to have commercial success.

In the field of software development, there is a myth that parallel programming is difficult. Alan Kay would be very surprised to hear this: he taught the children to use the model of actors, with which they wrote programs for more than 200 streams. This is not known to Erlang programmers, who often write programs with thousands of parallel components. It is more correct to say that parallel programming is difficult in a language with an abstract machine like C. And if you pay attention to the prevalence of parallel hardware (from multi-core processors to multi-core GPUs), then this is just another way of saying that C is not suitable for modern hardware provide.

Tags: