Pessimism about multithreading

Original author: Eric Raymond
  • Transfer
Massive and hardware parallelism are hot topics of the 21st century. There are several pleasant reasons for this and one rather sad one.

Two pleasant reasons: a combination of excellent GPU work in games and at the same time their unexpected side use in deep AI training, since massive parallelism is implemented there at the hardware level. The sad reason is that the speed of single-processor systems since 2006 rested on the laws of physics. Current problems with leakage and thermal breakdown severely limit the increase in clock frequency, and the classic voltage reduction now stumbles upon serious problems with quantum noise.

Competing for the attention of the public, processor manufacturers are trying to cram more and more processor cores into each chip, promoting a theoretical overall performance. Also, efforts are rapidly growing on pipelining and speculative execution methods that use multi-threading under the hood so that a single processor visible to a programmer can process instructions more quickly.

The inconvenient truth is that many of our less glamorous computational problems simply cannot use visible multithreading very well . There are various reasons for this, which have different consequences for the programmer, and there is a lot of confusion. In this article I want to clarify the situation.

First, you need to clearly understand where hardware parallelism works best and why. Let's look at calculations for graphics, neural networks, signal processing and bitcoin mining. There is a pattern: paralleling algorithms work best on equipment that (a) is specifically designed to run them; (b) can do nothing else!

We also see that the input data for the most successful parallel algorithms (sorting, string matching, fast Fourier transform, matrix operations, inverse quantization of images, etc.) look quite similar. As a rule, they have a metric structure and implies a difference between “near” and “far” data, which allows them to be cut into pieces, since the connection between distant elements is insignificant.

In terms of the previous article on semantic locality, one can say that parallel methods are applicable mainly where data has good locality. And they work best on equipment that supports only “near” connections, like a systolic matrix in the heart of a GPU.

On the other hand, it is very difficult to write software that effectively produces such a partition for input data with poor locality on general-purpose computers (von Neumann architecture).

As a result, we can formulate a simple heuristic: The chances of using parallel computing are inversely proportional to the degree of irreducible semantic nonlocality in the input data.

Another limitation of parallel computing is that some important algorithms cannot be parallelized at all - even theoretically. When I first discussed this topic in a blog, I came up with the term “sick algorithm”, where SICK stands for “Serial, Intrinscally - Cope, Kiddo!”. Significant examples include Dijkstra’s shortest path algorithm; detection of cycles in oriented graphs (using 3-SAT in solvers); depth search; calculating the nth member in the cryptographic hash chain; optimization of network flow ... and this is not a complete list.

Here too, the poor locality of the input data plays a role, especially in the contexts of the graph and the tree structure. Cryptographic hash chains cannot be parallelized, because the entries are computed in a strict order - this is a really important rule for protecting the chain against forgery.

And here comes the blocking: you can not parallelize anything while the SICK algorithm is running.

We are not done. There are at least two more classes of obstacles, and very common ones.

First, there are no tools needed. Most languages ​​do not support anything except mutexes and semaphores. This is convenient, primitives are easy to implement, but this situation causes terrible explosions of complexity in the head: it is almost impossible to comprehend the scale of more than four interacting locks.

If you're lucky, you'll get a more compliant set of primitives, such as the Go channels (aka Communicating Sequential Processes) or the Rust owning / sending / synchronization system. But in reality, we do not know which “correct” language of primitives is for the implementation of parallelism on von Neumann architecture. Perhaps there is not even one correct set of primitives. Perhaps two or three different sets are suitable for different problem areas, but they are incommensurable as a unit and a square root of two. To date, in 2018, no one really knows.

Last but not least, the human brain. Even on a clear algorithm with good data locality and efficient tools, parallel programming is simply difficultfor people, even if the algorithm is applied quite simply. Our brain does not model very well the simplest state spaces of purely sequential programs, and especially parallel ones.

We know this because there is a lot of real evidence that debugging parallel code is more than difficult. This is hampered by race conditions, deadlocks, self-locking locks, insidious data corruption due to the slightly unsafe order of instructions.

I think that understanding these limitations becomes more important after the collapse of the Dennard scaling law. Because of all these bottlenecks in programming, multi-core systems will always run software that cannot load hardware at 100% computing power. If you look at it from the other side, we have excess iron for current tasks. How much money and effort are we wasting?

Processor manufacturers want you to overestimate the functional benefits of smart new chips with even more cores. How else do they get money to cover the huge cost of production, while remaining in profit? Marketing makes every effort so that you never wondered at what tasks such multithreading is really beneficial.

Honestly, there are such tasks. Servers in data centers that process hundreds of thousands of concurrent transactions per second are likely to distribute the load fairly well across the cores. Smartphones or embedded systems, too — in both cases, considerable effort is being made to minimize cost and energy consumption, which makes it difficult to put excess capacity into operation.

But for ordinary users of desktops and laptops? I am plagued by vague doubts. It is difficult to understand the situation here, because the real productivity growth comes from other factors, such as the transition from HDD to SSD. Such achievements are easy to take for the effect of accelerating the CPU, if you do not conduct a thorough profiling.

Here are the reasons for such suspicions:

  1. Serious parallel computing on desktops / laptops occur only on the GPU.
  2. More than two cores in a processor are usually useless. Operating systems can distribute application flows, but typical software cannot use parallelism, and most users rarely manage to run many different applications that consume a lot of CPU resources at the same time in order to fully load their hardware.
  3. Consequently, most of the quad-core systems for the most part of the time do nothing but generate heat.

There are many people among my readers who will certainly be able to comment on this hypothesis sensibly. It is interesting to see what they say.

UPDATE. The commentator on G + pointed out one interesting benefit from multi-core processors: they compile code very quickly. The source code of languages ​​like C has good locality: here the well separated units (source files) are compiled into object files, which are then joined by the linker.

Also popular now: