How processors are designed and manufactured: the basics of computer architecture
We perceive the central processor as the “brain” of a computer, but what does it really mean? What exactly happens inside the billions of transistors that make a computer work? In our new mini-series of four articles, we will consider the process of creating the architecture of computer equipment and talk about the principles of its operation.
In this series we will talk about computer architecture, design of processor boards, VLSI (very-large-scale integration), chip manufacturing and future trends in the field of computer technology. If you were interested in understanding the details of the processors, then it is better to start the study with this series of articles.
We will start with a very high-level explanation of what the processor does and how the building blocks connect into a functioning structure. In particular, we will consider processor cores, memory hierarchy, branch prediction, and more. First, we need to give a simple definition of what the CPU does. The simplest explanation: the processor follows a set of instructions to perform a certain operation on a lot of incoming data. For example, it can be reading a value from memory, then adding it to another value, and finally saving the result to memory at a different address. It can be something more complicated, for example, the division of two numbers, if the result of the previous calculation is greater than zero.
Programs, such as an operating system or a game, are themselves sequences of instructions that the CPU must execute. These instructions are loaded from memory and executed in a simple processor one after another until the program terminates. Software developers write programs in high-level languages, such as C ++ or Python, but the processor cannot understand them. He understands only ones and zeros, so we need to somehow represent the code in this format.
Programs are compiled into a set of low-level instructions called the assembly language , which is part of the instruction set architecture (ISA). This is a set of instructions that the CPU must understand and execute. Some of the most common ISAs are x86, MIPS, ARM, RISC-V, and PowerPC. In the same way that the syntax for writing a function in C ++ differs from the function that performs the same action in Python, each ISA has its own different syntax.
These ISAs can be divided into two main categories: fixed and variable length. ISA RISC-V uses fixed-length instructions, which means that a predetermined number of bits in each instruction determines what type the instruction is. In x86, everything is different, it uses variable-length instructions. In x86, instructions can be encoded in different ways with different numbers of bits for different parts. Because of this complexity, the instruction decoder on the x86 processor is usually the most complex part of the entire device.
Fixed-length instructions provide simple decoding due to a constant structure, but limit the total number of instructions that can be supported by ISA. While the popular versions of the RISC-V architecture have approximately 100 instructions and they are all open source, the x86 architecture is proprietary and no one knows how many instructions there are in it. It is generally believed that there are several thousand x86 instructions, but no one publishes the exact number. Despite the differences between ISAs, in fact they all have the same basic functionality.
An example of some RISC-V instructions. The opcode on the right is 7 bits long and determines the type of instruction. In addition, each instruction contains bits that define the registers used and the functions performed. So assembler instructions are broken into binary code so that the processor understands it.
Now we are ready to turn on the computer and start executing programs. The execution of the instruction has several basic parts, which are divided into many stages of the processor.
The first stage is the transfer of instructions from memory to the processor to start execution. In the second step, the instruction is decoded so that the CPU can understand what type of instruction it is. There are many types, including arithmetic instructions, branch instructions, and memory instructions. After the CPU finds out what type of instruction it is executing, the operands for the instruction are taken from the memory or internal CPU registers. If you want to add the number A and the number B, you cannot add until you know the values of A and B. Most modern processors are 64-bit, that is, the size of each data value is 64 bits.
64 bits is the width of the processor register, data channel and / or memory address. For ordinary users, this means how much information a computer can process at a time, and this is best understood in comparison with a younger architecture relative - a 32-bit processor. The 64-bit architecture can process twice as many bits of information at a time (64 bits versus 32).
Having received the operands for the instruction, the processor transfers them to the execution stage, where the operation is performed on the incoming data. This can be adding numbers, performing logical manipulations with numbers, or simply passing numbers without changing them. After calculating the result, memory access may be required to store it, or the processor may simply store the value in one of its internal registers. After saving the result, the CPU updates the state of the various elements and proceeds to the next instruction.
This explanation, of course, is greatly simplified, and most modern processors divide these several stages into 20 or even more small stages to increase efficiency. This means that although the processor starts and ends with several instructions in each cycle, it may take 20 or more cycles to execute one instruction from start to finish. Such a model is usually called a pipeline ("pipeline", usually translated into Russian as a "conveyor"), because it takes time to fill the pipeline with liquid and complete its passage, but after filling the flow rate (data output) will be constant.
An example of a 4-stage conveyor. Multi-colored rectangles indicate instructions that are independent of each other.
The entire cycle the instruction goes through is a very carefully coordinated process, but not all instructions can be completed at the same time. For example, addition is very fast, and dividing or loading from memory can take thousands of cycles. Instead of stopping the entire processor until the completion of one slow instruction, most modern processors execute them with a change in order. That is, they determine which of the instructions is most advantageous to execute at the moment and buffer other instructions that are not yet ready. If the current instruction is not ready yet, then the processor can jump forward in the code to see if something else is ready.
In addition to performing a sequence change, modern processors use a technology called superscalar architecture.. This means that at any time, the processor simultaneously executes a lot of instructions at each stage of the pipeline. He can also expect hundreds more to start their execution, and in order to be able to execute several instructions simultaneously inside the processors, there are several copies of each stage of the pipeline. If the processor sees that two instructions are ready for execution, and there is no dependency between them, then it does not wait until they are completed separately, but executes them simultaneously. One popular implementation of this architecture is called Simultaneous Multithreading (SMT) and is also known as Hyper-Threading. Intel and AMD processors now support double-sided SMT, while IBM has developed chips that support up to eight SMTs.
To complete this carefully coordinated execution, the processor, in addition to the base core, has many additional elements. The processor has hundreds of separate modules, each of which has a specific function, but we will only consider the basics. The most important and profitable are caches and the predictor of transitions. There are other additional structures that we will not consider: reordering buffers, register renaming tables, and backup stations.
The need for caches can sometimes be confusing, because they store data, like RAM or SSD. But caches differ in latency and access speed. Even though RAM memory is extremely fast, it is orders of magnitude slower than what the CPU needs. Hundreds of cycles may be required to respond with the transfer of RAM data, and the processor will have nothing to do at this time. And if there is no data in RAM, then it may take tens of thousands of cycles to gain access to them from the SSD. Without caches, processors would constantly stop.
Processors typically have three cache levels that form the so-called memory hierarchy. L1 cache is the smallest and fastest, L2 is in the middle, and L3 is the largest and slowest of all caches. Above the caches in the hierarchy are small registers that store the only data value during calculations. In order of magnitude, these registers are the fastest storage devices in the system. When the compiler converts a high-level program into assembly language, it determines the best way to use these registers.
When the CPU requests data from memory, it first checks whether this data is already stored in the L1 cache. If so, then you can access them in just a couple of cycles. If they are not there, then the processor checks L2, and then the L3 cache. Caches are implemented in such a way that in general they are transparent to the kernel. The kernel simply requests data at the specified memory address, and the level in the hierarchy at which it exists answers it. When moving to subsequent levels in the memory hierarchy, size and delays usually increase by orders of magnitude. In the end, if the CPU does not find data in any of the caches, then it accesses the main memory (RAM).
In a regular processor, each core has two L1 caches: one for data and one for instructions. L1 caches usually have a total capacity of about 100 kilobytes and the size varies greatly depending on the chip and processor generation. In addition, usually each core has its own L2 cache, although in some architectures it may be common to two cores. L2 caches are typically several hundred kilobytes in size. Finally, there is a single L3 cache common to all cores, with a size of the order of tens of megabytes.
When the processor executes the code, the most frequently used instructions and data values are cached. This significantly speeds up execution, because the processor does not need to constantly go to the main memory for the necessary data. In the second and third parts of the series, we will talk more about how these memory systems are implemented.
In addition to caches, one of the most important building blocks of a modern processor is an accurate transition predictor . Transition (branching) instructions are similar to the if constructs for the processor. One set of instructions is executed if the condition is true, and the other if it is false. For example, we need to compare two numbers, and if they are equal, perform one function, and if they are not equal, then perform another. These branch instructions are extremely common and can make up about 20% of all instructions in a program.
At first glance, it seems that these branching instructions should not cause problems, but their proper execution can be very difficult for the processor. At any given time, the processor may be in the process of simultaneously executing ten or twenty instructions, so it is very important to know which instructions to execute. It may take 5 cycles to determine that the current instruction is a transition, and another 10 cycles to determine if the condition is true. At this time, the processor can already begin to execute dozens of additional instructions, without even knowing whether these instructions are really suitable for execution.
To get around this problem, all modern high-performance processors use a technique called speculation. This means that the processor keeps track of branch instructions and wonders if the conditional branch will be executed or not. If the prediction is correct, then the processor has already begun to execute the following instructions, and this provides an increase in performance. If the prediction is incorrect, then the processor stops execution, deletes all incorrect instructions that it began to execute, and starts again from the correct point.
Such branch predictors are some of the simplest types of machine learning because the predictor studies the behavior of branches during execution. If he predicts incorrectly too often, he begins to learn the correct behavior. Decades of research into transition prediction techniques have resulted in more than 90% prediction accuracy in modern processors.
Although anticipation provides a huge increase in performance, because the processor can execute instructions that are already ready, instead of waiting in the queue for completion to be executed, it also creates security vulnerabilities. The famous Specter attack exploits bugs in predicting and anticipating transitions. The attacker uses specially selected code to force the processor to proactively execute the code, which results in a leak of values from memory. To prevent data leakage, it was necessary to redo the design of certain aspects of anticipation, which led to a slight drop in performance.
Over the past decades, the architecture used in modern processors has come a long way. Innovation and the development of a well-thought-out structure have led to increased productivity and more optimal use of hardware. However, the developers of the central processors carefully keep the secrets of their technologies, so we can’t find out exactly what is happening inside them. However, the fundamental principles of the processors are standardized for all architectures and models. Intel can add its secret ingredients to increase the share of cache hits, and AMD can add an improved transition predictor, but the processors of both companies perform the same task.
In this first look and review, we covered the basics of how processors work. In the next part, we will tell you how to develop the components that make up the processors, talk about logic elements, clock frequencies, power management, circuitry, and more.