How processors are designed and manufactured: CPU design
Now that we know how processors work at a high level, it is time to delve deeper into the process of designing their internal components. This is the second article in a series on processor development. I recommend that you study the first part first, so that you understand the concepts outlined below.
Part 1: Computer architecture basics (instruction set architectures, caching, pipelines, hyperthreading)
Part 2: CPU design process (circuitry, transistors, logic elements, synchronization)
Part 3: Chip layout and physical manufacturing (VLSI and silicon fabrication)
Part 4 : Current trends and important future directions in computer architecture (sea of accelerators, three-dimensional integration, FPGA, Near Memory Computing)
As you may know, processors and most other digital devices are made up of transistors. The easiest way to perceive the transistor as a controlled switch with three contacts. When the shutter is turned on, electric current may flow through the transistor. When the shutter is off, current cannot flow. The shutter is like a light switch in a room, only it is much smaller, faster and can be controlled electrically.
There are two main types of transistors used in modern processors: pMOS (PMOS) and nMOS (NMOS). The nMOS transistor passes current when the gate is charged or has a high voltage, and the pMOS transistor passes current when the gate is discharged or has a low voltage. By combining these types of transistors in a complementary way, we can create CMOS logic elements. In this article, we will not analyze in detail the features of the operation of transistors, but we will touch on this in the third part of the series.
A logic element is a simple device that receives input signals, performs an operation, and outputs a result. For example, the AND element (AND) turns on its output signal if and only if all gate inputs are turned on. The inverter, or the element NOT (NOT), turns on its output if the input is disabled. You can combine these two shutters and get an NAND element that turns on the output, if and only if none of the inputs is turned on. There are other elements with their logical functionality, for example, OR (OR), OR-NOT (NOR), exclusive OR (XOR) and exclusive OR with inversion (XNOR).
The following shows how two simple elements are assembled from transistors: an inverter and NAND. In the inverter, the pMOS transistor (top) is connected to power, and the nMOS transistor (bottom) is connected to ground. On the designation of pMOS transistors there is a small circle connected to the gate. We said that pMOS devices pass current when the input is turned off, and nMOS devices pass current when the input is turned on, so it’s easy to notice that the output signal (Out) will always be the opposite of the input signal (In). Looking at the NAND element, we see that it requires four transistors, and that the output will always be disabled if at least one of the inputs is turned off. Connecting transistors in this way to form simple networks is the same process that is used to design more complex logic elements and other circuits inside processors.
Building blocks in the form of logical elements are so simple that it is difficult to understand how they turn into a functioning computer. The design process consists of combining several elements to create a small device that can perform a simple function. You can then combine many of these devices to create something that performs a more complex function. The process of combining individual components to create a working structure is exactly the process that is used today to create modern chips. The only difference is that a modern chip consists of billions of transistors.
As a small example, let's take a simple adder - a 1-bit full adder. It receives three input signals - A, B, and Carry-In (transfer input signal), and creates two output signals - Sum (sum) and Carry-Out (transfer output signal). The simplest circuit uses five logic elements, and they can be connected together to create an adder of any size. In modern schemes, this process is improved by optimizing part of the logic and transfer signals, but the fundamental principles remain the same.
The output of Sum is either A or B, but never to both, or there is an input carry signal, and then A and B are either turned on or both are turned off. The transfer output is a bit more complicated. It is active when either A and B are on at the same time, or there is a Carry-in and one of A or B is on. To connect multiple 1-bit adders to create a wider adder, we just need to connect the Carry-out of the previous bit to the Carry-in of the current bit. The more complicated the circuits become, the more confusing the logic is, but this is the easiest way to add two numbers. Modern processors use more sophisticated adders, but their circuits are too complicated for such a review. In addition to adders, processors also contain devices for dividing, multiplying, and versions of all these floating-point operations.
Such a combination of sequences of elements to perform a function on the input signals is called combinatorial logic . However, this is not the only type of logic used in computers. It will not be of much use if we cannot store data or track the status. In order to be able to save data, we need sequential logic.
Sequential logic is built by neatly connecting inverters and other logic elements so that their outputs transmit feedback signals to the input of the elements. Such feedback loops are used to store one bit of data and are called Static RAM., or SRAM. This memory is called static RAM as opposed to dynamic RAM (DRAM) because the stored data is always directly connected to positive voltage or ground.
The standard way to implement one SRAM bit is with the 6 transistor circuit shown below. The uppermost signal marked as WL ( Word Line ) is the address, and when it is turned on, the data stored in this 1-bit cell is transferred to the Bit Line marked as BL. The BLB output is called Bit Line Bar ; this is just the inverted value of the Bit Line. You must recognize the two types of transistors and understand that M3 with M1, like M4 with M2, form an inverter.
SRAM is used to build ultrafast caches and registers inside processors. This memory is very stable, but requires six to eight transistors to store each bit of data. Therefore, in comparison with DRAM, it is extremely expensive in terms of cost, complexity and area on the chip. Dynamic RAM, on the other hand, stores data in a tiny capacitor, rather than using logic gates. It is called dynamic, because the voltage on the capacitor can vary significantly, since it is not connected to power or ground. There is only one transistor used to access the data stored in the capacitor.
Since DRAM requires only one transistor per bit and is highly scalable, it can be packed densely and cheaply. The disadvantage of DRAM is that the charge on the capacitor is so small that it needs to be constantly updated. That is why, after turning off the power to the computer, all capacitors are discharged and data in RAM is lost.
Companies such as Intel, AMD, and Nvidia do not publish circuit diagrams of their processors, so it is impossible to show such complete circuitry for modern processors. However, this simple adder allows you to get the idea that even the most complex parts of the processor can be divided into logical and storage elements, and then into transistors.
Now that we know how some processor components are manufactured, we need to figure out how to put everything together and synchronize. All key components of the processor are connected to a clock signal . It alternately has a high and low voltage, changing it with a given interval, called the frequency (frequency). The logic inside the processor usually switches the values and performs the calculations when the clock signal changes voltage from low to high. By synchronizing all the parts, we can guarantee that the data always arrives at the right time so that there are no glitches in the processor.
You may have heard that you can increase the clock speed to increase processor performance. This performance increase is due to the fact that switching transistors and logic inside the processor begins to occur more often than intended. Since there are more cycles per second, more work can be done and the processor will have improved performance. However, this is true to a certain extent. Modern processors usually operate at frequencies from 3.0 GHz to 4.5 GHz, and this value has not changed much over the past ten years. Just as a metal chain is no stronger than its weakest link, a processor can run no faster than its slowest part. By the end of each clock cycle, each processor element should complete its work. If some parts have not completed it yet, then the clock is too fast and the processor will not work. Designers call this the slowest partcritical path (Critical Path) and it determines the maximum frequency with which the processor can work. Above a certain frequency, transistors simply do not have time to switch quickly enough and begin to fail or produce incorrect output values.
By increasing the voltage of the processor, we can accelerate the switching of transistors, but this also works to a certain limit. If too much voltage is applied, then we risk burning the processor. When we increase the frequency or voltage of the processor, it always starts to radiate more heat and consume more power. This is because the processor power is directly proportional to the frequency and proportional to the square of the voltage. To determine the power consumption of the processor, we consider each transistor as a small capacitor that needs to be charged or discharged when its value changes.
Power supply is such an important part of the processor that in some cases up to half of the physical contacts on the chip can only be used for power or grounding. Some chips at full load can consume more than 150 amperes, and with all this current you need to be controlled extremely carefully. For comparison: the central processor generates more heat per unit area than a nuclear reactor.
The clock signal in modern processors takes about 30-40% of its total power, because it is very complex and must manage many different devices. To conserve energy, most low-power processors disable parts of the chip when not in use. This can be done by turning off the clock (this method is called Clock Gating) or turning off the power (Power Gating).
Clock signals create another difficulty in the design of the processor: since their frequencies are constantly growing, the laws of physics begin to influence the work. Despite the extremely high speed of light, it is not large enough for high-performance processors. If you connect a clock signal to one end of the chip, then by the time the signal reaches the other end, it will be out of sync by a significant amount. To synchronize all parts of the chip, the clock signal is distributed using the so-called H-Tree. This is a structure that ensures that all endpoints are at exactly the same distance from the center.
It may seem that the design of each individual transistor, clock signal and power contact in the chip is an extremely monotonous and difficult task, and this is indeed so. Even though thousands of engineers work for companies like Intel, Qualcomm, and AMD, they would not be able to manually design every aspect of the chip. To design chips of this scale, they use many sophisticated tools that automatically generate designs and electrical circuits. Such tools usually get a high-level description of what the component should do, and determine the best hardware configuration that meets these requirements. Recently, a development line called High Level Synthesis, which allows developers to specify the necessary functionality in the code, after which computers determine how best to achieve it in the equipment.
In the same way that you can describe computer programs through code, designers can describe hardware devices with code. Languages such as Verilog and VHDL allow equipment designers to express the functionality of any circuitry they create. After performing simulations and verification of such projects, they can be synthesized into specific transistors, of which the electrical circuit will consist. Although the verification phase may not seem as exciting as designing a new cache or kernel, it is much more important than them. For each design engineer hired by a company, there may be five or more verification engineers.
Verification of a new project often takes more time and money than creating the chip itself. Companies spend so much time and money on verification, because after sending the chip into production, it cannot be fixed. In case of an error in the software, you can release the patch, but the equipment works differently. For example, Intel discovered a bug in the floating point division module of some Pentium chips, and as a result, it resulted in losses equivalent to the current $ 2 billion.
It is difficult to comprehend that there can be several billion transistors in one chip and understand what they all do. If you break the chip into its individual internal components, it becomes a little easier. Logic elements are composed of transistors, logic elements are combined into functional modules that perform a specific task, and these functional modules are connected together to form the computer architecture that we discussed in the first part of the series.
Most of the design work is automated, but the above allows us to realize how complex the new CPU we just bought is complex.
In the second part of the series, I talked about the CPU design process. We discussed transistors, logic gates, power and clock signals, design synthesis, and verification. In the third part, we will find out what is required for the physical production of the chip. All companies love to brag about how modern their manufacturing process is (Intel - 10nm, Apple and AMD - 7nm, etc.), but what do these numbers really mean? We will talk about this in the next part.