Prospects for the development of central processors

    Seventy years ago, in 1941, the first programmable computer was created. Since then, a lot of water has flowed, and now computers surround us everywhere. Many aspects of the design of computers have evolved greatly, many, on the contrary, have not changed at all in essence. In particular, the very principle of the operation of central processors - the algorithmic model - has not changed and, probably, will never change . The physical limitations of this model are well understood, and accordingly, the development limit of central processors in the sense of their speed is clearly visible. Technologically, this ceiling is still quite far: several decades of development and several orders of speed. But this should not prevent us from seriously thinking what kind of processors will be on the verge of the limit of their high-speed development.

    It is precisely the fact that the possible development paths are strictly dictated by fundamental physical constraints, such as the finiteness of the speed of light, the laws of thermodynamics, and the restriction on the minimum size of elements — that they cannot be smaller than a molecule to make predictions in this area .

    Part 1. The principles of computing devices today and in the future

    In the early decades of its existence, computers were used only for computing. The original meaning of the word "computer" is translated into Russian as "automated computer" (AVM). A little later, computers also began to be used as control elements of other devices, and the word “computer” acquired its modern meaning: “general-purpose program executing program” . Today, AVMs are highly specialized (used in computer centers) devices, and the computer as a control element, on the contrary, is used in almost all automated devices -from microwaves to robotic production lines and from mobile phones to spacecraft. In particular, it plays a controlling role in laptops, desktop workstations and game consoles - devices called household computers.

    There are several significantly different calculation models, so future AVMs will probably be arranged in completely unexpected ways. However, the computer as a control device in any case must contain an executive unit (central processor), and the device of this unit is rigidly fixed by its functions: it must be able to
    - issue sequential instructions to other devices and
    - make a choice of a further program of actions depending on the prevailing conditions.
    These two requirements automatically assume a device model called “relaxed sequential execution” in English or “algorithmic model” in the Russian-language tradition. This model, unlike many others, has no quantum analogue, since quantum computing is fundamentally reversible, and making a choice is fundamentally irreversible. This model will be discussed in the next two parts of the article.

    To complete the story, we give a list of all scalable hardware models of computations known at the moment:
    - an algorithmic model, that is, (conditionally) step-by-step execution of instructions (“conditionally” means that parallel execution of independent instructions is allowed);
    - stream vector computing and their natural generalization - quantum conveyors;
    - cellular / network automata and their natural generalization - quantum automata.

    Currently, in practice, with the exception of the sequential execution model, only streaming vector calculations are used: on household computers in specialized coprocessors for three-dimensional graphics (GPU), as well as in their analogs of increased power (GPGPU) at computer stations for numerical simulations and processing of large data arrays . The central processor delegates some specific tasks to the stream processor, however, only he himself performs the control functions in full. With the development of quantum computing, this state of affairs is unlikely to change: quantum pipelines and machines will be only additions to the central processor, but not its replacement.

    Part 2. Element base of processors today and in the future

    At the elementary level, processors consist of two types of elements: switches and tracks connecting them. The task of the tracks is to carry the signal, and the task of the switches is to transform this signal. There are many different implementations of switches and tracks.

    In the digital electronics around us, transistors are used as switches, and tracks are made of conductors.

    § 2.1. Signal Transmission
    Binary digital electronics use voltage to encode the signal. A track carries a “zero” signal if a potential (U 0 ± ε) is applied to it, and if a potential (U 1 ± ε) is applied to it , then it carries a “one” signal. Basic levels U 0 and U1 are chosen so that (U 0 + ε) <(U 1 - ε), where ε is the inevitable error for technical reasons. The transition from one state to another is a stop or, conversely, acceleration of a huge number of electrons in a conductor. Both of these processes take a lot of time and are inevitably associated with large energy losses.

    There are alternatives to digital electronics, devoid of this drawback: you can send a constant stream of electrons (the so-called spintronics) or photons (the so-called photon logic) along the tracks, and encode the signal using the polarization of this stream. Both electrons and photons moving along a one-dimensional waveguide can reside in exactly two orthogonal polarization states, one of which is declared zero and the other one. What is better - photon logic or spintronics is not yet clear:
    - in both cases, the particle stream flows with inevitable thermal losses, but these losses can be arbitrarily small;
    - in both cases, the flow velocity is lower than the speed of light, however, it can be arbitrarily close to it;
    - in both cases, complex techniques are required so that pickups, thermal noise and background radiation (random particles flying past) do not distort the signal.
    In other words, the known physical limitations of both technologies are the same. Most likely, combined technologies will give the best results: transportation of photons wins at long distances and loses at short ones; at the same time, working prototypes of adapters (in both directions) between the polarization of the photon and electron flux have already been developed. Graphene is an ideal material for spintron tracks: it provides excellent speed with very low heat loss. In addition, working prototypes of the following elements already exist:
    - tracks that reliably maintain current polarization at short and medium distances;
    - memory cells capable of preserving polarization and then forming a stream of stored polarization.

    Graphene Track Information:
    2010-06:  Mass fabrication of graphene nanowires. Georgia Institute of Technology.
    2011-01:  Researchers managed to generate a spin current in Graphene. City University of Hong Kong.
    2011-02:  Creating a pure spin current in graphene. City University of Hong Kong.
    2011-02:  Mass production-ready bottom-up synthesis of soluble defect-free graphene nanoribbons. Max Planck Institute for Polymer Research in Mainz.

    § 2.2. Switches
    The transistor used in digital electronics is a device with three contacts, operating as follows:
    - if a high potential is applied to the middle contact, then the extreme contacts are connected, the switch is in the "conductor" position;
    - if a low potential is applied to the middle one, then the extreme contacts are open, the switch is in the “insulator” position.
    At the moment, transistors consisting of one molecule have been successfully created and tested. Their speed and compactness almost reach the theoretical limit, but so far there are no technologies that would allow their mass production and use. All modern digital electronics are made on the basis of metal oxide field effect transistors (MOSFET). For several decades, chip manufacturers have only been reducing the linear dimensions and power consumption of such transistors, however, industrial tests of several types of multi-terminal and transitionless transistors (BDT, JNT) have already been conducted. Such transistors will be able to provide 4-6 times higher performance . This transition will not require major changes in the process and is expected in the next 5-10years.

    For fundamental thermodynamic reasons, transistor-type switches cannot work efficiently: effective ones must be conservative and reversible as a logical element. These conditions are met by a symmetric switch, also called a Fredkin valve. This device has two inputs In 1 , In 2 , two outputs Out 1 , Out 2 and a continuous track C (Control) . If C is in position “0”, the inputs and outputs are connected directly: In 1 with Out 1 , In 2 with Out 2 ; if C is in position “1”, the inputs and outputs are connected crosswise: In 1 with Out2 , In 2 with Out 1 . Most likely, spin symmetric switches can be implemented in a single-molecular version, combining maximum compactness, speed and efficiency.

    Both transistors and symmetric switches can make logic circuits of any complexity, but symmetric switches are much more economical. For example, for cloning a signal and logical operations “not”, “and”, “or”, “and not” and “should”, only one symmetric switch is required, “excluding either”, “not or” and “not and” require two switches. To implement these operations on transistors, 4 to 6 transistors are required:


    A fast single-digit adder (half adder) uses four symmetric switches (instead of 12 transistors) and ideally provides a significantly higher operating speed than a transistor analog: its operating time is only 2τ, where τ is the signal transit time of one switch and one track. On more complex circuits, the compactness and speed gain of the symmetric switches are even more pronounced. The link of a multi-bit adder (full adder) consists of dozens of transistors, while the implementation on symmetric switches requires from 4 (the most compact version) to 7 (the fastest option) switches:
     

    A unique feature of symmetric switches is that they help to achieve maximum compactness and speed of multiplexers and demultiplexers, which are very common constituent elements of processors. Unfortunately, symmetric switches become attractive only when using spintronics or photonic logic. Electronics is doomed to use conventional transistors.

    * The paragraph uses materials and images from articles by Bruce, Thornton et al. , “Efficient Adder Circuits Based on a Conservative Reversible Logic” (Mississippi State University, Proc. IEEE Symposium on VLSI, 2002); Rentergem, de Vos , “Optimal Design of A Reversible Full Adder”(Universiteit Gent, International Journal of Unconventional Computing, 2005), as well as private correspondence. The image of the single-molecule transistor is taken from the corresponding press release on the Yale University website.

    § 2.3. Conclusion
    So, in the limit, the central processor of the future is likely to be a huge web, where the threads are made of graphene tape conducting a polarized stream of electrons, and in the nodes there are miniature symmetric switches. At long distances, the signal will be transcoded into the light, delivered by photons and transcoded back to the destination. A number of special tasks (in particular, search and sorting) will be performed by quantum conveyors and automatic machines.

    Part 3. Architecture of central processors today and in the future

    All central processors common today are register machines of the von Neumann architecture:
    - the processor contains a certain number (for example, 16) numbered memory cells (for example, 64 bits each), which are called operational registers;
    - the processor is to process the contents of the registers in accordance with the program;
    - the program is a sequence of instructions, for example:
      - take the values ​​from registers No. 2 and No. 5 and record their amount in register No. 8;
      - read into register No. 4 the value from the external memory cell with the address from register No. 6;
      - send a number from register No. 12 to an external NNN device;
    - the data and the programs themselves are stored in RAM, while the instructions of the programs are encoded with numbers and written to sequential memory cells;
    - the number of the cell from which the instruction is currently being executed is stored in a special register called the Instruction Pointer, which at the end of the execution moves to the next memory cell.

    For more than half a century since the advent of this architecture, dozens of alternative archemetrics of sequential execution have been proposed, but all of them are inferior to it in the ceiling of performance. The performance ceiling is determined (due to the finiteness of the speed of light) by the linear dimensions of the critical part of the structure, and it is the classical register architecture that provides the maximum locality of computation from all currently proposed sequential execution architectures. Accordingly, it is precisely this architecture that will be preserved, most likely, in the future - for at least a few more decades.

    § 3.1. Functional Logical Units
    The switches and tracks described in the first part are the “raw material” for the production of functional logic units such as an adder, multiplier, multiplexer, and demultiplexer. Multiplexers and demultiplexers are devices similar to marshalling yards on the railway (see illustration to the right). The input of the multiplexer receives n numbers ( a 1 .. a n ) and the control number m . The output is the number a m , that is, the m- th of n input numbers. Demultiplexer - mirror analog multiplexer: transmits an incoming signal to the m -ty of nexits.


    The task of the adder is to add the numbers supplied to it. The input of an n- bit adder receives two numbers a and b of n bits plus a single-bit number c (0 or 1). The adder adds the input numbers and produces the result ( a + b + c ) . This is an ( n + 1) -bit number.


    Functional-logical units do not immediately respond to changes in arguments: the signal needs to go all the way from the beginning of the circuit to its end, taking into account delays caused by passing through the switches and possible cycles in the circuit. For each functional-logical unit during design, the operating time is calculated. An extensive theory of designing functional logic units with a minimum runtime has been developed. So, the simplest multi-bit adder, performing addition in the image and likeness of addition in a column on paper, requires time to work, growing linearly with bit depth. In the 60s, accelerated transfer schemes were developed that reduced growth to a logarithmic one. Using their adder on symmetric switches is guaranteed to calculate the result for ⌈3 + log 2(capacity) ⌉ · τ, which is the theoretical speed limit. In the case of a 64-bit adder, a 15-fold acceleration is achieved compared to a naive implementation. And the ingenious optimization of multiplication, division, and calculation of trigonometric functions sometimes allows one to achieve acceleration hundreds of times.

    § 3.2. Operation Blocks
    An operation block is a device that can execute instructions of one type — for example, addition and subtraction. The basic element of the operating unit isa memory cell (the so-called instruction register) into which the instruction to be executed is written. Upon execution of the instruction, the unit clears this cell. The rest of the operation block consists of a set of functional-logical units connected with each other and with the instruction register. As an example, consider an operating unit capable of following instructions such as “add the values ​​from registers No. 2 and No. 5 and write the amount to register No. 8 ”.
    We need:
    - the instruction register, which stores the summing operation code and the numbers of the operand registers (2, 5, and 8 in the above command);
    - adder;
    - two multiplexers connected to the block of registers on the one hand and the inputs of the adder on the other: they select from the registers those two whose numbers are stored in the register instructions ( No. 2 and No. 5 ) and feed the contents to the inputs of the adder;
    - one demultiplexer supplying the output of the adder to the inputs of the register in which it should be stored.

    In modern processors, operating units are capable of more than one action. For example, a block may contain, in addition to the adder, a multiplier. For technical implementation, it is enough to put in parallel with the adder a multiplier framed by a demultiplexer and a multiplexer, which choose to conduct the signal through the adder or multiplier, depending on the operation code stored in the instruction register.

    § 3.2.1. Organization of the operating cycle: synchronization, superscalarity, pipelining
    The processor is a repetition of the next cycle, during which the information goes in a circle " registers → operating units → registers":
    - load the instructions to be executed into operating units;
    - wait for the results;
    - save the results to the registers.

    There are two approaches to organizing the processor cycle: synchronous and asynchronous. Synchronous processors have a signal conductor. When it is changed (“the conductor waves his wand”), the values ​​generated by the operating units are stored in the registers, and the following instruction is sent to the operating units. In asynchronous processors there is no common signal-conductor, and the operational units themselves report the readiness of the result. From the moment the task and operands are ready, the block begins to track the result readiness. At the end of the calculation, the block gives the go-ahead to save the result and requests the next task.

    Synchronous processors have a fixed clock speed -the number of “swings of the conductor’s stick” per second. The time between “swings” has to be chosen based on the most pessimistic scenario. However, in practice, in the vast majority of cases, the operation requires several times less time. Three factors affect the speed of execution:
    - the initial state of the operating unit: if the instruction or part of the arguments are loaded in advance, part of the block comes to the desired state in advance;
    - values ​​of operands: so, the multiplication of short numbers takes much less time than the multiplication of long;
    - the level of background noise at the temperature of the processor 20 ° C , all operations take place much faster than at 90 ° C .

    A synchronized approach cannot account for all of these factors; in addition, the dependence of the noise level on temperature is determined by the concentration of crystalline defects - a value that varies from instance to instance and increases as the processor ages. When calculating the processor clock frequency should be based on the worst, i.e. worn out mediocre instance, operating at a temperature of 85-90 ° C . This fact is used by people who perform the so-called “overclocking” of the processor: successful instances with sufficient cooling continue to work stably when the clock speed set by the manufacturer is significantly exceeded.

    Asynchronous processors begin the next operation as soon as the results of the previous one are ready. They have no such thing as clock frequency, and their speed is always maximum for these conditions: if you put hot coffee on an asynchronous processor, it will slow down, if you pour it with liquid nitrogen, it will accelerate. However, asynchronous circuitry, due to its complexity, is currently used in only a small number of experimental asynchronous processors (starting with ILLIAC I in 1951 and ending with GA144 in 2010). In the future, it will probably be applied more and more, since there are no other areas of optimization left.

    For more than a decade, several operating units of each type have been installed in processors. Such processors are called superscalar. They can simultaneously perform independent actions - for example, in parallel to execute the instructions “c = a + b” and “d = a · b”. Another widely used optimization technique is the so-called “pipelining”: the task instruction is written to the operation blocks before the operands are ready, while the multiplexers and demultiplexers manage to arrive at the desired state in advance, and the operation time is drastically reduced. However, the maximum benefit of using pipe-lining can only be obtained on an asynchronous processor, since synchronous processors give each instruction a predetermined number of “swings of the baton” to execute, and the gain from pre-loading instructions into blocks is relatively small. If, in addition, separately monitor the readiness of each operand, on the asynchronous processor you can get a time gain not only due to pre-loading instructions, but also due to the earlier readiness of one of the operands. Sometimes an instruction may work at all before everyone is readyoperands - for example, if one of the operands is equal to zero, multiplication will give zero, without waiting for the second operand to be ready.

    The only effective way to separately control operand availability isone-time use of registers. This term means that during the operation of each subprogram, writing to each register is performed only once (the values ​​are never overwritten). At the end of the procedure, all used registers are reset back to the empty state. This ensures the simplicity of checking the availability of operands: the operand is ready as soon as the register containing it is not empty. The straightforward implementation of separate readiness control requires a huge number of registers, but in practice this is not required: a large number of logical registers can be emulated, in fact, dispensing with a small number of hardware registers.

    § 3.3. Control blocks and interrupt block
    The control unit is responsible for the distribution of tasks among the operating units in accordance with the program recorded in the main memory. It loads program instructions from the RAM, decodes them and sends them in decoded form to the instruction registers of free operating blocks. In addition, it executes branch and branch instructions, that is, it selects one of the program branches depending on the value of an operational register. There can be several control units in the processor, which allows you to execute several programs in parallel.

    Central processors not only execute programs, but also respond to hardware requests. Hardware Request -this is a signal supplied by external devices to processor inputs specially allocated for this. For example, when you click on the mouse button, the processor is requested to process this event. The processing of hardware requests is handled by an interrupt block. Upon receipt of the request, the interrupt block searches for a free control block and sends an order to execute the handler program. If there are no free blocks, he selects one of the blocks and temporarily interrupts the program running in it, redirecting the block to process the hardware request. At the end of the processing, the interrupted program resumes. The interrupt block owes its name to the fact that most of its work consists precisely in interrupting and resuming the execution of programs.

    § 3.3.1. Control instructions: jumps, calls, branching, and loops
    A program is a set of instruction-encoded numbers located in consecutive memory locations. If, for one reason or another, it is not possible to arrange consecutive instructions in consecutive cells, a jump instruction is usually used. Reading such an instruction, the control unit writes the address specified in the instruction into the Instruction Pointer. In traditional architectures, the state of the remaining registers does not change during the transition.

    Historically, the main use of jump instructions is to call subroutines, but now separate call and return instructions are used for this. The call instruction masks or clears all registers (by resetting their values ​​to a specially allocated storage), except for the range of registers in which the arguments are located so that the called subroutine works “from scratch”. (Masking works faster than moving values ​​to the storage, but can only be implemented on processors that use dynamic register renaming.) Subprograms end their execution with a return instruction that clears the registers used by the subprogram and unmasks or restores the registers removed by the call except for the range in which they are stored subroutine execution results.

    A special case is the so-called tail call, that is, the situation when a certain subroutine f ends with a pair of instructions of the form “call g and return the result of its execution to a higher level”. In this case, when calling g, you do not need to save or mask the current registers and the return point - instead, you can pass the called routine g that the result must not be returned to f, and immediately to a higher level. The tail call differs from the transition described at the very beginning of this section only in that it explicitly indicates which registers should be saved and which should be discarded during the transition. In the future, the “tail call” transition will replace the traditional transition (which does not change the values ​​of the operational registers), since the traditional transition interferes with asynchronous execution and one-time use of the registers, while the “tail call” transition does not. (In fact, the return statement is a special case of the transition, the “tail call”, however, a discussion of this beautiful concept is beyond the scope of this text.)

    In addition to calls and transitions, control units execute branch instructions, that is, select program branches depending on the value of one of the operational registers. Branching is traditionally done using conditional branch instructions, but on asynchronous processors, instead of the traditional branch operation, it is much better to use conditional call or conditional tail call instructions. An instruction of this kind indicates a table in which it is written which subroutine to call depending on the value of one of the operating registers. Using conditional (tail) call instructions, you can effectively implement constructions of the form if-then-else , match-case, conditional recursion and, accordingly, all possible types of cycles. No additional execution control operations are required.

    Often, when approaching the branch point of free computing resources, a large part of the arguments for the called subprograms is ready, but the value that determines which of the subprograms should ultimately be called is not ready. In this case, one can speculatively begin to perform one or more of the most probable procedures, avoiding only operations that affect the "outside world": saving any values ​​to the public RAM, accessing external devices and returning from the procedure. In many situations, this optimization technique can provide more than a twofold increase in productivity.

    Branching example:
      match (a <> b) {
        case 0: {
          print "a and b are equal"
        }
        case 1: {
          print "a is greater than b"
        }
        case -1: {
          print "b is greater than a"
        }
      }
    The most common particular case of branching is if-then-else binary branching :
      if (condition) {
        do_this
      } else {
        do_that
      }
    
      match (condition) {
        case True: {
          do_this
        }
        case False: {
          do that
        }
      }
    Recursion example:
      def gcd(a, b) = {
        if (b = 0)  return a
        else        return gcd(b, a % b)  // Хвостовая рекурсия
      }
    


    § 3.4. Conclusion
    Apparently, the architecture of the central processors will remain von Neumann. The increase in performance will be ensured by better parallelization of instructions and the transition to asynchronous circuitry. To ensure the best possible parallelism, it will be necessary to switch to more advanced sets of instructions - in particular, suitable for one-time use of registers.

    Modern processors use from 1 to 16 control units and from 4 to 64 operating units. In the transition to asynchronous circuitry, it will be justified to use several dozen control units and several hundred operating units. Such a transition, together with a corresponding increase in the number of blocks, will provide an increase in peak performance by more than two orders of magnitude and average productivity by more than an order of magnitude.

    Part 4. ↁⅠⅩ wanted!

    In 1962, Donald Knuth began writing the famous book series The Art of Programming, which describes effective algorithms and analyzes their speed. To preserve the ability to describe low-level algorithms and accurately evaluate the time of their execution, he decided not to use high-level programming languages, but to write in assembler. However, the architectures and assemblers of real processors are full of bizarre technical features that distract from the essence -in those days they were the results of the necessary engineering compromises and technical tricks, today their main source is the requirement of backward compatibility with older models. In order not to overload the reader with unnecessary and quickly outdated details and preserve the versatility of the presentation, Knut decided to develop his own computer architecture, designed specifically for training. This architecture is called MIX.

    Over the next three decades, major changes have taken place in the field of computer technology, and MIX is largely outdated. To keep the "Art of Programming" continuing to be a relevant source, the author decided to develop together with leading industrial developers of MMIX processors -a new fictional computer, which is an idealized and improved analogue of computers in the second half of the 1990s. MMIX is a well-thought-out architecture suitable for immediate hardware implementation and noticeably more developed than all widely used processor architectures today.

    The desire to make a machine that can be effectively implemented "in hardware" by the means of the late 1990s prevented us from making a sufficient reserve for the future -unfortunately, the MMIX instruction set is practically unsuitable for realizing the potential of asynchronous superscalar processors. The fact is that the organization of calls in MMIX is based on the use of traditional transitions and requires rewriting of registers. For a practical study of promising programming techniques, it is necessary to revise this set of instructions in accordance with the inevitable trends in the development of central processors described above - fortunately, this is not such a difficult task.

    Also popular now: