Ballad of "Multiclet"

    No, I won’t tell you the riddle hiding in the name MCp0411100101 , but I’ll try to answer the comment nerudo in the topic “Multiclet” processors in more detail :

    Reading the description of architectural innovations of this multiclet, I would like to use the phrase from the next topic: "I do not understand."


    In short, MCp is a stream (from dataflow) processor with an original EPIC architecture. EPIC is Explicitly Parallel Instruction Computing, computations with explicit instruction parallelism. I use this term here in this sense as an abbreviation, and not as a reference to Itanium architecture. Explicit concurrency in MCp is a completely different kind.

    About the benefits of MCp


    For starters, I will say that the EPIC in MCp is such that it gives the processor a number of attractive (at least for me personally) properties.

    1. Good energy efficiency. Which is ensured by the fact that MCp can:
      • coordinate their architectural condition much less frequently than traditional architecture processors;
      • naturally combine in parallel asynchronous execution instructions for working with memory, arithmetic instructions, and code prefetching.
    2. The non-standard branch organization provides interesting implementation possibilities for the so-called managed runtime, or (in classical terminology), safe programming languages.
    3. Functional languages ​​can be put on the MCp architecture “more naturally” than on traditional machines.
    4. The peculiarity of coding programs (machine code is implied) and the peculiarity of their execution make MCp a gradually degrading processor. That is, it may not break completely when functional devices fail, but it may be natural to switch to a mode of operation with a different distribution of computations, with lower performance. Unlike traditional fault-tolerant processors, in which functional devices (or the whole processors themselves) are simply tripled, MCp in the normal mode, when hardware errors do not occur, can more efficiently use its computing power.
    5. To all this, MCp can still be relatively easily scaled (it would allow the process technology) and run in multi-threading mode (meaning SMT - Simultaneous Multi Threading) and even with the dynamic sharing of resources between threads.


    Now I’ll try to explain where these properties come from and where the name “multicellular” comes from, that is, what a “cell” is. I do not know anything about the mysterious numbers in the markings. Maybe this is the key of some "MultiClet" quest? :) No, I seriously don’t know.

    Cells


    A cell (such a name) is the main element of MCp microarchitecture. There may be a simpler explanation of what it is, but it's easier for me to start with a description of the existing processors.

    Any modern CPU contains a set of certain functional devices. They can be divided into several types. Let's divide (to explain the features of MCp, I don’t need a very detailed description of them, so everything is superficial).

    1. ALU : arithmetic logic devices, in a broad sense. That is, devices that perform all kinds of transformations on data. They have input ports to which the operation code and operands and output are supplied, on which the results are generated.
    2. LSU : memory access devices (Load / Store Unit). Naturally, this piece of data does not convert, but writes or reads from memory. It has its own input and output ports.
    3. RF : register file. These devices store data from some buses (not quite the correct name, but not the essence) and transfer them to other buses, focusing on commands and values ​​on their input ports. These buses are connected to LSU or ALU ports. It is often said that registers are such a fast internal processor memory. It is more correct to say that RF is such a very efficient internal processor memory, and the semantics of registers are the interfaces for accessing it. Because His Majesty exists ...
    4. CU : control device. This is a device that controls many ALUs, LSUs and RFs (it’s now fashionable to make one common RF per core, but it hasn’t always been that way; I’m just clarifying), controlling the signal transmission between them, executing the program. In modern traditional high-performance processors, the CU itself is very complex, and consists of other components: schedulers, branch predictors, decoders, queues, acknowledgment buffers, etc. But for the purposes of this story, it is more convenient for me to consider all this as one device. In the same way, I do not lay out on the adders and shifters ALU.


    We can say that CU is a device that determines the type of processor. In almost all modern traditional processors, ALUs, LSUs, and RFs are functionally the same from a functional point of view (if you don’t go into the subtle details of the implementation and distinguish between vector and scalar ALUs; yes, I agree the statement was conditional). The whole variety of CPU, GPU, PPU, SPU, and other xPU models is provided by the difference in the logic of different CU variants (and this difference is much more significant than the difference between vector and scalar ALUs).

    This may be the logic of a simple stack processor whose CU should operate in a trivial cycle. Read from your RF, consisting of two registers IP (pointer to the current instruction) and SP (top of the loop), both registers. Set the read operation code and IP content on the LSU input ports (most likely, the CU in this case will simply switch the RF output and the LSU input), get the answer - the instruction code. If, for example, this is a jump instruction code, then the CU should set the LSU ports to read the value from the top of the stack, change the SP to one, send this value to RF, and at the next clock switch the LSU output port to the RF input port (to another port by setting the value corresponding to the entry in the IP). Then, repeat the cycle. In fact, very simple. I think

    This may be the logic of a fancy superscalar and multi-thread POWER8 with an extraordinary execution, which selects several instructions per cycle, decodes the previous sample, renames the registers, drives a huge register file (even with i686 with its visible 16 registers, register files could be the size in 128x64 bits), predicts branching, etc. You can’t do such a processor in the form of homework.

    Or it can also be a fairly simple RISC-like CU, which in the GPU distributes to all ALUs, LSUs and RFs packaged in a multiprocessor the same command.

    In today's high-performance CPU, it is CU that is the most complex device that takes up most of the chip. But this is not important yet. The main thing is that in all these cases, CU is one, although it can at the same time load with work and control many other functional devices. What can be equivalently formulated as follows: in modern processors, you can execute several control flows (chains of instructions) using one CU (SMT, for example, Hyper Threading); but you cannot execute a single control flow using multiple CUs.

    Yeah! My young Padawan (we are all young in spirit and we know that we don’t know anything :) the solution to the mystery of Multiclet is close. Naturally, now I will say that the design of a multicellular processor is such that it contains several CUs that work according to a certain protocol and at the same time form a certain distributed CU that can execute one thread of execution (one thread, that is, thread) on several cores . But first, I’ll say something else.

    So,cell- This is an analog of the kernel in the usual CPU. It contains its own CU, ALU (one, but quite advanced even in the early version of the processor, capable of performing operations with float [2] values, including complex arithmetic operations; the version currently being developed will support calculations with double). Cells can have access to common RF and LSUs, or they can have their own, which can operate in mirror mode or even RAID-5 (if necessary; remember, the most important word at this stage of the project development is “fault tolerance”). And one of the most pleasant places in the architecture of MCp is that although RF will work much slower in such modes, MCp performance will not significantly reduce it, since the main data exchange during the calculation is not through RF and shunts (bypass), but through another non-memory device — a switch.

    The main feature of cells is that their CUs, working according to a special protocol and with a special representation of the program, together make up one distributed CU that can execute one thread (in the sense of thread, in the sense of control flow). And they can execute this thread in parallel, asynchronous, combined mode, when they happen simultaneously: fetching instructions, working with memory and RF (this is done in a very neat way), arithmetic transformations, calculating the purpose of the transition (and from this I personally bastard, because pattern-matching from high-level languages ​​fits this perfectly). And, even more remarkable, these CUs turned out to be much more substantial :) really simpler than the CUs of modern superscalar processors with extraordinary execution. They are also capable of such parallel execution of the program (I’ll clarify: but not due to its simplicity and distribution, but rather, due to its complexity and centralization, which are necessary for the formation of special knowledge about the executable program; more in the next part of the text).

    In my opinion (which may differ from the opinion of the engineers themselves who developed and improved MCp), the most important achievement in the processor is precisely these CUs, which provide the fault tolerance and energy efficiency that are important at the current stage of the processor’s existence. And the proposed principle of their construction itself is important not only for microprocessors, but also for other high-performance distributed computing systems (for example, the RiDE system is built on similar principles).

    Energy efficiency. MCp is a parallel processor capable of executing 4 instructions per cycle, which is not bad at all. And for this, he does not need a complex and large (in size) central CU, he manages with relatively small devices local to each cell. Small means less energy. Local, which means you can do with shorter wiring for signal transmission, which means less energy will be dissipated and a higher frequency potential. This is all +3 to energy efficiency.

    Fault tolerance. If a CU dies in a traditional processor, then the entire processor dies. If one of the CU dies in the MCp, then one of the cells dies. But the calculation process can continue on the remaining cells, albeit more slowly. Conventional processors traditionally triple to ensure reliability. That is, they put three processors that run the same program. If one of them starts to fail, it is detected and it is turned off. The MCp architecture allows the processor to work in this mode by itself, and this can be controlled programmatically: if necessary, it can be considered in high-performance mode, when necessary, it can be read in the mode with cross-checking results, without wasting additional hardware resources, which can also refuse. Other modes are possible (so far, as far as I know, they have not been patented,

    The birth of nonlinearity


    Now I’ll try to explain why such a distributed CU is possible, that it can really be simple, why it needs a different way of coding the program, and why this method proposed by the authors of MCp is cool. Again, it’s easier for me to start by describing traditional (GPUs and VLIWs are also considered traditional) architectures.

    Let’s compile something already, otherwise I haven’t compiled anything for two days already, my hands are itching.

    cat test-habr.c && gcc -S test-habr.c && cat test-habr.s

    typedef struct arrst Arrst;
    struct arrst
    {
    	void * p;
    	char a[27];
    	unsigned x;
    };
    struct st2
    {
    	Arrst a[23];
    	struct st2 * ptr;
    };
    struct st2 fn5(unsigned x, char y, int z, char w, double r, Arrst a, Arrst b)
    {
    	int la[27];
    	char lb[27];
    	double lc[4];
    	struct st2 ld[1];
    	return ((struct st2 *)b.p)[a.a[((Arrst *)b.p)->a[13]]].ptr->ptr->ptr[lb[10]];
    }

    	.file	"test-habr.c"
    	.text
    	.globl	fn5
    	.type	fn5, @function
    fn5:
    .LFB0:
    	.cfi_startproc
    	pushq	%rbp
    	.cfi_def_cfa_offset 16
    	.cfi_offset 6, -16
    	movq	%rsp, %rbp
    	.cfi_def_cfa_register 6
    	subq	$1016, %rsp
    	movq	%rdi, -1112(%rbp)
    	movl	%esi, -1116(%rbp)
    	movl	%edx, %eax
    	movl	%ecx, -1124(%rbp)
    	movl	%r8d, %edx
    	movsd	%xmm0, -1136(%rbp)
    	movb	%al, -1120(%rbp)
    	movb	%dl, -1128(%rbp)
    	movq	56(%rbp), %rdx
    	movq	56(%rbp), %rax
    	movzbl	21(%rax), %eax
    	movsbl	%al, %eax
    	cltq
    	movzbl	24(%rbp,%rax), %eax
    	movsbq	%al, %rax
    	imulq	$928, %rax, %rax
    	addq	%rdx, %rax
    	movq	920(%rax), %rax
    	movq	920(%rax), %rax
    	movq	920(%rax), %rdx
    	movzbl	-134(%rbp), %eax
    	movsbq	%al, %rax
    	imulq	$928, %rax, %rax
    	leaq	(%rdx,%rax), %rcx
    	movq	-1112(%rbp), %rax
    	movq	%rax, %rdx
    	movq	%rcx, %rsi
    	movl	$116, %eax
    	movq	%rdx, %rdi
    	movq	%rax, %rcx
    	rep movsq
    	movq	-1112(%rbp), %rax
    	leave
    	.cfi_def_cfa 7, 8
    	ret
    	.cfi_endproc
    .LFE0:
    	.size	fn5, .-fn5
    	.ident	"GCC: (GNU) 4.7.2"
    	.section	.note.GNU-stack,"",@progbits


    This is a traditional assembler for a register machine. The fact that such code can be executed only with the help of one CU is demonstrated very well by this chain (I remind you that in AT&T assembler, writing occurs to the right operand):
    	imulq	$928, %rax, %rax
    	addq	%rdx, %rax
    	movq	920(%rax), %rax
    	movq	920(%rax), %rax
    	movq	920(%rax), %rdx
    	movzbl	-134(%rbp), %eax


    Meditating on this code, we can say that the first five instructions must be executed strictly one after the other, and the last, 6th, could be run in parallel with all previous instructions. But the instructions themselves do not directly contain information that the code can be executed this way. In fact, in the traditional code with registers, the binding of the instruction to the processed data is carried out in the only way: by setting the instruction to a certain place in the linearly ordered chain of commands.

    Each instruction, as it were, is an operator that acts on a certain value - the architectural state of the processor (in theory, of course, it can act on memory, and on the outside world, and for complete rigor and formality, we must talk about this too; but CU the processor can only be responsible for the processor; mathematicians terribly do not like this, by the way) - which was formed as a result of the execution of the previous instructions (taking into account the branches, of course). To try to execute this code using several independent CUs, you need to answer the questions asynchronously:
    • How to split the architectural state between these CUs?
    • how to determine to which part of the architectural state the next (which?) instruction should be applied in order to preserve the semantics of the code?
    • How then to put everything together?
    • Do I need to exchange architectural states?


    To answer these questions, you need to analyze the code according to its consistent semantics, when the next instruction is applied to the results of all previous ones, and collect information about this analysis in one place. That is, for this we need a certain single point of analysis, one CU. Everyone's favorite superscalar with extraordinary execution of Core iX, or AMD FX, or POWER8 are able to conduct such an analysis and plan the execution of code based on it. The CU of these processors is capable of such “miracles,” that is, of parsing a chain of sequential code into independent parallel pieces, and of assembling it back into a sequential chain of completion operations. But nothing happens by itself. Such CUs are the most expensive both in terms of the transistor budget and the device’s power consumption in modern high-performance CPUs. These are very complex schemes. You can even say masterpieces. It seems that for the invention of the first such was awarded the Turing Prize.

    Probably better to see once. This is a photograph of a VIA Isaiah chip (not the most sophisticated processor) with functional block marking. All that is not related to caches, PLL, FP, IU, Load / Store are execution control schemes.

    image

    That is why ARM plays big.LITTLE, because they cannot make both an economical and a productive processor, relying on the traditional architecture with registers. And that’s why, NVIDIA can bang its heel on its chest and report that in one core of their processor there are 4 times more arithmetic devices than in one core of Sandy Bridge. In SB, the rest of the space is occupied by analysis and planning schemes for the execution of a stream of linearly ordered instructions.

    In VLIW and in the EPIC version from Intel and HP (Itanium), the semantics of instructions are exactly the same: they must be executed one after another, changing the architectural state of the processor (roughly speaking, the values ​​in the registers). Yes, the instructions for such processors are complex and encode the execution of a large number of operations. But the need for linear ordering of these complex instructions remains. Therefore, CUs in VLIW processors and Itaniums cannot be distributed. Of course, in these processors, the CU is much simpler than in processors with an extraordinary execution of instructions. But these architectures have their drawbacks, which are not in MCp. For example, to maintain a high execution rate of VLIW code, processors must have complex register files: up to 6 values ​​at a time can be written to them simultaneously (one of Itanium models) simultaneously and subtracted, respectively, 12. There is a problem when working with memory: if the instruction contains an operation to access a memory area that has not yet been mapped to the cache (or other fast memory), then the entire VLIW processor stops and waits for the completion of this operation. Itanium tried to solve this problem by forced software prefetch, but it turned out well only for scientific calculations. Their memory access structure is regular, predictable, and is output at the compilation stage. In other high-performance markets, (so far) either RISC SMT processors or processors with extraordinary execution reign, which can perform other work during delays during memory accesses. There is a problem when working with memory: if the instruction contains an operation to access a memory area that has not yet been mapped to the cache (or other fast memory), then the entire VLIW processor stops and waits for this operation to complete. Itanium tried to solve this problem by forced software prefetch, but it turned out well only for scientific calculations. Their memory access structure is regular, predictable, and is output at the compilation stage. In other high-performance markets, (so far) either RISC SMT processors or processors with extraordinary execution reign, which can perform other work during delays during memory accesses. There is a problem when working with memory: if the instruction contains an operation to access a memory area that has not yet been mapped to the cache (or other fast memory), then the entire VLIW processor stops and waits for this operation to complete. Itanium tried to solve this problem by forced software prefetch, but it turned out well only for scientific calculations. Their memory access structure is regular, predictable, and is output at the compilation stage. In other high-performance markets, (so far) either RISC SMT processors or processors with extraordinary execution reign, which can perform other work during delays during memory accesses. then the entire VLIW processor stops and waits for the completion of this operation. Itanium tried to solve this problem by forced software prefetch, but it turned out well only for scientific calculations. Their memory access structure is regular, predictable, and is output at the compilation stage. In other high-performance markets, (so far) either RISC SMT processors or processors with extraordinary execution reign, which can perform other work during delays during memory accesses. then the entire VLIW processor stops and waits for the completion of this operation. Itanium tried to solve this problem by forced software prefetch, but it turned out well only for scientific calculations. Their memory access structure is regular, predictable, and is output at the compilation stage. In other high-performance markets, (so far) either RISC SMT processors or processors with extraordinary execution reign, which can perform other work during delays during memory accesses.

    So. Traditional coding of a program in the form of a linearly ordered sequence of instructions that refer to registers requires a centralized CU (very complex if you want to work effectively with memory and several ALUs). Which must manage all other devices in the processor (therefore, the more devices, the more this CU and more difficult). Which consumes a lot of energy and is a critical point of failure of the entire processor.

    At this point, Multiclet engineers make a graceful feint with their ears and raise the public with the question: why do we need to get attached to the linear order when coding a program ? And then they answer him even more elegantly and brilliantly: but not for anything !

    Paragraphs and Transitions


    Hurrah! Again compilation time! Let's look at the same program in details (no, we haven’t attached eyes to it yet) MCp.

    cat test-habr.c && rcc -target=mcp < test-habr.c

    typedef struct arrst Arrst;
    struct arrst
    {
    	void * p;
    	char a[27];
    	unsigned x;
    };
    struct st2
    {
    	Arrst a[23];
    	struct st2 * ptr;
    };
    struct st2 fn5(unsigned x, char y, int z, char w, double r, Arrst a, Arrst b)
    {
    	int la[27];
    	char lb[27];
    	double lc[4];
    	struct st2 ld[1];
    	return ((struct st2 *)b.p)[a.a[((Arrst *)b.p)->a[13]]].ptr->ptr->ptr[lb[10]];
    }
    


    Further, a slightly cleaned assembler: for convenience, I removed our debugging technical comments and .local / .global directives (they do not change the essence). Directive .aliasplays a role #define. It is used to improve code readability. All non-optimality and stupidity of the compiler are saved. We will have it for some time at the beta version stage (useful, however, for working with the processor and issuing the correct code). Therefore, do not judge too harshly. We know about various optimization techniques and we will gradually introduce them. In the meantime, this is really naive and suboptimal code, so what is there to hide. But we are discussing the architecture of the processor itself, not the compiler.

    .alias SP 39	; stack pointer
    .alias BP 38	; function frame base pointer
    .alias SI 37	; source address
    .alias DI 36	; destination address
    .alias CX 35	; counter
    .text
    fn5:
    	.alias fn5.2.0C #BP,8
    	.alias fn5.x.4C #BP,12
    	.alias fn5.y.8C #BP,16
    	.alias fn5.z.12C #BP,20
    	.alias fn5.w.16C #BP,24
    	.alias fn5.r.20C #BP,28
    	.alias fn5.a.24C #BP,32
    	.alias fn5.b.60C #BP,68
    	.alias fn5.2.0A #BP,8
    	.alias fn5.x.4A #BP,12
    	.alias fn5.y.8A #BP,16
    	.alias fn5.z.12A #BP,20
    	.alias fn5.w.16A #BP,24
    	.alias fn5.r.20A #BP,28
    	.alias fn5.a.24A #BP,32
    	.alias fn5.b.60A #BP,68
    	.alias fn5.lb.27AD #BP,-27
    	.alias fn5.1.32RT #BP,-32
    	.alias fn5.2.36RT #BP,-36
    	.alias fn5.3.40RT #BP,-40
    	.alias fn5.4.44RT #BP,-44
    	.alias fn5.5.48RT #BP,-48
    	jmp	fn5.P0
    	getl	#SP
    	getl	#BP
    	subl	@2, 4
    	subl	@3, 56
    	wrl	@3, @2
    	setl	#SP, @2
    	setl	#BP, @4
    	complete
    fn5.P0:
    	jmp	fn5.P1
    	rdsl	fn5.y.8C
    	wrsb	@1, fn5.y.8A
    	complete
    fn5.P1:
    	jmp	fn5.P2
    	rdsl	fn5.w.16C
    	wrsb	@1, fn5.w.16A
    	complete
    fn5.P2:
    	jmp	fn5.P3
    	getsl	0x340
    	wrsl	@1, fn5.1.32RT
    	complete
    fn5.P3:
    	jmp	fn5.P4
    	rdsb	fn5.lb.27AD + 10
    	rdsl	fn5.1.32RT
    	mulsl	@1, @2
    	wrsl	@1, fn5.2.36RT
    	complete
    fn5.P4:
    	jmp	fn5.P5
    	rdl	fn5.b.60A
    	wrl	@1, fn5.3.40RT
    	complete
    fn5.P5:
    	jmp	fn5.P6
    	rdl	fn5.3.40RT
    	addl	@1, 0x11
    	rdsb	@1
    	exa	fn5.a.24A + 4
    	addl	@2, @1
    	rdsb	@1
    	rdsl	fn5.1.32RT
    	mulsl	@1, @2
    	wrsl	@1, fn5.4.44RT
    	complete
    fn5.P6:
    	jmp	fn5.P7
    	getsl	0x33c
    	wrsl	@1, fn5.5.48RT
    	complete
    fn5.P7:
    	jmp	fn5.P7.blkloop
    	rdl	fn5.3.40RT
    	rdsl	fn5.4.44RT
    	rdsl	fn5.5.48RT
    	addl	@2, @3
    	addl	@1, @2
    	rdsl	fn5.5.48RT
    	rdl	@2
    	addl	@1, @2
    	rdsl	fn5.5.48RT
    	rdl	@2
    	addl	@1, @2
    	rdl	@1
    	rdsl	fn5.2.36RT
    	addl	@1, @2
    	rdl	fn5.2.0A
    ; Этот ужас - настройка на копирование структуры :)
    	getl	0x0000ffff
    	patch	@1, @3
    	patch	@2, @3
    	setq	#SI, @2
    	setq	#DI, @2
    	getl	0xfcc1ffff
    	patch	@1, 0
    	setq	#CX, @1
    	getl	#MODR
    	or	@1, 0x38
    	setl	#MODR, @1
    	complete
    ; само копирование, регистры с номерами CX, SI и DI меняются автоматически
    fn5.P7.blkloop:
    	exa	#CX
    	jne	@1, fn5.P7.blkloop
    	je	@2, fn5.P7.blkclean
    	rdb	#SI
    	wrb	@1, #DI
    	complete
    fn5.P7.blkclean:
    	jmp	fn5.PF
    	getl	#MODR
    	and	@1, 0xffffffc7
    	setl	#MODR, @1
    	complete
    fn5.1L:
    fn5.PF:
    	rdl	#BP, 4
    	jmp	@1
    	getl	#BP
    	rdl	#BP, 0
    	addl	@2, 4
    	setl	#BP, @2
    	setl	#SP, @2
    	complete
    


    It looks like ordinary assembler, but this impression is misleading. First, you should pay attention to the fact that all the code is divided into sections between some label and flag complete. This section is called a paragraph .

    Each paragraph contains a set of instructions. The instructions are different, some of them refer to registers (marked #), but some refer to the values ​​obtained as a result of the previous instruction. The link has the form @N, where N is the distance to some previous instruction in the code of the paragraph , the result of which is referred to by such an expression. The distance is indicated in the instructions and, of course, is counted back, that is, from bottom to top in the text from the current command .

    That is, in fact, using @-references in the section describes the data flow graph of a certain section of the program. And the instructions are mainly not operators that act on the previous architectural state, but simple operations that need to be performed with one or two values. The semantics of most MCp instructions are “lighter” than the semantics of traditional instructions. The architectural state is changed only by instructions to write to registers ( setX) or write to memory ( wrX).

    The multicellular processor, executing the paragraph, actually performs the reduction of the data flow graph described by the paragraph code. He does it as follows.

    Let's say that N cells are functioning in the processor. Then the cell with number n, having received the address of the next paragraph to execute, begins to read the commands that are in the positions N * k + n (k = 0, 1, ...) of the paragraph until it encounters the complete flag (the assembler will make sure that she will meet him). And begins to execute the instructions ready for execution.

    Instructions are considered ready for execution when the cell has received all the operands necessary for execution. When the instruction is completed, the cell reports its result to all other cells and to itself through the broadcast processor switch. A tag is assigned to the result, according to which it can be associated with a @-link (they are called links to the switch ) of some other instruction stored in the buffer of this or another cell.

    And the cycle of finding ready-to-execute instructions and sending the results of their execution to other cells is repeated. Of course, buffers that allow you to monitor the availability of instructions are relatively complex devices, but they are much, much simpler than the CUs of traditional processors with extraordinary execution. If we compare it with VLIW and other EPIC processors, then the cells calmly experience delays when accessing data in memory (or in registers), because, most likely, their CU buffers will contain instructions that are ready to execute, which do not depend on the results of this specific reading (or writing).

    But that is not all. Cells in a special way work with memory and RF. Since the MCp architecture assumes that there is no need to apply the instruction to the architectural state, when performing RF or memory access operations within the framework of one paragraph, no order is fixed. It is only guaranteed that all write operations will be completed before the first reading of data from memory or RF in the next paragraph.

    Remember the solid Load / Store / MOB module on the Isaiah diagram. So, most of it is MOB, that is, Memory Ordering Buffer. A special device that helps make decisions about whether one read / write operation can be performed before (or after) another. This analysis is also difficult: you need to compare addresses and dependencies. Therefore, the circuit is large.

    The cell always says: YES! All ongoing operations with memory and RF can be rearranged within a paragraph as desired. And this is possible, because the programmer and compiler have a means to build the necessary order of operations explicitly, with the help of @-references and paragraphs. For example, the construction:
    volatile a;
    a += a;
    

    translates into such a code (a little schematic):
    	rdsl	a
    	rdsl	a
    	addsl	@1, @2
    	wrsl	@1, a
    

    the record in which will be performed strictly after the readings are made, even on cells that do not follow the procedure for accessing memory. Thus, the necessary order of working with memory can be set without the help of a complex MOB, which brings another +1 to the MCp energy efficiency box.

    But that is not all. Now my favorite. In general, stream architectures in the history of processor engineering have occurred many times, and have been discarded each time. Because on a purely streaming machine, it is very difficult to organize branches and loops. Their organization only in the form of a data flow graph implies that all possible branches must be described in advance in this graph. The processor must load and execute such a graph entirely. For large programs this is not possible.

    And in MCp, this problem is solved elegantly and simply. In addition to the non-linear order between instructions within a paragraph, there is also a sequential linear (well, if absolutely accurate, linear temporal) order of execution of the paragraphs themselves. Each paragraph may contain several jump instructions: jmp (unconditional jump) and jCC (various conditional jumps), through which the address of the paragraph is calculated, which must be performed as follows.

    As soon as one of the jump instructions is triggered, and when all the cells have already reached the flag completein the current paragraph, they begin to select the instructions for the next paragraph. Therefore, in the above result of translating the source to C, jmp are at the very beginning of the paragraphs, and the code is executed normally.

    This feature of MCp already provides qualitatively different opportunities in coding. Consider, for example, a program with this structure:
    	doSomething;
    	if(condition)
    	{
    		doStuff;
    	}
    


    When translating this expression into the code of traditional processors, it will be impossible to combine the programming of the transition goal by condition with execution doSomething. You can combine the calculation with this by conditionspending a register to save the result (and in most cases the register is very sorry, so they almost never do that). But the conditional control transfer command itself can only stand strictly after doSomething.

    Until the processor executesdoSomethinghe should not perform the transition. Modern traditional high-performance processors, due to their complex CUs, may not linger in this place, but begin speculative transition execution, making an assumption about which branch the execution should go on. To do this, they use branch predictors to more likely guess the correct branching. When the transition condition turns out to be calculated, they can say to themselves: oh, I'm beautiful! not mistaken; or: kapets, kapets! I didn’t get there, I have to redo everything.

    In MCp, the calculation of the transition goal with a higher code can be carried out easily, not forcibly and in advance, in the same paragraph asdoSomething(of course, if data dependencies allow this to be done). Therefore, cells in MCp can not starve (not lack instruction), and can do without speculative code execution (without wasting energy on it) and without a transition predictor (without spending transistors and energy). The minus is the predictor of the transition without losing the sample rate of instructions +1 to energy efficiency.

    So, in MCp there really is a distributed CU that several interacting CU cells form. Indeed, these CUs are simple, and provide combined parallel processing: fetching instructions, arithmetic counting, and memory and RF access operations.

    Cell growth difficulties


    Here are some of the features that MCp has. Next, I will try to return to the beginning of the text and more reasonably talk about the declared benefits. But first you need to say about some of the problems that knowledgeable readers could already see for themselves. MCp has a couple of important problems, one of which will be resolved in future versions of the processor, and the other is still in the process of an intensive brainstorming.

    The first problem is interrupt handling. While interrupts are processed roughly enough according to the following principle. If during the execution of the paragraph there were no entries in the memory, RF, and accesses to the periphery, then the execution of the paragraph is stopped (as if it were not executed), and control is transferred to the interrupt handler. If one of these operations has already been performed, the processor waits for the paragraph to complete and only then transfers control to the handler. It is clear that for real-time systems this behavior is not the most optimal (to put it mildly). But this problem has a solution.

    First, to the periphery, in order for everything to work correctly, in most cases you need to access interrupt prohibition mode. Secondly, write operations can be accumulated in a special WriteBack buffer, and their actual execution will begin upon completion of the paragraph. Such a WB queue, of course, is a limited resource, but since MCp tolerates delays when working with memory and RF (yes, yes, I repeat, registers in MCp can be relatively slow, which, by the way, is +1 energy efficiency), then this It will not be a big problem. With this queue, interrupts will be processed without additional unpredictable delays.

    The second problem is the virtual memory management device (MMU). So far nothing is clear with him. On the one hand, there are traditional OSs that you want: Linux, Plan9 :) On the other hand, it seems like MMU is an expensive pleasure. It is believed that in SPEC tests 17.5% of the CPU time is spent on MMU management; and SUN in an impulse of propaganda of Java on certain loads counted as much as 40%. On the third hand, why are MMUs in control systems and supercomputations (immediate goals for MCp)? We consider it on CUDA without any virtual memory. Fourth, when contemplating progress in the field of managed (safe) languages, with the growing popularity of Java, .Net, JavaScript, Go, the question arises: should you focus on supporting such languages?

    On the fifth hand, due to the nature of interrupt handling, when the rollback point can only be the beginning of a paragraph, problems may arise. Suppose the translation cache (TLB) in the MMU is designed for 32 translations, and in the paragraph there will be 33 read operations from different pages of memory. Such a paragraph cannot be implemented. Here you need to somehow coordinate and limit everything specifically. Etc. In general, the process of brainstorming this cube is in full swing.

    Justification of the benefits of MCp


    Let us finally go through the points stated at the beginning of this article. I’m tired of

    energy efficiency already, probably :) But I repeat, it stems from the possibility of parallel code execution with a good sampling rate using a simple distributed CU. In addition, the architecture also allows you to choose work scenarios. Suppose, for example, that a system has decided that it can operate in a mode of low productivity and with reduced power consumption. Then she can select one cell, tell her that now for this cell must select N * k + n (k = 0, 1, ...) instructions with N = 1 and n = 0, and disconnect the rest of the cells. After that, all the same code will be executed by one cell. Profit? PROFIT! And no big.LITTLE is needed.

    Exactly the same mechanism can provide gradual degradation . If, during the execution of a certain paragraph, it is discovered that some cell has broken, then the surviving cells must be reprogrammed, changing each of N and n. After which you can try to continue the calculation (which, of course, should start sending SMS-messages to everyone that it’s trouble! Trouble!).

    In managed environments ( managed runtimes) Often you need to check the behavior of the code as it runs. Because it is not always possible to guarantee the correctness of the program by only static analysis (theoretically, of course, it is possible, but in practice, how can the compiler verify that some kind of furious physics function has certain properties?). The features of control transfer in the MCp architecture allow such checks (for example, overflowing the boundaries of arrays) and mixing control transfer to the exception handler to be mixed into paragraphs that perform useful calculations. In traditional architectures, code cannot be mixed like this, because the control transfer instruction must appear somewhere immediately before the code, the validity of which is checked. Functional Programming

    Fansshould be glad that the instructions in MCp are not operators with side effects, but pure operations, with a result that depends only on the operands. This, for example, facilitates code verification, which is important for dependable applications. In addition, thanks to the same branching feature, a favorite of all pattern-matching can be performed more efficiently. For example, by executing code (something classic):

    fib :: (Integral t) => t -> t
    fib 0 = 1
    fib 1 = 1
    fib n = fib (n - 1) + fib (n - 2)
    


    A CPU with a traditional architecture will first check that the argument is equal fibto zero, and will branch accordingly, then 1, etc. A fancy superscalar with extraordinary execution will try to do subsequent ones after the first test in speculative mode, but it will spend its internal buffers and energy on this. And MCp can carry out all three checks in parallel when performing one paragraph, especially without straining and saving your electricity.

    Finally, scaling and hardware threads. The algorithm for selecting instructions, disseminating calculation results and choosing a ready-to-execute command is not rigidly tied to the number of cells. There can be as many cells as the manufacturing technology of the processor allows (how many transistors fit, the length of wires, etc.). The developers tried to simulate a processor with 16 cells on some sophisticated industrial software, which is believed to reliably answer the question: can such a processor be built in reality and will it work at a given frequency? The answer was positive (I don’t know much details). But it is clear that there are no special restrictions in the architecture itself. And this scaling, I repeat, does not require a change in the architecture of the cell.

    Threads also fit easily into MCp algorithms and protocols. In order for the multicellular processor to support the execution of several threads, it is necessary to multiply register files (for future versions, buffers that accumulate write operations) and instruction counters, and assign a thread identifier to each tag value in the switch. Nothing more is needed. In this case, you can easily adjust the number of cells that run the thread by reprogramming the mentioned counters.

    There were many letters, thanks for your attention!


    I hope this voluminous (excuse me, shorter did not work out) text helps me share with someone the pleasure that I get when working with this processor.

    Also popular now: