Microprocessor "out of the garage"
Surely everyone dealing with electronics and FPGAs is familiar with the opencores.org website, which contains a lot of useful (and not so) electronics solutions - dozens, maybe hundreds, implementations of processors and peripherals - both original implementations of existing devices, so and new developments. This article will discuss a 32-bit microprocessor with an original instruction system created on the basis of the Mars rover2 board .
Our team has been engaged in the L4 microkernel for 10 years and at some point it came to an understanding that the microkernel itself can be implemented as a processor unit. Moreover, if it is very difficult to implement a full-fledged microkernel in hardware, then you can at least help the software part by shifting some functions to hardware. First, we decided to take the easy and optimal way - to study existing solutions, choose the right one andfile itadd features useful to the microkernel. The work took about a month and almost all the solutions found on opencores were studied. Taking a ready-made solution as a basis opened up quite good opportunities, in the form of ready-made compilers and various libraries. At some point, we ceased to like the existing solutions - something turned out to be complicated, something not optimal, something just unfinished, something to hide, had a not very suitable license and terms of use. Gaining courage and gritting our teeth, we started a gamble, deciding to develop a processor from the very beginning.
What does a microprocessor start with? Ask the system programmer, and he will answer you that this is a command system. Despite the total fashion for RISC architecture, we decided not to tie the length of instructions to the size of a machine word. Therefore, we conducted several experiments. Oddly enough, but it turned out to be a very convenient tool for designing a command system ... Microsoft Excel. First of all, we highlighted several columns, using them for numbering instructions in three calculus systems - decimal, hexadecimal and binary. The result was 256 lines, according to the number of states that can be described in one byte. Then we tried to logically group the instructions so that their decoding scheme was as simple as possible. The first block of instructions was occupied by single-byte instructions - prefixes, modifiers, and simple instructions.

At the next stage, I had to decide on the number and types of registers. How many registers do you think will be optimal for most tasks? The answers to this question can vary greatly depending on the personality of the person who responds - someone lacks 32 as well as adherents of non-register architecture. We decided to dwell on 16 general purpose registers. This amount is quite convenient for assembly language programming, it lays quite successfully on our architecture and is easily implemented in HDL.
Having decided on the registers, we decided to make a completely positionally independent command system - in the architecture there is not a single transition command to an absolute address - all transitions are carried out relative to the current command. We are simply obsessed with compactness, so all transition commands have three forms - signed offsets of 1, 2, and 3 bytes. For example, transitions with a 16-bit offset are shown below:

Finally, we abandoned the concept of a hardware stack, in favor of organizing the stack “by agreement”. For this, a special NOTCH prefix was introduced and the following scheme — if the conditional or unconditional jump instruction is prefixed with this instruction, then the address of the next instruction is placed in register R15, i.e. return address. Accordingly, the RETURN instruction jumps through the contents of register R15. Thus, with nested subprogram calls, the care of storing the return address rests with the programmer or compiler. At first glance, this does not seem very convenient, optimal and familiar, but if you think about it, you get several advantages - firstly, you can save several clock cycles without saving this register in external memory in terminal routines (i.e., routines that do not call other subroutines), secondly, the NOTCH prefix can be placed before the conditional branch instruction, thereby realizing conditional function calls - albeit small, but also economical. As for the complexity of programming in assembler, they are hidden by macros, which are assembler mnemonics of a higher level.
The positional independence of the code introduces another feature - access to constant data. Since the code can be located at an arbitrary address, the constant data can also be located arbitrarily with the code. The solution turned out to be quite simple - using the same NOTCH prefix when loading constants into the register uses the constant as an offset relative to the executable instruction - this solves the problem of addressing data in positionally independent code.
After designing the command system, which generally took about a year, we armed ourselves with the Qauartus and Icarus Verilog environments and ... realized that we were in a hurry. Implementing Verilog's command system turned out to be pretty darn complicated. Knowledgeable people advised to run solutions on a software model by writing a decoder and other functional devices in ordinary C. After implementing an emulator of a non-existent processor and running test programs on it, things went better. Another six months were needed to implement the processor on Verilog. It should be said that FPGA programming for a beginner can be incredibly difficult, and many years of programming experience in high-level languages can even complicate the task. In this case, modeling tools come to the rescue. At the first stage, Icarus Verilog, a free circuit simulation tool that comes with GTKWave, a program for displaying signals, turned out to be extremely useful. Using these tools, you can see what is happening with the device at any given time. At some stage, the capabilities of Icarus Verilog became fewer and we used the ModelSim simulator of MentorGraphis - this is a very powerful commercial tool, a stripped-down version of which can be installed for free together with the Altera Quartus environment.
You can talk about the debugging process for a long time. And at some point, when FPGA resources were occupied by a full third, suddenly there was an understanding that the resulting processor can already be used for some projects.

To demonstrate the capabilities of the processor, we wrote the simplest firmware, which at startup displays the following menu on the screen of the remote terminal:
If you press 1, and if your terminal supports file transfer using the X-modem protocol, you can download a file up to 4 Kb in size. This can be a text or an ANSI picture - in this case, pressing the 5 key will display the text or picture on the screen. But would it be worth writing an article for this? Of course not, therefore, when you press the 2 key in the terminal, control is transferred to the code loaded using the first menu item. If you transfer control to the downloaded text or ansi-picture, then after a few steps the processor will stumble upon a nonexistent (still indefinite command) or turn to a nonexistent memory. In this case, the processor will go into step-by-step mode - each code received from the terminal will cause the execution of one processor instruction with the output of the bus status to the remote terminal.

It's time to press the Reset key. We called the “reset” the left button on the Mars rover2 board.
To make the device do something meaningful, you will need Macro Assembler . In this archive, in addition to the assembler itself and a few examples, we placed the source code of the processor microcode. The following is an example of a simple user program that can be converted to a binary file using the assembler and loaded into the processor.
This program in a cycle, before pressing any key in the terminal, displays information on the number of seconds from the moment the device is started or reset to the remote terminal. To test it in action, you will need the generated usr_demo2.bin file .
A short explanation to the program. The _get_sysclock subroutine returns the number of pulses of the crystal oscillator since the device was turned on or reset. Subroutine dump example:
When exiting the _get_sysclock subroutine , register R0 contains the low 32 bits, and register R1 contains the high 32 bits of the result.
The constant 0x05F5E100 is the number of pulses of a clock in one second.
You can download the latest firmware for the Mars rover2 board here .
If you don’t hear news from our project, then you should know that we are working on transferring the L4 microkernel to the FPGA.
Thanks for attention.
Our team has been engaged in the L4 microkernel for 10 years and at some point it came to an understanding that the microkernel itself can be implemented as a processor unit. Moreover, if it is very difficult to implement a full-fledged microkernel in hardware, then you can at least help the software part by shifting some functions to hardware. First, we decided to take the easy and optimal way - to study existing solutions, choose the right one and
What does a microprocessor start with? Ask the system programmer, and he will answer you that this is a command system. Despite the total fashion for RISC architecture, we decided not to tie the length of instructions to the size of a machine word. Therefore, we conducted several experiments. Oddly enough, but it turned out to be a very convenient tool for designing a command system ... Microsoft Excel. First of all, we highlighted several columns, using them for numbering instructions in three calculus systems - decimal, hexadecimal and binary. The result was 256 lines, according to the number of states that can be described in one byte. Then we tried to logically group the instructions so that their decoding scheme was as simple as possible. The first block of instructions was occupied by single-byte instructions - prefixes, modifiers, and simple instructions.

At the next stage, I had to decide on the number and types of registers. How many registers do you think will be optimal for most tasks? The answers to this question can vary greatly depending on the personality of the person who responds - someone lacks 32 as well as adherents of non-register architecture. We decided to dwell on 16 general purpose registers. This amount is quite convenient for assembly language programming, it lays quite successfully on our architecture and is easily implemented in HDL.
Having decided on the registers, we decided to make a completely positionally independent command system - in the architecture there is not a single transition command to an absolute address - all transitions are carried out relative to the current command. We are simply obsessed with compactness, so all transition commands have three forms - signed offsets of 1, 2, and 3 bytes. For example, transitions with a 16-bit offset are shown below:

Finally, we abandoned the concept of a hardware stack, in favor of organizing the stack “by agreement”. For this, a special NOTCH prefix was introduced and the following scheme — if the conditional or unconditional jump instruction is prefixed with this instruction, then the address of the next instruction is placed in register R15, i.e. return address. Accordingly, the RETURN instruction jumps through the contents of register R15. Thus, with nested subprogram calls, the care of storing the return address rests with the programmer or compiler. At first glance, this does not seem very convenient, optimal and familiar, but if you think about it, you get several advantages - firstly, you can save several clock cycles without saving this register in external memory in terminal routines (i.e., routines that do not call other subroutines), secondly, the NOTCH prefix can be placed before the conditional branch instruction, thereby realizing conditional function calls - albeit small, but also economical. As for the complexity of programming in assembler, they are hidden by macros, which are assembler mnemonics of a higher level.
The positional independence of the code introduces another feature - access to constant data. Since the code can be located at an arbitrary address, the constant data can also be located arbitrarily with the code. The solution turned out to be quite simple - using the same NOTCH prefix when loading constants into the register uses the constant as an offset relative to the executable instruction - this solves the problem of addressing data in positionally independent code.
After designing the command system, which generally took about a year, we armed ourselves with the Qauartus and Icarus Verilog environments and ... realized that we were in a hurry. Implementing Verilog's command system turned out to be pretty darn complicated. Knowledgeable people advised to run solutions on a software model by writing a decoder and other functional devices in ordinary C. After implementing an emulator of a non-existent processor and running test programs on it, things went better. Another six months were needed to implement the processor on Verilog. It should be said that FPGA programming for a beginner can be incredibly difficult, and many years of programming experience in high-level languages can even complicate the task. In this case, modeling tools come to the rescue. At the first stage, Icarus Verilog, a free circuit simulation tool that comes with GTKWave, a program for displaying signals, turned out to be extremely useful. Using these tools, you can see what is happening with the device at any given time. At some stage, the capabilities of Icarus Verilog became fewer and we used the ModelSim simulator of MentorGraphis - this is a very powerful commercial tool, a stripped-down version of which can be installed for free together with the Altera Quartus environment.
You can talk about the debugging process for a long time. And at some point, when FPGA resources were occupied by a full third, suddenly there was an understanding that the resulting processor can already be used for some projects.

To demonstrate the capabilities of the processor, we wrote the simplest firmware, which at startup displays the following menu on the screen of the remote terminal:
Welcome───────────────────────> Welcome to Everest core <────────────────── ────────┐ │ 1 - Load binary file via X-modem protocol │ │ 2 - Run previously loaded binary file │ │ 3 - Show RAM (0x100000-0x100140) │ │ 4 - Test of message registers │ │ 5 - Show previously loaded ANSI picture │ │ 6 - Show built-in ANSI pic # 1 │ │ 7 - Show built-in ANSI pic # 2 │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ └ ─ └ ─ ─ └ ─ └ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ──────────────────────────────┘
If you press 1, and if your terminal supports file transfer using the X-modem protocol, you can download a file up to 4 Kb in size. This can be a text or an ANSI picture - in this case, pressing the 5 key will display the text or picture on the screen. But would it be worth writing an article for this? Of course not, therefore, when you press the 2 key in the terminal, control is transferred to the code loaded using the first menu item. If you transfer control to the downloaded text or ansi-picture, then after a few steps the processor will stumble upon a nonexistent (still indefinite command) or turn to a nonexistent memory. In this case, the processor will go into step-by-step mode - each code received from the terminal will cause the execution of one processor instruction with the output of the bus status to the remote terminal.

It's time to press the Reset key. We called the “reset” the left button on the Mars rover2 board.
To make the device do something meaningful, you will need Macro Assembler . In this archive, in addition to the assembler itself and a few examples, we placed the source code of the processor microcode. The following is an example of a simple user program that can be converted to a binary file using the assembler and loaded into the processor.
function user_main
load r14, 0x2000
push r15
loop:
call _get_sysclock
load r2, 0x05F5E100
call _div64
call _print_dec
lea r1, $shw_str
call _puts
call _uart_status
rcr r0, 2 ; Бит RCV_RDY в перенос
jc done ; Выход из цикла если была нажата клавиша
load r0, 0x01000000
call _delay
jmp loop
done:
pop r15
return
end
include tty.asm
include delay.asm
include mul.asm
include div.asm
include print_dec.asm
include sysclock.asm
$shw_str db ' seconds since boot',13,10,0
This program in a cycle, before pressing any key in the terminal, displays information on the number of seconds from the moment the device is started or reset to the remote terminal. To test it in action, you will need the generated usr_demo2.bin file .
A short explanation to the program. The _get_sysclock subroutine returns the number of pulses of the crystal oscillator since the device was turned on or reset. Subroutine dump example:
; ------------------- _get_sysclock ------------------
0198: 37 e3 ; DEC R14, 4
019a: 60 e3 ; MOV (R14), R3
019c: e3 ff fe ff f8 ; LOAD R3, 0xfffefff8
01a1: 68 03 ; MOV R0, (R3)
01a3: 36 33 ; INC R3, 4
01a5: 68 13 ; MOV R1, (R3)
01a7: 68 3e ; MOV R3, (R14)
01a9: 36 e3 ; INC R14, 4
01ab: 05 ; RETURN
When exiting the _get_sysclock subroutine , register R0 contains the low 32 bits, and register R1 contains the high 32 bits of the result.
The constant 0x05F5E100 is the number of pulses of a clock in one second.
You can download the latest firmware for the Mars rover2 board here .
If you don’t hear news from our project, then you should know that we are working on transferring the L4 microkernel to the FPGA.
Thanks for attention.