Forth VHDL processor
In this article I will tell you how to write a processor on VHDL myself. There will not be much code (at least I hope so). The full code is posted on the github, and there you can see several iterations of writing.
The processor falls under the class of soft processors .
First of all, you need to choose the processor architecture. I will use the RISC architecture for the processor and the Harvard memory architecture .
The processor will be without a pipeline with two states:
Since we are writing the forth processor, it will be stacked. This will reduce the bit capacity of the team, because in it will not have to keep registers codes with which calculations are carried out. For operations, the processor will have two upper stack numbers available.
The data stack and the return stack will be separate.
In FPGA there is a block memory with a configuration of 18 bits * 1024 cells. Focusing on it, I choose the bit capacity of the command at 9 bits (2048 commands will fit in one memory block).
The capacity of the data memory should be “standard” in 32 bits.
I implement “communication” with peripheral devices using the bus.
The scheme of all this disgrace will turn out approximately the following.
We decided on the architecture, now "try to take off with all this." Now you need to come up with a command system.
All processor commands can be divided into several groups:
So, we have 9 bits of the team, which we need to meet.
The bit depth of the command is less than the bit depth of the data, so you need to come up with a mechanism for loading numbers.
I chose the following command format for loading literals onto the stack:
Senior, 8 bits of the command will be a sign of loading a number. The remaining 8 bits are directly the number loaded onto the stack.
But the data capacity is 32 bits, and so far only 8 bits can be downloaded.
Let's agree that if there are several LIT commands in a row, then this is considered to be loading a single number. The first command loads the number onto the stack (expanding it), each subsequent one modifies the top number on the stack, shifting it 8 bits to the left and inscribing the value from the command into the lower part. Thus, you can load the number of any bit by a sequence of several LIT commands.
You can use any command (for example, NOP) to separate multiple numbers.
I decided to break all other commands into groups for easy decoding. We will group by the way they affect the stack.
Groups of teams:
Transitions:
The JMP and CALL commands take the address from the stack and go over it (call additionally puts the return address on the corresponding stack).
The IF command takes the transition address (top number on the stack) and the transition flag (next number). If the sign is equal to zero, then the transition to the address is carried out.
The RET team works with the return stack, picking up the top number and going over it.
If the command is not a transition, the command counter is incremented by one.
To describe the commands, the stack notation is used , which looks like this:
<Stack state before word execution> - <stack state after
word execution >
The top of the stack is on the right, i.e. writing 2 3 - 5 means that before the word was executed, the
number 3 was at the top of the stack, and below it was 2; after execution, these numbers
turned out to be deleted, and on the top instead of them appeared the number 5.
Example:
DUP (a - aa)
DROP (ab - a)
Take the minimum set of commands with which you can at least do something.
You can write 1 number on the stack in one processor clock cycle; there is a SWAP command in the fort that swaps the top 2 numbers on the stack. To implement it you need 2 teams. The first command, NIP (ab - b), removes the second number “a” from the top and stores it in a temporary register, and the second command TEMP> (- a) extracts this number from the temporary register and puts it on the top of the stack.
Memory implementation.
The code and data memory is implemented through the template:
Ram is a signal declared as follows:
The memory can be initialized as follows:
Stacks are implemented through a similar template.
The only difference from the memory template is that it “forwards” the recorded value to the output. With the previous template, the recorded value would be obtained at the next, after recording, measure.
The synthesizer automatically recognizes these patterns and generates the corresponding memory blocks. This is visible in the report. For example, for a data stack, it looks like this:
I think it makes no sense to provide a complete code for the implementation of memory, it is, in fact, boilerplate.
The main cycle of the processor - at the first clock cycle, the team is sampled, at the second - execution. To determine which clock the processor is on, a fetching signal is made.
The simplest option for decoding and executing a command is a large “case” for all options. For ease of writing, it is better to divide it into several components.
In this project, I broke it into 3 parts:
The sample is part of the command, the lower 4 bits are not used.
All declared team groups are painted. This case will need to be changed only when a new group of teams appears.
The next case will be responsible for the execution of the team. It forms the data for the data stack (sorry for the tautology), the iowr signal for the OUTPORT command, etc.
So far, only 2 teams have been implemented. Loading numbers onto the stack and adding the top two numbers on the stack. This is enough to “test the idea”, and if these 2 teams work, most of the rest will be implemented “by template” without any problems.
And the last case is the formation of the following address for the command counter:
Implemented basic transition commands. The transition address is taken from the stack.
Before moving on, it is advisable to test already written code. I created a TestBench, in which I entered only the output of a reset signal to the processor in the first 100 ns.
The code memory was initialized as follows:
First, a few numbers are put in, the addition operation is tested, and the stack is cleared with the DROP command. Next, transition, subroutine call, and return are tested.
The simulation result is shown in the following pictures (clickable):
Entire test: Number load test:
The figure shows the execution of the Lit 0 command. After removing the reset signal, the command counter is zero (ip = 0) and the processor is told that it is in the fetching phase of the command (fetching = '1'). At the first measure, sampling is performed. The first NOP command, which does nothing but increase the command counter (however, any unknown command will increase the command counter, and can also do something with the data stack, depending on the group in which it is located).
Command # 1 is loading the number 0 onto the stack. 3 signals are set at the execution step: the address of the data stack is increased by 1, data is set and the write permission signal is set.
On the next sampling clock, the value “0” is written to the stack at address “1”. The value, also, is immediately "forwarded" to the output (so that the next command operates on a new value). The write enable signal is removed.
Command # 2 is also a command to load a number onto the stack. Because Since it follows the LIT command, the new number will not be loaded onto the stack, but the top one will be modified. It is shifted 8 bits to the left, the lower part is written the value from the command (which is 0x01).
Command # 3 performs the same operations as command # 2. The number on the stack, after its operation equals 0x0102.
The first teams are tested. Almost all the remaining commands are written in a stereotyped manner (“draw circles, draw the rest of the owl”).
The purpose of the article was to show that you can write the processor yourself, and, I hope, I did it at least to some extent. The next step is to write the bootloader and the cross-compiler, if this article is of interest to the habrasociety.
Github project: github.com/whiteTigr/vhdl_cpu
Processor code: github.com/whiteTigr/vhdl_cpu/blob/master/cpu.vhd Testbench
code (although there is practically nothing): github.com/whiteTigr/vhdl_pu blob / master / cpu_tb.vhd
The processor falls under the class of soft processors .
Architecture
First of all, you need to choose the processor architecture. I will use the RISC architecture for the processor and the Harvard memory architecture .
The processor will be without a pipeline with two states:
- Selection of a command and operands
- Executing a command and saving the result
Since we are writing the forth processor, it will be stacked. This will reduce the bit capacity of the team, because in it will not have to keep registers codes with which calculations are carried out. For operations, the processor will have two upper stack numbers available.
The data stack and the return stack will be separate.
In FPGA there is a block memory with a configuration of 18 bits * 1024 cells. Focusing on it, I choose the bit capacity of the command at 9 bits (2048 commands will fit in one memory block).
The capacity of the data memory should be “standard” in 32 bits.
I implement “communication” with peripheral devices using the bus.
The scheme of all this disgrace will turn out approximately the following.
Command system
We decided on the architecture, now "try to take off with all this." Now you need to come up with a command system.
All processor commands can be divided into several groups:
- Loading literal (numbers) onto the stack
- Transitions (conditional branch, subroutine call, return)
- Access to data memory (read and write)
- A call to the bus (the meaning is the same as a call to memory).
- ALU teams.
- Other teams.
So, we have 9 bits of the team, which we need to meet.
Download Literals
The bit depth of the command is less than the bit depth of the data, so you need to come up with a mechanism for loading numbers.
I chose the following command format for loading literals onto the stack:
Mnemonics | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
---|---|---|---|---|---|---|---|---|---|
Lit | 1 | Lit |
Senior, 8 bits of the command will be a sign of loading a number. The remaining 8 bits are directly the number loaded onto the stack.
But the data capacity is 32 bits, and so far only 8 bits can be downloaded.
Let's agree that if there are several LIT commands in a row, then this is considered to be loading a single number. The first command loads the number onto the stack (expanding it), each subsequent one modifies the top number on the stack, shifting it 8 bits to the left and inscribing the value from the command into the lower part. Thus, you can load the number of any bit by a sequence of several LIT commands.
You can use any command (for example, NOP) to separate multiple numbers.
Team Grouping
I decided to break all other commands into groups for easy decoding. We will group by the way they affect the stack.
Mnemonics | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
---|---|---|---|---|---|---|---|---|---|
Lit | 0 | Group | Command |
Groups of teams:
Group | Takes from the stack | Pushes on the stack | Example |
---|---|---|---|
0 | 0 | 0 | Nop |
1 | 0 | 1 | Depth |
2 | 1 | 0 | Drop |
3 | 1 | 1 | DUP @ |
4 | 2 | 0 | !, OUTPORT |
5 | 2 | 1 | Arithmetic (+, -, AND) |
Transitions:
Mnemonics | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
---|---|---|---|---|---|---|---|---|---|
Jmp | 0 | 2 | 0 | ||||||
Call | 0 | 2 | 1 | ||||||
IF | 0 | 4 | 0 | ||||||
Ret | 0 | 0 | 1 |
The JMP and CALL commands take the address from the stack and go over it (call additionally puts the return address on the corresponding stack).
The IF command takes the transition address (top number on the stack) and the transition flag (next number). If the sign is equal to zero, then the transition to the address is carried out.
The RET team works with the return stack, picking up the top number and going over it.
If the command is not a transition, the command counter is incremented by one.
Command table
To describe the commands, the stack notation is used , which looks like this:
<Stack state before word execution> - <stack state after
word execution >
The top of the stack is on the right, i.e. writing 2 3 - 5 means that before the word was executed, the
number 3 was at the top of the stack, and below it was 2; after execution, these numbers
turned out to be deleted, and on the top instead of them appeared the number 5.
Example:
DUP (a - aa)
DROP (ab - a)
Take the minimum set of commands with which you can at least do something.
H \ l | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|
0 | Nop | Ret | ||||||||
1 | TEMP> | Depth | RDEPTH | Dup | Over | |||||
2 | Jmp | Call | Drop | |||||||
3 | @ | INPORT | NOT | SHL | SHR | SHRA | ||||
4 | IF | ! | OUTPORT | |||||||
5 | Nip | + | - | AND | OR | Xor | = | > | < | * |
Command | Stack notation | Description |
---|---|---|
Nop | No operation. One processor latency | |
Depth | Stacking the number of numbers on the data stack before executing this word | |
RDEPTH | Stacking the number of numbers on the return stack before executing this word | |
Dup | Duplicate Top Number | |
Over | Copy to the top of the second number on top | |
Drop | Delete top number | |
@ | Reading data memory at address A | |
INPORT | Reading data from bus at address A | |
NOT | Logical NOT top number (0 is replaced by -1, any other number is replaced by 0) | |
SHL | Shift the top number by 1 digit to the left | |
SHR | Shift the top number by 1 digit to the right | |
SHRA | Arithmetic shift of the top number by 1 digit to the right (the sign of the number is preserved) | |
! | Writing data D at address A to the data memory | |
OUTPORT | Writing data D at address A to the "bus" (iowr signal will be set for one clock cycle, the periphery should "catch" its address with a high level of this signal) | |
Nip | Removing the second number from the top from the stack (the number is stored in the TempReg register) | |
TEMP> | Retrieving TempReg Register Content | |
+ | Stack top numbers | |
- | Subtraction from the second number from the top number | |
AND | Bitwise AND Over Heights | |
OR | Bitwise OR over the upper numbers | |
Xor | Bitwise XOR over the upper numbers | |
= | Verification of equality of upper numbers. If the numbers are equal, leaves -1 on the stack, otherwise 0 | |
> | Comparison of the upper numbers. If A> B, leaves -1 on the stack, otherwise 0. Comparison taking into account the sign | |
< | Comparison of the upper numbers. If A <B, leaves -1 on the stack, otherwise 0. Comparison taking into account the sign | |
* | Multiplication of the upper numbers |
You can write 1 number on the stack in one processor clock cycle; there is a SWAP command in the fort that swaps the top 2 numbers on the stack. To implement it you need 2 teams. The first command, NIP (ab - b), removes the second number “a” from the top and stores it in a temporary register, and the second command TEMP> (- a) extracts this number from the temporary register and puts it on the top of the stack.
Getting started coding
Memory implementation.
The code and data memory is implemented through the template:
process(clk)
if rising_edge(clk) then
if WeA = '1' then
Ram(AddrA) <= DinA;
end if;
DoutA <= Ram(AddrA);
DoutB <= Ram(AddrB);
end if;
end process;
Ram is a signal declared as follows:
subtype RamSignal is std_logic_vector(RamWidth-1 downto 0);
type TRam is array(0 to RamSize-1) of RamSignal;
signal Ram: TRam;
The memory can be initialized as follows:
signal Ram: TRam :=
(0 => conv_std_logic_vector(0, RamWidth),
1 => conv_std_logic_vector(1, RamWidth),
2 => conv_std_logic_vector(2, RamWidth),
-- ...
others => (others => '0'));
Stacks are implemented through a similar template.
process(clk)
if rising_edge(clk) then
if WeA = '1' then
Stack(AddrA) <= DinA;
DoutA <= DinA;
else
DoutA <= Stack(AddrA);
end if;
DoutB <= Stack(AddrB);
end if;
end process;
The only difference from the memory template is that it “forwards” the recorded value to the output. With the previous template, the recorded value would be obtained at the next, after recording, measure.
The synthesizer automatically recognizes these patterns and generates the corresponding memory blocks. This is visible in the report. For example, for a data stack, it looks like this:
-----------------------------------------------------------------------
| ram_type | Distributed | |
-----------------------------------------------------------------------
| Port A |
| aspect ratio | 16-word x 32-bit | |
| clkA | connected to signal | rise |
| weA | connected to signal | high |
| addrA | connected to signal | |
| diA | connected to signal | |
| doA | connected to internal node | |
-----------------------------------------------------------------------
| Port B |
| aspect ratio | 16-word x 32-bit | |
| addrB | connected to signal | |
| doB | connected to internal node | |
-----------------------------------------------------------------------
I think it makes no sense to provide a complete code for the implementation of memory, it is, in fact, boilerplate.
The main cycle of the processor - at the first clock cycle, the team is sampled, at the second - execution. To determine which clock the processor is on, a fetching signal is made.
process(clk)
begin
if rising_edge(clk) then
if reset = '1' then
-- обнуление сигналов
ip <= (others => '0');
fetching <= '1';
else
if fetching = '1' then
fetching <= '0';
else
fetching <= '1';
-- исполнение команды, формирование адреса для выборки
end if;
end if;
end if;
end process;
The simplest option for decoding and executing a command is a large “case” for all options. For ease of writing, it is better to divide it into several components.
In this project, I broke it into 3 parts:
- a case, which will be responsible for generating the address of the data stack, and generate a write signal;
- case of team performance;
- case of forming a new command counter (ip).
-- Data stack addr and we
case conv_integer(cmd(8 downto 4)) is
when 16 to 31 => -- LIT
if PrevCmdIsLIT = '0' then
DSAddrA <= DSAddrA + 1;
end if;
DSWeA <= '1';
when 0 => -- group 0; pop 0; push 0
null;
when 1 => -- group 1; pop 0; push 1;
DSAddrA <= DSAddrA + 1;
DSWeA <= '1';
when 2 => -- group 2; pop 1; push 0;
DSAddrA <= DSAddrA - 1;
when 3 => -- group 3; pop 1; push 1;
DSWeA <= '1';
when 4 => -- group 4; pop 2; push 0;
DSAddrA <= DSAddrA - 2;
when 5 => -- group 5; pop 2; push 1;
DSAddrA <= DSAddrA - 1;
DSWeA <= '1';
when others => null;
end case;
The sample is part of the command, the lower 4 bits are not used.
All declared team groups are painted. This case will need to be changed only when a new group of teams appears.
The next case will be responsible for the execution of the team. It forms the data for the data stack (sorry for the tautology), the iowr signal for the OUTPORT command, etc.
-- Data stack value
case conv_integer(cmd) is
when 256 to 511 => -- LIT
if PrevCmdIsLIT = '1' then
DSDinA <= DSDoutA(DataWidth - 9 downto 0) & Cmd(7 downto 0);
else
DSDinA <= sxt(Cmd(7 downto 0), DataWidth);
end if;
when cmdPLUS =>
DSDinA <= DSDoutA + DSDoutB;
when others => null;
end case;
So far, only 2 teams have been implemented. Loading numbers onto the stack and adding the top two numbers on the stack. This is enough to “test the idea”, and if these 2 teams work, most of the rest will be implemented “by template” without any problems.
And the last case is the formation of the following address for the command counter:
-- New ip and ret stack;
case conv_integer(cmd) is
when cmdJMP => -- jmp
ip <= DSDoutA(ip'range);
when cmdIF => -- if
if conv_integer(DSDoutB) = 0 then
ip <= DSDoutA(ip'range);
else
ip <= ip + 1;
end if;
when cmdCALL => -- call
RSAddrA <= RSAddrA + 1;
RSDinA <= ip + 1;
RSWeA <= '1';
ip <= DSDoutA(ip'range);
when cmdRET => -- ret
RSAddrA <= RSAddrA - 1;
ip <= RSDoutA(ip'range);
when others => ip <= ip + 1;
end case;
Implemented basic transition commands. The transition address is taken from the stack.
Testing
Before moving on, it is advisable to test already written code. I created a TestBench, in which I entered only the output of a reset signal to the processor in the first 100 ns.
The code memory was initialized as follows:
signal CodeMemory: TCodeMemory := (
0 => "000000000", -- lit tests
1 => "100000000",
2 => "100000001",
3 => "100000010",
4 => "000000000",
5 => "100001111",
6 => "000000000",
7 => "100010000",
8 => "100001000",
9 => conv_std_logic_vector(cmdPLUS, CodeWidth),
10 => conv_std_logic_vector(cmdPLUS, CodeWidth),
11 => conv_std_logic_vector(cmdDROP, CodeWidth),
12 => "100010011",
13 => conv_std_logic_vector(cmdJMP, CodeWidth), -- jmp to 19
14 => "100000010",
15 => "000000000",
16 => "100000010",
17 => conv_std_logic_vector(cmdPLUS, CodeWidth),
18 => conv_std_logic_vector(cmdRET, CodeWidth), -- ret
19 => "100001110",
20 => conv_std_logic_vector(cmdCALL, CodeWidth), -- call to 14
21 => "111111111",
others => (others => '0')
);
First, a few numbers are put in, the addition operation is tested, and the stack is cleared with the DROP command. Next, transition, subroutine call, and return are tested.
The simulation result is shown in the following pictures (clickable):
Entire test: Number load test:
Parsing loading numbers
The figure shows the execution of the Lit 0 command. After removing the reset signal, the command counter is zero (ip = 0) and the processor is told that it is in the fetching phase of the command (fetching = '1'). At the first measure, sampling is performed. The first NOP command, which does nothing but increase the command counter (however, any unknown command will increase the command counter, and can also do something with the data stack, depending on the group in which it is located).
Command # 1 is loading the number 0 onto the stack. 3 signals are set at the execution step: the address of the data stack is increased by 1, data is set and the write permission signal is set.
On the next sampling clock, the value “0” is written to the stack at address “1”. The value, also, is immediately "forwarded" to the output (so that the next command operates on a new value). The write enable signal is removed.
Command # 2 is also a command to load a number onto the stack. Because Since it follows the LIT command, the new number will not be loaded onto the stack, but the top one will be modified. It is shifted 8 bits to the left, the lower part is written the value from the command (which is 0x01).
Command # 3 performs the same operations as command # 2. The number on the stack, after its operation equals 0x0102.
Conclusion
The first teams are tested. Almost all the remaining commands are written in a stereotyped manner (“draw circles, draw the rest of the owl”).
The purpose of the article was to show that you can write the processor yourself, and, I hope, I did it at least to some extent. The next step is to write the bootloader and the cross-compiler, if this article is of interest to the habrasociety.
Github project: github.com/whiteTigr/vhdl_cpu
Processor code: github.com/whiteTigr/vhdl_cpu/blob/master/cpu.vhd Testbench
code (although there is practically nothing): github.com/whiteTigr/vhdl_pu blob / master / cpu_tb.vhd