whiteTigr August 16, 2012 at 15:53

Forth VHDL processor

In this article I will tell you how to write a processor on VHDL myself. There will not be much code (at least I hope so). The full code is posted on the github, and there you can see several iterations of writing.

The processor falls under the class of soft processors .

Architecture

First of all, you need to choose the processor architecture. I will use the RISC architecture for the processor and the Harvard memory architecture .
The processor will be without a pipeline with two states:

Selection of a command and operands
Executing a command and saving the result

Since we are writing the forth processor, it will be stacked. This will reduce the bit capacity of the team, because in it will not have to keep registers codes with which calculations are carried out. For operations, the processor will have two upper stack numbers available.
The data stack and the return stack will be separate.

In FPGA there is a block memory with a configuration of 18 bits * 1024 cells. Focusing on it, I choose the bit capacity of the command at 9 bits (2048 commands will fit in one memory block).
The capacity of the data memory should be “standard” in 32 bits.
I implement “communication” with peripheral devices using the bus.

The scheme of all this disgrace will turn out approximately the following.

Command system

We decided on the architecture, now "try to take off with all this." Now you need to come up with a command system.
All processor commands can be divided into several groups:

Loading literal (numbers) onto the stack
Transitions (conditional branch, subroutine call, return)
Access to data memory (read and write)
A call to the bus (the meaning is the same as a call to memory).
ALU teams.
Other teams.

So, we have 9 bits of the team, which we need to meet.

Download Literals

The bit depth of the command is less than the bit depth of the data, so you need to come up with a mechanism for loading numbers.

I chose the following command format for loading literals onto the stack:

Mnemonics	8	7	6	5	4	3	2	1	0
Lit	1	Lit

Senior, 8 bits of the command will be a sign of loading a number. The remaining 8 bits are directly the number loaded onto the stack.
But the data capacity is 32 bits, and so far only 8 bits can be downloaded.
Let's agree that if there are several LIT commands in a row, then this is considered to be loading a single number. The first command loads the number onto the stack (expanding it), each subsequent one modifies the top number on the stack, shifting it 8 bits to the left and inscribing the value from the command into the lower part. Thus, you can load the number of any bit by a sequence of several LIT commands.
You can use any command (for example, NOP) to separate multiple numbers.

Team Grouping

I decided to break all other commands into groups for easy decoding. We will group by the way they affect the stack.

Mnemonics	8	7	6	5	4	3	2	1	0
Lit	0	Group				Command

Groups of teams:

Group	Takes from the stack	Pushes on the stack	Example
0	0	0	Nop
1	0	1	Depth
2	1	0	Drop
3	1	1	DUP @
4	2	0	!, OUTPORT
5	2	1	Arithmetic (+, -, AND)

Transitions:

Mnemonics	7	3
Jmp	2	0
Call	2	1
IF	4	0
Ret	0	1

The JMP and CALL commands take the address from the stack and go over it (call additionally puts the return address on the corresponding stack).
The IF command takes the transition address (top number on the stack) and the transition flag (next number). If the sign is equal to zero, then the transition to the address is carried out.
The RET team works with the return stack, picking up the top number and going over it.
If the command is not a transition, the command counter is incremented by one.

Command table

To describe the commands, the stack notation is used , which looks like this:
<Stack state before word execution> - <stack state after
word execution >
The top of the stack is on the right, i.e. writing 2 3 - 5 means that before the word was executed, the
number 3 was at the top of the stack, and below it was 2; after execution, these numbers
turned out to be deleted, and on the top instead of them appeared the number 5.
Example:
DUP (a - aa)
DROP (ab - a)

Take the minimum set of commands with which you can at least do something.

H \ l	0	1	2	3	4	5	6	7	8	9
0	Nop	Ret
1	TEMP>	Depth	RDEPTH	Dup	Over
2	Jmp	Call	Drop
3	@	INPORT	NOT	SHL	SHR	SHRA
4	IF	!	OUTPORT
5	Nip	+	-	AND	OR	Xor	=	>	<	*

Command	Stack notation	Description
Nop		No operation. One processor latency
Depth	- D	Stacking the number of numbers on the data stack before executing this word
RDEPTH	- D	Stacking the number of numbers on the return stack before executing this word
Dup	A - AA	Duplicate Top Number
Over	AB - ABA	Copy to the top of the second number on top
Drop	A -	Delete top number
@	A - d	Reading data memory at address A
INPORT	A - d	Reading data from bus at address A
NOT	A - 0 \| -1	Logical NOT top number (0 is replaced by -1, any other number is replaced by 0)
SHL	A - B	Shift the top number by 1 digit to the left
SHR	A - B	Shift the top number by 1 digit to the right
SHRA	A - B	Arithmetic shift of the top number by 1 digit to the right (the sign of the number is preserved)
!	DA -	Writing data D at address A to the data memory
OUTPORT	DA -	Writing data D at address A to the "bus" (iowr signal will be set for one clock cycle, the periphery should "catch" its address with a high level of this signal)
Nip	AB - B	Removing the second number from the top from the stack (the number is stored in the TempReg register)
TEMP>	- A	Retrieving TempReg Register Content
+	AB - A + B	Stack top numbers
-	AB - AB	Subtraction from the second number from the top number
AND	AB - A and B	Bitwise AND Over Heights
OR	AB - A or B	Bitwise OR over the upper numbers
Xor	AB - A xor B	Bitwise XOR over the upper numbers
=	AB - 0 \| -1	Verification of equality of upper numbers. If the numbers are equal, leaves -1 on the stack, otherwise 0
>	AB - 0 \| -1	Comparison of the upper numbers. If A> B, leaves -1 on the stack, otherwise 0. Comparison taking into account the sign
<	AB - 0 \| -1	Comparison of the upper numbers. If A <B, leaves -1 on the stack, otherwise 0. Comparison taking into account the sign
*	AB - A * B	Multiplication of the upper numbers

You can write 1 number on the stack in one processor clock cycle; there is a SWAP command in the fort that swaps the top 2 numbers on the stack. To implement it you need 2 teams. The first command, NIP (ab - b), removes the second number “a” from the top and stores it in a temporary register, and the second command TEMP> (- a) extracts this number from the temporary register and puts it on the top of the stack.

Getting started coding

Memory implementation.
The code and data memory is implemented through the template:

process(clk)
  if rising_edge(clk) then
    if WeA = '1' then
      Ram(AddrA) <= DinA;
    end if;
    DoutA <= Ram(AddrA);
    DoutB <= Ram(AddrB);
  end if;
end process;

Ram is a signal declared as follows:

subtype RamSignal is std_logic_vector(RamWidth-1 downto 0);
type TRam is array(0 to RamSize-1) of RamSignal;
signal Ram: TRam;

The memory can be initialized as follows:

signal Ram: TRam :=
(0 => conv_std_logic_vector(0, RamWidth),
 1 => conv_std_logic_vector(1, RamWidth),
 2 => conv_std_logic_vector(2, RamWidth),
 -- ...
 others => (others => '0'));

Stacks are implemented through a similar template.

process(clk)
  if rising_edge(clk) then
    if WeA = '1' then
      Stack(AddrA) <= DinA;
      DoutA <= DinA;
    else
      DoutA <= Stack(AddrA);  
    end if;
    DoutB <= Stack(AddrB);
  end if;
end process;

The only difference from the memory template is that it “forwards” the recorded value to the output. With the previous template, the recorded value would be obtained at the next, after recording, measure.

The synthesizer automatically recognizes these patterns and generates the corresponding memory blocks. This is visible in the report. For example, for a data stack, it looks like this:

-----------------------------------------------------------------------
| ram_type           | Distributed                         |          |
-----------------------------------------------------------------------
| Port A                                                              |
|     aspect ratio   | 16-word x 32-bit                    |          |
|     clkA           | connected to signal            | rise     |
|     weA            | connected to signal          | high     |
|     addrA          | connected to signal        |          |
|     diA            | connected to signal         |          |
|     doA            | connected to internal node          |          |
-----------------------------------------------------------------------
| Port B                                                              |
|     aspect ratio   | 16-word x 32-bit                    |          |
|     addrB          | connected to signal        |          |
|     doB            | connected to internal node          |          |
-----------------------------------------------------------------------

I think it makes no sense to provide a complete code for the implementation of memory, it is, in fact, boilerplate.

The main cycle of the processor - at the first clock cycle, the team is sampled, at the second - execution. To determine which clock the processor is on, a fetching signal is made.

process(clk)
begin
  if rising_edge(clk) then
    if reset = '1' then
      -- обнуление сигналов     
      ip <= (others => '0');
      fetching <= '1';
    else      
      if fetching = '1' then
        fetching <= '0';
      else
        fetching <= '1';
        -- исполнение команды, формирование адреса для выборки
      end if;
    end if;
  end if;
end process;

The simplest option for decoding and executing a command is a large “case” for all options. For ease of writing, it is better to divide it into several components.
In this project, I broke it into 3 parts:

a case, which will be responsible for generating the address of the data stack, and generate a write signal;
case of team performance;
case of forming a new command counter (ip).

-- Data stack addr and we
case conv_integer(cmd(8 downto 4)) is
  when 16 to 31 => -- LIT
    if PrevCmdIsLIT = '0' then
      DSAddrA <= DSAddrA + 1;
    end if;
    DSWeA <= '1';          
  when 0 => -- group 0; pop 0; push 0
    null;
  when 1 => -- group 1; pop 0; push 1;
    DSAddrA <= DSAddrA + 1;
    DSWeA <= '1';          
  when 2 => -- group 2; pop 1; push 0;
    DSAddrA <= DSAddrA - 1;                        
  when 3 => -- group 3; pop 1; push 1;
    DSWeA <= '1';          
  when 4 => -- group 4; pop 2; push 0;
    DSAddrA <= DSAddrA - 2;          
  when 5 => -- group 5; pop 2; push 1;
    DSAddrA <= DSAddrA - 1;
    DSWeA <= '1';             
  when others => null;
end case;

The sample is part of the command, the lower 4 bits are not used.
All declared team groups are painted. This case will need to be changed only when a new group of teams appears.

The next case will be responsible for the execution of the team. It forms the data for the data stack (sorry for the tautology), the iowr signal for the OUTPORT command, etc.

-- Data stack value
case conv_integer(cmd) is
  when 256 to 511 => -- LIT
    if PrevCmdIsLIT = '1' then
      DSDinA <= DSDoutA(DataWidth - 9 downto 0) & Cmd(7 downto 0);
    else
      DSDinA <= sxt(Cmd(7 downto 0), DataWidth);              
    end if;
  when cmdPLUS =>            
    DSDinA <= DSDoutA + DSDoutB;
  when others => null;
end case;

So far, only 2 teams have been implemented. Loading numbers onto the stack and adding the top two numbers on the stack. This is enough to “test the idea”, and if these 2 teams work, most of the rest will be implemented “by template” without any problems.

And the last case is the formation of the following address for the command counter:

-- New ip and ret stack;
case conv_integer(cmd) is
  when cmdJMP => -- jmp
    ip <= DSDoutA(ip'range);
  when cmdIF => -- if
    if conv_integer(DSDoutB) = 0 then
      ip <= DSDoutA(ip'range);
    else
      ip <= ip + 1;
    end if;
  when cmdCALL => -- call
    RSAddrA <= RSAddrA + 1;
    RSDinA <= ip + 1;
    RSWeA <= '1';
    ip <= DSDoutA(ip'range);
  when cmdRET => -- ret
    RSAddrA <= RSAddrA - 1;            
    ip <= RSDoutA(ip'range);
  when others => ip <= ip + 1;
end case;

Implemented basic transition commands. The transition address is taken from the stack.

Testing

Before moving on, it is advisable to test already written code. I created a TestBench, in which I entered only the output of a reset signal to the processor in the first 100 ns.

The code memory was initialized as follows:

signal CodeMemory: TCodeMemory := (
  0  => "000000000", -- lit tests
  1  => "100000000",
  2  => "100000001",
  3  => "100000010",
  4  => "000000000",
  5  => "100001111",
  6  => "000000000",
  7  => "100010000",
  8  => "100001000",
  9  => conv_std_logic_vector(cmdPLUS, CodeWidth),
  10 => conv_std_logic_vector(cmdPLUS, CodeWidth),
  11 => conv_std_logic_vector(cmdDROP, CodeWidth),
  12 => "100010011",
  13 => conv_std_logic_vector(cmdJMP, CodeWidth), -- jmp to 19
  14 => "100000010",
  15 => "000000000",
  16 => "100000010",
  17 => conv_std_logic_vector(cmdPLUS, CodeWidth),
  18 => conv_std_logic_vector(cmdRET, CodeWidth), -- ret
  19 => "100001110",
  20 => conv_std_logic_vector(cmdCALL, CodeWidth), -- call to 14
  21 => "111111111",
  others => (others => '0')
);

First, a few numbers are put in, the addition operation is tested, and the stack is cleared with the DROP command. Next, transition, subroutine call, and return are tested.

The simulation result is shown in the following pictures (clickable):

Entire test: Number load test:

Parsing loading numbers

The figure shows the execution of the Lit 0 command. After removing the reset signal, the command counter is zero (ip = 0) and the processor is told that it is in the fetching phase of the command (fetching = '1'). At the first measure, sampling is performed. The first NOP command, which does nothing but increase the command counter (however, any unknown command will increase the command counter, and can also do something with the data stack, depending on the group in which it is located).

Command # 1 is loading the number 0 onto the stack. 3 signals are set at the execution step: the address of the data stack is increased by 1, data is set and the write permission signal is set.
On the next sampling clock, the value “0” is written to the stack at address “1”. The value, also, is immediately "forwarded" to the output (so that the next command operates on a new value). The write enable signal is removed.

Command # 2 is also a command to load a number onto the stack. Because Since it follows the LIT command, the new number will not be loaded onto the stack, but the top one will be modified. It is shifted 8 bits to the left, the lower part is written the value from the command (which is 0x01).

Command # 3 performs the same operations as command # 2. The number on the stack, after its operation equals 0x0102.

Conclusion

The first teams are tested. Almost all the remaining commands are written in a stereotyped manner (“draw circles, draw the rest of the owl”).
The purpose of the article was to show that you can write the processor yourself, and, I hope, I did it at least to some extent. The next step is to write the bootloader and the cross-compiler, if this article is of interest to the habrasociety.

Github project: github.com/whiteTigr/vhdl_cpu
Processor code: github.com/whiteTigr/vhdl_cpu/blob/master/cpu.vhd Testbench
code (although there is practically nothing): github.com/whiteTigr/vhdl_pu blob / master / cpu_tb.vhd

Tags: