Forth VHDL processor

    In this article I will tell you how to write a processor on VHDL myself. There will not be much code (at least I hope so). The full code is posted on the github, and there you can see several iterations of writing.

    The processor falls under the class of soft processors .


    First of all, you need to choose the processor architecture. I will use the RISC architecture for the processor and the Harvard memory architecture .
    The processor will be without a pipeline with two states:

    1. Selection of a command and operands
    2. Executing a command and saving the result

    Since we are writing the forth processor, it will be stacked. This will reduce the bit capacity of the team, because in it will not have to keep registers codes with which calculations are carried out. For operations, the processor will have two upper stack numbers available.
    The data stack and the return stack will be separate.

    In FPGA there is a block memory with a configuration of 18 bits * 1024 cells. Focusing on it, I choose the bit capacity of the command at 9 bits (2048 commands will fit in one memory block).
    The capacity of the data memory should be “standard” in 32 bits.
    I implement “communication” with peripheral devices using the bus.

    The scheme of all this disgrace will turn out approximately the following.

    Command system

    We decided on the architecture, now "try to take off with all this." Now you need to come up with a command system.
    All processor commands can be divided into several groups:
    • Loading literal (numbers) onto the stack
    • Transitions (conditional branch, subroutine call, return)
    • Access to data memory (read and write)
    • A call to the bus (the meaning is the same as a call to memory).
    • ALU teams.
    • Other teams.

    So, we have 9 bits of the team, which we need to meet.

    Download Literals

    The bit depth of the command is less than the bit depth of the data, so you need to come up with a mechanism for loading numbers.

    I chose the following command format for loading literals onto the stack:

    Senior, 8 bits of the command will be a sign of loading a number. The remaining 8 bits are directly the number loaded onto the stack.
    But the data capacity is 32 bits, and so far only 8 bits can be downloaded.
    Let's agree that if there are several LIT commands in a row, then this is considered to be loading a single number. The first command loads the number onto the stack (expanding it), each subsequent one modifies the top number on the stack, shifting it 8 bits to the left and inscribing the value from the command into the lower part. Thus, you can load the number of any bit by a sequence of several LIT commands.
    You can use any command (for example, NOP) to separate multiple numbers.

    Team Grouping

    I decided to break all other commands into groups for easy decoding. We will group by the way they affect the stack.

    Groups of teams:
    GroupTakes from the stackPushes on the stackExample
    311DUP @
    420!, OUTPORT
    521Arithmetic (+, -, AND)


    The JMP and CALL commands take the address from the stack and go over it (call additionally puts the return address on the corresponding stack).
    The IF command takes the transition address (top number on the stack) and the transition flag (next number). If the sign is equal to zero, then the transition to the address is carried out.
    The RET team works with the return stack, picking up the top number and going over it.
    If the command is not a transition, the command counter is incremented by one.

    Command table

    To describe the commands, the stack notation is used , which looks like this:
    <Stack state before word execution> - <stack state after
    word execution >

    The top of the stack is on the right, i.e. writing 2 3 - 5 means that before the word was executed, the
    number 3 was at the top of the stack, and below it was 2; after execution, these numbers
    turned out to be deleted, and on the top instead of them appeared the number 5.
    DUP (a - aa)
    DROP (ab - a)

    Take the minimum set of commands with which you can at least do something.
    H \ l0123456789

    CommandStack notationDescription
    NopNo operation. One processor latency
    Depth - DStacking the number of numbers on the data stack before executing this word
    RDEPTH - DStacking the number of numbers on the return stack before executing this word
    DupA - AADuplicate Top Number
    OverAB - ABACopy to the top of the second number on top
    DropA - Delete top number
    @A - dReading data memory at address A
    INPORTA - dReading data from bus at address A
    NOTA - 0 | -1Logical NOT top number (0 is replaced by -1, any other number is replaced by 0)
    SHLA - BShift the top number by 1 digit to the left
    SHRA - BShift the top number by 1 digit to the right
    SHRAA - BArithmetic shift of the top number by 1 digit to the right (the sign of the number is preserved)
    !DA - Writing data D at address A to the data memory
    OUTPORTDA - Writing data D at address A to the "bus" (iowr signal will be set for one clock cycle, the periphery should "catch" its address with a high level of this signal)
    NipAB - BRemoving the second number from the top from the stack (the number is stored in the TempReg register)
    TEMP> - ARetrieving TempReg Register Content
    +AB - A + BStack top numbers
    -AB - ABSubtraction from the second number from the top number
    ANDAB - A and BBitwise AND Over Heights
    ORAB - A or BBitwise OR over the upper numbers
    XorAB - A xor BBitwise XOR over the upper numbers
    =AB - 0 | -1Verification of equality of upper numbers. If the numbers are equal, leaves -1 on the stack, otherwise 0
    >AB - 0 | -1Comparison of the upper numbers. If A> B, leaves -1 on the stack, otherwise 0. Comparison taking into account the sign
    <AB - 0 | -1Comparison of the upper numbers. If A <B, leaves -1 on the stack, otherwise 0. Comparison taking into account the sign
    *AB - A * BMultiplication of the upper numbers

    You can write 1 number on the stack in one processor clock cycle; there is a SWAP command in the fort that swaps the top 2 numbers on the stack. To implement it you need 2 teams. The first command, NIP (ab - b), removes the second number “a” from the top and stores it in a temporary register, and the second command TEMP> (- a) extracts this number from the temporary register and puts it on the top of the stack.

    Getting started coding

    Memory implementation.
    The code and data memory is implemented through the template:
      if rising_edge(clk) then
        if WeA = '1' then
          Ram(AddrA) <= DinA;
        end if;
        DoutA <= Ram(AddrA);
        DoutB <= Ram(AddrB);
      end if;
    end process;

    Ram is a signal declared as follows:
    subtype RamSignal is std_logic_vector(RamWidth-1 downto 0);
    type TRam is array(0 to RamSize-1) of RamSignal;
    signal Ram: TRam;

    The memory can be initialized as follows:
    signal Ram: TRam :=
    (0 => conv_std_logic_vector(0, RamWidth),
     1 => conv_std_logic_vector(1, RamWidth),
     2 => conv_std_logic_vector(2, RamWidth),
     -- ...
     others => (others => '0'));

    Stacks are implemented through a similar template.
      if rising_edge(clk) then
        if WeA = '1' then
          Stack(AddrA) <= DinA;
          DoutA <= DinA;
          DoutA <= Stack(AddrA);  
        end if;
        DoutB <= Stack(AddrB);
      end if;
    end process;

    The only difference from the memory template is that it “forwards” the recorded value to the output. With the previous template, the recorded value would be obtained at the next, after recording, measure.

    The synthesizer automatically recognizes these patterns and generates the corresponding memory blocks. This is visible in the report. For example, for a data stack, it looks like this:
    | ram_type           | Distributed                         |          |
    | Port A                                                              |
    |     aspect ratio   | 16-word x 32-bit                    |          |
    |     clkA           | connected to signal            | rise     |
    |     weA            | connected to signal          | high     |
    |     addrA          | connected to signal        |          |
    |     diA            | connected to signal         |          |
    |     doA            | connected to internal node          |          |
    | Port B                                                              |
    |     aspect ratio   | 16-word x 32-bit                    |          |
    |     addrB          | connected to signal        |          |
    |     doB            | connected to internal node          |          |

    I think it makes no sense to provide a complete code for the implementation of memory, it is, in fact, boilerplate.

    The main cycle of the processor - at the first clock cycle, the team is sampled, at the second - execution. To determine which clock the processor is on, a fetching signal is made.
      if rising_edge(clk) then
        if reset = '1' then
          -- обнуление сигналов     
          ip <= (others => '0');
          fetching <= '1';
          if fetching = '1' then
            fetching <= '0';
            fetching <= '1';
            -- исполнение команды, формирование адреса для выборки
          end if;
        end if;
      end if;
    end process;

    The simplest option for decoding and executing a command is a large “case” for all options. For ease of writing, it is better to divide it into several components.
    In this project, I broke it into 3 parts:
    • a case, which will be responsible for generating the address of the data stack, and generate a write signal;
    • case of team performance;
    • case of forming a new command counter (ip).

    -- Data stack addr and we
    case conv_integer(cmd(8 downto 4)) is
      when 16 to 31 => -- LIT
        if PrevCmdIsLIT = '0' then
          DSAddrA <= DSAddrA + 1;
        end if;
        DSWeA <= '1';          
      when 0 => -- group 0; pop 0; push 0
      when 1 => -- group 1; pop 0; push 1;
        DSAddrA <= DSAddrA + 1;
        DSWeA <= '1';          
      when 2 => -- group 2; pop 1; push 0;
        DSAddrA <= DSAddrA - 1;                        
      when 3 => -- group 3; pop 1; push 1;
        DSWeA <= '1';          
      when 4 => -- group 4; pop 2; push 0;
        DSAddrA <= DSAddrA - 2;          
      when 5 => -- group 5; pop 2; push 1;
        DSAddrA <= DSAddrA - 1;
        DSWeA <= '1';             
      when others => null;
    end case;

    The sample is part of the command, the lower 4 bits are not used.
    All declared team groups are painted. This case will need to be changed only when a new group of teams appears.

    The next case will be responsible for the execution of the team. It forms the data for the data stack (sorry for the tautology), the iowr signal for the OUTPORT command, etc.
    -- Data stack value
    case conv_integer(cmd) is
      when 256 to 511 => -- LIT
        if PrevCmdIsLIT = '1' then
          DSDinA <= DSDoutA(DataWidth - 9 downto 0) & Cmd(7 downto 0);
          DSDinA <= sxt(Cmd(7 downto 0), DataWidth);              
        end if;
      when cmdPLUS =>            
        DSDinA <= DSDoutA + DSDoutB;
      when others => null;
    end case;

    So far, only 2 teams have been implemented. Loading numbers onto the stack and adding the top two numbers on the stack. This is enough to “test the idea”, and if these 2 teams work, most of the rest will be implemented “by template” without any problems.

    And the last case is the formation of the following address for the command counter:
    -- New ip and ret stack;
    case conv_integer(cmd) is
      when cmdJMP => -- jmp
        ip <= DSDoutA(ip'range);
      when cmdIF => -- if
        if conv_integer(DSDoutB) = 0 then
          ip <= DSDoutA(ip'range);
          ip <= ip + 1;
        end if;
      when cmdCALL => -- call
        RSAddrA <= RSAddrA + 1;
        RSDinA <= ip + 1;
        RSWeA <= '1';
        ip <= DSDoutA(ip'range);
      when cmdRET => -- ret
        RSAddrA <= RSAddrA - 1;            
        ip <= RSDoutA(ip'range);
      when others => ip <= ip + 1;
    end case;

    Implemented basic transition commands. The transition address is taken from the stack.


    Before moving on, it is advisable to test already written code. I created a TestBench, in which I entered only the output of a reset signal to the processor in the first 100 ns.

    The code memory was initialized as follows:
    signal CodeMemory: TCodeMemory := (
      0  => "000000000", -- lit tests
      1  => "100000000",
      2  => "100000001",
      3  => "100000010",
      4  => "000000000",
      5  => "100001111",
      6  => "000000000",
      7  => "100010000",
      8  => "100001000",
      9  => conv_std_logic_vector(cmdPLUS, CodeWidth),
      10 => conv_std_logic_vector(cmdPLUS, CodeWidth),
      11 => conv_std_logic_vector(cmdDROP, CodeWidth),
      12 => "100010011",
      13 => conv_std_logic_vector(cmdJMP, CodeWidth), -- jmp to 19
      14 => "100000010",
      15 => "000000000",
      16 => "100000010",
      17 => conv_std_logic_vector(cmdPLUS, CodeWidth),
      18 => conv_std_logic_vector(cmdRET, CodeWidth), -- ret
      19 => "100001110",
      20 => conv_std_logic_vector(cmdCALL, CodeWidth), -- call to 14
      21 => "111111111",
      others => (others => '0')

    First, a few numbers are put in, the addition operation is tested, and the stack is cleared with the DROP command. Next, transition, subroutine call, and return are tested.

    The simulation result is shown in the following pictures (clickable):

    Entire test: Number load test:

    Parsing loading numbers

    The figure shows the execution of the Lit 0 command. After removing the reset signal, the command counter is zero (ip = 0) and the processor is told that it is in the fetching phase of the command (fetching = '1'). At the first measure, sampling is performed. The first NOP command, which does nothing but increase the command counter (however, any unknown command will increase the command counter, and can also do something with the data stack, depending on the group in which it is located).

    Command # 1 is loading the number 0 onto the stack. 3 signals are set at the execution step: the address of the data stack is increased by 1, data is set and the write permission signal is set.
    On the next sampling clock, the value “0” is written to the stack at address “1”. The value, also, is immediately "forwarded" to the output (so that the next command operates on a new value). The write enable signal is removed.

    Command # 2 is also a command to load a number onto the stack. Because Since it follows the LIT command, the new number will not be loaded onto the stack, but the top one will be modified. It is shifted 8 bits to the left, the lower part is written the value from the command (which is 0x01).

    Command # 3 performs the same operations as command # 2. The number on the stack, after its operation equals 0x0102.


    The first teams are tested. Almost all the remaining commands are written in a stereotyped manner (“draw circles, draw the rest of the owl”).
    The purpose of the article was to show that you can write the processor yourself, and, I hope, I did it at least to some extent. The next step is to write the bootloader and the cross-compiler, if this article is of interest to the habrasociety.

    Github project:
    Processor code: Testbench
    code (although there is practically nothing): blob / master / cpu_tb.vhd

    Also popular now: