Part 2: RocketChip: connecting RAM

  • Tutorial

In the previous part, we assembled a microcontroller without any RAM based on the Altera / Intel FPGA. However, the board has a connector with installed SO-DIMM DDR2 1Gb, which, obviously, I want to use. To do this, we need to wrap the DDR2 controller with the interface ALTMEMPHYin a module that is understandable for the TileLink memory protocol used throughout RocketChip. Under the cut - tactile debugging, brute force programming and RAKE.


As you know, Computer Science has two main problems: cache invalidation and variable naming. At KDPV, you see a rare moment - the two main problems CS met each otherand are plotting something.


DISCLAIMER: In addition to the warning from the previous article, I strongly recommend that you read the article to the end before repeating the experiments, in order to avoid damage to the FPGA, memory module or power circuits.


This time I wanted to, if not boot Linux, then at least connect the RAM, which on my board already has a whole gigabyte (or you can put up to four). A success criterion is proposed to consider the ability to read and write through a bunch of GDB + OpenOCD, including addresses that are not aligned by 16 bytes (the width of one request to memory). At first glance, you just need to fix the config a bit, the SoC generator cannot support RAM out of the box. It supports it, but through the MIG interface (well, and, possibly, some other interface from Microsemi). Through the standard interface, AXI4 also supports it, but, as I understand it, it is not so easy to get it (at least, not mastering Platform Designer).


Lyrical digression: As far as I understand, there is a rather popular series of “intra-chip” AXI interfaces developed by ARM. Here one would think that it is all patented and closed. But after I registered (without any “university programs” and anything else - just by e-mail and filling out the questionnaire) and got access to the specification, I was pleasantly surprised. Of course, I'm not a lawyer, but it seems that the standard is pretty open: you either have to use licensed kernels from ARM, or not at all claim to be compatible with ARM, and then everything seems to be OK . But in general, of course, read the license, read with lawyers, etc.


Monkey and TileLink (fable)


The task seemed quite simple, and I opened the description of the module board supplier already in the project ddr2_64bit:


Intel property and generally
module ddr2_64bit (
    local_address,
    local_write_req,
    local_read_req,
    local_burstbegin,
    local_wdata,
    local_be,
    local_size,
    global_reset_n,
    pll_ref_clk,
    soft_reset_n,
    local_ready,
    local_rdata,
    local_rdata_valid,
    local_refresh_ack,
    local_init_done,
    reset_phy_clk_n,
    mem_odt,
    mem_cs_n,
    mem_cke,
    mem_addr,
    mem_ba,
    mem_ras_n,
    mem_cas_n,
    mem_we_n,
    mem_dm,
    phy_clk,
    aux_full_rate_clk,
    aux_half_rate_clk,
    reset_request_n,
    mem_clk,
    mem_clk_n,
    mem_dq,
    mem_dqs);
    input   [25:0]  local_address;
    input       local_write_req;
    input       local_read_req;
    input       local_burstbegin;
    input   [127:0] local_wdata;
    input   [15:0]  local_be;
    input   [2:0]   local_size;
    input       global_reset_n;
    input       pll_ref_clk;
    input       soft_reset_n;
    output      local_ready;
    output  [127:0] local_rdata;
    output      local_rdata_valid;
    output      local_refresh_ack;
    output      local_init_done;
    output      reset_phy_clk_n;
    output  [1:0]   mem_odt;
    output  [1:0]   mem_cs_n;
    output  [1:0]   mem_cke;
    output  [13:0]  mem_addr;
    output  [1:0]   mem_ba;
    output      mem_ras_n;
    output      mem_cas_n;
    output      mem_we_n;
    output  [7:0]   mem_dm;
    output      phy_clk;
    output      aux_full_rate_clk;
    output      aux_half_rate_clk;
    output      reset_request_n;
    inout   [1:0]   mem_clk;
    inout   [1:0]   mem_clk_n;
    inout   [63:0]  mem_dq;
    inout   [7:0]   mem_dqs;
  ...

Popular wisdom says: “Any documentation in Russian must begin with the words:“ So, it does not work. ”” But the interface here is not entirely intuitive , so we still read it . In the description we are immediately told that working with DDR2 is not an easy task. You need to set up PLL, carry out some calibration, crack-fex-pex , a signal is set local_init_done, you can work. In general, the naming logic here is approximately the following: names with prefixes local_are the “user" interface, the ports mem_must be directly displayed on the legs connected to the memory module, onpll_ref_clk you need to send a clock signal with the frequency indicated when configuring the module - the rest of the frequencies will be obtained from it, well, all sorts of inputs and outputs reset and frequency outputs that the user interface should work in synchronization with.


Let's create a description of external signals to the memory and interface of the module ddr2_64bit:


trait memif
trait MemIf {
  val local_init_done   = Output(Bool())
  val global_reset_n    = Input(Bool())
  val pll_ref_clk       = Input(Clock())
  val soft_reset_n      = Input(Bool())
  val reset_phy_clk_n   = Output(Clock())
  val mem_odt   = Output(UInt(2.W))
  val mem_cs_n  = Output(UInt(2.W))
  val mem_cke   = Output(UInt(2.W))
  val mem_addr  = Output(UInt(14.W))
  val mem_ba    = Output(UInt(2.W))
  val mem_ras_n = Output(UInt(1.W))
  val mem_cas_n = Output(UInt(1.W))
  val mem_we_n  = Output(UInt(1.W))
  val mem_dm    = Output(UInt(8.W))
  val phy_clk           = Output(Clock())
  val aux_full_rate_clk = Output(Clock())
  val aux_half_rate_clk = Output(Clock())
  val reset_request_n = Output(Bool())
  val mem_clk   = Analog(2.W)
  val mem_clk_n = Analog(2.W)
  val mem_dq    = Analog(64.W)
  val mem_dqs   = Analog(8.W)
  def connectFrom(mem_if: MemIf): Unit = {
    local_init_done := mem_if.local_init_done
    mem_if.global_reset_n := global_reset_n
    mem_if.pll_ref_clk := pll_ref_clk
    mem_if.soft_reset_n := soft_reset_n
    reset_phy_clk_n := mem_if.reset_phy_clk_n
    mem_odt <> mem_if.mem_odt
    mem_cs_n <> mem_if.mem_cs_n
    mem_cke <> mem_if.mem_cke
    mem_addr <> mem_if.mem_addr
    mem_ba <> mem_if.mem_ba
    mem_ras_n <> mem_if.mem_ras_n
    mem_cas_n <> mem_if.mem_cas_n
    mem_we_n <> mem_if.mem_we_n
    mem_dm <> mem_if.mem_dm
    mem_clk <> mem_if.mem_clk
    mem_clk_n <> mem_if.mem_clk_n
    mem_dq <> mem_if.mem_dq
    mem_dqs <> mem_if.mem_dqs
    phy_clk := mem_if.phy_clk
    aux_full_rate_clk := mem_if.aux_full_rate_clk
    aux_half_rate_clk := mem_if.aux_half_rate_clk
    reset_request_n := mem_if.reset_request_n
  }
}
class MemIfBundle extends Bundle with MemIf

class dd2_64bit
class ddr2_64bit extends BlackBox {
  override val io = IO(new MemIfBundle {
    val local_address     = Input(UInt(26.W))
    val local_write_req   = Input(Bool())
    val local_read_req    = Input(Bool())
    val local_burstbegin = Input(Bool())
    val local_wdata       = Input(UInt(128.W))
    val local_be          = Input(UInt(16.W))
    val local_size        = Input(UInt(3.W))
    val local_ready       = Output(Bool())
    val local_rdata       = Output(UInt(128.W))
    val local_rdata_valid = Output(Bool())
    val local_refresh_ack = Output(Bool())
  })
}

Here the first bunch of rakes was waiting for me: firstly, having looked at the class ROMGenerator, I thought that the memory controller could be pulled out of the depths of the design through a global variable, and Chisel would somehow forward the wires itself. Did not work out. Therefore, we had to make a wiring harness MemIfBundlethat stretched across the entire hierarchy. Why doesn’t it stick out BlackBox, and it doesn’t connect at once? The fact is that BlackBoxall external ports are stuffed into val io = IO(new Bundle { ... }). If everything is MemIfBundledone in one bundle in one variable, then the name of this variable will be made a prefix for the names of all ports, and the names will not corny coincide with the interface of the block. Probably, it can be done somehow more adequately , but for now let’s leave it this way.


Further, by analogy with other TileLink devices (mainly living in rocket-chip/src/main/scala/tilelink), and in particular BootROM, we describe our interface to the memory controller:


class AltmemphyDDR2RAM(implicit p: Parameters) extends LazyModule {
  val MemoryPortParams(MasterPortParams(base, size, beatBytes, _, _, executable), 1) = p(ExtMem).get
  val node = TLManagerNode(Seq(TLManagerPortParameters(
    Seq(TLManagerParameters(
      address = AddressSet.misaligned(base, size),
      resources = new SimpleDevice("ram", Seq("sifive,altmemphy0")).reg("mem"),
      regionType = RegionType.UNCACHED,
      executable = executable,
      supportsGet = TransferSizes(1, 16),
      supportsPutFull = TransferSizes(1, 16),
      fifoId = Some(0)
    )),
    beatBytes = 16
  )))
  override lazy val module = new AltmemphyDDR2RAMImp(this)
}
class AltmemphyDDR2RAMImp(_outer: AltmemphyDDR2RAM)(implicit p: Parameters)
    extends LazyModuleImp(_outer) {
  val (in, edge) = _outer.node.in(0)
  val ddr2 = Module(new ddr2_64bit)
  val mem_if = IO(new MemIfBundle)
  // TODO здесь дорисовать сову
}
trait HasAltmemphyDDR2 { this: BaseSubsystem =>
  val dtb: DTB
  val mem_ctrl = LazyModule(new AltmemphyDDR2RAM)
  mem_ctrl.node := mbus.toDRAMController(Some("altmemphy-ddr2"))()
}
trait HasAltmemphyDDR2Imp extends LazyModuleImp {
  val outer: HasAltmemphyDDR2
  val mem_if = IO(new MemIfBundle)
  mem_if <> outer.mem_ctrl.module.mem_if
}

Using the standard key, ExtMemwe extract the external memory parameters from the SoC config ( this strange syntax allows me to say “I know that I will get a case class instance back MemoryPortParameters(by the type of the key at the stage of compiling the Scala code, by analogy with the pattern matching) , provided that in runtime we won’t fall, taking the contents out of Option[MemoryPortParams], equal None, but then there was nothing to create a memory controller inSystem.scala...), and so, I don’t need the case class itself, and some of its fields are needed ”). Next, we create the manager port of the TileLink device (the TileLink protocol ensures the interaction of almost everything related to memory: the DDR controller and other memory-mapped devices, processor caches, maybe something else, each device can have several ports, each the device can be both manager and client). beatBytes, as I understand it, it sets the size of one transaction, and we have 16 bytes exchanged with the controller. HasAltmemphyDDR2and HasAltmemphyDDR2Impwe mix in the right places in System.scala, write a config


class BigZeowaaConfig extends Config (
  new WithNBreakpoints(2) ++
    new WithNExtTopInterrupts(0) ++
    new WithExtMemSize(1l << 30) ++
    new WithNMemoryChannels(1) ++
    new WithCacheBlockBytes(16) ++
    new WithNBigCores(1) ++
    new WithJtagDTM ++
    new BaseConfig
)

Having made some “sketch of an owl” in AltmemphyDDR2RAMImp, I synthesized the design (something only at ~ 30MHz, it’s good that I clock from 25MHz) and, putting my fingers on the memory modules and the FPGA chip, poured it into the board. Then I saw what a real intuitive interface is: this is when you give a command in gdb to write to memory, and by a frozen processor andburnt fingers feeling strong heat, you need to urgently press the reset button on the board and fix the controller.


Read the documentation for the DDR2 controller


Apparently, it's time to read the documentation on the controller beyond the list of ports. So, what do we have here? .. Oops, it turns out that the I / O with the prefix local_should not be set synchronously not with pll_ref_clkwhich is 25MHz, but with one that produces phy_clkhalf the memory frequency for the half-rate controller, or, in our case, aux_half_rate_clk(maybe, all the same aux_full_rate_clk?), issuing the full memory frequency, and it, for a minute, is 166MHz.


Therefore, it is necessary to cross the boundaries of frequency domains. According to old memory, I decided to use latches, or rather a chain of them:


  +-+  +-+  +-+  +-+
--| |--| |--| |--| |--->
  +-+  +-+  +-+  +-+
   |    |    |    |
---+    |    |    |
inclk   |    |    |
        |    |    |
--------+----+    |
outclk            |
                  |
------------------+
output enable

But, having tinkered with the clock, I came to the conclusion that I couldn’t handle two stages (in the high-frequency domain and vice versa) on the “scalar” latches, each of which would have antidirectional signals ( readyand valid), and even so, to be sure that some some bitik will not lag behind a beat or two along the road. After some time, I realized that describing synchronization on ready- validwithout a common clock signal - is also a task akin to creating non-blocking data structures in the sense that you need to think and formally prove a lot, it’s easy to make a mistake, it’s hard to notice, and most importantly, everything has already been realized before us: Intel has a primitivedcfifo, which is a queue of configurable length and width, which is read and written from different frequency domains. As a result, I took advantage of the experimental opportunity of fresh Chisel, namely, the parameterized black boxes:


class FIFO (val width: Int, lglength: Int) extends BlackBox(Map(
  "intended_device_family" -> StringParam("Cyclone IV E"),
  "lpm_showahead" -> StringParam("OFF"),
  "lpm_type" -> StringParam("dcfifo"),
  "lpm_widthu" -> IntParam(lglength),
  "overflow_checking" -> StringParam("ON"),
  "rdsync_delaypipe" -> IntParam(5),
  "underflow_checking" -> StringParam("ON"),
  "use_eab" -> StringParam("ON"),
  "wrsync_delaypipe" -> IntParam(5),
  "lpm_width" -> IntParam(width),
  "lpm_numwords" -> IntParam(1 << lglength)
)) {
  override val io = IO(new Bundle {
    val data = Input(UInt(width.W))
    val rdclk = Input(Clock())
    val rdreq = Input(Bool())
    val wrclk = Input(Clock())
    val wrreq = Input(Bool())
    val q = Output(UInt(width.W))
    val rdempty = Output(Bool())
    val wrfull = Output(Bool())
  })
  override def desiredName: String = "dcfifo"
}

And he wrote a simple little binocular of arbitrary data types:



object FIFO {
  def apply[T <: Data](
                        lglength: Int,
                        output: T, outclk: Clock,
                        input: T, inclk: Clock
                      ): FIFO = {
    val res = Module(new FIFO(width = output.widthOption.get, lglength = lglength))
    require(input.getWidth == res.width)
    output := res.io.q.asTypeOf(output)
    res.io.rdclk := outclk
    res.io.data := input.asUInt()
    res.io.wrclk := inclk
    res
  }
}

Debugging


After that, the code turned into transferring messages between domains through two already unidirectional queues: tl_req/ ddr_reqand ddr_resp/ tl_resp(what has a prefix tl_is clocked together with TileLink, what ddr_is together with a memory controller). The problem is that everything was deadlocked anyway, and sometimes it was pretty warm. And if the cause of overheating was the simultaneous exposure local_read_reqand local_write_req, then with the deadlocks it was not so easy to compete. The code at the same time was something like


class AltmemphyDDR2RAMImp(_outer: AltmemphyDDR2RAM)(implicit p: Parameters)
    extends LazyModuleImp(_outer) {
  val addrSize = log2Ceil(_outer.size / 16)
  val (in, edge) = _outer.node.in(0)
  val ddr2 = Module(new ddr2_64bit)
  require(ddr2.io.local_address.getWidth == addrSize)
  val tl_clock = clock
  val ddr_clock = ddr2.io.aux_full_rate_clk
  val mem_if = IO(new MemIfBundle)
  class DdrRequest extends Bundle {
    val size = UInt(in.a.bits.size.widthOption.get.W)
    val source = UInt(in.a.bits.source.widthOption.get.W)
    val address = UInt(addrSize.W)
    val be = UInt(16.W)
    val wdata = UInt(128.W)
    val is_reading = Bool()
  }
  val tl_req = Wire(new DdrRequest)
  val ddr_req = Wire(new DdrRequest)
  val fifo_req = FIFO(2, ddr_req, ddr_clock, tl_req, clock)
  class DdrResponce extends Bundle {
    val is_reading = Bool()
    val size   = UInt(in.d.bits.size.widthOption.get.W)
    val source = UInt(in.d.bits.source.widthOption.get.W)
    val rdata = UInt(128.W)
  }
  val tl_resp = Wire(new DdrResponce)
  val ddr_resp = Wire(new DdrResponce)
  val fifo_resp = FIFO(2, tl_resp, clock, ddr_resp, ddr_clock)
  // логика общения с TileLink
  withClock(ddr_clock) {
    // логика общения с контроллером
  }

To localize the problem, I decided to banally comment out all the code inside withClock(ddr_clock)(doesn't it, visually it looks like creating a stream) and replace it with a stub that works for sure:


  withClock (ddr_clock) {
    ddr_resp.rdata      := 0.U
    ddr_resp.is_reading := ddr_req.is_reading
    ddr_resp.size       := ddr_req.size
    ddr_resp.source     := ddr_req.source
    val will_read = Wire(!fifo_req.io.rdempty && !fifo_resp.io.wrfull)
    fifo_req.io.rdreq := will_read
    fifo_resp.io.wrreq := RegNext(will_read)
  }

As I later realized, this stub also did not work for the reason that the design Wire(...)that I added “for reliability” to show that it was a named wire actually used the argument only as a prototype to create a type of its value, but not tied him to an argument expression. Also, when I tried to read what was still generated, I realized that in the simulation mode there is a wide selection of assertions regarding non-compliance with the TileLink protocol. They will probably come in handy to me later, but so far there has been no attempt to run the simulation - why start it? Verilator probably does not know about Alter's IP Cores, ModelSim Starter Edition will most likely refuse to simulate such a huge project, but I also swore at the lack of a controller model for simulation. And in order to generate it, you probably need to first switch to the new version of the controller (because the old one was configured in the ancient Quartus).


In fact, the blocks of code were taken from an almost working version, and not the one that was actively debugged a few hours before. But you’re better;) By the way, you can constantly reassemble the design faster if you WithNBigCores(1)change the setting to WithNSmallCores(1)- from the point of view of the basic functionality of the memory controller, there seems to be no difference. And a little trick: in order not to drive the same commands into gdb each time (at least I don’t have a history of commands between sessions there), you can simply type something like this at the command line


../../rocket-tools/bin/riscv32-unknown-elf-gdb -q -ex "target remote :3333" -ex "x/x 0x80000000"
../../rocket-tools/bin/riscv32-unknown-elf-gdb -q -ex "target remote :3333" -ex "set variable *0x80000000=0x1234"

and run as needed using regular means of the shell.


Total


As a result, the following code was obtained for working with the controller:


  withClock(ddr_clock) {
    val rreq = RegInit(false.B) // запрос чтения (ещё не принят)
    val wreq = RegInit(false.B) // запрос записи (ещё не принят)
    val rreq_pending = RegInit(false.B) // запрос чтения (ждём данные)
    ddr2.io.local_read_req := rreq
    ddr2.io.local_write_req := wreq
    // какие-то магические константы :)
    ddr2.io.local_size := 1.U
    ddr2.io.local_burstbegin := true.B
    // данные из запроса (надеюсь на буферизованность вывода q FIFO)
    ddr2.io.local_address := ddr_req.address
    ddr2.io.local_be := ddr_req.be
    ddr2.io.local_wdata := ddr_req.wdata
    // копируем информацию, какой запрос обслуживаем
    ddr_resp.is_reading := ddr_req.is_reading
    ddr_resp.size := ddr_req.size
    ddr_resp.source := ddr_req.source
    // читаем следующий запрос, если готово **вообщё всё**
    val will_read_request = !fifo_req.io.rdempty &&
                               !rreq && !wreq && !rreq_pending && ddr2.io.local_ready
    // отвечаем, если есть что сказать
    val will_respond = !fifo_resp.io.wrfull &&
                          (  (rreq_pending && ddr2.io.local_rdata_valid) ||
                             (wreq && ddr2.io.local_ready))
    val request_is_read = RegNext(will_read_request)
    fifo_req.io.rdreq := will_read_request
    fifo_resp.io.wrreq := will_respond
    // прочитан запрос, заказанный на предыдущем такте
    when (request_is_read) {
      rreq := ddr_req.is_reading
      rreq_pending := ddr_req.is_reading
      wreq := !ddr_req.is_reading
    }
    when (will_respond) {
      rreq := false.B
      wreq := false.B
      ddr_resp.rdata := ddr2.io.local_rdata
    }
    // прочитанных данных ещё нет, но запрос ушёл
    when (rreq && ddr2.io.local_ready) {
      rreq := false.B
    }
  }

Here we will still slightly change the completion criterion: I have already seen how, without any work with memory, the recorded data is as if read, because it’s a cache. Therefore, we compile a simple piece of code:


#include 
static volatile uint8_t *x = (uint8_t *)0x80000000u;
void entry()
{
  for (int i = 0; i < 1<<24; ++i) {
    x[i] = i;
  }
}

../../rocket-tools/bin/riscv64-unknown-elf-gcc test.c -S -O1

As a result, we obtain the following fragment of assembler listing, initializing the first 16 MB of memory:


    li  a5,1
    slli    a5,a5,31
    li  a3,129
    slli    a3,a3,24
.L2:
    andi    a4,a5,0xff
    sb  a4,0(a5)
    addi    a5,a5,1
    bne a5,a3,.L2

We put it in the beginning bootrom/xip/leds.S. Now it’s unlikely that everything will be able to hold on to only one cache. It remains to run the Makefile, rebuild the project in Quartus, fill it in the board, connect OpenOCD + GDB and ... Presumably, cheers, victory:


$ ../../rocket-tools/bin/riscv32-unknown-elf-gdb -q -ex "target remote :3333"
Remote debugging using :3333
warning: No executable has been specified and target does not support
determining executable automatically.  Try using the "file" command.
0x0000000000010014 in ?? ()
(gdb) x/x 0x80000000
0x80000000:     0x03020100
(gdb) x/x 0x80000100
0x80000100:     0x03020100
(gdb) x/x 0x80000111
0x80000111:     0x14131211
(gdb) x/x 0x80010110
0x80010110:     0x13121110
(gdb) x/x 0x80010120
0x80010120:     0x23222120

Is it so, we will find out in the next series (I also can not say about performance, stability, etc.).


Code: AltmemphyDDR2RAM.scala .


Also popular now: