SoC: raising simple DMA to FPGA

  • Tutorial

Good afternoon! In a previous article, I described how to “raise” Altrera's SoC from scratch .
We focused on measuring the throughput between the CPU and FPGA when copying is done by the processor.

This time we will go a little further and implement the primitive DMA in FPGA.
Who cares - welcome to cat.

Used iron

The last time we used the board SoCrates of the EBV .
This time I will use the board of our own development - it is it that is presented in the photo.

The main difference is that our board has 2 Gigabit Ethernet interfaces and they are not connected to the CPU, but to the FPGA.
This allows for very flexible traffic processing. Plus, a large number of pins are output to the connectors.

But these differences will become important for us only in the following articles.
In one, we will implement NIC in FPGA - for this, of course, we use gigabit interfaces. In another, we write framebuffer support for the ILI9341 display, again, in FPGA - this will require an expansion card.

And to perform the actions described below, any board with SoC Cyclone V is suitable


In the course of this article, I will provide only important pieces of code with explanations.
All source code can be viewed on github


Details of building the kernel, getting bootloader, and other actions described in a previous article , I will not give.

A note about the kernel - it is better to use the more recent kernel version 3.18 from here:
git clone git:// 
git checkout remotes/origin/socfpga-3.18

Thinking about implementation

DMA controller selection

So, our goal is to transfer data from FPGA to the processor and / or vice versa with maximum bandwidth and minimum CPU load.
The option of copying by the processor immediately disappears, you need to use DMA. But who can play the role of a DMA controller?
There are two options for our SoC - either FPGA or the DMA-330 controller integrated in the HPS . Judging by the discussions on the network, the DMA-330 is not very productive, and the corresponding driver may not even be fully operational. Perhaps someday, we will try to “revive” the DMA-330, but now our choice is FPGA

Interface selection

To perform the functions of a DMA controller, the FPGA must be a wizard. This can be implemented on one of two interfaces:
  • FPGA-to-HPS ( fpga2hps )
  • FPGA-to-HPS SDRAM ( fpga2sdram )

Block diagram of the HPS components and the interfaces between them:
HPS Architecture

Let's see what are the advantages and disadvantages of each option.

fpga2hps allows masters in FPGAs to access almost all the slaves in the system. That is, not only as a memory, but also to a diverse periphery.

fpga2sdram enables FPGAs to work with DPS memory “owned” by HPS. In this case, access is limited only by RAM.

fpga2sdram allows you to get more bandwidth.

When using fpga2hps, the exchange occurs through one interface. If the FPGA requires multiple masters, then arbitration is required. So you need to either write your own modules, or use the ones generated using Qsys, and they are quite resource-intensive. Fpga2sdram on the
other handYou can create up to 6 independent ports, and all issues with arbitration will be decided by the DDR controller.
Note: the number 6 is not entirely “honest” - 6 command ports, 4 write ports and 4 read ports are available.
At the same time, one 128-bit interface requires the use of a first command port, 2 write ports and 2 read ports.

Both fpga2hps and fpga2sdram must be initialized by writing to the appropriate registers before use. Unfortunately, for fpga2sdram this needs to be done after FPGA firmware, but at the moment when no transactions on the interface occur. In fact, when using Linux , this means that you need to flash FPGA in the U-boot. Details can be read here .

When working with fpga2hps, the master in FPGA must use the byte address, when working with fpga2sdram , the address of the word.

For more information, see the Cyclone V Device Handbook, Volume 3: Hard Processor System Technical Reference Manual .
Chapters 8 of the HPS-FPGA Bridges and 11 SDRAM Controller Subsystem .

For our task there is no fundamental difference what to use. Let's choose fpga2sdram hoping to get more bandwidth.

Choosing a DMA Controller Implementation

We decided that we will implement the DMA controller in FPGA and through which interface it will work.
But how will we do the controller itself? You can use one of the open “peels”, for example this one , which is also available through Qsys.

This is a good DMA controller that has many useful features. We will return to it when we implement our NIC .
But now, for our task, such a controller is unnecessary functionality and unnecessary complexity.
For a learning task, it’s much better to sketch a couple of counters in an FPGA to realize that the essence of the DMA controller is very simple.

Top level

From the software side, everything is also quite simple - we need a driver that will allocate memory, get the bus address of this memory, configure and start the DMA controller in FPGA, wait for the transaction to complete and receive data.

And we will write it. But we will not start with the driver, but with a slightly strange program in userspace that will perform the same functions.
This will allow us to work with DMA controllers in FPGA without having to write something at the kernel level.
For "production" such solutions are usually not used, but sometimes it can be convenient for debugging.

For simplicity of firmware in FPGA we will transfer data in the direction of FPGA -> CPU.
Data transmission in the opposite direction is almost completely similar, with the exception of one nuance, which will be discussed below.
With the direction CPU -> FPGA we will work with the implementation of the framebuffer for LCD .

So the plan:
  • Firmware for FPGA
  • Program in userspace
  • Kernel driver

FPGA firmware implementation

Let's start with our beloved Qsys. We will need three IP peels:
  • Processors and Peripherals -> Hard Processor Systems -> Arria V / Cyclone V Hard Processor System
  • Basic Functions -> Bridges and Adaptors -> Memory Mapped -> Avalon-MM Pipeline Bridge
  • Basic Functions -> Bridges and Adaptors -> Clock -> Clock Bridge

For HPS, we leave everything almost the same as in the previous article.
On the FPGA Interfaces tab, you need to add the FPGA-to-HPS SDRAM interface.
Choose the type Avalon-MM Bidirectional , width - 128 bits.

You also need to check the box next to Enable FPGA-to-HPS Interrupts .
This will allow our DMA controller to “inform” the CPU about the completion of the transaction through interruption.

Also, the width of the HPS-to-FPGA interface must be set to 64 bits. This is the interface through which the CPU will configure the DMA controller.
Its width can be any, we set 64 bits simply because I had chosen such a width, and the source code described below is configured for this value.

Here's what you should get:
FPGA Interfaces

Go to the Avalon-MM Bridge .
This peel will serve as a converter. We need to export the HPS-to-FPGA from the auto-generated Qsys module to the outside.
But if we just do this, we get an AXI interface , which is much more complicated than Avalon-MM . And we don’t want to work with whom at all. After adding this module, Qsys will automatically convert AXI to Avalon. It will take some resources, but it will be much more convenient to work.

You need to configure the module like this:
Avalon-mm bridge

We pass to the last module. We need it so that we can export the lock from HPS to the outside and synchronize the DMA controller for this lock. Its setting is primitive - you just need to specify the number of shreds equal to 1.

After that you need to connect all our modules (pay attention to the names in the Export column ):
Qsys Connections

It remains to save and generate files.

It's time to implement our primitive DMA controller. How will we configure it?
For configuration, we will use the so-called Control and Status Register ( CSR ).
These are fixed-size blocks that are available to the CPU for reading / writing (control) or read-only (status).

Access to these registers will be via HPS-to-FPGA .
Since the interface has a width of 64 bits, you can either make the registers the same width, or add a converter.
Making registers 64-bit is very expensive. Indeed, very often in the whole register only a few bits are used.
It is better to make the registers 16-bit, and if it becomes necessary to have a high-bit word, use 2 or 4 adjacent registers.

Theoretically, it was possible to use the converter generated by Qsys, specifying the width of 16 bits for the Avalon-MM Bridge IP-peel , but in practice this could not be done - Qsys generated an inoperative module. It's okay, we will use our own :)

The module is used as a converter , and the registers themselves are implemented in the regfile_with_be.v module

The logic of the register module is extremely simple - depending on the address we set the contents of the desired register on the data bus. If a write signal has also arrived, then we save the input data into the register. The address specifies the register number, not the byte number. The method of dividing into control and status registers is set by the parameter during assembly - either by the high bit of the address (the address space in this case is divided evenly between control and status registers), or by the number of registers indicated by the parameters.

We pass directly to the DMA controller. For simplicity, it is located in the top module .

All that our DMA controller will consist of are three counters and a pair of signals.

Let me remind you that our controller issues data to the Avalon-MM interface. A detailed description can be found here , but in general it is a fairly simple interface.
In order to record data, you need to set the following signals:
  • sdram0_address - address (I remind you that for fpga2sdram it should be the address of the word).
  • sdram0_writedata - data for writing.
  • sdram0_byteenable - a signal indicating which bytes from the data should be written. For simplicity, set it to 16'hFFFF .
  • sdram0_burstcount - a signal for controlling burst. Again, for simplicity, set it to 1.
  • sdram0_write - this signal must be set to 1 to complete a write transaction

The only caveat to keep in mind is the presence of the sdram0_waitrequest signal . If it is equal to 1, this means that the slave cannot currently process the transaction and the master must leave all its signals unchanged. Exactly how often the sdram0_waitrequest signal will be set to 1 and ultimately determine the throughput of our DMA.

So, we will describe the used counters. The first is the address counter, addr_cnt . When a DMA transaction starts, it is set to the address specified by the CPU. After each successful transaction (when sdram0_waitrequest is not 1), this counter is incremented by 1. The

second is the data_cnt counterto emulate data. You can write whatever you want into the data. The main condition is that after the transaction is completed, the software must read from the memory exactly the same data that was recorded. Therefore, writing a simple counter is not very correct - there will be many zeros in the data and it will be difficult to verify the validity of the record. It would be ideal to write a pseudo-random sequence, but for simplicity, the counter and its inverted value are enough.

The third counter - the cycle counter, cycle_cnt , will be reset to 0 at the start of the DMA transaction and then increase by 1 in each cycle.
We need it so that we can find out how many clocks our DMA transaction took and calculate the throughput.

Total, for the counters we get the following code:
Description of counters
// For emulate data
logic [63:0] data_cnt;
// Current address on SDRAM iface
logic [31:0] addr_cnt;
// Overall cycles count. 
logic [31:0] cycle_cnt;
// Form pseudo-data 
always_ff @( posedge clk_w )
  if( !test_is_running )
    data_cnt <= '0;
    if( !sdram0_waitrequest )
      if( data_cnt != ( dma_data_size - 1 ) )
        data_cnt <= data_cnt + 1;
// Increase address if no waitrequest
always_ff @( posedge clk_w )
  if( run_test_stb )
    addr_cnt <= dma_addr;
    if( !sdram0_waitrequest )
      addr_cnt <= addr_cnt + 1;
always_ff @( posedge clk_w )
  if( test_is_running_stb )
    cycle_cnt <= '0;
    if( test_is_running )
      cycle_cnt <= cycle_cnt + 1;

Back to the signals. We need only:
  • test_is_running is a signal indicating whether a DMA transaction is currently in progress.
  • run_test_stb - strobe signal, active for 1 clock at the moment when the CPU starts the DMA controller
  • test_finished - a signal indicating that the required amount of data has been written. It also starts to interrupt.

The formation of these signals is trivial.

What do we need to configure the DMA controller (these will be our control registers)?
  • Buffer address where to copy data
  • Record Data Size
  • A signal to start a transaction, from which we then select the front

Status registers will be:
  • DMA controller busy signal
  • Counter value cycle_cnt

So this is how our register declaration looks like this:
Register Declaration
// Control registers
`define DMA_CTRL_CR        0
        `define DMA_CTRL_CR_RUN_STB      0
`define DMA_ADDR_CR0       1
`define DMA_ADDR_CR1       2
`define DMA_SIZE_CR0       3
`define DMA_SIZE_CR1       4
// Status registers
`define DMA_STAT_SR        0
        `define DMA_STAT_SR_BUSY         0
`define DMA_CYCLE_CNT_SR0  1
`define DMA_CYCLE_CNT_SR1  2

And here is the purpose of the registers:
Register Assignment
// Control from CPU -- bit for start, DMA buffer address and transaction size.
assign run_test       = cregs_w[`DMA_CTRL_CR][`DMA_CTRL_CR_RUN_STB];
assign dma_addr       = { cregs_w[`DMA_ADDR_CR1], cregs_w[`DMA_ADDR_CR0] };
assign dma_data_size  = { cregs_w[`DMA_SIZE_CR1], cregs_w[`DMA_SIZE_CR0] };
// Status for CPU -- current state and overall cycles count.
assign sregs_w[`DMA_STAT_SR][`DMA_STAT_SR_BUSY] = test_is_running;
assign { sregs_w[`DMA_CYCLE_CNT_SR1], sregs_w[`DMA_CYCLE_CNT_SR0] } = cycle_cnt;

Everything, you can compile the project. First, let's do Analysis & Synthesis .

After that, create a SignalTap file - with it we can look at the signal values ​​inside the FPGA.
To do this, go to File -> New -> SignalTap II Logic Analyzer File and click OK.
In the window that appears, you need to add the necessary signals. You should get something like:
SignalTap File

We save the file, add it to the project and complete the assembly.

After the build is complete, we need to get the .rbf file:
quartus_cpf -c etln.sof dma.rbf

Everything, the firmware is ready. We pass to the software part.

Attention: remember that after changing the settings in Qsys (in particular, after enabling fpga2sdram ), you need to regenerate and rebuild Preloader .

Also note that in the FPGA github only files with Verilog code and a file with Qsys settings are uploaded.
Project files (.qpf, .qsf, etc.) are missing due to the fact that they do not carry really useful information.

Implementation of userspace program

What do we need in order to work with the DMA controller from the software side?

First, we need to be able to configure and run the DMA controller. For this, we use the mem program from the previous article.

Secondly, we need to get a memory area, the address of which we can pass to the DMA controller.

Here you need a small digression. Usually all processes in userspace and even most in the kernel work with so-called virtual addresses. But the DMA controller needs to transfer the physical address (more precisely, the bus address, but for the platforms we use it is equal to the physical one)

In the kernel for performing such tasks there is a set of special functions that allow you to get a physical address (and vice versa) on a virtual address or allocate a memory area and get two addresses at once that will point to it.

What to do in userspace ? The wonderful file / proc / [PID] / pagemap , which contains information on the mapping of all virtual pages to physical for any process, will help us .

The information for each page in this file is 8 bytes. In this case, the lower 55 bits contain the so-called physical page number - Page Frame Number ( PFN ), and the highest 9 bits contain various flags (page presence, swap location, etc.) A detailed description can be found here or inman proc

Thus, knowing the virtual address and page size, it is easy to calculate the number of the virtual page. After that, from the file / proc / [PID] / pagemap you just need to read 8 bytes at the desired offset and in the lower 55 bits there will be the number of the physical page. And it is already easy to translate it into a physical address, which we will write to the DMA controller.

If our memory area starts at the page border, then everything becomes even a little easier.
Therefore, instead of the malloc () function, it is better to use the posix_memalign () function , which allows you to set the desired offset.

Also, in order to prevent data unloading from RAM to swap, it is advisable to use the mlock () function

The things described above are performed by the phys_addr.c program .

Important note - pages adjacent to the virtual address space will not necessarily be adjacent to RAM.
Therefore, in this method, we cannot write data that is larger than the page size by the DMA controller.
We can get around this limitation when we write a driver.

Midterm check

So, the firmware and test program are ready, time to test them a little.

We copy the binaries to the SD card, connect the USB-Blaster and run our board.

I wrote above that you need to enable the fpga2sdram interface before booting Linux . This is true, but not always.
If you turn on the interface already in Linux and try to read data from memory from FPGA, the system will completely freeze.
But write data will succeed. Naturally, this option is clearly not worth using on the combat system and below I will write how to properly initialize the fpga2sdram interface . But for intermediate testing, this is quite suitable for us.

First, we’ll flash FPGA:
cat dma.rbf > /dev/fpga0 

Now enable the HPS-to-FPGA interface :
echo 1 > /sys/class/fpga-bridge/hps2fpga/enable 

If we run SignalTap now , we will see that the sdram0_waitrequest signal is constantly hanging at 1. This is because the fpga2sdram interface is turned off.

Turn it on:
./mem.o 0xFFC25080 0x3fff

Writing units to register bits 0xFFC25080 includes the corresponding ports on the fpga2sdram interface . A description of which bits are responsible for which ports is given in the above Handbook . For simplicity, it is enough for us to include all the ports (in total, 14 bits are used in the register).

Now, in SignalTap, the signal sdram0_waitrequest has become 0.

We run the phys_addr utility :

It will allocate a buffer and display its physical address. I have it 0x2d593000 .
We remember that when using the fpga2sdram interface, you need to be addressed by words.
Since we have 128-bit words, the word address is calculated as follows:
0x2d593000 / 16 = 0x2d59300

We write this address in the FPGA registers:
./mem.o 0xC0000002 0x2d59300 

For the address we use control registers numbered 1 and 2. Each address is 16 bits, or 2 bytes. Since the HPS-to-FPGA starts at 0xC0000000 , the first control register will have a byte address of 0xC0000002.
Let me remind you that mem.c uses exactly byte addresses.

After that, we write the length of the DMA transaction in the control register number 3. The length should not exceed the page size, and for us it is 4096 bytes. Since our fpga2sdram interface has a width of 128 bits, and we indicate the size of the transaction in words, we must write the number 256 in the third register:
./mem.o 0xC0000006 256 

Next configure SignalTap on the negative edge capture signal test_is_running and run the DMA-controller.
To do this, first write 0 in the zero bit of the zero register (if it is not there), and then 1. At the same time, remember that the mem.o utility performs transactions by 4 bytes, and these are 2 of our registers. Therefore, if we are not careful, we will overwrite the data in the neighboring register.

In total, we need to first read the data at 0xC0000000 , and then write it down, but with the zero bit set.

We read:
./mem.o 0xC0000000 

I read 0x93000000 We

./mem.o 0xC0000000 0x93000001 

After that, we should get something like this in SignalTap :
SignalTap Result

As you can see, the value of the cycle_cnt counter at the time the transaction ends is 3167.
Let's calculate the throughput. The frequency of the clock signal in my project is 150 MHz (in order to be able to change the frequency over a wider range, I do not use the HPS shred, but import it there, I took it from PLL. These changes are trivial, but they are not in the github).
Width - 128 bits. Over 3167 measures 256 words were transmitted. Total:
128 * 150 / (3167/256) = 1551 Мбит/c 

UPDATE: Such a small bandwidth is obtained due to a typo, details in the conclusions.

It remains to verify that the data was recorded correctly. "Remove" the phys_addr utility from the pause by pressing Enter.
We should see the following text:
Phys_addr execution result

0: 0x0
1: 0xffffffffffffffff
2: 0x1
3: 0xfffffffffffffffe
507: 0xffffffffffffff02
508: 0xfe
509: 0xffffffffffffff01
510: 0xff
511: 0xffffffffffffff00

If you saw, then everything went well.

Having experimented with different parameters, I saw that the frequency of the clock signal almost does not affect the bandwidth.
It remains approximately the same for both 25 MHz and 150 MHz.
But the width of the fpga2sdram interface , on the contrary, gives an almost linear dependence - it was tested at 64 and 128 bits. For 256 did not check.

Naturally, due to the fact that the amount of recorded data is small (only 4096 bytes), the measurement error is quite large.
We can increase the size of a DMA transaction by writing our own primitive driver.

Driver spelling

The article came out a little more than I expected, so I’ll tell you about the driver very briefly.
Moreover, we still have to work with him in the following articles.
But the code lies on the github , who are interested - you can see the details.

The basic idea is simple - when starting the driver, we set with the parameter what size of transaction we need.
The driver allocates memory and writes the bus address and transaction size to the FPGA.

The driver also registers the interrupt handler that we set in the FPGA firmware.

After that, the driver creates two char devices:
  • / dev / etn-ctrl - to start a DMA transaction
  • / dev / etn-data - to get data

When reading from the file / dev / etn-ctrl , a DMA transaction is launched.
After that, the call is blocked until the interrupt arrives from the FPGA.

When the interrupt arrives, the call ends. This means that the data is written and can be read from the file / dev / etn-data .

For the driver to work in the .dts file, add the following lines:
Changes to .dts

fpga {
    compatible = "mtk,etn";
    interrupts = <0x0 0x28 0x1>;

The first line defines the compatible driver, and the second - the number and type of interrupt from the FPGA.

When using a 4MB transaction, the throughput is about 2000 Mbit / s 20 Gbit / s (see UPFATE in Outputs).


I wrote a primitive DMA controller in FPGA and measured its throughput.

It was about 2 Gb / s.
A small amount of bandwidth is due to a typo in the DDR3 settings.
Namely, the fact that the PLL shred was set to 125 MHz, and not 25 MHz, as it actually is.
Because of this, the multiplier and divisor ratios for the PLL were not calculated correctly.
As a result, DDR3 worked at 66 MHz instead of the prescribed 333 MHz.

With the correct coefficients and an interface width of 256 bits, the throughput is about 16-17 Gb / s, which corresponds to the theoretical one for the DDR3 interface with a width of 32 bits and a frequency of 333 MHz.

I will describe in more detail in the next article.

The further outline of the articles is as follows, unless, of course, they are of interest to anyone:
  • Framebuffer implementation for ILI9341 in FPGA
  • Working with SGDMA Controller
  • Gigabit 2-Port NIC Implementation in FPGA Using SGDMA Controller

Thanks to those who got to the end! Good luck.

Also popular now: