Ethernet + PCIe + FPGA = LOVE

Ethernet access is not possible without network interface cards (NICs). At low speeds (up to 1G), NICs are integrated into motherboards, and at high speeds (10G / 40G) NICs are located on a separate PCIe board. The main core of such a board is an integrated chip (ASIC), which is engaged in receiving / sending packets at the lowest level. For most tasks, the capabilities of this chip are more than enough.

What to do if the capabilities of the network card are not enough? Or does the task require the closest access to a low level? Then boards with reprogrammable logic - FPGAs come onto the scene.

What tasks they solve, what they place, as well as the most interesting representatives you will see under the cut!

Caution , there will be pictures!

Plan:

Fields of application
Filling
Overview

FPGA Board Applications

DPI, filtering and firewall

A server with such a board can get into a “gap” and monitor all packets passing through it. Intelligent DPI is implemented on the basis of the processor, and packet transfer and simple filtering (for example, many 5-tuple rules) are implemented on the basis of FPGAs.

How can I do that:

The streams that we trust, or the solution for them is already in the table by FPGA, pass through the chip with a slight delay, the rest are copied to the CPU and processing is done there.
FPGA can remove part of the load from the CPU and look for suspicious signatures in itself, for example, according to the Bloom algorithm . This algorithm has the probability of a false positive, so if there is a line in the packet that Bloom has reacted to, then such a packet is copied to the CPU for additional analysis.
Only the traffic that is interesting is processed on the processor - the FPGA selects packets (for example, HTTP requests or SIP traffic) by the specified criteria and copies them to the CPU, everything else (torrents, video, etc.) passes through the FPGA without significant delay.

All three of these options can be combined in various variations. Also FPGA can do some other dirty work, for example, act as a shaper / polyser, or collect statistics on flows.

In the illustration above, green packets are processed on the CPU, burgundy and yellow passed through filters in FPGA, and pink ones were dropped (also in FPGA).

Traffic analysis and capture

Sometimes these boards are used to capture traffic and further post-processing on the CPU (recording in pcap, delay analysis, etc.). In this case, a splitter is inserted into the links (or traffic is taken from the port's mirror). The result is a non-intrusive connection, similar to what we did in the RTP stream monitoring project .

Here, FPGA is required:

Filtering by fields (type 5-tuple): to select only the traffic that is interesting.
PTP synchronization, for hardware timestamping packets: the time is latched when a packet arrives, and this mark is placed at the end of the packet. Then on the CPU you can calculate, for example, the response time to the request.
Slicing - cutting off only the necessary piece of data (most often it is the first N bytes from the packet - in order to copy only the headers, because very often the data is not very interesting).
Packet Buffering:
- if the CPU doesn’t have time to write in case of some bursts, then you can smooth it out if you place packets in external memory for a couple of gigabytes
- if we want to guarantee the recording of packets for a short time (for example, after a trigger is triggered) - it is most often applicable for high speeds (40G / 100G).
Throwing packets into queues and CPU cores.

Having access to the lowest level (well, almost), you can support any protocol or tunneling, and not wait for Intel to do this in its cards.

In the above illustration, the FPGA accepts all packets after the traffic branches, but only those of interest to us are copied to the CPU (pinkish).

Network card

Cards with FPGAs can be used as a regular NIC, but there is not much point in this:

At present, at all Ethernet speeds (up to 100G inclusive), there are network cards based on ASICs. For the price they will be cheaper than FPGA solutions.
If you write a card yourself, then for a less serious performance, you need to wind a huge number of buns in such a card (RSS, LSO, LRO).

The meaning appears only when it is necessary to provide a unique chip, which will never be in the chip from Intel. For example, hardware encryption according to GOST or Grasshopper .

Network accelerator

CPU load reduction

When high speeds appear, the processor does not have time to do everything: I want to remove some of the tasks from it. For example, what happens when you copy a large amount of data over a network?

The processor must:

take some piece of data
thrust into TCP, split into several packets, according to MTU
substitute a header (MAC / IP addresses)
calculate IP and TCP checksums (although most NICs already take care of this)
pass descriptor to NIC

It is also necessary:

follow answers
resend packets if the packet is lost
lower / increase tcp-window and so on

The TCP stack can be implemented on FPGA: it is enough for the CPU to provide a pointer to the raw data and the IP + port of the receiver, and all the low-level work (establishing a connection, forwarding, etc.) will be done by a piece of iron.

There are ready-made IP cores that do all this: for example, implementations of TCP and UDP stacks from PLDA .

They have standard interfaces (Avalon or AXI), which makes them easy to connect to other IP cores.

Faster Response

There is a class of tasks where money is brought not by processors, but by reaction speed. Of course I'm talking about High Frequency Trading . The role of FPGA in HFT can be found in this article.

The PLDA website provides a video and an example of how this is done. Using hardware TCP and UDP cores allows you to reduce the latency for purchases / sales.

Hidden text

I apologize for the red underscores - the picture was taken from the PLDA website, and they have it in the original ...

There are special IP cores that decode data from the markets and are ready to interface with hardware TCP and UDP stacks.

Of course, using standard cores or approaches will not give an advantage over competitors: exclusive solutions are being developed, because "They want an even smaller delay . "

Measuring equipment

Network emulators

It often happens that you need to check the engineering solution in the laboratory, because in combat, it can be very expensive.
Recently there was an article from CROC about traffic optimization in the North and large RTT. To check what the quality of services will be in real conditions, you must first create these conditions in your laboratory. To do this, you can use the usual Linux machine, but there are special glands that are engaged in network emulation.

Most often it is necessary to be able to set such parameters as delay / jitter, packet loss (error). We can’t do without hardware support (read, FPGA), but we also need a “smart” processor to emulate various protocols (user sessions). In order not to develop hardware from scratch, you can take a server and insert a PCIe card with FPGA.

Faster Computing

Such cards can also be used to speed up some calculations or modeling, for example, for biology or chemistry. An example of such modeling was described by Algeronflowers in this article. In this case, Ethernet ports may not be needed, but on the other hand they can be useful if you want to make a farm of boards: input or output data for calculation to transmit via Ethernet.

Opencl

Sometimes there is no need to squeeze out all the juices from the iron: time-to-market is very important. Many developers refuse to use FPGA, because they are frightened by the low-level optimization to the clock (plus you need to know the new language (s) and instrument). I would like to write code at a "high" level, and the compiler will decompose everything into triggers / memory blocks. One such option is OpenCL . Altera and Xilinx support.

OpenCL on FPGA is the topic of a separate article (and not one). I recommend that you familiarize yourself with a presentation from Altera about the technology overview and development path for FPGA.

Highload

On the Internet you can find a lot of news about the fact that giants are eyeing the FPGA to process big data in data centers.

So, there was a note that Microsoft uses FPGA to speed up the search engine Bing. Technical details can be found in A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services .

Unfortunately, there is no good technical article in Russian about this, although the topic is very interesting. Can olgakuznet_ms or her colleagues fix this flaw?

Hope the releaseCPU + FPGA chips will encourage developers of highly loaded systems to transfer part of the calculations to FPGA. Yes, development under FPGA is “more complicated” than under CPU, but on specific tasks it can give a wonderful result.

Development / debugging of IP cores and software

Such boards can still be used by ASIC / FPGA developers to verify their IP cores, which can then be launched on completely different pieces of hardware.

It often happens that software is written at the same time as the piece of hardware is developed / produced, and it is already necessary to debug the software somewhere. In this context, software is like FPGA firmware + various drivers and user-space programs. In the 100G project of the analyzer and balancer , problems arose that we never solved:

FPGA configuration (CSR: status-control registers) must occur via PCIe
for linux'a FPGA with a bunch of interfaces should look like a network card: you need to write driver (s), and transfer packets from c / to the host

Of course, there were other tasks in parallel (such as generating / filtering 100G traffic), but they were quietly solved in the simulator, but you don’t really drive these two tasks in the simulator. What have we done? It turned out that we have a devboard from Altera. Despite the fact that there is a completely different chip, another PCIe, etc., we debugged a bunch of FPGA + drivers on it, and when the production department handed us a board for b100, then after raising the iron, this whole bunch worked without problems.

General scheme

Before reviewing the cards, consider the general scheme of such PCIe cards.

Ethernet

The boards are equipped with standard Ethernet connectors:

SFP - 1G
SFP + - 10G
QSFP - 40G
CFP / CFP2 / CFP4 - 100G

The most common combinations are:

4 x SFP / SFP +
2 x QSFP
1 x CFP

You can read about what happens at a low level and how the connection to 10G to integrated circuits occurs, for example, here .

PCIe

A standard connector that can be plugged into a computer with an ordinary motherboard. At the moment, the top FPGAs support Gen3 x8 IP hardware cores, but this bandwidth (~ 63 Gbps) is not enough for all tasks. On some boards, there is a PCIe switch that combines 2 Gen3 x8 channels into one Gen3 x16.

On future chips, Altera and Xilinx announce hardware support for Gen3 x16, and even Gen4.

Connectors

Sometimes place the connector (s) for connecting expansion cards, but there is no de facto standard (such as USB). The most common are VITA ( FMC ) and HSMC .

Avago MiniPod

The above connectors have a slight drawback - they are metallic and attenuation at high frequencies / long distances can be significant.

In response to this problem, Avago has developed Avago Minipod : optical transceivers. They are ready to transfer 12 lanes of 10-12.5GBd. In size, the connector is comparable to a coin. Using this connector, you can connect not only adjacent boards, but also make connections in supercomputers or in racks between servers.

When our colleagues showed the MiniPod demo on such a board , they said that no additional IP cores or Verilog code needed to be inserted - these modules are simply connected to the I / O of the FPGA transceivers, and everything works.

External memory

There is not much memory in FPGAs - there are 50-100 Mbit in the top chips. To process big data, an external memory is connected to the chip.
Two types of memory are allocated:

DRAM: DDR, RLDRAM
SRAM: QDR

When choosing take into account such parameters as price, volume, delays in sequential / random reading, bandwidth, power consumption, availability of memory controllers, ease of wiring / replacement, and so on.

Altera has an External Memory Interface Handbook , which, as you might guess, is dedicated to external memory. An interested reader in the chapter Selecting Your Memory will find tables comparing various types of memory and selection tips. The guide itself is available here (carefully, the file is large).

If you look at the use of memory in communication networks, then the tips are approximately the following:

DRAM is used to create large buffers (for packets)
SRAM:
- decision tables / structures where to send the packet
- queue management structures
- calculation of packet statistics (RMON, etc.)
a hybrid approach is possible - DRAM is used to store the payload of a packet, and only the header is placed in SRAM

If you open the presentation of Cisco's Anatomy of Internet Routers , you can see that in some routers they use RLDRAM as DRAM.

Hmc

HMC (Hybrid Memory Cube) is a new type of RAM memory that can supersede DDR / QDR memory in some applications: promise significant bandwidth acceleration and lower power consumption. On the hub you can find news: one and two . In the comments to them you can find fears that this is still far away and so on.

I assure you that everything is not so bad. So, six months (!) Ago, our colleagues from EBV showed a working demoboard of four Stratix V (on the sides) and HMC (in the center).

Commercial samples (for mass product) are expected to be available in 2015.

PCIe Card Review

The review is probably too big a word - I will try to show the most interesting representatives from different companies. There will be no comparison or unpacking tables. In fact, there will not be much variety between the boards; they all fit into the “template” that was described earlier. I am sure that you can find about five to seven more companies that produce such boards, and there are about a dozen more boards themselves.

NetFPGA 10G

Hidden text

FPGA:

Xilinx Virtex-5 TX240T
240K logic cells
11.664 Kbit block RAM

10-gigabit ethernet networking ports

4 SFP + connectors

Quad Data Rate Static Random Access Memory (QDRII SRAM)

300MHz Quad data rate (1.2 Giga transactions every second), synchronous with the logic
Three parallel banks of 72 MBit QDRII+ memories
Total capacity: 27 MBytes
Cypress: CY7C1515KV18

Reduced Latency Random Access Memory (RLDRAM II)

Four x36 RLDRAMII on-board device
400MHz clock (800MT/s)
115.2 Gbps peak memory throughput
Total Capacity: 288MByte
Micron: MT49H16M36HT-25

This is not the top card, but I could not tell about it:

NetFPGA boards are positioned as “open research platforms”: they are used worldwide (in more than 150 institutions). Students / researchers can do various labs / projects on them.
the project is positioned as opensource: there is an organization of the same name on the github . On the github in the private repository are various reference designs (network card, switch, router, etc.), which are written in Verilog and distributed under the LGPL. They will become available after a simple registration .

Advanced IO V5031

Hidden text

Altera Stratix V
Quad 10 Gigabit Ethernet SFP+ optical ports
2 banks of 1GB to 8GB 72-bit 1066MHz DDR3 SDRAM
4 banks of 36Mbit to 144Mbit 18-bit 350MHz QDRII+ SRAM
x8 PCI Express Gen 3
PPS Interface for time synchronization with microsecond resolution

This board has a twin brother: captureXG 1000 , but it is already positioned as a card for recording data streams:

Hidden text

Time Synchronization: IRIG-A, B and G time synchronization via a front panel SMA connector
Filters: 128 programmable 5-tuple filters ( IPv4, TCP, UDP, ICMP, ARP )
Packet Capture: PCAP Next Generation format or raw data format

In fact, to the card that was shown above, they wrote firmware for the FPGA, as well as the driver. And this already actually turns out to be another product that is ready to work out of the box. I wonder what is the difference in money between the two products.

Napatech NT40E3-4-PTP

Another card for recording and analyzing traffic:

Hidden text

FPGA: Xilinx Virtex-7
Quad 10 Gigabit Ethernet SFP+ optical ports
4 GB DDR3
PCIe x8 Gen 3

Увы, больше технических подробностей из маркетинговой брошюры выжать не удалось.

In such a case, she looks very pretty. Fundamentally in terms of hardware, this card is not very different from others, but Napatech considers it as a finished product and screwed up a bunch of features that are implemented on FPGA:

Hidden text

Hardware Time Stamp
Full line-rate packet capture
Frame buffering
Frame and protocol information
Time Stamp Injection
Buffer size configuration
Onboard IEEE 1588-2008 (PTP v2) support
Inter-Frame Gap Control
Frame Classification
HW Time Synchronization
Extended RMON1 port statistics
Advanced Statistics
Synchronized statistics delivery
Flow identification based on hash keys
Dynamic hash key selection
Frame and flow filtering
Deduplication
Slicing
Intelligent multi-CPU distribution
Cache pre-fetch optimization
Coloring
IP fragment handling
Checksum verification
Checksum generation
GTP tunneling support
IP-in-IP tunneling support
Filtering inside tunnels
Slicing inside tunnels

All this can be done on other cards. You just have to spend time on this.

COMBO-80G

Hidden text

Virtex-7 FPGA chip manufactured by Xilinx company
2× QSFP+ cage multi/single mode, CWDM or copper
4× 10G to 40G fanout modules for 10G Ethernet technology
PCI Express 3.0 x8, throughput up to 50Gb/s to software
2× 72Mbits QDRII+ SRAM memory
2× 1152Mbits RLDRAM III memory
2× 4GB DDR3 memory
External PPS (Pulse per second) synchronization
Unique on-the-fly FPGA boot system (no need for host computer reboot)

Nallatech 385A and Nallatech 385C

Hidden text

385A:

Arria 10 1150 GX FPGA with up to 1.5 TFlops
Network Enabled with (2) QSFP 10/40 GbE Support

385C:

Altera Arria 10 GT FPGA with up to 1.5 TFlops
Network Enabled with (2) QSFP28 100 GbE support

Общее:

Low Profile PCIe form factor
8 GB DDR3 on-card memory
PCIe Gen3 x8 Host Interface
OpenCL tool flow

As you can see, these are two twin brothers: the 385A has a more budget FPGA (GX) with 17.4 Gbps transceivers, which is enough for 10 / 40G, and the 385C already uses Arria 10 GT, because you need 28 Gpbs transceivers to support 100G, which come in 4x25G performance.

I note that Nallatech provides OpenCL BSP for these cards.

HiTech Global 100G NIC

Hidden text

x1 Xilinx Virtex-7 H580T
x16 PCI Express Gen3 (16x8Gbps)
x1 CFP2 (4x25Gbps)
x1 CFP4 (4x25Gbps)
x1 Cypress QDR IV SRAM
x2 DDR3 SODIMMs (with support up to 16GB)
x4 Avago MiniPod (24 Tx and 24 Rx) for board-to-board high-speed communications
x1 FMC with 8 GTH transceivers and 34 LVDS pairs (LA0-LA33)

Here we see both the FMC connector for connecting other boards, and the Avago MiniPod, which was mentioned earlier.

Bonus:

Nallatech 510t

There is no Ethernet in this card, but it's really a bomb .

Hidden text

GPU Form Factor Card with (2) Arria 10 10A1150GX FPGAs
Dual Slot Standard Configuration
Single Slot width possible, if user design fits within ~100W power footprint
PCIe Gen3 x 16 Host Interface
290 GBytes/s Peak Aggregate Memory Bandwidth:
- 85GB/s Peak DDR4 Memory Bandwidth per FPGA (4 Banks per FPGA)
- 30GB/s Write + 30GB/s Read Peak HMC Bandwidth per FPGA

Here are two fat top-end chips that are riveted using 20-nm technology, and DDR4, and HMC. Performance promised up to 3 TFlops!

Judging by the render, the real hardware is still far away, but it feels like it will be gold (for the price), but it will occupy its niche: it is positioned as a coprocessor for data centers. They promise support for OpenCL, which means that no one will be nursing this board before the clock: they will drive the ready-made algorithms and burn the watts. Who knows, maybe on this board Youtube, Facebook, VK will convert the video, replacing dozens of servers? Or maybe special effects for the new Avatar will be rendered on such farms?

Conclusion

Having looked at all this variety of boards, my colleagues and I thought: why don't we also make such a card?
By the complexity of the printed circuit board, it will not be more difficult than the B100 , we seem to be able to write software for FPGA and Linux, and there are some companies and departments in demand for such glands.

My colleagues and I argued a little about what card to do, and we are interested in what you think about this.

Thanks for attention! Ready to answer questions in the comments or in PM.

Only registered users can participate in the survey. Please come in.

What card to do?

Tags:

Ethernet + PCIe + FPGA = LOVE

FPGA Board Applications

DPI, filtering and firewall

Traffic analysis and capture

Network card

Network accelerator

CPU load reduction

Faster Response

Measuring equipment

Network emulators

Faster Computing

Opencl

Highload

Development / debugging of IP cores and software

General scheme

Ethernet

PCIe

Connectors

Avago MiniPod

External memory

Hmc

PCIe Card Review

NetFPGA 10G

Advanced IO V5031

Napatech NT40E3-4-PTP

COMBO-80G

Nallatech 385A and Nallatech 385C

HiTech Global 100G NIC

Bonus:

Nallatech 510t

Conclusion

What card to do?

Also popular now: