YevgenyY June 19, 2013 at 12:21

NETMAP (by Luigi Rizzo). Simple and convenient opensource framework for processing traffic at 10Gbit / s or 14 Mpps

From the sandbox

The capacity of communication channels is constantly increasing, if a couple of years ago a server with a 10Gbit / s channel was a privilege of only a few, now offers have appeared on the market that are available for small and medium-sized companies. At the same time, the TCP / IP protocol stack was developed at a time when you could only dream about speeds of about 10Gbit / s. As a result, the code of most modern general-purpose operating systems has many overheads that are wasting resources. In these conditions, the importance of high-performance processing of network flows.

The article is based on my report at Highload ++ 2012 and is intended for quick introduction to the convenient and very effective opensource framework, which is included in the HEAD / STABLE FreeBSD, called NETMAP and allows you to work with packages at speeds of 1-10Gbit / s without using specialized hardware in conventional * nix operating systems.

Netmap uses well-known performance-enhancing techniques, such as mapping the network card buffers into memory, I / O batch processing, and using ring transmit and receive memory buffers corresponding to the hardware buffers in the network card, which allows generating and receiving traffic up to 14 million packets per second (which corresponds to the theoretical maximum for 10Gbit / s).

The article contains key fragments of NETMAP author’s publications - Luigi Rizzo, discusses the architecture and key features of the internal implementation of the netmap framework, which encapsulates critical functions when working with the OS kernel and network card, providing userland with a simple and understandable API.

Under the cat, the basic primitives of using the framework for developing applications related to processing packets at speeds of 14Mpps are considered, practical experience of using the netmap framework when developing a component of the DDOS protection system responsible for the L3 level is considered. Separately, the comparative performance characteristics of netmap on 1 / 10Gbit / s channels, one / several processor cores, large and short packets, performance comparison with OS FreeBSD / Linux stacks are separately considered.

1. Introduction

Modern general-purpose operating systems provide rich and flexible capabilities for processing packets at a low level in network monitoring programs, traffic generation, software switches, routers, firewalls and attack recognition systems. Software interfaces, such as: raw sockets, the Berkley Packet Filter (BPF), the AF_PACKET interface and the like, are currently used to develop most of these programs. Apparently, the high packet processing speed required for these applications was not the main goal in the development of the above mechanisms, since in each of the implementations there is significant overhead (hereinafter referred to as overhead).

At the same time, significantly increased throughput of the transmission medium, network cards and network devices, suggest the need for such an interface for low-level packet processing that would allow processing at wire speeds, i.e. millions of packets per second. For better performance, some systems work directly in the kernel of the operating system or gain access directly to the network card structures (hereinafter NIC) bypassing the TCP / IP stack and network card driver. The effectiveness of these systems is due to the specific features of the iron network cards that provide direct access to NIC data structures.

The NETMAP framework successfully combines and extends the ideas implemented in NIC direct access solutions. In addition to a dramatic increase in productivity, NETMAP provides userland hardware independent interface for high-performance packet processing. Here is just one metric so that you can evaluate the speed of packet processing: sending a packet from a transmission medium (cable) to userspace takes less than 70 CPU cycles. In other words, using NETMAP on only one core of the processor with a frequency of 900 MHz, it is possible to carry out fast forwarding of 14.88 million packets per second (Mpps), which corresponds to the maximum transmission rate of Ethernet frames in the 10Gbit / s channel.

Another important feature of NETMAP is that it provides a device-independent interface for accessing NIC registers and data structures. These structures, as well as critical areas of kernel memory, are inaccessible to the user program, which increases the reliability, because from userspace, it is difficult to insert an invalid pointer to a memory region and cause a crash in the OS kernel. At the same time, NETMAP uses a very efficient data model that allows zero-copy packet forwarding, i.e. packet forwarding without the need for copying memory, which allows for tremendous performance.

The article focuses on the architecture and capabilities that NETMAP provides, as well as on the performance indicators that can be obtained with it on ordinary hardware.

2. TCP / IP stack of modern OS

This part of the article will be especially interesting for developers who make applications such as software switches, routers, firewalls, traffic analyzers, attack recognition systems or traffic generators. General-purpose operating systems, as a rule, do not provide effective mechanisms for accessing raw packets at high speeds. In this section of the article, we will concentrate on analyzing the TCP / IP stack in the OS, consider where overheads come from, and understand the cost of processing a packet at different stages of its passage through the OS stack.

2.1. NIC Data Structures and Operations

Network adapters (NICs) for processing incoming and outgoing packets use ring queues (rings) of descriptors of memory buffers, as shown in

Fig. No. 1 “Fig. No. 1. NIC Data Structures and Their Relationship with OS Data Structures ”

Each slot in the ring queue contains the length and physical address of the buffer. Available (addressable) for the CPU NIC registers contain information about the queues for receiving and transmitting packets.
When a packet arrives on the network card, it is placed in the current memory buffer, its size and status are recorded in the slot, and information that there are new incoming data for processing is recorded in the corresponding NIC register. The network card initiates an interrupt to inform the CPU when new data arrives.
In the case when a packet is sent to the network, the NIC assumes that the OS fills the current buffer, places information on the size of the transmitted data in the slot, writes the number of slots for transmission in the corresponding NIC register, which initiates the sending of packets to the network.
In the case of high speeds of receiving and transmitting packets, a large number of interrupts can lead to the inability to perform any useful work (“receive live-lock”). To solve these problems, the OS uses the polling or interrupt throttling mechanism. Some high-performance NICs use multiple queues for receiving / transmitting packets, which allows you to distribute the load on the processor across several cores or split the network card into several devices for use in virtual systems working with such a network card.

2.2. Kernel and API for the user

The OS copies NIC data structures to a queue from memory buffers, which is specific to each particular OS. In the case of FreeBSD, these are mbufs, its equivalents are sk_buffs and NdisPackets. At their core, these memory buffers are containers that contain a large amount of metadata about each packet: size, the interface from / to which the packet arrived, various attributes and flags that determine the processing order of the memory buffer data in NIC and / or OS.
The NIC driver and the TCP / IP stack of the operating system (hereinafter referred to as the host stack), as a rule, assume that packets can be divided into an arbitrary number of fragments, therefore, both the driver and host stack should be ready to handle packet fragmentation. The corresponding API exported to userspace suggests that various subsystems can leave packages for delayed processing, therefore, memory buffers and metadata cannot be simply passed by reference during call processing, but they must be copied or processed by reference counting. All this is a payment by high overheads for flexibility and usability.
The design of the above API was developed quite a long time ago and today is too expensive for modern systems. The cost of memory allocation, management, and passing through the buffer chain often goes beyond the linear dependence on the useful data transmitted in packets.
The standard API for input / output of raw packages in a user program requires at least memory allocation for copying data and metadata between the OS kernel and userspace and one system call for each package (in some cases, a sequence of packages).
Consider the overheads that occur in OS FreeBSD when sending a UDP packet from userlevel using the sendto () function. In this example userspace, the program sends a UDP packet in a loop. Table 2 illustrates the average time spent processing a package in userspace and various kernel functions. The Time field contains the average time in nanoseconds to process the packet; the value is recorded when the function returns. The delta field indicates the elapsed time before the start of the next function in the system call execution chain. For example, 8 nanoseconds takes execution in the context of userspace, 96 nanoseconds takes entry into the context of the OS kernel.
For tests, we use the following macro definitions that work in OS FreeBSD:

The test was performed on a computer running OS FreeBSD HEAD 64bit, i7-870 2.93GHz (TurboBoost), Intel 10Gbit NIC, ixgbe driver. Values are averaged over several dozen 5-second tests.

As can be seen from Table 1, there are several functions that consume a critically large amount of time at all levels of packet processing on the OS stack. Any API for network input / water, be it TCP, RAW_SOCKET, BPF, will be forced to send a packet through several very expensive levels. Using this standard API, there is no way to bypass memory allocation and copying mechanisms in mbufs, check the correct routes, prepare and construct TCP / UDP / IP / MAC headers and at the end of this processing chain, convert mbuf structures and metadata to NIC format for transmission packet to the network. Even in the case of local optimization, for example, caching routes and headers instead of building them from scratch, does not give a radical increase in the speed required for processing packets on 10 Gbit / s interfaces.

2.3. Modern techniques for increasing productivity when processing packages at high speeds

Since the problem of high-speed processing of packets has been relatively long, various techniques for increasing productivity have already been developed and are used to qualitatively increase the processing speed.

Socket API

Berkley Packet Filter (hereinafter BPF) is the most popular mechanism for accessing raw packages. BPF connects to the network card driver and sends a copy of each received or sent packet to a file descriptor from which the user program can receive / send it. Linux has a similar mechanism called the AF_PACKET socket family. BPF works with the TCP / IP stack, although in most cases, it puts the network card in “transparent” mode (promiscuous), which leads to a large flow of extraneous traffic that enters the kernel and is deleted there.

Packet filter hooks

Netgraph (FreeBSD), Netfilter (Linux), Ndis Miniport drivers (MS Windows) are kernel-built mechanisms that are used when copying packages is not required and an application, such as a firewall, must be integrated into the packet flow chain. These mechanisms receive traffic from the network card driver and transmit it to the processing modules without additional copying. Obviously, all the mechanisms indicated in this clause are based on the representation of packages in the form of mbuf / sk_buff.

Direct buffer access

One of the easiest ways to avoid additional copying when transferring a package from kernel space to user space and vice versa is to allow the application direct access to NIC structures. Typically, this requires the application to run on the OS kernel. Examples are the Click software router project or the kernel mode pkt-gen traffic generator. Along with ease of access, kernel space is a very fragile environment, errors in which can lead to a system crash, so the more correct mechanism is exporting packet buffers to userspace. Examples of this approach are PF_RING and Linux PACKET_MMAP, which export an area of shared memory containing pre-allocated areas for network packets. At the same time, the kernel of the operating system copies data between sk_buffers and packet buffers in shared memory. This allows batch processing of packages, but at the same time, overheads remain associated with copying and managing the sk_buff chain.
Even better performance can be achieved by allowing access to the NIC directly from userspace. This approach requires special NIC drivers and increases some risks, as The NIC DMA engine will be able to write data to any memory address and an incorrectly written client may accidentally “kill” the system by erasing critical data somewhere in the kernel. It is fair to say that in a large number of modern network cards there is an IOMMU unit, which restricts the writing of the NIC DMA engine to memory. An example of this approach in speeding up performance is PF_RING_DNA and some commercial solutions.

3. NETMAP architecture

3.1. Key Features

In the previous article, various mechanisms for increasing the performance of packet processing at high speeds were considered. The costly operations in data processing were analyzed, such as: copying data, managing metadata and other overheads that occur when a packet passes from userspace through the TCP / IP stack to the network.
The framework presented in the report, called NETMAP, is a system that provides userspace applications with very quick access to network packets, both for receiving and sending, both when exchanging with the network and when working with the TCP / IP OS stack (host stack). At the same time, efficiency is not sacrificed to the risks that arise when the data structures and network card registers are fully opened in userspace. The framework independently manages the network card, the operating system, at the same time, protects the memory.
Also, NETMAP's distinctive feature is its tight integration with existing OS mechanisms and the absence of dependence on the hardware features of specific network cards. To achieve the desired high performance performance NETMAP uses several well-known techniques:

• Compact and lightweight package metadata structures. Simple to use, they hide device-specific mechanisms, providing a convenient and easy way to work with packages. In addition, NETMAP metadata is designed to handle a wide variety of packets in a single system call, thereby reducing the burden of sending packets.
• Linear preallocated buffers, fixed sizes. Allows you to reduce memory management overhead.
• Zero copy operations when forwarding packets between interfaces, as well as between interfaces and host stack.
• Support for useful hardware features of network cards such as multiple hardware queues.

In NETMAP, each subsystem does exactly what it is designed for: NIC sends data between the network and RAM, the OS kernel protects memory, provides multitasking and synchronization.

Fig. No. 2. In NETMAP mode, NIC queues are disconnected from the TCP / IP OS stack. Exchange between the network and host stack is done only through the NETMAP API

At the highest level, when an application puts the network card into NETMAP mode through the NETMAP API, NIC queues are disconnected from the host stack. The program thus gets the opportunity to control the exchange of packets between the network and the OS stack, using ring buffers called “netmap rings” for this. Netmap rings, in turn, are implemented in shared memory. To synchronize queues in the NIC and the OS stack, the usual OS system calls are used: select () / poll (). Despite disconnecting the TCP / IP stack from the network card, the operating system continues to work and perform its operations.

3.2. Data structures

The key NETMAP data structures are shown in Fig. No. 3. The structures were developed taking into account the following tasks:
• Reducing overheads during packet processing
• Increasing efficiency in packet transmission between interfaces, as well as between interfaces and the stack
• Support for multiple hardware queues in network cards

Fig. No. 3. Structures exported by NETMAP to userspace

NETMAP contains three types of objects visible from userspace:
• Packet buffers
• Ring queue buffers (netmap rings)
• Interface descriptor (netmap_if)

All objects of all netmap enabled system interfaces are located in the same area of nonpaged shared memory, which is allocated by the kernel and accessible between processes from user space. Using such a dedicated memory segment allows you to conveniently carry out zero copy packet exchange between all interfaces and the stack. At the same time, NETMAP supports the separation of interfaces or queues in such a way as to isolate the memory areas provided to different processes from each other.
Since different user processes operate in different virtual addresses, all links in exported NETMAP data structures are relative, i.e. are offsets.
Packet buffers are fixed in size (2K currently) and are used by both NIC and user processes. Each buffer is identified by a unique index, its virtual address can be easily calculated by the user process, and its physical address can be easily calculated by the NIC DMA engine.
All netmap buffers are allocated at the moment when the network card goes into NETMAP mode. The metadata describing the buffer, such as index, size, and some flags, is stored in slots, which are the main cell of the netmap ring, which will be described below. Each buffer is attached to the netmap ring and the corresponding queue in the network card (hardware ring).
Netmap ring is an abstraction of the hardware ring queues of a network card. Netmap ring is characterized by the following parameters:
• ring_size, the number of slots in the queue (netmap ring)
• cur, the current slot for writing / reading in the queue
• avail, the number of available slots: in the case of TX, these are empty slots through which data can be sent to RX case - these are slots filled with NIC DMA engine into which data came
• buf_ofs, the offset between the beginning of the queue and the beginning of the array of packet buffers of a fixed size (netmap buffers)
• slots [], an array consisting of ring_size the amount of metadata of a fixed size. Each slot contains the index of the packet buffer containing the received data (or data to be sent), the packet size, some flags used in processing the packet.

Finally, netmap_if contains readonly information describing the netmap interface. This information includes: the number of queues (netmap rings) associated with the network card and the offset, to obtain a pointer to each of the NIC-associated queues

3.3. Data Processing Contexts

As mentioned above, NETMAP data structures are shared between the kernel and user programs. NETMAP strictly defines the “access rights” and owners of each structure in such a way as to protect data. In particular, netmap rings are always managed from a user program, unless a system call is made. During a system call, the code from kernel space updates netmap rings, but does so in the context of a user process. Interrupt handlers and other kernel threads never touch netmap rings.
The packet buffers between cur and cur + avail - 1 are also controlled by the user program, while the remaining buffers are processed by the code from the kernel. In reality, only the NIC accesses the packet buffers. The boundaries between these two regions are updated during a system call.

4. Basic operations in NETMAP

4.1. Netmap API

In order to put the network card into netmap mode, the program should open the file descriptor on the special device / dev / netmap and execute
Ioctl (..., NIOCREGIF, arg) The
arguments of this system call contain: interface name and (optionally) which of netmap ring ' We want to open using the newly opened file descriptor. If successful, the size of the shared memory area in which all the data structures exported by NETMAP are located and the offset to the netmap_if memory area through which we get pointers to these structures will be returned.
After the network card is switched to netmap mode, the following two system calls are used to force reception or sending packets:
• ioctl (..., NIOCTXSYNC) - synchronization of queues (netmap rings) for sending with the corresponding queues of the network card, which is equivalent to sending packets to the network, synchronization starts from the position cur
• ioctl (..., NIOCRXSYNC) - synchronization of queues of the network card with the corresponding netmap queues rings, to receive packets from the network. Recording starts from position cur
Both of the above system calls are non-blocking, do not perform unnecessary copying of data, except for copying from the network card to netmap rings and vice versa, and work with one or many packets in one system call. This feature is key and provides a dramatic reduction in overheads during packet processing. The core part of the NETMAP handler during the specified system calls performs the following actions:
• checks the cur / avail fields of the queue and the contents of the slots involved in the processing (sizes and indices of packet buffers in netmap rings and in hardware rings (network card queues)
• synchronizes the contents of packet slots involved in processing between netmap rings and hardware rings, issues a command to the network card to send packets, or reports on the availability of new free buffers for receiving data
• updates the avail field in netmap rings
As you can see, the NETMAP kernel processor does a minimum of work and turns on verification of user data entered to prevent system crash.

4.2. Blocking primitives

Blocked I / O is supported by select () / poll () system calls with the / dev / netmap file descriptor. The result is either an early return of control with the parameter avail> 0. Before returning control from the kernel context, the system will perform the same actions as in ioctl calls (... NIOC ** SYNC). Using this technique, a user program can, without loading the CPU, check the status of the queues in a cycle using only one system call per pass.

4.3. Multi-Queue Interface

Powerful network cards with multiple queues NETMAP allows you to configure in two ways, depending on how many queues you need to control the program. In default mode, one file descriptor / dev / netmap controls all netmap rings, but if the ring_id field is specified when opening the file descriptor, the file descriptor is associated with a single netmap ring RX / TX pair. Using this technique allows you to bind the handlers of various netmap queues to specific processor cores through setaffinity () and carry out processing independently and without the need for synchronization.

4.4. Usage example

The example shown in Figure 5 is a prototype of the simplest traffic generator based on the NETMAP API. The example is intended to illustrate the ease of use of the NETMAP API. The example uses NETMAP_XXX macros, which make code easier to understand, to calculate pointers to the corresponding NETMAP data structures. There is no need to use any libraries to use the NETMAP API. The API was designed so that the code would be as simple and straightforward as possible.

fds.fd = open("/dev/netmap", O_RDWR);
strcpy(nmr.nm_name, "ix0");
ioctl(fds.fd, NIOCREG, &nmr);
p = mmap(0, nmr.memsize, fds.fd);
nifp = NETMAP_IF(p, nmr.offset);
fds.events = POLLOUT;
for (;;) {
   poll(fds, 1, -1);
   for (r = 0; r < nmr.num_queues; r++) {
      ring = NETMAP_TXRING(nifp, r);
      while (ring->avail-- > 0) {
         i = ring->cur;
         buf = NETMAP_BUF(ring, ring->slot[i].buf_index);
         //... store the payload into buf ...
         ring->slot[i].len = ... // set packet length
         ring->cur = NETMAP_NEXT(ring, i);
       }
   }
}

Prototype traffic generator.

4.5. Sending / receiving packets to / from host stack

Even when the network card is switched to netmap mode, the OS network stack still continues to manage the network interface and knows nothing about disconnecting from the network card. The user can use ifconfig and / or generate / expect packets from the network interface. This traffic received or directed to the OS network stack can be processed using a special pair of netmap rings associated with the device’s file descriptor / dev / netmap.
In the case when NIOCTXSYNC is executed on this netmap ring, the netmap kernel handler encapsulates packet buffers in the mbuf structures of the OS network stack, thus sending packets to the stack. Accordingly, packets coming from the OS stack are placed in a special netmap ring and become available to the user program through a call to NIOCRXSYNC. Thus, the responsibility for transferring a packet between netmap rings associated with the host stack and netmap rings associated with the NIC lies with the user program.

4.6. Security Considerations

A process using NETMAP, even if it does something wrong, is unable to crash the system, unlike some other systems, such as UIO-IXGBE, PF_RING_DNA. In fact, the memory area exported by NETMAP to user space does not contain critical areas, all indices and sizes of packet and other buffers are easily checked for validity by the OS kernel before use.

4.7. Zero copy packet forwarding

The presence of all buffers for all network cards in the same shared memory area allows for very fast (zero copy) packet transfer from one interface to another or to the host stack. To do this, you just need to exchange the indices for packet buffers in the netmap ring associated with the incoming and outgoing interfaces, update the packet size, slot flags and set the current position in the netmap ring (cur), as well as update the netmap values ring / avail, which signals a new package for receiving and sending.

ns_src = &src_nr_rx->slot[i]; /* locate src and dst slots */
ns_dst = &dst_nr_tx->slot[j];
/* swap the buffers */
tmp = ns_dst->buf_index;
ns_dst->buf_index = ns_src->buf_index;
ns_src->buf_index = tmp;
/* update length and flags */
ns_dst->len = ns_src->len;
/* tell kernel to update addresses in the NIC rings */
ns_dst->flags = ns_src->flags = BUF_CHANGED;
dst_nr_tx->avail--; // Для большей ясности кода проверка 
src_nr_rx->avail--; // avail > 0 не сделана

5. Example: NETMAP API for use in the traffic purification subsystem for DDOS protection system

5.1. Primary requirements

Due to the combination of extremely high performance and convenient mechanisms for accessing the contents of packets, managing packet routing between interfaces and the network stack, NETMAP is a very convenient framework for systems that process network packets at high speeds. Examples of such systems are traffic monitoring applications, IDS / IPS systems, firewalls, routers, and especially traffic cleaning systems, which are a key component of DDOS protection systems.
The main requirements for the traffic cleaning subsystem in the DDOS protection system are the ability to filter packets at the maximum speeds and the ability to process packets in the filter system that implement various techniques currently known to counter DDOS attacks.

5.2. Preparing and enabling netmap mode

Since it is assumed that the prototype of the traffic purification subsystem will analyze and modify the contents of packets, as well as manage its own lists and data structures necessary to perform DDOS protection, it is necessary to redistribute CPU resources to perform DDOS module operations, accordingly, if possible, leaving NETMAP necessary to work at full speed minimum. For these purposes, it is supposed to associate several CPU cores to work with "their" netmap rings.

struct nmreq nmr; 
//…
for (i=0, i < MAX_THREADS, i++) {
// …
 targ[i]->nmr.ringid = i | NETMAP_HW_RING;
 …
 ioctl(targ[i].fd, NIOCREGIF, &targ[i]->nmr);
 //…
 targ[i]->mem = mmap(0, targ[i]->nmr.nr_memsize, PROT_WRITE | PROT_READ, 
                     MAP_SHARED, targ[i].fd, 0);
 targ[i]->nifp = NETMAP_IF(targ[i]->mem, targ[i]->nmr.nr_offset);
 targ[i]->nr_tx = NETMAP_TXRING(targ[i]->nifp, i);
 targ[i]->nr_rx = NETMAP_RXRING(targ[i]->nifp, i);
 //…
}

In case we plan to exchange packets with the OS network stack, you need to open the netmap rings couple responsible for interacting with the stack.

struct nmreq nmr; 
//…
 /* NETMAP ассоциирует netmap ring с наибольшим ringid с сетевым стеком */
 targ->nmr.ringid = stack_ring_id | NETMAP_SW_RING; 
// …
 ioctl(targ.fd, NIOCREGIF, &targ->nmr);
// …
5.3. Главный цикл rx_thread
	После выполнения всей необходимой подготовки и переводу сетевой карты в режим NETMAP, осуществляется запуск на выполнение thread’ов

for ( i = 0; i < MAX_THREADS; i++ ) { 
    /* start first rx thread */    
    targs[i].used = 1;        
    if (pthread_create(&targs[i].thread, NULL, rx_thread, &targs[i]) == -1) { 
        D("Unable to create thread %d", i);
        exit(-1);
    } 
}
//…
/* Wait until threads will finish their loops */
for ( r = 0; r < MAX_THREAD; r++ ) { 
    if( pthread_join(targs[r].thread, NULL) )
      ioctl(targs[r].fd, NIOCUNREGIF, &targs[r].nmr);
      close(targs[r].fd);       
    }
 //…
}

As a result, after starting all threads, max_threads + 1 independent threads remain in the system, each of which works with its own netmap ring without the need for synchronization with each other. Synchronization is required only in case of exchange with the network stack.
The wait and packet loop thus works in rx_thread ().

while(targ->used) {         
    ret = poll(fds, 2, 1 * 100);
    if (ret <= 0)
      continue;
  …
  /* run filters */
  for ( i = targ->begin;  i < targ->end; i++) {
      ioctl(targ->fd, NIOCTXSYNC, 0);
      ioctl(targ->fd_stack, NIOCTXSYNC, 0);                                                                                              
      targ->rx = NETMAP_RXRING(targ->nifp, i);
      targ->tx = NETMAP_TXRING(targ->nifp, i); 
      if (targ->rx->avail > 0)       
      {         
        …
      /* process rings */
      cnt = process_incoming(targ->id, targ->rx, targ->tx, targ->stack_rx,  
                             targ->stack_tx);      
       …
   }
}

Thus, after receiving a signal that one of the netmap_ring received incoming packets after the poll () system call, control is passed to the process_incoming () function to process the packets in the filters.

5.5. process_incoming ()

After passing control to process_incoming, you need to access the contents of the packets for analysis and processing by various DDOS recognition techniques.

limit = nic_rx->avail;
while ( limit-- > 0 ) {
    struct netmap_slot *rs = &nic_rx->slot[j]; // rx slot
    struct netmap_slot *ts = &nic_tx->slot[k]; // tx slot
    eth = (struct ether_header *)NETMAP_BUF(nic_rx, rs->buf_idx);
    if (eth->ether_type != htons(ETHERTYPE_IP)) {
      goto next_packet; // pass non-ip packet
    }
    /* get ip header of the packet */
    iph = (struct ip *)(eth + 1);
//    …
}

The considered code examples reveal the basic techniques for working with NETMAP, starting from putting the network card in NETMAP mode and ending with access to the contents of the packets when the packet passes through the filter chain.

6. Performance

6.1. Metrics

When performing performance testing tests, it is always necessary to first determine the test metrics. Many subsystems are involved in packet processing: CPU, caches, data bus, etc. The report considers the CPU load parameter, as this parameter may be most dependent on the correct implementation of the framework that processes the packets.
It is customary to measure the CPU load based on two approaches: depending on the size of the transmitted data (per-byte costs) and depending on the number of processed packets (per-packet cost). In the case of NETMAP, due to the fact that zero copy packet forwarding is performed, per-byte-based CPU utilization measurement is not so interesting compared to per-packet costs, because there is no copying of memory and therefore, when transferring large volumes, the load on the CPU will be minimal. At the same time, in measurements based on per-packet costs NETMAP performs relatively many actions in processing each packet, and therefore, performance measurement in this approach is of particular interest. So, measurements were carried out on the basis of the shortest packets, 64 bytes in size (60 bytes + 4 bytes of CRC).
Two programs were used for measurements: a traffic generator based on NETMAP and a traffic receiver, which performed exclusively incoming packet counting. The traffic generator takes as parameters: the number of cores, the size of the transmitted packet, the number of packets transmitted per one system call (batch size).

6.2. Test Iron and OS

As a test iron, we used a system with an i7-870 4-core 2.93GHz CPU (3.2 GHz in turbo-boost mode), the RAM worked at a frequency of 1.33 GHz, a dual-port network card based on the Intel 82599 chipset was installed in the system. As an operating system used FreeBSD HEAD / amd64.
All measurements were carried out on two identical systems connected directly by cable to each other. The results obtained are well correlated, the maximum deviation from the average is about 2%.
The first test results showed that NETMAP is very efficient and completely fills the 10GBit / s channel with the maximum number of packets. Therefore, to perform the experiments, the processor frequency was lowered in order to determine the effectiveness of the changes made due to the NETMAP code and to obtain various dependencies. The base frequency for the Core i7 CPU is 133 MHz, respectively, using the CPU multiplier (max x21) it is possible to run the system on a set of discrete values up to 3GHz.

6.3. Speed depending on the frequency of the processor, cores, etc.

The first test is the generation of traffic at different processor frequencies, using a different number of cores, transferring many packets in one system call (batch mode).

When transmitting 64 byte packets, it allows you to instantly completely fill the 10GBit / s channel on a single core and a frequency of 900Mz. Simple calculations show that it takes about 60-65 processor cycles to process one packet. Obviously, in this test only the costs that NETMAP package processing makes are affected. The analysis of the package contents and other actions for the useful processing of the package are not performed.

A further increase in the number of cores and the frequency of the processor leads to the fact that the CPU is idle until the network card is sending packets and does not inform him of the appearance of new free slots for sending packets.
With increasing frequency, you can observe the following indicators on the processor load on one core:

6.4. Speed depending on package size

The previous test shows performance on the shortest packets, which are the most expensive in terms of per-packet costs. This test measures the performance of NETMAP depending on the size of the transmitted packet.

As you can see from the figure, the speed of sending packets almost according to the 1 / size formula decreases with increasing packet size. Along with this, as a surprise, we see that the packet reception rate changes in an unusual way. When sending packets ranging in size from 65 to 127 bytes, the speed drops to 7.5 Mpps. This feature has been tested in several network cards, including 1Gbit / s.

6.5. Speed depending on the number of packets per system call

Obviously, working with a large number of packages simultaneously reduces overheads and reduces the cost of processing a single package. Since not all applications can afford to work in this way, it is of interest to measure the processing speed of packets depending on the number of packets processed per system call.

7. Conclusion

NETMAP's author (Luigi Rizzo) managed to achieve a dramatic increase in productivity by eliminating overheads from packet processing that occur when a packet passes through the OS network stack. The speed that NETMAP allows you to get is limited only by bandwidth. NETMAP combines the best techniques for increasing productivity in processing network packets, and the concept embodied in the NETMAP API offers a new, healthy approach to developing high-performance applications for processing network traffic.
Currently, the effectiveness of NETMAP is evaluated by the FreeBSD community, NETMAP is included in the HEAD version of OS FreeBSD, as well as in the stable / 9, stable / 8 branches.

Tags: