Traffic generation in user space
Generating traffic through MoonGen + DPDK + Lua in the artist's view.
Neutralizing DDoS attacks in real conditions requires preliminary testing and testing of various techniques. Network equipment and software should be tested in artificial conditions close to real - with intensive traffic flows that simulate attacks. Without such experiments, it is extremely difficult to obtain reliable information about the specific features and limitations of any complex tool.
In this article we will reveal some of the traffic generation methods used in Qrator Labs.
We strongly recommend the reader not to try to use these tools to attack real infrastructure objects. DoS attacks are punishable by law and can lead to severe punishment. Qrator Labs conducts all tests in an isolated laboratory environment.
Modern technical level
An illustrative task in our field is to saturate the 10G Ethernet interface with small packets, which means processing 14.88 Mpps (millions of packets per second). Hereinafter, we consider the smallest Ethernet network packets - 64 bytes - because our main interest is to maximize the number of transmitted packets per unit of time. A simple calculation shows that we only have about 67 nanoseconds to process one such packet.
Just for comparison - this time is close to what is required by a modern processor to get a piece of data from memory in the event of a cache miss. Everything becomes even more difficult when we start working with 40G and 100G Ethernet interfaces and try to fully saturate them up to the line rate (the maximum possible performance of the network device).
Since in the usual case the data flow passes through the application in the user space (userspace), then through the kernel, finally getting into the network controller (NIC), the first and most straightforward idea is to attempt to configure packet generation directly in the kernel. An example of such a solution is the pktgen nuclear module .. This method allows to significantly improve performance, but is not flexible enough, since the slightest change in the source code in the kernel leads to a long build cycle, reloading kernel modules or even the entire system and, in fact, testing, which reduces the overall productivity (that is, it takes a programmer more time) and effort).
Another possible approach is to get direct access from userspace to the memory buffers of the network controller. This path is more complicated, but it is worth the effort to achieve higher productivity. Disadvantages include high complexity and low flexibility. Examples of this approach are the technologies netmap , PF_RING and DPDK .
Another effective, albeit very costly way to achieve high performance is to use not universal, but specialized equipment. Example: Ixia .
There are also solutions based on DPDK using scripts, which increases the flexibility in controlling generator parameters, and also allows you to vary the type of packages generated during the launch process. Below we describe our own experience with one of these tools - MoonGen.
The distinctive features of MoonGen are:
- Processing DPDK data in userspace, this is the main reason for performance gains;
- The Lua stack [ 5 ] with simple scripts at the top level and bindings to the DPDK library written in C on the bottom;
- Thanks to JIT technology (just in time), Lua scripts work fairly quickly, which is somewhat contrary to generally accepted ideas about the effectiveness of scripting languages.
MoonGen can be perceived as a Lua-wrapper around the DPDK library. At least the following DPDK operations are visible at the Lua user interface level:
- Configuring network controllers;
- Location and direct access to pools and memory buffers, which, for optimization purposes, should be allocated with continuous aligned areas;
- Direct access to RSS queues of network controllers;
- API for managing computational flows that take into account the heterogeneity of memory access (NUMA and CPU affinity) [ 12 ].
Architecture MoonGen, scheme from the material [ 1 ].
MoonGen is a high-speed scripting package generator based on the DPDK library. Lua scripts control the entire process: a user-created script is involved in creating, modifying, and sending packets. Thanks to the very fast LuaJIT and DPDK packet processing library, this architecture allows saturating a 10-gigabit Ethernet interface with 64-byte packets using only one core of the CPU. MoonGen allows you to achieve this speed even in the case when the Lua script modifies each package. It does not use tricks like reusing the same network controller buffer.
MoonGen can also receive packets, that is, check which packets have been dropped by the system under test. Since packet reception is controlled exclusively by a custom Lua script, it can also be used to create more complex test scripts. For example, you can use two instances of MoonGen to establish a connection with each other. This configuration can be used, in particular, to test the so-called middleboxes (equipment between the point of sending and receiving traffic), for example, firewalls. MoonGen focuses on four main areas:
- High performance and multi-core scaling: more than 20 million packets per second on a single CPU core;
- Flexibility: each package is generated in real time based on a user-created Lua script;
- Accurate time stamps: on ordinary (commodity) iron, time marking is performed with millisecond precision;
- Accurate control of the intervals between the sent packets: reliable generation of the required patterns and types of traffic on regular hardware.
DPDK stands for Data Plane Development Kit and consists of libraries, the main functions of which are to increase the performance of generating network packets on a wide variety of central processor architectures.
In a world where computer networks become the foundation of human communication, performance, bandwidth and latency are becoming increasingly critical for the operation of systems such as wireless networks and cable infrastructure, including all their individual components: routers, load balancers, firewalls; and also spheres of applications: transfer of media (streaming), VoIP, etc.
DPDK is a lightweight and convenient way to build tests and scripts. Data transfer within userspace is something we see less frequently, mainly because most applications communicate with network equipment through the operating system and the kernel stack, which is the opposite of the DPDK model.
The main purpose of the existence of Lua is to provide simple and flexible expressive means that are extensible for specific current tasks, instead of a set of primitives that is applicable only in one programming paradigm. As a result, the base language is very easy - the entire interpreter is only 180 KB in compiled form and easily adapts to a wide range of possible implementations.
Lua is a dynamic language. It is so compact that it can be placed on virtually any device. Lua supports a small set of types: boolean values, numbers (double-precision floating point), and strings. Regular data structures, such as arrays, sets, and lists, can be represented by the only embedded data structure in Lua — the table, which is a heterogeneous associative array.
Lua uses JIT compilation (just in time), therefore, being a scripting language, it shows performance comparable to compiled languages such as C [ 10 ].
As a company specializing in neutralizing DDoS attacks, Qrator Labs needs a reliable way to create, upgrade and test its own security solutions. For the latter, testing, various methods of generating traffic are needed that mimic real attacks. However, it is not so easy to imitate a dangerous, straightforward, flood attack at 2-3 levels of the OSI model, primarily because of the difficulties in achieving high performance in packet generation.
In other words, for a company engaged in the continuous availability and neutralization of DDoS, simulating various DoS attacks in an isolated laboratory environment is a way to understand how different equipment, which is part of the company's hardware, behaves in reality.
MoonGen is a good way to generate near-limit values for the network controller for the traffic at the minimum of the CPU cores. Data transfer within userspace significantly improves the performance of the stack under consideration (MoonGen + DPDK), compared to many other options for generating high traffic values. The use of pure DPDK requires much more effort, so there is no need to be surprised at our desire to optimize work. We also support the clone [ 7 ] of the original MoonGen repository in order to extend the functionality and implement its own tests.
In order to achieve maximum flexibility, the packet generation logic is specified by the user using the Lua script, which is one of the main features of MoonGen. In the case of relatively simple packet processing, this solution works quickly enough to saturate the 10G interface on a single CPU core. A typical way to modify incoming packages and create new ones is to work with packages of the same type, in which only some of the fields change.
An example is the l3-tcp-syn-ack-flood test, described below. Note that any modification of the package can be made in the same buffer where the package was generated or received in the previous step. Indeed, this kind of packet conversion is performed very quickly, since it does not involve expensive operations, such as system calls, access to potentially uncached memory, and so on.
Tests on Qrator Labs hardware
Qrator Labs conducts all tests in the laboratory on a variety of equipment. In this case, we used the following network interface controllers:
- Intel 82599ES 10G
- Mellanox ConnectX-4 40G
- Mellanox ConnectX-5 100G
We note separately that when working with network controllers operating on standards higher than 10G, the performance problem is getting more acute. Today it is not possible to saturate the 40G interface with one core, although this is already possible with a small number of cores.
In the case of Mellanox network controllers, it is possible to change some parameters and settings of the device using the tuning guide [ 3] provided by the manufacturer. This allows you to increase performance, and in some special cases - to deeper change the behavior of the NIC. Other manufacturers may have similar documents for their own high-performance devices intended for professional use. Even if you cannot find such a document in the public domain, it always makes sense to contact the manufacturer directly. In our case, the representatives of Mellanox were very kind and, in addition to providing documentation, quickly answered questions that arose, which resulted in 100% utilization of the strip, which was very important to us.
TCP SYN flood test
L3-tcp-syn-ack-flood is an example of simulating an attack like SYN flood [ 6 ]. This is an enhanced Qrator Labs version of the l3-tcp-syn-flood test from the MoonGen main repository, which is stored in our repository clone.
Our test can run three kinds of processes:
- Generate a TCP SYN packet stream from scratch, varying the required fields, such as source IP address, source port number, etc .;
- Create a valid ACK response for each received SYN packet according to the TCP protocol;
- Create a valid SYN-ACK response for each received ACK packet according to the TCP protocol.
For example, the internal (respectively, the “hottest”) code loop for creating ACK responses is as follows:
local tx = 0local rx = rxQ:recv(rxBufs) for i = 1, rx dolocal buf = rxBufs[i] local pkt = buf:getTcpPacket(ipv4) if pkt.ip4:getProtocol() == ip4.PROTO_TCP and pkt.tcp:getSyn() and (pkt.tcp:getAck() or synack) thenlocal seq = pkt.tcp:getSeqNumber() local ack = pkt.tcp:getAckNumber() pkt.tcp:unsetSyn() pkt.tcp:setAckNumber(seq+1) pkt.tcp:setSeqNumber(ack) local tmp = pkt.ip4.src:get() pkt.ip4.src:set(pkt.ip4.dst:get()) pkt.ip4.dst:set(tmp) … -- some more manipulations with packet fields tx = tx + 1 txBufs[tx] = buf endendif tx > 0then txBufs:resize(tx) txBufs:offloadTcpChecksums(ipv4) -- offload checksums to NIC txQ:send(txBufs) end
The general idea of creating a response packet is as follows. First, you need to remove the packet from the RX queue, then check whether the packet type matches the expected one. If there is a match, prepare an answer by modifying some fields of the original package. Finally, put the created packet into the TX queue using the same buffer. To improve performance, instead of taking and modifying packets one by one, we aggregate them, extract all available packets from the RX queue, create the corresponding responses and put them all into the TX queue. Despite the sufficiently large number of manipulations on one package, the performance remains high, primarily due to the fact that the Lua JIT compiles all these operations into a small number of processor instructions. Many other tests, not just TCP SYN / ACK,
The table below shows the results of the SYN flood test (generating SYN without trying to respond) using Mellanox ConnectX-4. This NIC has two 40G ports with a theoretical performance ceiling of 59.52 Mpps per port and 2 * 50 Mpps for two ports. The concrete implementation of connecting a NIC to a PCIe somewhat limits the bandwidth (yielding 2 * 50 instead of the expected 2 * 59.52).
|cores per port||1 port, Mpps||2 ports, Mpps per each port|
SYN flood test; NIC: Mellanox Technologies MT27700 Family (ConnectX-4), dual 40G port; CPU: Intel® Xeon® Silver 4114 CPU @ 2.20GHz
The following table shows the results of the same SYN flood test conducted on a Mellanox ConnectX-5 with a single 100G port.
SYN flood test; NIC: Mellanox Technologies MT27800 Family (ConnectX-5), single 100G port; CPU: Intel® Xeon® Silver 4114 CPU @ 2.20GHz
Note that in all cases we achieve more than 96% of the theoretical performance ceiling on a small number of processor cores.
Capture incoming traffic and save to PCAP files
Another example of the test is rx-to-pcap, which attempts to capture all incoming traffic and save to a certain number of PCAP files [ 8 ]. Although this test is not specifically concerned with the generation of packets as such, it serves as a demonstration of the fact that the filespace is the weakest link in the organization of data transfer through userspace. Even the tmpfs virtual file system slows down the stream significantly. In this case, 8 CPU cores are needed for the disposal of 14.88 Mpps, while just one core is enough to receive (and reset, or redirect) the same amount of traffic.
The following table shows the amount of traffic (in Mpps) that was received and saved to PCAP files that are in the ext2 file system on the SSD (second column) or on the tmpfs file system (third column).
|cores||on SSD, Mpps||on tmpfs, mpps|
Rx-to-pcap test; NIC: Intel 82599ES 10-Gigabit; CPU: Intel® Xeon® CPU E5-2683 v4 @ 2.10GHz
MoonGen modification: tman task manager
We would also like to present the reader with its own extension of the MoonGen functional, which provides another way to run a group of tasks for testing. The main idea here is to separate the overall configuration and the settings specific to each task, allowing you to run an arbitrary number of different tasks (that is, Lua scripts) at the same time. In our clone of the MoonGen repository, the implementation of MoonGen with the task manager [ 9 ] is presented , here we only briefly list its main functions.
The new command line interface allows you to run multiple tasks of different types at the same time. The baseline script looks like this:
./build/tman [tman options...] [-- <task1-file> [task1 options...]] [-- <task2-file> [task2 options...]] [-- ...]
In addition, ./build/tman -h provides detailed help.
However, there is a limitation - regular Lua task files are incompatible with the tman interface . The tman task file must clearly define the following objects:
- A configure (parser) function that describes the parameters of the job;
- The task function (taskNum, txInfo, rxInfo, args), which describes the actual task process. Here txInfo and rxInfo are arrays of RX and TX queues, respectively; args contains the parameters of the task manager and the task itself.
- Examples can be found in examples / tman.
Task Manager enables the use of b for greater flexibility in the launch of heterogeneous tests.
The method that MoonGen offers turned out to be well suited to our goals and satisfied the staff with the results. We got a tool with high performance, while maintaining and test environment, and the language is quite simple. The high performance of this setup is achieved thanks to two main features: direct access to the buffers of the network interface controller and the Just-In-Time compilation technique in Lua.
As a rule, achieving the theoretical ceiling of the performance of a network interface controller is quite a feasible task. As we have demonstrated, a single core can be enough to saturate a 10G port, while with a large number of cores, the complete loading of a 100G port does not pose a particular problem.
We are especially grateful to the Mellanox team for their assistance in working with their equipment and to the MoonGen team for their reaction in correcting errors.
- MoonGen: A Scriptable High-Speed Packet Generator - Paul Emmerich et al., Internet Measurement Conference 2015 (IMC'15), 2015
- Mellanox tuning guide
- Data Plane Development Kit
- SYN flood
- Qrator Labs' clone of MoonGen repository
- PCAP file format
- Task manager
- Lua performance
- Network Functions Virtualization Whitepaper
- NUMA, non-uniform memory access