How to drop 10 million packets per second

Transfer

In the company, our team for resisting DDoS attacks is called “packet packet droppers”. While all the other teams are doing cool things with traffic passing through our network, we have fun finding new ways to get rid of it.

Photo: Brian Evans , CC BY-SA 2.0 The

ability to quickly drop packets is very important in resisting DDoS attacks.

Dropping packets reaching our servers can be done at several levels. Each method has its pros and cons. Under the cut, we look at everything we tried.

Translator's note: in the output of some of the commands presented, extra spaces have been removed to preserve readability.

Test pad

For ease of comparison, we will give you some numbers, however, you should not take them too literally, due to the artificiality of the tests. We will use one of our Intel-servers with a 10Gbit / s network card. The remaining characteristics of the server are not so important, because we want to focus on the limitations of the operating system, not the hardware.

Our tests will look like this:

We create a load of a huge number of small UDP packets, reaching a value of 14 million packets per second;
All this traffic is directed to one processor core of the selected server;
We measure the number of packets processed by the core on one core processor.

Artificial traffic is generated in such a way as to create maximum load: a random IP address and port of the sender are used. Here’s something like this in tcpdump:

$ tcpdump -ni vlan100 -c 10 -t udp and dst port 1234
IP 198.18.40.55.32059 > 198.18.0.12.1234: UDP, length 16
IP 198.18.51.16.30852 > 198.18.0.12.1234: UDP, length 16
IP 198.18.35.51.61823 > 198.18.0.12.1234: UDP, length 16
IP 198.18.44.42.30344 > 198.18.0.12.1234: UDP, length 16
IP 198.18.106.227.38592 > 198.18.0.12.1234: UDP, length 16
IP 198.18.48.67.19533 > 198.18.0.12.1234: UDP, length 16
IP 198.18.49.38.40566 > 198.18.0.12.1234: UDP, length 16
IP 198.18.50.73.22989 > 198.18.0.12.1234: UDP, length 16
IP 198.18.43.204.37895 > 198.18.0.12.1234: UDP, length 16
IP 198.18.104.128.1543 > 198.18.0.12.1234: UDP, length 16

On the selected server, all packets will be in one RX queue and, therefore, processed by one core. We achieve this with hardware flow control:

ethtool -N ext0 flow-type udp4 dst-ip 198.18.0.12 dst-port 1234 action 2

Performance testing is a complex process. When we were preparing tests, we noticed that the presence of active raw sockets has a negative effect on performance, so you need to make sure that none of them are running before running the tests tcpdump. There is an easy way to check for bad processes:

$ ss -A raw,packet_raw -l -p|cat
Netid  State    Recv-Q Send-Q Local Address:Port
p_raw  UNCONN   525157 0      *:vlan100          users:(("tcpdump",pid=23683,fd=3))

Finally, we disable Intel Turbo Boost on our server:

echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

Despite the fact that Turbo Boost is a great thing and increases throughput by at least 20%, it significantly spoils the standard deviation in our tests. With turbo enabled, the deviation reaches ± 1.5%, while without it only 0.25%.

Step 1. Dropping packets in the application

Let's start with the idea of delivering all the packages to the application and ignoring them there. For the integrity of the experiment, make sure that iptables does not affect the performance:

iptables -I PREROUTING -t mangle -d 198.18.0.12 -p udp --dport 1234 -j ACCEPT
iptables -I PREROUTING -t raw -d 198.18.0.12 -p udp --dport 1234 -j ACCEPT
iptables -I INPUT -t filter -d 198.18.0.12 -p udp --dport 1234 -j ACCEPT

An application is a simple loop in which incoming data is immediately dropped:

s = socket.socket(AF_INET, SOCK_DGRAM)
s.bind(("0.0.0.0", 1234))
whileTrue:
    s.recvmmsg([...])

We have already prepared the code , run:

$ ./dropping-packets/recvmmsg-loop
packets=171261 bytes=1940176

This solution allows the kernel to take only 175 thousand packets from the hardware queue, as was measured by the utilities ethtooland oursmmwatch :

$ mmwatch 'ethtool -S ext0|grep rx_2'
 rx2_packets: 174.0k/s

Technically, the server receives 14 million packets per second, however, one processor core cannot cope with such a volume. mpstatconfirms this:

$ watch 'mpstat -u -I SUM -P ALL 1 1|egrep -v Aver'
01:32:05 PM  CPU %usr %nice  %sys %iowait %irq  %soft %steal %guest %gnice %idle
01:32:06 PM    0 0.00  0.00  0.00    2.94 0.00   3.92   0.00   0.00   0.00 93.14
01:32:06 PM    1 2.17  0.00 27.17    0.00 0.00   0.00   0.00   0.00   0.00 70.65
01:32:06 PM    2 0.00  0.00  0.00    0.00 0.00 100.00   0.00   0.00   0.00  0.00
01:32:06 PM    3 0.95  0.00  1.90    0.95 0.00   3.81   0.00   0.00   0.00 92.38

As we can see, the application is not a bottleneck: CPU # 1 is used at 27.17% + 2.17%, while interrupt handling takes 100% on CPU # 2.

Use recvmessagge(2)is important. After Specter’s vulnerability was discovered, system calls became even more expensive due to the KPTI and retpoline used in the kernel .

$ tail -n +1 /sys/devices/system/cpu/vulnerabilities/*
==> /sys/devices/system/cpu/vulnerabilities/meltdown <==
Mitigation: PTI
==> /sys/devices/system/cpu/vulnerabilities/spectre_v1 <==
Mitigation: __user pointer sanitization
==> /sys/devices/system/cpu/vulnerabilities/spectre_v2 <==
Mitigation: Full generic retpoline, IBPB, IBRS_FW

Step 2. Killing conntrack

We specifically made such a load with different IP and port of the sender in order to load the conntrack as much as possible. The number of entries in the conntrack during the test strives for the maximum possible and we can verify this:

$ conntrack -C
2095202
$ sysctl net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_max = 2097152

Moreover, dmesgyou can also see conntrack shouts:

[4029612.456673] nf_conntrack: nf_conntrack: table full, dropping packet
[4029612.465787] nf_conntrack: nf_conntrack: table full, dropping packet
[4029617.175957] net_ratelimit: 5731 callbacks suppressed

So let's turn it off:

iptables -t raw -I PREROUTING -d 198.18.0.12 -p udp -m udp --dport 1234 -j NOTRACK

And restart the tests:

$ ./dropping-packets/recvmmsg-loop
packets=331008 bytes=5296128

This allowed us to reach the level of 333 thousand packets per second. Hooray!
PS With the use of SO_BUSY_POLL we can reach as much as 470 thousand per second, however, this is a topic for a separate post.

Step 3. Berkeley packet filter

Go ahead. Why do we need to deliver packages to the application? Although this is not a common solution, we can tie a classic Berkeley packet filter to a socket by calling setsockopt(SO_ATTACH_FILTER)and configure the filter to drop packets while it is still in the kernel.
Prepare the code , run:

$ ./bpf-drop
packets=0 bytes=0

Using a packet filter (the classic and advanced Berkeley filters give roughly the same performance) we get to about 512,000 packets per second. Moreover, dropping a packet during an interrupt frees the processor from having to wake up the application.

Step 4. iptables DROP after routing

Now we can drop packets by adding the following rule to iptables in the INPUT chain:

iptables -I INPUT -d 198.18.0.12 -p udp --dport 1234 -j DROP

Let me remind you that we have already disabled conntrack rule -j NOTRACK. These two rules give us 608 thousand packets per second.

Look at the numbers in iptables:

$ mmwatch 'iptables -L -v -n -x | head'
Chain INPUT (policy DROP 0 packets, 0 bytes)
    pkts    bytes target prot opt in out source    destination
605.9k/s  26.7m/s DROP   udp  --  *  *   0.0.0.0/0 198.18.0.12          udp dpt:1234

Well, not bad, but we can do better.

Step 5. iptabes DROP to PREROUTING

Faster technology is to drop packets before routing using this rule:

iptables -I PREROUTING -t raw -d 198.18.0.12 -p udp --dport 1234 -j DROP

This allows us to discard a solid 1.688 million packets per second.

In fact, this is a slightly surprising jump in performance. I did not understand the reasons, perhaps our routing is difficult, or maybe just a bug in the server configuration.

In any case, raw iptables work much faster.

Step 6. nftables DROP

Now the iptables utility is a bit old. It was replaced by nftables. Check out this video explaining why nftables are top. Nftables promise to be faster than gray iptables for a variety of reasons, including the rumor that retpoline slows down iptables a lot.

But our article is still not about comparing iptables and nftables, so let's just try the quickest thing I could do:

nft add table netdev filter
nft -- add chain netdev filter input { type filter hook ingress device vlan100 priority -500 \; policy accept \; }
nft add rule netdev filter input ip daddr 198.18.0.0/24 udp dport 1234 counter drop
nft add rule netdev filter input ip6 daddr fd00::/64 udp dport 1234 counter drop

Counters can be seen as:

$ mmwatch 'nft --handle list chain netdev filter input'
table netdev filter {
    chain input {
        type filter hook ingress device vlan100 priority -500; policy accept;
        ip daddr 198.18.0.0/24 udp dport 1234 counter packets    1.6m/s bytes    69.6m/s drop # handle 2
        ip6 daddr fd00::/64 udp dport 1234 counter packets 0 bytes 0 drop # handle 3
    }
}

The nftables input hook showed values of about 1.53 million packets. This is a little less than the prefix chain in iptables. But there is a mystery in it: theoretically, the nftables hook comes earlier than PREROUTING iptables and, therefore, should be processed faster.

In our test, nftables are a bit slower than iptables, but still, nftables are cooler. : P

Step 7. tc DROP

Somewhat unexpectedly, the tc (traffic control) hook occurs earlier than iptables PREROUTING. tc allows us to select packets according to simple criteria and, of course, discard them. The syntax is a bit unusual, so for the customization we suggest using this script . And we need a rather complicated rule that looks like this:

tc qdisc add dev vlan100 ingress
tc filter add dev vlan100 parent ffff: prio 4 protocol ip u32 match ip protocol 17 0xff match ip dport 1234 0xffff match ip dst 198.18.0.0/24 flowid 1:1 action drop
tc filter add dev vlan100 parent ffff: protocol ipv6 u32 match ip6 dport 1234 0xffff match ip6 dst fd00::/64 flowid 1:1 action drop

And we can check it in action:

$ mmwatch 'tc -s filter  show dev vlan100  ingress'
filter parent ffff: protocol ip pref 4 u32 
filter parent ffff: protocol ip pref 4 u32 fh 800: ht divisor 1 
filter parent ffff: protocol ip pref 4 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1  (rule hit   1.8m/s success   1.8m/s)
  match 00110000/00ff0000 at 8 (success   1.8m/s ) 
  match 000004d2/0000ffff at 20 (success   1.8m/s ) 
  match c612000c/ffffffff at 16 (success   1.8m/s ) 
        action order 1: gact action drop
         random type none pass val 0
         index 1 ref 1 bind 1 installed 1.0/s sec
        Action statistics:
        Sent  79.7m/s bytes  1.8m/s pkt (dropped   1.8m/s, overlimits 0 requeues 0)

The tc hook allowed us to drop up to 1.8 million packets per second on a single core. It's fine!
But we can even faster ...

Step 8. XDP_DROP

And finally, our strongest weapon: XDP - eXpress Data Path . With XDP, we can run Berkeley’s extended Berkley Packet Filter (eBPF) code directly in the context of the network driver and, most importantly, before allocating memory under skbuffwhat promises us a speed boost.

Usually an XDP project consists of two parts:

eBPF downloadable code
bootloader that puts code in the correct network interface

Writing your bootloader is a difficult task, so just use the new iproute2 chip and load the code with a simple command:

ip link set dev ext0 xdp obj xdp-drop-ebpf.o

Ta-dam!

The source code for the downloadable eBPF program is available here . The program looks at such characteristics of IP packets as the UDP protocol, the sender's subnet and destination port:

if (h_proto == htons(ETH_P_IP)) {
    if (iph->protocol == IPPROTO_UDP
        && (htonl(iph->daddr) & 0xFFFFFF00) == 0xC6120000// 198.18.0.0/24
        && udph->dest == htons(1234)) {
        return XDP_DROP;
    }
}

An XDP program must be built using modern clang, which can generate BPF bytecode. After that we can download and test the functionality of the BFP program:

$ ip link show dev ext0
4: ext0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc fq state UP mode DEFAULT group default qlen 1000
    link/ether 24:8a:07:8a:59:8e brd ff:ff:ff:ff:ff:ff
    prog/xdp id 5 tag aedc195cc0471f51 jited

And then see the statistics in ethtool:

$ mmwatch 'ethtool -S ext0|egrep "rx"|egrep -v ": 0"|egrep -v "cache|csum"'
     rx_out_of_buffer:     4.4m/s
     rx_xdp_drop:         10.1m/s
     rx2_xdp_drop:        10.1m/s

Yu-hoo! With XDP, we can drop up to 10 million packets per second!

Photography: Andrew Filer , CC BY-SA 2.0

findings

We repeated the experiment for IPv4 and for IPv6 and prepared this diagram:

In general, it can be argued that our configuration for IPv6 is a bit slower. But since IPv6 packets are somewhat larger, the difference in speed is expected.

In Linux, there are many ways to filter packets, each with its own speed and complexity of configuration.

To protect against DDoS, it is quite reasonable to give packets to the application and process them there. A well-tuned application can perform well.

For DDoS attacks with random or spoofed IPs, it may be useful to disable conntrack to get a small increase in speed, but beware: there are attacks against which conntrack is very useful.

In other cases, it makes sense to add a Linux firewall as one of the ways to mitigate a DDoS attack. In some cases it is better to use the table "-t raw PREROUTING", since it is much faster than the table filter.

For the most neglected cases, we always use XDP. And yes, this is a very powerful thing. Here is the chart as above, only with XDP:

If you want to repeat the experiment, here is the README, in which we have documented everything .

We use CloudFlare ... almost all of these techniques. Some user space tricks are integrated into our applications. The technique with iptables is found in our Gatebot . Finally, we replace our own core solution with XDP.

Many thanks to Jesper Dangaard Brouer for the help in work.

Tags: