Large traffic flows and interrupt management in Linux
In this article, I will describe methods for increasing the performance of a Linux router. For me, this topic became relevant when the network traffic passing through one Linux router became quite high (> 150 Mbit / s,> 50 Kpps). In addition to routing, the router is also engaged in shaping and acts as a firewall.
For high loads, it is worth using Intel network cards based on 82575/82576 (Gigabit), 82598/82599 (10 Gigabit) chipsets, or the like. Their charm is that they create eight interrupt processing queues for one interface - four on rx and four on tx (maybe the RPS / RFS technologies that appeared in 2.6.35 kernel will do the same for regular network cards). Also, these chips accelerate the processing of traffic at the hardware level quite well.
To get started, look at the contents
In this example, Intel 82576 network cards are used. Here you can see that network interrupts are distributed evenly across the cores. However, this will not happen by default. It is necessary to throw interrupts on processors. To do this, you need to run a command
Also, if the router has two interfaces, one for input and one for output (classic circuit), then rx from one interface should be grouped with tx of another interface on the same processor core. For example, in this case, interrupts 46 (eth0-rx-0) and 59 (eth1-tx-0) were defined on one core.
Another very important parameter is the delay between interrupts. You can view the current value using
When preparing the server with Intel Xeon E5520 (8 cores, each with HyperThreading), I chose the following interrupt distribution scheme:
/ proc / interrupts on this server without load can be viewed here . I don’t cite this in a note because of the cumbersomeness of
UPD:
If the server works only with a router, then the TCP stack tuning does not really matter. However, there are sysctl options that allow you to increase the size of the ARP cache, which may be relevant. If there is a problem with the size of the ARP cache in dmesg, the message “Neighbor table overflow” will appear.
For example: Description of parameters: gc_thresh1 - the minimum number of entries that should be in the ARP cache. If the number of entries is less than this value, the garbage collector will not clear the ARP cache. gc_thresh2
- soft limit on the number of entries in the ARP cache. If the number of records reaches this value, then the garbage collector will start within 5 seconds.
gc_thresh3 - hard limit the number of entries in the ARP cache. If the number of records reaches this value, then the garbage collector will start immediately.
For high loads, it is worth using Intel network cards based on 82575/82576 (Gigabit), 82598/82599 (10 Gigabit) chipsets, or the like. Their charm is that they create eight interrupt processing queues for one interface - four on rx and four on tx (maybe the RPS / RFS technologies that appeared in 2.6.35 kernel will do the same for regular network cards). Also, these chips accelerate the processing of traffic at the hardware level quite well.
To get started, look at the contents
/proc/interrupts, in this file you can see what causes interrupts and which kernels handle them.# cat / proc / interrupts
CPU0 CPU1 CPU2 CPU3
0: 53 1 9 336 IO-APIC-edge timer
1: 0 0 0 2 IO-APIC-edge i8042
7: 1 0 0 0 IO-APIC-edge
8: 0 0 0 75 IO-APIC-edge rtc0
9: 0 0 0 0 IO-APIC-fasteoi acpi
12: 0 0 0 4 IO-APIC-edge i8042
14: 0 0 0 127 IO-APIC-edge pata_amd
15: 0 0 0 0 IO-APIC-edge pata_amd
18: 150 1497 12301 473020 IO-APIC-fasteoi ioc0
21: 0 0 0 0 IO-APIC-fasteoi sata_nv
22: 0 0 15 2613 IO-APIC-fasteoi sata_nv, ohci_hcd: usb2
23: 0 0 0 2 IO-APIC-fasteoi sata_nv, ehci_hcd: usb1
45: 0 0 0 1 PCI-MSI-edge eth0
46: 138902469 21349 251748 4223124 PCI-MSI-edge eth0-rx-0
47: 137306753 19896 260291 4741413 PCI-MSI-edge eth0-rx-1
48: 2916 137767992 248035 4559088 PCI-MSI-edge eth0-rx-2
49: 2860 138565213 244363 4627970 PCI-MSI-edge eth0-rx-3
50: 2584 14822 118410604 3576451 PCI-MSI-edge eth0-tx-0
51: 2175 15115 118588846 3440065 PCI-MSI-edge eth0-tx-1
52: 2197 14343 166912 121908883 PCI-MSI-edge eth0-tx-2
53: 1976 13245 157108 120248855 PCI-MSI-edge eth0-tx-3
54: 0 0 0 1 PCI-MSI-edge eth1
55: 3127 19377 122741196 3641483 PCI-MSI-edge eth1-rx-0
56: 2581 18447 123601063 3865515 PCI-MSI-edge eth1-rx-1
57: 2470 17277 183535 126715932 PCI-MSI-edge eth1-rx-2
58: 2543 16913 173988 126962081 PCI-MSI-edge eth1-rx-3
59: 128433517 11953 148762 4230122 PCI-MSI-edge eth1-tx-0
60: 127590592 12028 142929 4160472 PCI-MSI-edge eth1-tx-1
61: 1713 129757168 136431 4134936 PCI-MSI-edge eth1-tx-2
62: 1854 126685399 122532 3785799 PCI-MSI-edge eth1-tx-3
NMI: 0 0 0 0 Non-maskable interrupts
LOC: 418232812 425024243 572346635 662126626 Local timer interrupts
SPU: 0 0 0 0 Spurious interrupts
PMI: 0 0 0 0 Performance monitoring interrupts
PND: 0 0 0 0 Performance pending work
RES: 94005109 96169918 19305366 4460077 Rescheduling interrupts
CAL: 49 34 39 29 Function call interrupts
TLB: 66588 144427 131671 91212 TLB shootdowns
TRM: 0 0 0 0 Thermal event interrupts
THR: 0 0 0 0 Threshold APIC interrupts
MCE: 0 0 0 0 Machine check exceptions
MCP: 199 199 199 199 Machine check polls
ERR: 1
MIS: 0In this example, Intel 82576 network cards are used. Here you can see that network interrupts are distributed evenly across the cores. However, this will not happen by default. It is necessary to throw interrupts on processors. To do this, you need to run a command
echo N > /proc/irq/X/smp_affinitywhere N is the processor mask (determines which processor will get the interrupt) and X is the interrupt number, visible in the first output column / proc / interrupts . To determine the processor mask, you need to raise 2 to the power of cpu_N (processor number) and translate it into a hexadecimal system. With the help of bccomputed as follows: echo "obase=16; $[2 ** $cpu_N]" | bc. In this example, the distribution of interrupts was performed as follows:# CPU0 echo 1> / proc / irq / 45 / smp_affinity echo 1> / proc / irq / 54 / smp_affinity echo 1> / proc / irq / 46 / smp_affinity echo 1> / proc / irq / 59 / smp_affinity echo 1> / proc / irq / 47 / smp_affinity echo 1> / proc / irq / 60 / smp_affinity # CPU1 echo 2> / proc / irq / 48 / smp_affinity echo 2> / proc / irq / 61 / smp_affinity echo 2> / proc / irq / 49 / smp_affinity echo 2> / proc / irq / 62 / smp_affinity # CPU2 echo 4> / proc / irq / 50 / smp_affinity echo 4> / proc / irq / 55 / smp_affinity echo 4> / proc / irq / 51 / smp_affinity echo 4> / proc / irq / 56 / smp_affinity # CPU3 echo 8> / proc / irq / 52 / smp_affinity echo 8> / proc / irq / 57 / smp_affinity echo 8> / proc / irq / 53 / smp_affinity echo 8> / proc / irq / 58 / smp_affinity
Also, if the router has two interfaces, one for input and one for output (classic circuit), then rx from one interface should be grouped with tx of another interface on the same processor core. For example, in this case, interrupts 46 (eth0-rx-0) and 59 (eth1-tx-0) were defined on one core.
Another very important parameter is the delay between interrupts. You can view the current value using
ethtool -c ethNthe rx-usecs and tx-usecs options . The higher the value, the higher the delay, but the lower the load on the processor. Try decreasing this value during peak hours to zero.When preparing the server with Intel Xeon E5520 (8 cores, each with HyperThreading), I chose the following interrupt distribution scheme:
# CPU6 echo 40> / proc / irq / 71 / smp_affinity echo 40> / proc / irq / 84 / smp_affinity # CPU7 echo 80> / proc / irq / 72 / smp_affinity echo 80> / proc / irq / 85 / smp_affinity # CPU8 echo 100> / proc / irq / 73 / smp_affinity echo 100> / proc / irq / 86 / smp_affinity # CPU9 echo 200> / proc / irq / 74 / smp_affinity echo 200> / proc / irq / 87 / smp_affinity # CPU10 echo 400> / proc / irq / 75 / smp_affinity echo 400> / proc / irq / 80 / smp_affinity # CPU11 echo 800> / proc / irq / 76 / smp_affinity echo 800> / proc / irq / 81 / smp_affinity # CPU12 echo 1000> / proc / irq / 77 / smp_affinity echo 1000> / proc / irq / 82 / smp_affinity # CPU13 echo 2000> / proc / irq / 78 / smp_affinity echo 2000> / proc / irq / 83 / smp_affinity # CPU14 echo 4000> / proc / irq / 70 / smp_affinity # CPU15 echo 8000> / proc / irq / 79 / smp_affinity
/ proc / interrupts on this server without load can be viewed here . I don’t cite this in a note because of the cumbersomeness of
UPD:
If the server works only with a router, then the TCP stack tuning does not really matter. However, there are sysctl options that allow you to increase the size of the ARP cache, which may be relevant. If there is a problem with the size of the ARP cache in dmesg, the message “Neighbor table overflow” will appear.
For example: Description of parameters: gc_thresh1 - the minimum number of entries that should be in the ARP cache. If the number of entries is less than this value, the garbage collector will not clear the ARP cache. gc_thresh2
net.ipv4.neigh.default.gc_thresh1 = 1024
net.ipv4.neigh.default.gc_thresh2 = 2048
net.ipv4.neigh.default.gc_thresh3 = 4096- soft limit on the number of entries in the ARP cache. If the number of records reaches this value, then the garbage collector will start within 5 seconds.
gc_thresh3 - hard limit the number of entries in the ARP cache. If the number of records reaches this value, then the garbage collector will start immediately.