Large traffic flows and Linux: interrupts, router, and NAT server
Written in the wake of the publication Large traffic flows and interrupt management in Linux
There are more than 30 thousand subscribers in our city network. The total volume of external channels is more than 3 gigabits. And the advice given in the mentioned article, we passed several years ago. Thus, I want to expand the topic more widely and share my best practices with readers within the framework of the issue under discussion.
The note describes the nuances of tuning / tuning the router and NAT-server running Linux, as well as some clarifications about the distribution of interrupts.
Throwing network card interruptions into different cores is the very first thing a system administrator encounters with increasing load on a linux router. In the mentioned article, the topic is covered in sufficient detail - therefore, we will not dwell on this issue for a long time.
I just want to note:
In the original article, there is the phrase "if the server works only with a router, then tuning the TCP stack does not really matter." This statement is fundamentally false. Of course, on small flows tuning does not play a big role. However, if you have a large network and the corresponding load, then you will have to do the tuning of the network stack.
First of all, if gigabits "walk" on your network, then it makes sense to turn your attention to MTU on your servers and switches. In a nutshell, MTU is the volume of a packet that can be transmitted over the network without resorting to fragmentation. Those. how much information your one router can transfer to another without fragmentation. With a significant increase in the volume of data transmitted over the network, it is much more efficient to transmit packets of a larger volume less often than to often send small data packets.
On switching equipment, this will usually be called a jumbo-frame. Specifically, for the Cisco Catalyst 3750 Note: the switch then needs to be rebooted. By the way, mtu jumbo concern only gigabit links - such a command does not affect 100 Mbps.
By default, the value is 1000. For gigabit links, it is recommended to set 10000. In a nutshell, this is the size of the transfer buffer. When the buffer is filled to this limit value, data is transmitted to the network.
Keep in mind that if you change the MTU size on the interface of a piece of hardware, you must do the same on the interface of its “neighbor”. That is, if you increased the MTU to 9000 on the interface of the linux router, then you must enable jumbo-frame on the switch port in which this router is included. Otherwise, the network will work, but it’s very bad: packets will go through the network “through one”.
As a result of all these changes, “pings” will increase in the network - but the overall throughput will noticeably increase, and the load on active equipment will decrease.
Operation NAT (Network Address Translation) is one of the most expensive (in the sense of resource-intensive). Therefore, if you have a large network, you can not do without tuning the NAT server.
To accomplish its task, the NAT server needs to “remember” about all the connections that pass through it. Whether it’s ping or someone’s ICQ, the NAT server “remembers” all these sessions and monitors them in its memory in a special table. When a session is closed, information about it from the table is deleted. The size of this table is fixed. That is why if there is a lot of traffic through the server, but the table size is not enough, then the NAT server begins to “drop” packets, break sessions, the Internet starts to work with terrible interruptions, and it sometimes becomes impossible to even get to the NAT server via SSH .
To prevent such horrors, it is necessary to adequately increase the size of the table - in accordance with the traffic passing through NAT:
It is strongly recommended that you do not set such a large value if your NAT server has less than 1 gigabyte of RAM.
You can see the current value like this: To
see how much the connection tracking table is already full, you can like this:
The hash table in which the lists of conntrack records are stored must be proportionally increased.
The rule is simple: hashsize = nf_conntrack_max / 8
As you recall, a NAT server only monitors live sessions that pass through it. When a session is closed, information about it is deleted so that the table does not overflow. Session information is also deleted by timeout. That is, if there is no traffic for a long time within the exchange connection, it is closed and information about it is also deleted from the NAT memory.
However, by default, timeout values are quite large. Therefore, with large traffic flows, even if you stretch nf_conntrack_max to the limit, you still run the risk of quickly encountering table overflows and connection breaks.
To prevent this from happening, you must correctly set the timeouts for connection tracking on the NAT server.
Current values can be viewed, for example, like this:
As a result, you will see something similar: These are the timeout values in seconds. As you can see, the value of net.netfilter.nf_conntrack_generic_timeout is 600 (10 minutes). Those. The NAT server will keep in memory session information until it runs through it at least once every 10 minutes. At first glance, it's okay - but in fact it is very, very much. If you look at net.netfilter.nf_conntrack_tcp_timeout_established - then you will see the value 432000 there. In other words, your NAT server will monitor a simple TCP session until some packet runs through it at least once every 5 days ( !).
Speaking in even simpler language, making such a NAT-DDOS server easier than simple: its NAT table (nf_conntrack_max parameter) is full with a bang with the simplest flood - as a result, it will disconnect and in the worst case will quickly turn into a black hole .
Timeout values are recommended to be set within 30-120 seconds. This is quite enough for the normal work of subscribers, and this is quite enough for timely cleaning of the NAT table, eliminating its overflow.
And do not forget to write the corresponding changes in /etc/rc.local and /etc/sysctl.conf
After tuning, you will get a completely viable and productive NAT server. Of course, this is only a “basic” tuning - we did not touch, for example, core tuning, etc. of things. However, in most cases, even such simple actions will be sufficient for the normal operation of a sufficiently large network. As I said, in our network there are more than 30 thousand subscribers whose traffic is processed by 4 NAT-servers.
In the following issues:
There are more than 30 thousand subscribers in our city network. The total volume of external channels is more than 3 gigabits. And the advice given in the mentioned article, we passed several years ago. Thus, I want to expand the topic more widely and share my best practices with readers within the framework of the issue under discussion.
The note describes the nuances of tuning / tuning the router and NAT-server running Linux, as well as some clarifications about the distribution of interrupts.
Interruptions
Throwing network card interruptions into different cores is the very first thing a system administrator encounters with increasing load on a linux router. In the mentioned article, the topic is covered in sufficient detail - therefore, we will not dwell on this issue for a long time.
I just want to note:
- if you manually throw the interrupts, then you need to stop the irqbalance service. This service is designed specifically for automatically controlling interrupts between processor cores. If you do this work manually, it is better to stop the service;
- do not forget to make the appropriate changes to the "startup" (for example, /etc/rc.local) - because after the restart of the server, all interrupts will again be distributed vkuchnu on one core;
- after restarting the server, network cards can receive (and most likely, this will be the case) new interrupt numbers. Therefore, in /etc/rc.local it’s better not to manually enter specific interrupt numbers - but to automate with the help of an auxiliary script the recognition of which network interrupt took which one.
Router
In the original article, there is the phrase "if the server works only with a router, then tuning the TCP stack does not really matter." This statement is fundamentally false. Of course, on small flows tuning does not play a big role. However, if you have a large network and the corresponding load, then you will have to do the tuning of the network stack.
First of all, if gigabits "walk" on your network, then it makes sense to turn your attention to MTU on your servers and switches. In a nutshell, MTU is the volume of a packet that can be transmitted over the network without resorting to fragmentation. Those. how much information your one router can transfer to another without fragmentation. With a significant increase in the volume of data transmitted over the network, it is much more efficient to transmit packets of a larger volume less often than to often send small data packets.
Extend MTU on linux
/sbin/ifconfig eth0 mtu 9000
Increase MTU on switches
On switching equipment, this will usually be called a jumbo-frame. Specifically, for the Cisco Catalyst 3750 Note: the switch then needs to be rebooted. By the way, mtu jumbo concern only gigabit links - such a command does not affect 100 Mbps.
3750(config)# system mtu jumbo 9000
3750(config)# exit
3750# reload
Increase the transfer queue on linux
/sbin/ifconfig eth0 txqueuelen 10000
By default, the value is 1000. For gigabit links, it is recommended to set 10000. In a nutshell, this is the size of the transfer buffer. When the buffer is filled to this limit value, data is transmitted to the network.
Keep in mind that if you change the MTU size on the interface of a piece of hardware, you must do the same on the interface of its “neighbor”. That is, if you increased the MTU to 9000 on the interface of the linux router, then you must enable jumbo-frame on the switch port in which this router is included. Otherwise, the network will work, but it’s very bad: packets will go through the network “through one”.
Summary
As a result of all these changes, “pings” will increase in the network - but the overall throughput will noticeably increase, and the load on active equipment will decrease.
NAT Server
Operation NAT (Network Address Translation) is one of the most expensive (in the sense of resource-intensive). Therefore, if you have a large network, you can not do without tuning the NAT server.
Increased monitored connections
To accomplish its task, the NAT server needs to “remember” about all the connections that pass through it. Whether it’s ping or someone’s ICQ, the NAT server “remembers” all these sessions and monitors them in its memory in a special table. When a session is closed, information about it from the table is deleted. The size of this table is fixed. That is why if there is a lot of traffic through the server, but the table size is not enough, then the NAT server begins to “drop” packets, break sessions, the Internet starts to work with terrible interruptions, and it sometimes becomes impossible to even get to the NAT server via SSH .
To prevent such horrors, it is necessary to adequately increase the size of the table - in accordance with the traffic passing through NAT:
/sbin/sysctl -w net.netfilter.nf_conntrack_max=524288
It is strongly recommended that you do not set such a large value if your NAT server has less than 1 gigabyte of RAM.
You can see the current value like this: To
/sbin/sysctl net.netfilter.nf_conntrack_max
see how much the connection tracking table is already full, you can like this:
/sbin/sysctl net.netfilter.nf_conntrack_count
Increasing the size of the hash table
The hash table in which the lists of conntrack records are stored must be proportionally increased.
echo 65536 > /sys/module/nf_conntrack/parameters/hashsize
The rule is simple: hashsize = nf_conntrack_max / 8
Decreasing time-out values
As you recall, a NAT server only monitors live sessions that pass through it. When a session is closed, information about it is deleted so that the table does not overflow. Session information is also deleted by timeout. That is, if there is no traffic for a long time within the exchange connection, it is closed and information about it is also deleted from the NAT memory.
However, by default, timeout values are quite large. Therefore, with large traffic flows, even if you stretch nf_conntrack_max to the limit, you still run the risk of quickly encountering table overflows and connection breaks.
To prevent this from happening, you must correctly set the timeouts for connection tracking on the NAT server.
Current values can be viewed, for example, like this:
sysctl -a | grep conntrack | grep timeout
As a result, you will see something similar: These are the timeout values in seconds. As you can see, the value of net.netfilter.nf_conntrack_generic_timeout is 600 (10 minutes). Those. The NAT server will keep in memory session information until it runs through it at least once every 10 minutes. At first glance, it's okay - but in fact it is very, very much. If you look at net.netfilter.nf_conntrack_tcp_timeout_established - then you will see the value 432000 there. In other words, your NAT server will monitor a simple TCP session until some packet runs through it at least once every 5 days ( !).
net.netfilter.nf_conntrack_generic_timeout = 600
net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 120
net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 60
net.netfilter.nf_conntrack_tcp_timeout_established = 432000
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60
net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_close = 10
net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300
net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300
net.netfilter.nf_conntrack_udp_timeout = 30
net.netfilter.nf_conntrack_udp_timeout_stream = 180
net.netfilter.nf_conntrack_icmp_timeout = 30
net.netfilter.nf_conntrack_events_retry_timeout = 15
Speaking in even simpler language, making such a NAT-DDOS server easier than simple: its NAT table (nf_conntrack_max parameter) is full with a bang with the simplest flood - as a result, it will disconnect and in the worst case will quickly turn into a black hole .
Timeout values are recommended to be set within 30-120 seconds. This is quite enough for the normal work of subscribers, and this is quite enough for timely cleaning of the NAT table, eliminating its overflow.
And do not forget to write the corresponding changes in /etc/rc.local and /etc/sysctl.conf
Summary
After tuning, you will get a completely viable and productive NAT server. Of course, this is only a “basic” tuning - we did not touch, for example, core tuning, etc. of things. However, in most cases, even such simple actions will be sufficient for the normal operation of a sufficiently large network. As I said, in our network there are more than 30 thousand subscribers whose traffic is processed by 4 NAT-servers.
In the following issues:
- high flows and high-performance shaper;
- large streams and high-performance firewall.