undying February 20, 2019 at 13:06

I'm going deeper underground, or what you should know about, optimizing the network application

Greetings friends!

In the previous two articles ( one , two ), we plunged into the complexity of the choice between technologies and looked for the optimal settings for our solution in Ostrovok.ru . What topic will we raise today?

Each service should work on some server, communicating with hardware through the tools of the operating system. There are a great many of these tools, as well as settings for them. In most cases, their default settings will be more than enough. In this article, I would like to talk about those cases when the standard settings were still not enough, and I had to get to know the operating system a little closer - in our case with Linux .

We use kernels wisely

In a previous article, I talked about the cpu-map option in Haproxy . With it, we bind Haproxy processes to threads of one core on a dual-processor server. We gave the second core to network card interrupt handling.

Below is a screen where you can see a similar separation. Haproxy's are busy on the left user space, and interrupt processing on the right kernel space.

Binding interrupts to a network card is done automatically using this

Bash script:

#! /bin/bash
interface=${1}
if [ -z "${interface}" ];then
  echo "no interface specified"
  echo "usage: ${0} eth1"
  exit 1
fi
nproc=$(grep 'physical id' /proc/cpuinfo|sort -u|wc -l)
ncpu=$(grep -c 'processor' /proc/cpuinfo)
cpu_per_proc=$[ncpu / nproc]
queue_threads=$[cpu_per_proc / 2]
binary_map=""
cpumap=""
for(( i=0; i < ncpu; i++ ));do
  cpumap=${cpumap}1
  b+='{0..1}'
done
binary_map=($(eval echo ${b}))
### Здесь заодно пытаемся поднять количество очередей сетевой карты 
### до нужного количества потоков, в случае если карта это поддерживает.
ethtool -L ${interface} combined ${queue_threads} || true
count=${ncpu}
while read irq queue;do
  let "cpu_num=$[count-1]"
  let "cpu_index=$[2**cpu_num]"
  printf "setting ${queue} to %d (%d)\n" $((2#${binary_map[${cpu_index}]})) ${cpu_num}
  printf "%x\n" "$((2#${binary_map[${cpu_index}]}))" > /proc/irq/${irq}/smp_affinity
  [ ${interface} != ${queue} ] && count=$[count-1]
  [ $[ncpu - count] -gt ${queue_threads} ] && count=${ncpu}
done < <(awk "/${interface}/ {if(NR > 1){ sub(\":\", \"\", \$1); print \$1,\$(NF)} }" /proc/interrupts)
exit 0

There are many suitable simple and more complex scripts on the Internet that do the same job, but this script is enough for our needs.

At Haproxy, we linked processes to kernels, starting with the first kernel. The same script binds interrupts, starting with the last. Thus, we can divide the server processors into two camps.

For a deeper insight into interruptions and networking, I highly recommend reading this article .

We reveal the capabilities of network devices

It so happens that a great many frames can fly over the network at one moment, and the card queue may not be ready for such an influx of guests, even if it has the opportunity to do so.

Let's talk about the network card buffer. Most often, the default values do not use the entire available buffer. You can view the current settings using the powerful ethtool utility.

Command usage example:

> ethtool -g eno1
Ring parameters for eno1:
Pre-set maximums:
RX:             4096
RX Mini:        0
RX Jumbo:       0
TX:             4096
Current hardware settings:
RX:             256
RX Mini:        0
RX Jumbo:       0
TX:             256

Now let's take everything from life:

> ethtool -G eno1 rx 4096 tx 4096
> ethtool -g eno1
Ring parameters for eno1:
Pre-set maximums:
RX:             4096
RX Mini:        0
RX Jumbo:       0
TX:             4096
Current hardware settings:
RX:             4096
RX Mini:        0
RX Jumbo:       0
TX:             4096

Now you can be sure that the card is not restrained and works at the maximum of its capabilities.

Minimum sysctl settings for maximum benefit

Sysctl has a great variety of options in all colors and sizes that you can imagine. And, as a rule, articles on the Internet, addressing the issue of optimization, cover a rather impressive part of these parameters. I will consider only those that were really useful to change in our case.

net.core.netdev_max_backlog - the queue where frames from the network card get, which are then processed by the kernel. With fast interfaces and large amounts of traffic, it can fill up quickly. Default : 1000.
We can observe the excess of this queue by looking at the second column in the / proc / net / softnet_stat file.

awk '{print $2}' /proc/net/softnet_stat

The file itself describes the structure of netif_rx_stats per line for each CPU in the system.
Specifically, the second column describes the number of packets in the dropped state. If the value in the second column grows over time, then it is probably worth increasing the value net.core.netdev_max_backlogor putting the CPU faster.

net.core.rmem_default / net.core.rmem_max && net.core.wmem_default / net.core.wmem_max - these parameters indicate the default value / maximum value for socket read and write buffers. The default value can be changed at the application level at the time the socket is created (by the way, in Haproxy there is a parameterthat does this). We had cases when the kernel threw more packets than Haproxy managed to rake, and then the problems started. Therefore, the thing is important.

net.ipv4.tcp_max_syn_backlog - responsible for the limit of new connections not yet established for which the SYNpacket was received . If there is a large stream of new connections (for example, a lot of HTTP requests c Connection: close), it makes sense to raise this value so as not to waste time sending forwarded packets.

net.core.somaxconn - here we are talking about established connections, but not yet processed by the application. If the server is single-threaded, and two requests came to it, then the first request will be processed by the function accept(), and the second will hang in backlog, for which size this parameter is responsible.

nf_conntrack_max is probably the most famous of all parameters. I think almost everyone who dealt with iptables knows about it. Ideally, of course, if you do not need to use iptables masquerading, then you can unload the conntrack module and not think about it. In my case, Docker is used , so you won’t upload anything special.

Monitoring Obvious and not very

In order not to blindly search for why “your proxy is slowing down,” it will be useful to set up a couple of graphs and put them on triggers.

nf_conntrack_count is the most obvious metric. On it, you can monitor how many connections are now in the conntrack table . When the table overflows, the path for new connections will be closed.

The current value can be found here:

cat /proc/sys/net/netfilter/nf_conntrack_count

Tcp segments retransmited - the number of segment transfers. The metric is very voluminous, as it can talk about problems at different levels. The increase in transfers may indicate network problems, the need to optimize system settings, or even that the final software (for example, Haproxy) is not doing its job. Be that as it may, the abnormal growth of this value may serve as a reason for proceedings.

In our country, an increase in values most often indicates problems with one of the suppliers, although there have been problems with the performance of both servers and the network.

Example for verification:

netstat -s|grep 'segments retransmited'

Socket Recv-Q - remember, we talked about moments when an application may not have enough time to process requests, and then it socket backlogwill grow? The growth of this indicator makes it clear that something is wrong with the application and it can’t cope.

I saw mountains in graphs with this metric, when the maxconn parameter in Haproxy had a default value (2000), and it simply did not accept new connections.

And again an example:

ss -lntp|awk '/LISTEN/ {print $2}'

It will not be superfluous to have a graph with a breakdown by state of TCP connections:

And separately render it time-wait/established, because their values, as a rule, are very different from the rest:

In addition to these metrics, there are many others, but more obvious - for example, the load on the network interface or CPU. Their choice will already depend more on the specifics of your workload.

Instead of a conclusion

In general, that’s all - I tried to describe the key points that I had to face when setting up the http reverse proxy. It would seem that the task is not difficult, but with an increase in load, the number of pitfalls that always pop up at the wrong time also increases. I hope this article helps you avoid the difficulties that I had to face.

All Peace

Tags: