zuzzas July 6, 2017 at 08:45

Network Performance Comparison for Kubernetes

Transfer

Kubernetes requires that each container in the cluster has a unique, routable IP. Kubernetes does not assign IP addresses itself, leaving this task to third-party solutions.

The goal of this study is to find the solution with the least latency, highest throughput, and lowest configuration cost. Since our load depends on the delays, we measure the delays of high percentiles at a sufficiently active network load. In particular, we focused on performance in the region of 30-50 percent of the maximum load, as this best reflects typical situations for non-congested systems.

Options

Docker with `--net=host`

Our exemplary installation. All other options were compared with her.

The option --net=hostmeans that the containers inherit the IP addresses of their host machines, i.e. No network containerization.

The lack of network containerization a priori provides better performance than the presence of any implementation - for this reason we used this installation as a reference.

Flannel

Flannel is a virtual network solution supported by the CoreOS project . Well tested and ready for production, so the cost of implementation is minimal.

When you add a flannel machine to a cluster, flannel does three things:

Assigns a subnet to the new machine using etcd .
Creates a virtual bridge interface on the machine ( docker0 bridge).
Configures the packet forwarding backend :
- aws-vpc- Registers the machine subnet in the Amazon AWS instance table. The number of entries in this table is limited to 50, i.e. you cannot have more than 50 machines in a cluster if you use flannel with aws-vpc. In addition, this backend only works with Amazon AWS;
- host-gw- creates IP routes to subnets through the IP addresses of the remote machine. Requires direct L2 connectivity between hosts running flannel;
- vxlan- Creates a virtual VXLAN interface .

Because flannel uses a bridge interface to forward packets, each packet passes through two network stacks when it is sent from one container to another.

IPvlan

IPvlan is a driver in the Linux kernel that allows you to create virtual interfaces with unique IP addresses without the need for a bridge interface.

To assign an IP address to a container using IPvlan, you need:

Create a container without a network interface at all.
Create an ipvlan interface in a standard network namespace.
Move the interface to the network namespace of the container.

IPvlan is a relatively new solution, so there are no ready-made tools for automating this process. Thus, the deployment of IPvlan on many machines and containers becomes more complicated, that is, the cost of implementation is high. However, IPvlan does not require a bridge interface and forwards packets directly from the NIC to the virtual interface, so we expected better performance than flannel.

Load test script

For each option, we have completed the following steps:

Set up a network on two physical machines.
Launched tcpkali in the container on the same machine, setting it up for sending requests at a constant speed.
We started nginx in the container on another machine, setting it up to respond with a file of a fixed size.
They removed the system metrics and the results of tcpkali.

We ran this test with a different number of requests: from 50,000 to 450,000 requests per second (RPS).

For each request, nginx responded with a static file of a fixed size: 350 bytes (contents of 100 bytes and headers of 250 bytes) or 4 kilobytes.

results

IPvlan has the lowest latency and best maximum throughput. Flannel with host-gwand aws-vpcfollows him with close indicators, while he host-gwproved to be better under maximum load.
Flannel has vxlanshown the worst results in all tests. However, we suspect that its exceptionally bad percentile 99.999 is caused by a bug.
The results for a 4-KB response are similar to the 350-byte case, but there are two noticeable differences:
- the maximum RPS is much lower, since for 4-kilobyte responses it took only ≈270 thousand RPS to fully load a 10-gigabit NIC;
- IPvlan is much closer to --net=hostwhen approaching the bandwidth limit.

Our current choice is flannel s host-gw. It has few dependencies (in particular, it does not require AWS or a new version of the Linux kernel), it is easy to install compared to IPvlan and offers sufficient performance. IPvlan is our fallback. If at some point flannel gets IPvlan support, we will move on to this option.

Despite the fact that the performance aws-vpcturned out to be slightly better host-gw, the limitation of 50 machines and the fact of tight binding to Amazon AWS became for us decisive factors.

50,000 RPS, 350 bytes

At 50,000 requests per second, all candidates showed acceptable performance. You can already notice the main trend: IPvlan shows the best results, host-gwand aws-vpcfollow it, but the vxlanworst.

150,000 RPS, 350 bytes

Percentile delays of 150,000 RPS (≈30% of maximum RPS), ms

IPvlan is slightly better than host-gwand aws-vpc, however, it has the worst percentile of 99.99. It has host-gwslightly better performance than that aws-vpc.

250,000 RPS, 350 bytes

It is assumed that such a load is common for production, so the results are especially important.

Percentile delays of 250,000 RPS (≈50% of maximum RPS), ms

IPvlan again shows better performance, but the aws-vpcbest result in percentiles is 99.99 and 99.999. host-gwsuperior aws-vpcin percentiles 95 and 99.

350,000 RPS, 350 bytes

In most cases, the delay is close to the results for 250,000 RPS (350 bytes), but it is growing rapidly after the 99.5 percentile, which means approaching the maximum RPS.

450,000 RPS, 350 bytes

Interestingly, it host-gwshows much better performance than aws-vpc:

500,000 RPS, 350 bytes

With a load of 500,000 RPS, only IPvlan still works and even surpasses it --net=host, but the delay is so high that we cannot call it acceptable for applications that are sensitive to delays.

50,000 RPS, 4 kilobytes

Large query results (4 kilobytes versus previously tested 350 bytes) lead to a larger network load, but the leaderboard remains almost unchanged:

Delay percentiles at 50,000 RPS (≈20% of maximum RPS), ms

150,000 RPS, 4 kilobytes

It has host-gwa surprisingly poor percentile of 99.999, but it still shows good results for smaller percentiles.

Delay percentiles at 150,000 RPS (≈60% of maximum RPS), ms

250,000 RPS, 4 kilobytes

This is the maximum RPS with a large response (4 Kb). aws-vpcsignificantly superior host-gwin contrast to the case with a small answer (350 bytes).

Vxlanwas again excluded from the schedule.

Testing Environment

The basics

To better understand this article and reproduce our test environment, you must be familiar with the basics of high performance.

These articles provide useful information on this topic:

How to receive a million packets per second from CloudFlare;
How to achieve low latency with 10Gbps Ethernet from CloudFlare;
Scaling in the Linux Networking Stack from the Linux kernel documentation.

Cars

We used two instances of c4.8xlarge on Amazon AWS EC2 with CentOS 7.
Both machines have enhanced networking enabled .
Each machine is NUMA with 2 processors, each processor has 9 cores, each core has 2 threads (hyperthreads), which ensures effective launch of 36 threads on each machine.
Each machine has a 10Gbps network interface card (NIC) and 60 GB of RAM.
To support enhanced networking and IPvlan, we installed the Linux 4.3.0 kernel with the Intel ixgbevf driver.

Configuration

Modern NICs use Receive Side Scaling (RSS) through multiple interrupt request lines ( IRQs ). EC2 offers only two of these lines in a virtualized environment, so we tested several configurations with RSS and Receive Packet Steering (RPS) and came to the following settings, partly recommended by the Linux kernel documentation:

The IRQ . The first core of each of the two NUMA nodes is configured to receive interrupts from the NIC. To map a CPU to a NUMA node, use lscpu:
```
$ lscpu | grep NUMA
NUMA node(s):          2
NUMA node0 CPU(s):     0-8,18-26
NUMA node1 CPU(s):     9-17,27-35
```
This setting is done by writing to 0and 9in , where IRQ numbers are obtained through :/proc/irq//smp_affinity_listgrep eth0 /proc/interrupts
```
$ echo 0 > /proc/irq/265/smp_affinity_list
$ echo 9 > /proc/irq/266/smp_affinity_list
```
Receive Packet Steering (RPS). Было протестировано несколько комбинаций для RPS. Чтобы уменьшить задержку, мы разгрузили процессоры от обработки IRQ, используя только CPU под номерами 1–8 и 10–17. В отличие от smp_affinity в IRQ, у sysfs-файла rps_cpus нет постфикса _list, поэтому для перечисления CPU, которым RPS может направлять трафик, используются битовые маски (подробнее см. Linux kernel documentation: RPS Configuration):
```
$ echo "00000000,0003fdfe" > /sys/class/net/eth0/queues/rx-0/rps_cpus
$ echo "00000000,0003fdfe" > /sys/class/net/eth0/queues/rx-1/rps_cpus
```
Transmit Packet Steering (XPS). Все процессоры NUMA 0 (включая HyperThreading, т.е. CPU под номерами 0—8, 18—26) были настроены на tx-0, а процессоры NUMA 1 (9—17, 27—37) — на tx-1 (подробнее см. Linux kernel documentation: XPS Configuration):
```
$ echo "00000000,07fc01ff" > /sys/class/net/eth0/queues/tx-0/xps_cpus
$ echo "0000000f,f803fe00" > /sys/class/net/eth0/queues/tx-1/xps_cpus
```
Receive Flow Steering (RFS). Мы планировали использовать 60 тысяч постоянных подключений, а официальная документация рекомендует округлить это количество до ближайшей степени двойки:
```
$ echo 65536 > /proc/sys/net/core/rps_sock_flow_entries
$ echo 32768 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt
$ echo 32768 > /sys/class/net/eth0/queues/rx-1/rps_flow_cnt
```
Nginx. Nginx использовал 18 рабочих процессов, у каждого — свой CPU (0—17). Это настраивается с помощью worker_cpu_affinity:
```
workers 18;
worker_cpu_affinity 1 10 100 1000 10000 ...;
```
Tcpkali. У Tcpkali нет встроенной поддержки привязки к конкретным CPU. Чтобы использовать RFS, мы запускали tcpkali в taskset и настроили планировщик для редкого переназначения потоков:
```
$ echo 10000000 > /proc/sys/kernel/sched_migration_cost_ns
$ taskset -ac 0-17 tcpkali --threads 18 ...
```

This configuration made it possible to evenly distribute the load on interrupts across the processor cores and achieve better throughput while maintaining the same delay as in other tested configurations.

Kernels 0 and 9 only service network interface interrupts (NICs) and do not work with packets, but they remain the busiest: I

also used tuned from Red Hat with the network-latency profile enabled.

NOTRACK rules have been added to minimize the impact of nf_conntrack .

The sysctl configuration was configured to support a large number of TCP connections:

fs.file-max = 1024000
net.ipv4.ip_local_port_range = "2000 65535"
net.ipv4.tcp_max_tw_buckets = 2000000
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_fin_timeout = 10
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_low_latency = 1

From a translator : Many thanks to colleagues from Machine Zone, Inc for the testing! It helped us, so we wanted to share it with others.

PS You might also be interested in our article “ Container Networking Interface (CNI) - Network Interface and Standard for Linux Containers ”.

Tags: