Failover balancing of VoIP traffic. Load switching between data centers at peak-time
A few words about what we do. DINS participates in the development and support of UCaaS service in the international market for corporate clients. The service is used by both small companies and startups, as well as big business. Clients connect via the Internet using the SIP protocol over TCP, TLS or WSS. This creates a rather heavy load: almost 1.5 million connections from endpoints - Polycom / Cisco / Yealink telephones and software clients for PC / Mac / IOS / Android.
In the article, I talk about how the VoIP entry points are arranged.
On the perimeter of the system (between the terminal devices and the core) are commercial SBC (Session Border Controller).
Since 2012, we have used solutions from Acme Packet, which was later acquired by Oracle. Before that, we used NatPASS.
Briefly list the functionality that we use:
• NAT traversal;
• SIP normalization (allowed / disallowed headers, header manipulation rules, etc)
• TLS & SRTP offload;
• Conversion of transport (within the system we use SIP over UDP);
• MOS monitoring (via RTCP-XR);
• ACLs, Bruteforce detection;
• Reduced registration traffic due to increased contact expiration (low expire on the access side, high on the kernel side);
• Per-Method SIP messages throttling.
Commercial systems have their own obvious advantages (out-of-the-box functionality, commercial support) and minuses (price, delivery time, lack of opportunity or too long implementation times for new features we need, time to solve problems, etc.). Gradually, the disadvantages began to outweigh, and it became clear that there was a need to develop their own solutions.
The development was launched a year and a half ago. In the border subsystem, we traditionally distinguish 2 main components: SIP and Media servers; over each component load balancers. I work here on entry points / balancers, so I will try to talk about them.
- Fault tolerance: the system must provide a service if one or more instances in the data center or the entire data center fails
- Serviceability: we want to be able to switch loads from one data center to others
- Scalability: I want to increase capacity quickly and inexpensively.
We chose IPVS (aka LVS) in IPIP mode (traffic tunneling). I will not go into the comparative analysis of NAT / DR / TUN / L3DSR, (you can read about modes, for example, here ), I will mention only the reasons:
- We do not want to impose on backends the requirement to be on a common subnet with LVS (pools contain backends from both their own and remote data centers);
- The backend should receive the original source IP of the client (or its NAT), in other words, source NAT is not suitable;
- The backend should support simultaneous work with multiple VIPs.
We are balancing the media traffic (it’s very difficult, we are going to refuse), so the current deployment scheme in the data center looks like this:
The current IPVS balancing strategy is “sed” (Shortest Expected Delay), more about it. Unlike the Weighted Round Robin / Weighted Least-Connection, it allows you not to overflow traffic to backends with lower weights until you reach a certain threshold. Shortest expected delay is calculated using the formula (Ci + 1) / Ui, where Ci is the number of connections on the backend i, Ui is the weight of the backend. For example, if the pool has backends with weights of 50,000 and 2, new connections will be distributed across the first until each server reaches 25,000 connections or until it reaches the uthreshold — the limit on the total number of connections.
You can read more about balancing strategies inman ipvsadm .
The IPVS pool looks like this (hereinafter, the made-up IP addresses are shown):
# ipvsadm -ln Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 18.104.22.168:5060 sed -> 10.11.100.181:5060 Tunnel 50000 5903 4 -> 10.11.100.192:5060 Tunnel 50000 5905 1 -> 10.12.100.137:5060 Tunnel 2 0 0 -> 10.12.100.144:5060 Tunnel 2 0 0
The load on the VIP is distributed among the servers with a weight of 50,000 (they are deployed in the same data center as a specific LVS instance), if they are overloaded or fall into the blacklist, the load will be poured on the backup part of the pool, adjacent data center.
Exactly the same pool, but with weights on the contrary, is configured in the adjacent data center (on the production system, the number of backends, of course, is much larger).
Synchronizing connections via ipvs sync allows the backup LVS to be aware of all current connections.
A “dirty” trick was used for synchronization between data centers, which nevertheless works fine. IPVS sync works only through multicast, which was difficult for us to correctly deliver to the neighboring DC. Instead of multicast, we duplicate the synchronization traffic using the iptables target TEE from the ipvs master in the ip-ip tunnel to the server in the neighboring DC, and there can be several target hosts / data centers:
#### start ipvs sync master role: ipvsadm --start-daemon master --syncid 10 --sync-maxlen 1460 --mcast-interface sync01 --mcast-group 22.214.171.124 --mcast-port 8848 --mcast-ttl 1 #### duplicate all sync packets to remote LVS servers using iptables TEE target: iptables -t mangle -A POSTROUTING -d 126.96.36.199/32 -o sync01 -j TEE --gateway 172.20.21.10 # ip-ip remote lvs server 1 iptables -t mangle -A POSTROUTING -d 188.8.131.52/32 -o sync01 -j TEE --gateway 172.20.21.14 # ip-ip remote lvs server 2 #### start ipvs sync backup role: ipvsadm --start-daemon backup --syncid 10 --sync-maxlen 1460 --mcast-interface sync01 --mcast-group 184.108.40.206 --mcast-port 8848 --mcast-ttl 1 #### be ready to receive sync sync packets from remote LVS servers: iptables -t mangle -A PREROUTING -d 220.127.116.11/32 -i loc02_srv01 -j TEE --gateway 127.0.0.1 iptables -t mangle -A PREROUTING -d 18.104.22.168/32 -i loc02_srv02 -j TEE --gateway 127.0.0.1
In fact, each of our LVS server plays both roles at once (master & backup), on the one hand, this is simply convenient, as it eliminates role changes when switching traffic, on the other, it is necessary, since each DC by default processes the traffic of its group public VIPs.
Load transfer between data centers
In normal operation, each public IP address is advertised on the Internet from anywhere (in this diagram from two data centers). The incoming VIP traffic is routed to the DC we need at the moment using the BGP attribute MED (Multi Exit Discriminator) with different values for Active DC and Backup DC. At the same time, Backup DC is always ready to accept traffic, if something happens to the active one:
By changing the values of BGP MEDs and using the IPVS-sync cross-location, we are able to smoothly transfer traffic from the backends of one data center to another without affecting set phone calls that will naturally end sooner or later. The process is fully automated (for each VIP we have a button in the management console), and it looks like this:
SIP-VIP is active in DC1 (left), the cluster in DC2 (right) is a backup, thanks to ipvs synchronization, it has information in the memory about the established connections. On the left, active VIPs are announced with a MED value of 100, on the right - with a value of 500:
The toggle button causes a change in so-called. “Target_state” (internal concept declaring BGP MEDs values at a given time). Here we do not hope that DC1 is OK and is ready to handle traffic, so LVS in DC2 comes to “force active”, lowering the value of MEDs to 50, and thus pulls traffic to itself. If the backends in DC1 are alive and available, the calls will not break. All new tcp connections (registrations) will be sent to DC2 backends:
DC1 received a new target_state replication and set the backup value to MEDs (500). When DC2 finds out about this, it normalizes its value (50 => 100). It remains to wait for the completion of all active calls in DC1 and terminate the established tcp connections. SBC-instances in DC1 enter the necessary services in the so-called. “graceful shutdown” status: “SIP 503” replies to the next SIP requests and break connections, while new connections are not accepted. Also, these instances fall into the blacklist on LVS. When breaking, the client establishes a new registration / connection, which already comes in DC2:
The process ends when all traffic in DC2.
DC1 and DC2 swapped roles.
Under conditions of constant high load on the entry points, it turned out to be very convenient to be able to switch traffic at any time. The same mechanism starts automatically if backup DC suddenly began to receive traffic. At the same time, to protect against flapping, switching is triggered only once in one direction and a lock is set for automatic switchings; human intervention is required to remove it.
VRRP cluster & IPVS manager: Keepalived. Keepalived is responsible for switching VIPs within the cluster, as well as for backends healthchecking / blacklisting.
BGP Stack: ExaBGP. Responsible for announcing routes on VIP-addresses and putting down relevant BGP MEDs. Fully controlled by the management server. A robust BGP daemon written in Python is actively developing and performs its task 100%.
Management server (API / Monitoring / sub-components management): Pyro4 + Flask. It is a Provisioning server for Keepalived and ExaBGP, manages all other system settings (sysctl / iptables / ipset / etc), provides monitoring (gnlpy), adds and removes backends on demand (they communicate with its API).
A virtual machine with four cores Intel Xeon Gold 6140 CPU @ 2.30GHz serves a 300Mbps / 210Kpps traffic stream (media traffic, about 3 thousand simultaneous calls during peak-time process through them). CPU utilization with this - 60%.
Now this is enough to serve traffic to 100 thousand end devices (desk phones). To serve all traffic (more than 1 million terminals), we build about 10 pairs of such clusters in several data centers.