rekby January 26, 2015 at 14:29

How to implement almost instant website switching between sites when one fell

It happens that sites fall due to a failure of the hosting site, channels, and so on. I have been hosting for 7 years, and often see such problems.

A couple of years ago, I realized that the backup site service (without finalizing their website or service) is very important to customers. Theoretically, everything is simple here:
1. Have a copy of all the data in another data center.
2. In the event of a failure, switch the operation to the backup DC.

In practice, the system underwent 2 complete technical reorganizations (preservation of the main ideas with the replacement of a significant part of the instrumentation), 3 relocations to new equipment, 1 relocation between service providers (relocation from a German data center to two Russian ones). It took 2 years to study the behavior of different systems in real conditions under a client load.

So, even if the hoster uses a cluster solution to host client VDSs - the vast majority of existing cluster solutions are designed to work within the same data center, or inside one complex system, a breakdown of which leads to the shutdown of the entire cluster.

Main
1. Replication - from local disks via drbd WITHOUT drbd-proxy
2. Switching between data centers by announcing your own network of IP addresses, without changing them for domains
3. Separate backup copy (regular copy, not replica) + monitoring in the third data center
4. Hypervisor-based virtualization (in this case, KVM)
5. Internal L3 network with dynamic routing, IPSEC mGRE OSPF
6. Organization of simple composite components, each of which can be temporarily disabled or replaced.
7. The principle of the absence of the “most reliable system"
8. Fatal point of failure

The following will describe the reason for precisely such solutions.

System requirements
1. Launch client service without modifications (ie launch projects not designed for clustering)
2. Ability to continue working after a single equipment failure up to shutting down any one data center.
3. Restoring the client services within 15 minutes after a fatal failure, data loss is allowed 1 minute before the failure.
4. The ability to recover data from the loss of any two data centers.
5. Maintenance / replacement of own equipment and service providers without downtime of customer services.

Data storage The
optimal option for DRBD without proxy is selected, traffic compression is performed simultaneously with encryption using ipsec. In synchronization mode B and local data caching inside virtual machines, delays of 10-12 ms (ping between data centers Moscow - St. Petersburg) do not affect the speed of work, moreover, this delay is only for writing, reading occurs from the local disk quickly.

We investigated 4 options:
1, 2. Ceph in the rbd (block device) and CephFS (distributed file system)
versions 3. XtremeFS
4. DRBD
5. DRBD with DRBD-Proxy

Common for Ceph
Advantages:
Convenient bandwidth scaling. Simple capacity addition, automatic distribution / redistribution of data across servers.

Disadvantages:
It works within the same data center, with two distant data centers you need two clusters and replication between them. It works poorly on a small number of disks (data integrity checks suspend current disk operations). Synchronous replication only.
Diagnosing a collapsed cluster is a difficult task. It is necessary or very long time to break and repair this system or have a contract for quick support with the developers.
When updating a cluster, there is a critical moment - restarting the average in the monitor quorum. If at this moment something goes wrong - the cluster will stop and then it will need to be assembled manually.

Ceph fs
It was supposed to be used as a single file system for storing all data of LXC / OpenVZ containers.
Advantages:
Ability to create snapshots of the file system. The file system and individual files may be larger than the local disk. All data is simultaneously available on all physical servers.

Disadvantages:
For each file open operation, the server must contact the metadata server, check if the file exists, where to look for it, etc. Metadata caching is not provided. Sites start to work more slowly and it is noticeable by the eyes. Local caching is not implemented.
The developers warn that the system is not ready yet and it is better not to store important data in it.

Ceph rdb
Intended use - one block device per container.
Advantages:
Convenient creation of snapshots, image cloning (in the format of the second version). The image size may be larger than the local disk. Local operations are cached by the host / container operating system. There are no delays in frequently reading small files.

XtremeFS
Pluses:
Declared support for replication over long distances, including work offline, can support partial replicas.

Disadvantages:
During tests, it proved to be very slow and was not investigated further. It feels like it is intended for distributed storage of an array of data / documents / disks so that for example each office has its own copy and is not intended for actively changing files - such as databases or virtual server images.

DRBD
Advantages:
Replication of block devices. Reading from local drives. The ability to work autonomously, without a cluster. Natural data storage - in case of problems with DRBD, you can connect to the device directly and work with it. Several modes of synchronization.

Disadvantages:
Each image can only be synchronized between two servers (there is the possibility of replication to 3-4 servers, but when switching the master server, difficulties are expected with the distribution of metadata between servers + bandwidth is reduced several times).
The size of the device cannot exceed the size of the local disk.

DRBD with DRBD-Proxy
Special paid supplement to DRBD for long-distance replication

Advantages:
1. It compresses traffic well, 2-10 times as compared to operation without compression.
2. A large local buffer for accepting write operations and sending them gradually to a remote server without slowing down operations on the main one (in asynchronous replication mode).
3. Sane, support, quite quick answers at some time (apparently if you get to it during business hours)

Disadvantages:
Immediately upon startup, I came across a bug that has already been fixed but the fix has not yet been published - they sent a separately assembled binary with the fix.
In tests, it proved extremely unstable - the simplest test by random recording at a high speed hung a proxy service so that you had to restart the entire server, not just the proxy.
From time to time it goes into a spinlock and bluntly eats the entire processor core.

Switching traffic between data centers
Two options were considered:
1. Switching through changing records on DNS servers
2. By announcing your own network of IP addresses from two data centers

The best choice is the announcement of your network via BGP.

Changing records on DNS servers
Advantages:
Ease of implementation
By ping you can understand which data center traffic comes in.

Minuses:
Some clients are not ready to delegate their domains to other DNS servers.
Long switching time - DNS caching is often more aggressive than specified in TTL, and even with a TTL of 5-15 minutes in an hour, someone will still break into the previous server. And individual scanners - even after a few days.
It is not possible to save the IP address of client servers when moving between data centers.
In case of half-loss of communication with the data center, the dns server can start to give out different IP addresses and switching will only happen partially.

Announcement of its own network of addresses
Advantages:
Fast guaranteed traffic switching between data centers. During tests inside Moscow, a change in the BGP announcement diverges in a few seconds. The world may take longer, but still faster and more reliable than VDS.
It is possible to disable traffic from a semi-working data center, the connection with which the system is lost, but which is visible for part of the Internet.

Disadvantages: The
configuration of internal routing is complicated. Switching of individual system resources is only possible partial - traffic will come to the data center that is closer to the client, and leave the data center where the virtual server is running.

Backup in the third data center
The situation of data loss in two data centers is quite real, for example, a software error that synchronously deleted data on the primary and backup servers. Or a hardware failure on the primary server while the backup or data resynchronization is in progress.
For such cases, a server is installed in the third data center, which is excluded from the general cluster system. All he can do is monitor and store backups.

Virtualization Method The
following options were considered:
LXC, OpenVZ, KVM, Hyper-V

The choice was made in favor of KVM, as it provides the greatest freedom of action.

LXC
Advantages:
Easy to install, runs on a standard Linux kernel without modifications. Allows basic insulation of containers.
No performance loss on virtualization.

Disadvantages:
Weak isolation.
There is no live migration between servers.
Inside the container, only Linux systems can be launched. Even within Linux systems, there are limitations on the functionality of kernel modules.

OpenVZ
Advantages:
There is no loss of performance for virtualization.
There is live migration.

Disadvantages:
It works on a modified kernel, you need to manually build additional modules and, possibly, have compatibility problems due to their non-standard environment.
Inside the container, only Linux works. Even within Linux systems, there are limitations on the functionality of kernel modules.

KVM
Advantages:
It works without modifying the system core.
There is live migration.
You can connect equipment directly to the virtual machine (disks, usb devices, etc.).

Disadvantages:
Loss of performance on hardware virtualization

Hyper-V
Advantages:
Good integration with Windows
There is live migration

Disadvantages:
Not supported necessary features from Linux OS: caching on SSD, replication of local disks, connection remote connection to the VDS console for the client.

Choosing an Internal Network
The task is to provide an internal network, with addressing independent of the external network and the location of the internal server. The ability to quickly switch the traffic flow to the server when changing its physical location (moving VDS) is needed. The possibility of arbitrary routing to each specific server (i.e. the server moves without changing the IP address). The possibility of organizing a fully connected network between data centers is desirable. Traffic protection is desirable.

Initially, the tinc variant and fully connected L2 network were used. This is the easiest to configure and relatively flexible option. But not always predictable. After a series of routing experiments, I came to the conclusion that routing at the L3 level is just what you need - predictably, manageably, quickly. Dynamic routing through OSPF works inside the internal network; the route is registered for each private IP address. Those. all routers know through which one there is access to each specific server.
In the case of the L2 network, the tables would be about the same, but less transparent, because hidden inside the software, and not in the standard routing tables of the kernel.
If necessary (in case of problems with OSPF) with an increase in the number of routes, this system can easily be replaced by fully static route registration through its own services.

Options considered:
OpenVPN, L2 tinc, GRE, mGRE The

option mGRE is selected. L2 traffic and multicast are currently not needed. If necessary, it is possible to add multicast via software on nodes, there is no need for L2 traffic. The lack of encryption was compensated for by configuring IPSEC for traffic between nodes. IPSEC will also compress it concurrently.

When setting up in real conditions, an interesting feature was found out - despite the complete disabling of filtering in the data center - its equipment looks inside the GRE protocol and analyzes what's inside. In this case, packets with OSPF traffic are deleted and an additional delay of 2ms occurs. So IPSEC turned out to be needed not only for abstract encryption, but also for the performance of the system in principle.
Data center specialists asked the equipment supplier why such filtering occurs, but have not yet received an answer (for 1-2 months already).

OpenVPN
Advantages
Already familiar. It works well. Able to work in L2 / L3 modes.

Disadvantages:
Works on point-to-point or star communications. To establish a fully connected network, you will need to support a large number of tunnels.

Tinc
Advantages:
Initially able to organize a fully connected L2-network between an arbitrary number of points. Easy to set up. Used it for about 1-2 years in the previous version of the system, before moving the servers to Russia.

Disadvantages:
Routing uncertainty if a situation arises when two computers with the same MAC appear on the network (for example, split-brain in a cluster). The delay in determining the change in server location when moving about 10-30 seconds.
Drives L2 traffic, but in practice you need L3.

GRE
Pluses:
Works in core. Just customizable. You can drive L2 traffic.

Disadvantages:
No encryption. A
large number of

mGRE interfaces must be supported .
Advantages:
Works in the kernel. Just customizable. A mesh network is created using only one tunnel interface, just adding neighbors.

Minuses:
No encryption. Does not know how to work with L2 traffic, there is no multicast out of the box.

The use of ready-made cluster solutions, the principle of the lack of "the most reliable supplier / hardware / program"
When using the reliable cephfs / cephrbd disk storage, I was able to break them so that repair was required by the developers. Within a few days I received the necessary consultations through the IRC channel and during the course of the diagnosis it became clear that it was almost impossible to independently diagnose and fix such a problem - you need a deep knowledge of the system device and a lot of experience in such diagnostics. In addition, if such a system breaks down, the cluster ceases to work in principle, which is unacceptable even if there is a support contract. In addition, contracts for round-the-clock fast support in any such products are very expensive and this will immediately put an end to the mass solutions, as it will not work to sell cheaply what I bought for expensive.

Similarly with any supplier of reliability, the insides of which are closed or not studied to the depth of the natural understanding of every detail. When building a cluster, an eye was made on the fact that each component of the system in an emergency can be turned off if it stops working as it should and is replaced by something, at least temporarily, with a loss of some functionality but while maintaining the work of client services in general.

You can turn off DRBD, you can even delete it and connect to LVM volumes directly. Synchronization between nodes will stop working and live migration will be impossible. For an emergency, while a breakdown in drbd will be repaired (probably metadata or configs or rollback of software versions) this is acceptable. When the problem is delayed, replication through rsync, snapshots of LVM, lvmsync, etc. are possible.

KVM - systems are never updated at the same time. In the event of a failure on one server, all client services will continue to work on the backup while repair work is ongoing. All client services are removed from the node before starting work.

Third data center with monitoring and backups. If backups are lost, they can be temporarily replaced with LVM snapshots on the main nodes of the system. If monitoring is lost, the hosting system and all client services will continue to work. This will break the auto-fix function of broken resources. This is currently an acceptable compromise. If necessary, this system can also be duplicated.

Dynamic VPN internal network. If this network breaks down, it is possible to move all resources to one data center and work without a VPN network.

Public network of IP addresses. This is currently a compromise and is a single point of failure. If for some reason the address block stops working (forgot to pay, the office that issued this block closed, was selected in connection with the optimization of the address space). Access to client resources will be lost. Here the assumption is made that such things usually do not happen unexpectedly and there will be time to prepare for the loss of the block. If the address block is nevertheless lost unexpectedly, there is a backup option to take the address block from the data centers. The work of client resources can be restored within 1-2 days - basically this is the time for coordination and configuration of the address block by the data center, the operation of the cluster itself does not depend on this and you only need to update the DNS records of domains.
In the future, this point of failure can also be eliminated by obtaining a second block of addresses through another company.

Data Center
In practice, data centers sometimes turn off electricity and the Internet, despite all the backup systems. Sometimes the most reliable pieces of hardware break down and the Internet crashes at the same time in several data centers that worked through this piece of iron, the work of data centers stops state-owned. bodies during investigative measures, etc. This is what has been tested on our own experience in Russian and foreign data centers.
In the final version of hosting, these problems are also taken into account: the main hosting system is located in two independent data centers located in different regions of the Russian Federation: Moscow and St. Petersburg. Data centers use communication channels independent from each other, are not legally connected with each other and do not have any special coordination / interactions like common routers.

Fatal point of failure
This system allows you to protect the end customer from any single failures in any systems and some multiple failures, except for the existence of the work of the service provider itself. The client inevitably has to take this risk when using someone else's infrastructure, the alternative is to build and maintain their own similar system.

What happened
1. The client comes and says: I do not want to fall.
2. In most cases, no project adaptation is needed, but you need to get on our servers. As a rule, they give us access to hosting, we do everything ourselves with support forces.
3. We give the client a test address to access the server. The client checks that everything works as he is used to, we correct minor inaccuracies. Hardware virtualization allows you to accurately copy any project along with the existing operating system, programs and settings, without having to repeat the settings on the new server and risk forgetting something on the old one.
4. We switch without a break, the longest part is the new IP addresses. When transferring sites, it turns out to make a move with a break of several minutes.
5. A hosting contract is concluded, an invoice is issued for payment. For a small sample project, it costs 4500 per month (this includes hosting, support, and duplication).
6. Further usually nothing falls.

Tags:

How to implement almost instant website switching between sites when one fell

Also popular now: