Vipnet Failover crypto gateway or how not to implement fault tolerance

From the sandbox

For about three years I was engaged in the integration of Infotex products. During this time, I became closely acquainted with most of its products, and in general, I believe that they deservedly received such wide distribution in Russia. Among their main advantages are the availability of FSB and FSTEC certificates, a wide range of products, including both software and software and hardware solutions, easy and convenient scaling and network administration, good technical support, convenient licensing, ease of installation and configuration, and of course price in comparison with analogues. There are, of course, disadvantages, but who does not have them? However, the most, in my opinion, the most unsuccessful product from the entire line is the ViPNet Failover failover cluster, and I will explain why later.

As the documentation says, the hot standby cluster mode is designed to hot-swap the functions of one of the servers with ViPNet software with another server in case of failure of the first one. The hot standby server cluster consists of two interconnected computers, one of which (active) acts as the ViPNet server (coordinator), and the other (passive) is in standby mode. In the event of failures that are critical for the performance of ViPNet software on the active server (primarily in the event of a malfunction in the network or network equipment), the passive server switches to active mode, taking over the load and acting as a coordinator instead of the server that recorded the failure. When working in the hot standby cluster mode, the failure protection system also functions as a single mode,

I note that Vipnet Coordinator Linux itself is not bad, and the hot spare scheme itself spoils everything.

The entire cluster, from the point of view of other computers on the network, has one IP address on each of its network interfaces. This address has a server that is currently in active mode. A server in passive mode has a different IP address that is not used by other computers to communicate with the cluster. Unlike active mode addresses, in passive mode each server has its own address on each of the interfaces; these addresses for the two servers do not match.

The failover.ini config contains 3 types of parameter sections. The [channel] section describes the IP addresses of the reserved interfaces:

[channel] 
device= eth1 
activeip= 192.168.10.50 
passiveip= 192.168.10.51 
testip= 192.168.10.1 
ident= if-1 
checkonlyidle= yes

The [network] section describes the backup options:

[network] 
checktime= 10 
timeout= 2 
activeretries= 3 
channelretries= 3 
synctime= 5 
fastdown= yes

In [sendconfig], the address of the network interface of the second server is written, on which the cluster operation will be synchronized:

[sendconfig] 
device= eth0

Again, I will give the backup algorithm from the documentation:

The algorithm for working on the active server is as follows. Every checktimeseconds, the health of each of the interfaces in the configuration is checked. If the checkonlyidle parameter is set to yes, then the incoming and outgoing network traffic passed through the interface is analyzed. If the difference in the number of packets between the beginning and the end of the interval is positive, then it is considered that the interface is functioning normally, and the failure counter for this interface is reset. If no packets were sent and received during this interval, an additional verification mechanism is activated, which consists in sending pings to the nearest routers. If the checkonlyidle parameter is set to no, then the additional check mechanism is used instead of the main one, that is, packets are sent to testip addresses every checktime seconds. Then, during the timeout time, responses are expected. If there is no response from any testip address on any interface, then its failure counter is incremented. If at least on one interface the failure counter is not zero, then new packets are sent immediately to all testip and a response is expected within timeout. If in the process of new sendings to the interface, the failure counter of which is not equal to zero, an answer arrives, its failure counter is reset. If after any sending the failure counters on all interfaces become equal to zero, then there is a return to the main loop, a new wait during checktime, and so on. If, after a certain number of new packages, the failure counter of at least one interface reaches the channelretries value, then a complete interface failure is detected and a system reboot begins. Thus, the maximum down time of the interface beforechecktime + (timeout * channelretries) .

On a passive server, the algorithm is slightly different. Once in checktime seconds, the entries in the system ARP table are deleted for all activeip. Then, UDP requests from all interfaces are sent to activeip addresses, as a result of which the system first sends an ARP request and only receives a UDP request if a response is received. After the timeout response interval has elapsed, the presence of an ARP record for each activeip in the system ARP table is checked, by the presence of which it is concluded that the corresponding interface is working on the active server. If no response was received from any interface, the failure counter (it is the only one on all interfaces) increases. If a response has been received from at least one interface, the failure counter is reset to zero. If the failure counter reaches the value of activeretries, it switches to active mode.checktime + (timeout * activeretries) .

The total system uptime during a crash may be slightly longer than checktime * 2 + timeout * (channelretries + activeretries) . This is due to the fact that after the start of a reboot of a failed server, the system puts its interfaces into an idle state not immediately, but after a while, after the other subsystems stop. Therefore, for example, if two interfaces are checked and only one fails, the address of the second interface will be available for some time, during which the passive server will receive responses from it. Typically, the time from the start of the reboot to shutting down the interfaces does not exceed 30 seconds, however, it can greatly depend on the speed of the computer and the number of services running on it.

At first glance, everything is correct, as soon as there is a problem with the active server, it reboots, and passive takes its place. What do we have in practice?

You can’t just take and connect a protected resource (for example, 1C server) to a filer through an unmanaged switch, more precisely, of course you can, but in this case, you will need to specify the address of the protected resource itself as a test IP. As a result of which, when rebooting the 1C server, for example, for scheduled maintenance or updating the software, the filer will also be cut down. This, of course, is not a problem if protecting access to it is the only task, but in most cases it is not.
The fault tolerance of the filer depends on the fault tolerance of each of the test addresses, and becomes smaller with each additional network interface. In my practice, there was a case when a data center employee accidentally disconnected a switch serving as a test node from the power supply, as a result, the cluster went into a continuous reboot cycle, because of which local machines with Vipnet clients could not connect to the domain and their work was paralyzed, until we were allowed into the data center.

It turns out that the fault tolerance of the filer is possible only in a spherical vacuum, where the entire network infrastructure works without failures. Someone may say that this is a necessary measure, but if something happens to one of the servers, for example, the network interface crashes, it will reboot, its functionality will be restored and the backup algorithm will prove its usefulness. However, in practice, there was a case when one of the interfaces of the filer really crashed, and it began to reboot according to the scheme described above, while the problem after the reboot did not go away. To get rid of the glitch, I had to manually turn off the power, so in this case, fault tolerance is only visible on paper.

All this could be fixed if the algorithm included one more condition: the active server should not reboot if the passive one is turned off. It seems that a simple condition that provides real fault tolerance was never added by the developers. I can only assume that this is due to the need to make changes to the kernel and its re-certification.

In the case of a working network environment and server equipment, the server uptime was almost continuous. One of the first filers that I had to install has been working for 3 years and during this time the hot backup scheme worked out only during software modernization or due to general technical work in the data center. In general, the real benefit of such a scheme is possible only in the event of a hardware failure in one of the cluster nodes, which in my practice has not yet happened.

There was hope for Vipnet Cluster Windows, but Infotex, unfortunately, did not master it, and it is a pity the backup scheme was very promising.

In general, my advice is that if you don’t need a fault-tolerant cryptographic gateway for a tick, then it’s better not to bother with the filer, but to use the usual Vipnet Linux coordinator, it is quite reliable in itself, especially if you do not touch it;)

Tags:

Vipnet Failover crypto gateway or how not to implement fault tolerance

Also popular now: