ToFoIn - Toggle Failover of Internet or switching between two external channels in FreeBSD

annotation


One of the options to improve the stability of the Internet connection is to use two external communication channels, which implies automatic switching between them. The article briefly discusses some options for solving this problem. A method for solving using bash scripts in the FreeBSD OS is proposed. Instructions are given for creating the final system and source codes for the scripts necessary for this.

Introduction


To improve the stability of connecting to the Internet, corporate solutions involve the use of two or more external network channels. Their simultaneous (for example, balancing method) or alternate (with switching between channels) use is not quite trivial, but already solved in many ways by a problem. Here is some of them:

  1. SOHO class routers with two exits to the external network (hereinafter referred to as the external network as the Internet, and the internal as the local network of the enterprise);
  2. Layer 3 switches, as a rule, are carrier-class, having a large number of variable parameters, in particular, allowing to solve the above problem;
  3. Many self-written scripts in different languages ​​for various unix- and linux-like systems, most often of dubious quality;
  4. Channel balancing by NAT rules;
  5. Balancing or switching using a proxy server.

Each of the above approaches has its advantages and disadvantages. Option one, SOHO routers:

Advantages:
  • low price;
  • Easy to install and configure.

Disadvantages:
  • insufficient reliability for the corporate segment due to lack of redundancy;
  • lack of configuration flexibility, low functionality. (Typically, such devices are able to solve a very limited range of problems, and they can’t either “take a step to the side” or not at all, or this is due to various difficulties.)

Second option, Layer 3 switches:

Advantages:
  • reliability;
  • flexibility of customization;

Disadvantages:
  • price (Typically, prices for such devices are beyond 50 t.);
  • difficulty setting (a professional-level device requires an appropriate approach).

Third option, switching scripts:

Advantages:
  • price (free of charge, not counting the working time to configure).

Disadvantages:
  • unpredictable reliability (since the professional level of the authors of these scripts is often unknown, without a detailed study it is difficult to conclude about the quality of the product);
  • lack of flexibility and complexity of customization (usually such scripts are created for specific conditions, and sometimes it’s easier to write your own version than to understand someone else’s, which explains this variety).

Fourth option, balancing with NAT rules:

Advantages:
  • price (free of charge, not counting the working time to configure);
  • relative ease of setup.

Disadvantages:
  • you must have approximately equivalent throughput channels.

There are doubts about the speed in the event of a “fall” of one of the external channels.

And finally, the fifth option, the use of a proxy server:

Advantages:
  • price (free of charge, not counting the working time to configure);
  • flexibility of customization.

Disadvantages:
  • slow down data flow;
  • the need for additional configuration on user machines;
  • difficulty setting in unusual situations.

At the beginning of the development, several years ago, the option was chosen to write your own script for the following reasons. Firstly, the price. By this criterion, the Layer 3 switches from the second paragraph drop out. In a local area network with 10 machines, enterprise-level solutions are a luxury. Alas, the author did not know about the devices from the first paragraph at the time of the decision. By the way, now they do not fit the “stability” item. And the solution from the fourth paragraph does not fit, because the available Internet channels differ dozens of times in speed and the use of such a scheme, in my opinion, is not justified. In addition, doubts are added regarding the quality of communication with the external network in the event of a “fall” of one of the channels. The fifth point is not satisfied, firstly, by slowing down the flow rate, secondly, I would like to have a solution independent of optional components. Accordingly, point 3 remained, where after researching other people's scripts and trying to adapt them, it was decided to abandon this idea and write your own script.

Over time, a backup was installed on the FreeBSD near the main “router”, the dns, dhcp, nat and ipfw settings underwent changes more than once. Everything gradually developed and improved, except for the aforementioned script, which in the end it was decided to rewrite using, as fundamental, the following principles: modularity, a single settings file, as well as the flexibility and simplicity of settings in any unix-like system, as well as the ease of adding new modules .

Targets and goals


What is the ultimate goal of this project? Create a universal and easily scalable software package based on the client-server system (although it would be more correct to call its agent-server), focused on identifying problems in external and internal connections and automatically switching to working connections. In this case, the agent is the “collector” of information about the state of external and internal connections at the current time, and the server is the part of the program that decides which connection is priority and, if necessary, sends commands to switch to this connection. Moreover, on the server (in this context), the agent may not function.

So:

  • We have n “routers” with m external channels on each. Moreover, all n "routers" are in a strict hierarchy.
  • An agent works independently on each machine, the task of which is to collect and “add” the results of testing external channels to a server or “router” with the highest priority at the moment (it is assumed that the server part will be a mandatory addition to the agent, at that time as an agent is not required to perform server functions), as well as determine its (server) availability.
  • The server, in turn, analyzes the received data and determines which channel and which “router” is currently priority. For this purpose, the article discusses the settings of the DHCP server, as dhcpd settings will change to change the gateway.
  • In the event of a server failure, a program is activated on all agents, which selects and assigns a new server from among the agents according to predefined priorities and delegates to it the functions of collecting information about the current state of external connections and making decisions about switching. After the original server is restored to working condition, the reverse process occurs - automatic switching to it.

The details of the algorithm can be painted for a very long time, the general essence is only stated above. I do not argue that both n and m (from the example above) take values ​​of more than 2 extremely rarely, but they are found, therefore, why not make a universal tool?

In the process of writing scripts, I encountered some limitations of the bash language, so at the moment a more elegant solution to the above problem is very vague. So far, there is a solution for a stand-alone “router”, designed with a focus on further expanding capabilities.

Decision


For many reasons, it was decided to use the old machine (Pentium 3, 512 OP) with FreeBSD as the basis of the local network, as well as the gateway to the Internet, currently version 9.2. Subsequently, to increase reliability, a second similar machine was installed, which works in tandem with the existing one. By the way, over the past two years there have been exactly two breakdowns - for the first time the power supply has failed, in the second one of the network cards. It is worth considering that at the same time the entire local network worked flawlessly, since in the event of a failure a backup machine came into play. So the use of old iron in this scheme practically does not affect the stability of the network. There are also 2 external channels from different Internet providers. The general scheme is given below, on it:

Blue and red arrows are external communication channels.
Black arrows are internal communication channels.

This system looks like this:

image

The switch separates traffic from providers using vlan-s. In a specific case, this is the Cisco SF300-08.
In more detail, what and with what it works on the machines themselves:
Firewall - IPFW
NAT - "nuclear" NAT from IPFW.
DNS - Bind 9 (using the latest version for FreeBSD)
DHCP - isc-dhcpd
ToFoIn is the main culprit in this article.

The article will not describe the intricacies of configuring DNS, DHCP, since, generally speaking, it is assumed that the reader is already familiar with similar systems. In addition, there are plenty of materials on this subject, and some links will be mentioned at the end of the article. The technical part contains the complete Firewall and NAT rules for ipfw with almost no comments (again, there are also plenty of materials on this topic) that are available at the moment, as well as kernel parameters and rc.conf.

Now we will consider in detail the principle of the script. To begin with, what are the modules and their functions:

Daemon , as the name implies, is the main process that starts the testing and switching modules by timer.
Tester - tests the connection through external channels using the ping command.
Judge- Based on the test results, it determines which external channel works and whether switching is necessary.
Logger - responsible for event logging. It is necessary so that information about events is not duplicated and the magazine is easier to read.
Watchdog - runs on schedule from crontab. It determines the “freezes” of all modules and, whenever possible, tries to solve the problems that have arisen.

In addition to the scripts themselves, it is worth considering some more important files:

Tofoin.conf - a single settings file.
Tofoin.log - a single event log file.
Result_ < internal channel number > - working file, test results "add up" here

A certain number of working files are also used, and, of course, each script creates a pid file at startup , and deletes it during the shutdown process.

The work of Logger and Watchdog will not be described in detail, who are interested, will be able to familiarize themselves if they wish. Let us consider in more detail the operation of the main modules, i.e. Daemon, Tester, and Judge. Daemon starts Tester and Judge on timers, which are stored in the configuration file. It looks like this: at start, tests are started, and timestamp is also remembered, then, based on sensitivity, every n seconds it is checked whether the time to start the next test is exceeded or the current status of communication is assessed. Thus, Daemon remembers the last timestamp for tests and validations and compares them with the current timestamp. If the difference is greater than indicated in the configuration file, then a test or test is launched, respectively, and the timestamp is replaced with the current one. Etc.

Tester is the simplest module so far. It accepts 2 variables as input:
./tester.sh a b

, where a is the routing table number, b is the task (in the usual version, b = 10, which means full testing and recording the result).

There are also trial modes for the Tester module, where b = 0 - ping only the first target (from the configuration file), b = 1 - ping only the second target (from the configuration file), b = <destination>, for example, b = habrhabr. ru - in this mode ping of an arbitrary target is performed. In this case, for the 0 routing table, the command will look like this:
./tester.sh 0 habrahabr.ru

The main component of the program, obviously, is the Judge module. The algorithm of his work in general terms is as follows:
  1. Based on the current ipfw rules, the current external channel is determined.
  2. The loop compiles an array of relevant state data of external channels.
  3. The next cycle determines the preferred external channel.
  4. Next, the function of determining whether the channel needs to be switched is started, and, if necessary, the switching function is started, to which the internal channel number for switching is transmitted. (The return to the main channel does not occur immediately. This was done so that in case of unstable operation of the main channel there would be no round-trip jumps, and switching would occur only when the main external channel began to work stably).
  5. In the end, if there is a need, a switching function is launched, which substitutes the necessary ipfw settings, restarts it, and also restarts the Bind with the necessary routing table.

Of course, all key actions are recorded in the event log, and in the event of an emergency, again, the cause of the error is recorded and Watchdog is called.

So, the basic principles of work are considered, I propose to get acquainted with how this is all implemented in practice.

Technical part


Equipment

About the equipment already mentioned, in this section I will try to tell in more detail. To ensure the operation of DNS, DHCP, NAT and IPFW in my case (an internal network for about 30 machines), Celeron based on Pentium III, 512 MB of RAM and 40 GB HDD, as well as a 350W PSU with support for the corresponding motherboard connectors are enough. 2 additional PCI network cards are also connected. In power, both routers are about the same.

It may be objected that in some places capacities are even superfluous, however, these machines were not specially purchased, but were collected from what was left after updating the fleet of user machines. Most likely, the minimum necessary set of services can be launched on much weaker hardware. It would also be nice to play it safe and organize a mirrored RAID. Unfortunately, I did not think about this in advance and now it is connected with some difficulties, but this is a completely different story.

In my opinion, this is quite a worthy use of the old working iron, which otherwise often either gets dusted in the warehouse, or is thrown out or distributed.

Preset

In order for this system to work, of course, you must perform some preliminary configuration.

First, configure the Primary and Secondary DNS servers. If you have only one “router”, then only a Primary DNS server is enough for a start. In this problem, we used, as was mentioned, Bind 9. Some tuning links are given at the end of the article. Cricket Lee and Paul Albitz's “DNS and BIND” tutorial helps very well in this case.

Secondly, you need to configure dhcp failover peer. If you have only one “router”, then the usual settings for standalone DHCP server are enough. Again, the links are given at the end of the article. In case, for some reason, the article about setting up failover dhcp peer will not be available via the link (and in the last few months the situation is just that), I will provide here a script for synchronizing settings, as well as key points for setting up.
Configure failover dhcpd
In order to configure failover dhcp peer you need:
  1. Create in / usr / local / etc the main configuration file dhcpd.conf, which is referenced in rc.conf. My looks like this:
    /usr/local/etc/dhcpd.conf
    
    # dhcpd.conf
    #
    # option definitions common to all supported networks...
    option domain-name "companyname.local";
    option domain-name-servers 10.0.0.2, 10.0.0.1;
    option ntp-servers 10.0.0.2, 10.0.0.1;
    option log-servers 10.0.0.1;
    update-static-leases on;
    # 1 hour
    default-lease-time 3600;
    # 1 day
    max-lease-time 86400;
    # Use this to enable / disable dynamic dns updates globally.
    ddns-update-style interim;
    # If this DHCP server is the official DHCP server for the local
    # network, the authoritative directive should be uncommented.
    authoritative;
    # Use this to send dhcp log messages to a different log file (you also
    # have to hack syslog.conf to complete the redirection).
    log-facility local7;
    set vendorclass = option vendor-class-identifier;
    # DNS key
    include "/usr/local/etc/dhcpd/dns.key";
    zone companyname.local.{
    	primary 127.0.0.1;
    	key DHCP_UPDATER;
    }
    zone 0.0.10.in-addr.arpa.{
    	primary 127.0.0.1;
    	key DHCP_UPDATER;
    }
    # DHCP Failover, Primary
    include "/usr/local/etc/dhcpd/dhcpd.conf_primary";
    # Subnet declaration
    include "/usr/local/etc/dhcpd/dhcpd.subnet";
    # Static IP addresses
    include "/usr/local/etc/dhcpd/dhcpd.static";
    


    Here dns.key is the key for communication with the dns server, these issues are discussed in detail in the articles on configuring dns + dhcp.
  2. Create the folder / usr / local / etc / dhcpd. Create the following files in it, containing approximately the following:
    /usr/local/etc/dhcpd/dhcpd.conf_primary
    
    ##########################
    # DHCP Failover, Primary #
    ##########################
    failover peer "dhcpdpeer" {              # Failover configuration
    	primary;                         # I am the primary
            address 10.0.0.1;                # My IP address
            port 1111;
            peer address 10.0.0.2;           # Peer's IP address
            peer port 2222;
            max-response-delay 60;
            max-unacked-updates 10;
            mclt 3600;
            split 128;                       # Leave this at 128, only defined on Primary
            load balance max seconds 3;
    }
    

    /usr/local/etc/dhcpd/dhcpd.subnet
    
    subnet 10.0.0.0 netmask 255.255.255.0 {
    	pool {
    		failover peer "dhcpdpeer";
    		range 10.0.0.15 10.0.0.240;
    	}
    	option subnet-mask 255.255.255.0;
    	option routers 10.0.0.2, 10.0.0.1;
    	option broadcast-address 10.0.0.255;
    	option netbios-name-servers 10.0.0.3;
    	option netbios-dd-server 10.0.0.3;
    	option netbios-node-type 8;
    }
    

    In this case, netbios name server is a windows server with the wins server service running, and samba can also play this role.
    /usr/local/etc/dhcpd/dhcpd.static
    
    host SERVER3 {
      hardware ethernet 11:11:11:11:11:11;
      fixed-address 10.0.0.3;
    }  	
    host SERVER4 {
      hardware ethernet 22:22:22:22:22:22;
      fixed-address 10.0.0.4;
    }
    

    This file, as you might guess, for static addresses.
  3. On the second “router”, the files look like this:
    /usr/local/etc/dhcpd.conf
    
    # dhcpd.conf
    #
    # option definitions common to all supported networks...
    option domain-name "companyname.local ";
    option domain-name-servers 10.0.0.2, 10.0.0.1;
    option ntp-servers 10.0.0.2, 10.0.0.1;
    option log-servers 10.0.0.1;
    update-static-leases on;
    # 1 hour
    default-lease-time 3600;
    # 1 day
    max-lease-time 86400;
    # Use this to enable / disable dynamic dns updates globally.
    ddns-update-style interim;
    # If this DHCP server is the official DHCP server for the local
    # network, the authoritative directive should be uncommented.
    authoritative;
    # Use this to send dhcp log messages to a different log file (you also
    # have to hack syslog.conf to complete the redirection).
    log-facility local7;
    set vendorclass = option vendor-class-identifier;
    # DNS key
    include "/usr/local/etc/dhcpd/dns.key";
    zone companyname.local.{
    	secondary 127.0.0.1;
    	key DHCP_UPDATER;
    }
    zone 0.0.10.in-addr.arpa.{
    	secondary 127.0.0.1;
    	key DHCP_UPDATER;
    }
    # DHCP Failover, Primary
    include "/usr/local/etc/dhcpd/dhcpd.conf_secondary";
    # Subnet declaration
    include "/usr/local/etc/dhcpd/dhcpd.subnet.DONOTEDIT";
    # Static IP addresses
    include "/usr/local/etc/dhcpd/dhcpd.static.DONOTEDIT";
    

    /usr/local/etc/dhcpd/dhcpd.conf_secondary
    
    ###########################
    # DHCP Failover,Secondary #
    ###########################
    failover peer "dhcpdpeer" {              # Failover configuration
    	secondary;                       # I am the secondary
    	address 10.0.0.2;                # My IP address
    	port 2222;
    	peer address 10.0.0.1;           # Peer's IP address
    	peer port 1111;
    	max-response-delay 60;
    	max-unacked-updates 10;
    	mclt 3600;
    	load balance max seconds 3;
    }
    

    The rest of the files can be taken from the first “router” only by changing the name, or configured to the end and the files will move automatically when isc-dhcpd is restarted (about how, below).
  4. Create an executable file with the following contents:
    / usr / local / bin / dhcpd-sync
    
    #!/bin/sh
    # backup generation
    date=`date -v-1d '+%Y%m%d-%H%M%s'`
    month=`date '+%m%Y'`
    sudo -u dhcp-updater cp -f /usr/local/etc/dhcpd/dhcpd.subnet /var/dhcp-backup/dhcpd.subnet.$date
    sudo -u dhcp-updater bzip2 -f -k -z /var/dhcp-backup/dhcpd.subnet.$date
    sudo -u dhcp-updater tar -r -f /var/dhcp-backup/dhcpd.subnet.$month.tar -C /var/dhcp-backup dhcpd.subnet.$date.bz2
    sudo -u dhcp-updater cp -f /usr/local/etc/dhcpd/dhcpd.static /var/dhcp-backup/dhcpd.static.$date
    sudo -u dhcp-updater bzip2 -f -k -z /var/dhcp-backup/dhcpd.static.$date
    sudo -u dhcp-updater tar -r -f /var/dhcp-backup/dhcpd.static.$month.tar -C /var/dhcp-backup dhcpd.static.$date.bz2
    sudo -u dhcp-updater scp -P 22 -q /var/dhcp-backup/dhcpd.subnet.$date.bz2 dhcp-updater@10.0.0.2:/var/dhcp-backup
    sudo -u dhcp-updater ssh -p 22 10.0.0.2 tar -r -f /var/dhcp-backup/dhcpd.subnet.$month.tar -C /var/dhcp-backup dhcpd.subnet.$date.bz2
    sudo -u dhcp-updater scp -P 22 -q /var/dhcp-backup/dhcpd.static.$date.bz2 dhcp-updater@10.0.0.2:/var/dhcp-backup
    sudo -u dhcp-updater ssh -p 22 10.0.0.2 tar -r -f /var/dhcp-backup/dhcpd.static.$month.tar -C /var/dhcp-backup dhcpd.static.$date.bz2
    sudo -u dhcp-updater ssh -p 22 10.0.0.2 rm /var/dhcp-backup/dhcpd.subnet.$date.bz2
    sudo -u dhcp-updater ssh -p 22 10.0.0.2 rm /var/dhcp-backup/dhcpd.static.$date.bz2
    sudo -u dhcp-updater rm /var/dhcp-backup/dhcpd.subnet.$date
    sudo -u dhcp-updater rm /var/dhcp-backup/dhcpd.static.$date
    sudo -u dhcp-updater rm /var/dhcp-backup/dhcpd.subnet.$date.bz2
    sudo -u dhcp-updater rm /var/dhcp-backup/dhcpd.static.$date.bz2
    # sync and restart secondary DHCP
    sudo -u dhcp-updater scp -P 22 -q /usr/local/etc/dhcpd/dhcpd.subnet dhcp-updater@10.0.0.2:/usr/local/etc/dhcpd/dhcpd.subnet.DONOTEDIT
    sudo -u dhcp-updater scp -P 22 -q /usr/local/etc/dhcpd/dhcpd.static dhcp-updater@10.0.0.2:/usr/local/etc/dhcpd/dhcpd.static.DONOTEDIT
    sudo -u dhcp-updater ssh -p 22 10.0.0.2 sudo /usr/local/etc/rc.d/isc-dhcpd restart
    
  5. Create a dhcp-updater user with the appropriate rights on both servers, register it in sudo settings, configure ssh connection by key from the primary to the secondary “router”, delete the password. You may also need to create the / var / dhcp-backup / folder on both machines.
  6. Modify a piece of the /usr/local/etc/rc.d/isc-dhcpd file as follows:
    Before:
    
    dhcpd_checkconfig ()
    {
            local rc_flags_mod
            setup_flags
    	rc_flags_mod="$rc_flags"
            # Eliminate '-q' flag if it is present
    	case "$rc_flags" in
    	*-q*)	rc_flags_mod=`echo "${rc_flags}" | sed -Ee 's/(^-q | -q | -q$)//'` ;;
    	esac
            if ! ${command} -t -q ${rc_flags_mod}; then
                    err 1 "`${command} -t ${rc_flags_mod}` Configuration file sanity check failed"
            fi
    }
    

    After:
    
    dhcpd_checkconfig ()
    {
            local rc_flags_mod
            setup_flags
    	rc_flags_mod="$rc_flags"
            # Eliminate '-q' flag if it is present
    	case "$rc_flags" in
    	*-q*)	rc_flags_mod=`echo "${rc_flags}" | sed -Ee 's/(^-q | -q | -q$)//'` ;;
    	esac
            if ! ${command} -t -q ${rc_flags_mod}; then
                    err 1 "`${command} -t ${rc_flags_mod}` Configuration file sanity check failed"
    	else sh /usr/local/bin/dhcpd-sync	
            fi
    }
    

  7. If all the settings are made correctly, when the dhcp server is restarted on the main machine, the current configuration will be archived, synchronized with the second server, and the restart will occur on both machines.
  8. It would be useful to add the following task to crontab:
    0	0	*	*	*	root	/usr/local/etc/rc.d/isc-dhcpd restart
  9. This completes the failover dhcpd configuration.


Thirdly, in order for routing tables to appear in addition to zero, as well as for the “nuclear” nat and ipfw to work, you need to rebuild the kernel with the following parameters (of course, options are possible, but they are, again, by the links at the end):

options		IPFIREWALL		
options		IPFIREWALL_VERBOSE
options         IPFIREWALL_VERBOSE_LIMIT=50
options         IPFIREWALL_NAT
options		LIBALIAS
options		DUMMYNET		
options		HZ=1000			
options		ROUTETABLES=2

In order for the second routing table (under the number “1”, since the first one has the number “0”) to work after a reboot, you need to create it in rc.d (I have it located in /usr/local/etc/rc.d /) file with the following contents:
/usr/local/etc/rc.d/setfib1

#!/bin/sh
#
# PROVIDE: SETFIB1
# REQUIRE: NETWORKING
# BEFORE: DAEMON
#
# Add the following lines to /etc/rc.conf to enable setfib -1 at startup
# setfib1 (bool): Set to "NO" by default.
#                Set it to "YES" to enable setfib1
# setfib1_defaultroute (str): Set to "" by default
#       Set it to ip address of default gateway for use in fib 1
. /etc/rc.subr
name="setfib1"
rcvar=`set_rcvar`
load_rc_config $name
[ -z "$setfib1_enable" ] && setfib1_enable="NO"
[ -z "$setfib1_defaultrouter" ] && setfib1_defaultrouter=""
start_cmd="${name}_start"
stop_cmd="${name}_stop"
setfib1_start()
{
	if [ ${setfib1_defaultrouter} ]
	then
		setfib 1 route add -net default ${setfib1_defaultrouter}
	else
		echo "Can not set default route for fib 1 - setfib1_defaultrouter is not assigned in rc.conf!"
	fi
}
setfib1_stop()
{
	setfib 1 route del -net default
}
run_rc_command "$1"

And also add several lines to rc.conf, for example, for the primary “router”:

setfib1_enable="YES"
setfib1_defaultrouter="2.2.2.1"

In fact, this boot script adds as much as the default route to the second table. If necessary, you can run up to 65536 routing tables (in version 10 of FreeBSD), copying the above script with minor changes and adding parameters to rc.conf. (Of course, you must first include these 65536 tables in the kernel parameters.)

My rc.conf configuration on the main “router”:

But first, a few comments:
Eth0 is the physical interface of the main external channel.
Eth1 is the physical interface of the backup external channel.
Eth2 is the physical interface of the internal channel.
Vlan1 - the interface of the main external channel.
Vlan2- interface backup external channel.
Vlan3 and vlan4 are reserved for future functionality, more about this at the end of the article.
10.0.0.1 - the address of the “router” in the internal network, respectively, the backup will have, for example, 10.0.0.2.
1.1.1.2 and 1.1.1.1 - the IP address and default gateway for the main external channel.
2.2.2.2 and 2.2.2.1 - the IP address and default gateway for the backup external channel.
## ATTENTION! The names of the interfaces and ip-addresses are taken as an example, in each case they will be their own! ##
/etc/rc.conf

hostname="SERVER1.companyname.local"
keymap="ru.koi8-r"
font8x8="cp866-8x8"
font8x14="cp866-8x14"
font8x16="cp866-8x16"
scrnmap="koi8-r2cp866"
cursor="destructive"
ifconfig_eth0="up"
vlans_eth0="vlan1 vlan3"
create_args_vlan1="vlan 1"
create_args_vlan3="vlan 3"
ifconfig_eth1="up"
vlans_eth1="vlan2 vlan4"
create_args_vlan2="vlan 2"
create_args_vlan4="vlan 4"
ifconfig_eth2="inet 10.0.0.1 netmask 255.255.255.0"
ifconfig_vlan1="inet 1.1.1.2/24"
ifconfig_vlan3="inet 10.0.1.1/30"
ifconfig_vlan2="inet 2.2.2.2/24"
ifconfig_vlan4="inet 10.0.2.1/30"
defaultrouter="1.1.1.1"
setfib1_enable="YES"
setfib1_defaultrouter="2.2.2.1"
gateway_enable="YES"
sshd_enable="YES"
moused_enable="YES"
ntpd_enable="YES"
powerd_enable="YES"
hald_enable="YES"
dbus_enable="YES"
dumpdev="AUTO"
firewall_enable="YES"
firewall_logging="YES" 
firewall_script="/etc/firewall.sh"
named_enable="YES"
named_program="/usr/sbin/named"
named_flags="-u bind -c /etc/namedb/named.conf"
dhcpd_enable="YES"
dhcpd_conf="/usr/local/etc/dhcpd.conf"
dhcpd_ifaces="eth2"

Below are the NAT and Firewall settings that work for me:

When working through the main external channel:
/etc/rules.firewall0

#!/bin/sh
# Delete all rules
/sbin/ipfw -q -f flush
/sbin/ipfw -q -f pipe flush
/sbin/ipfw -q -f queue flush
/sbin/ipfw -q -f nat 1 delete
/sbin/ipfw -q -f table all flush
# Parameters
ipfw="/sbin/ipfw -q add"
extM_if="vlan1"
extM_ip="1.1.1.2"
extS_if="vlan2"
extS_ip="2.2.2.2"
int_if="eth2"
int_ip="10.0.0.1"
lan_net="10.0.0.0/24"
odmin="10.0.0.111"
# Tables
# Table 1 - non-routes networks
/sbin/ipfw table 1 add 192.168.0.0/16
/sbin/ipfw table 1 add 172.16.0.0/12
/sbin/ipfw table 1 add 10.0.0.0/8
/sbin/ipfw table 1 add 127.0.0.0/8
/sbin/ipfw table 1 add 0.0.0.0/8
/sbin/ipfw table 1 add 169.254.0.0/16
/sbin/ipfw table 1 add 192.0.2.0/24
/sbin/ipfw table 1 add 204.152.64.0/23
/sbin/ipfw table 1 add 224.0.0.0/3
# Choose route table
$ipfw setfib 0 all from any to any via $int_if 
# Allow all traffic on loopback
$ipfw allow all from any to any via lo0
# Deny access to lo0 from out
$ipfw deny log all from any to 127.0.0.0/8
# Deny outcome packets from lo0
$ipfw deny log all from 127.0.0.0/8 to any
# Allow returning 
$ipfw check-state
# Deny IPv6
$ipfw deny log ipv6 from any to any
# Antispoofing
$ipfw deny log all from any to any not antispoof in
# Block any delayed packets (fragments)
$ipfw deny all from any to any frag
#########################################
# Internal interface, outcoming traffic #
#########################################
# Allow all traffic from gateway to lan
$ipfw allow all from any to $lan_net out via $int_if
# Deny and log other
$ipfw deny log all from any to any out via $int_if
########################################
# Internal interface, incoming traffic #
########################################
# Deny all Netbios 
$ipfw deny tcp from any to any 81,137,138,139 in via $int_if
# Allow traffic on internal interface
# DHCP
$ipfw allow udp from any to me 67,68,1515,1516 in via $int_if
# Mail
$ipfw allow tcp from $lan_net to any 25,110,143,465,993,995 in via $int_if
# Time
$ipfw allow tcp from $lan_net to any 37 in via $int_if
$ipfw allow udp from $lan_net to any 123 in via $int_if
# ICQ
$ipfw allow tcp from $lan_net to any 443,5190,5222 in via $int_if
# FTP and some other
$ipfw allow tcp from $lan_net to any 21,22,49152-65535 in via $int_if
# HTTP
$ipfw allow tcp from $lan_net to any 80 in via $int_if
# Output whois
$ipfw allow tcp from $lan_net to any 43 in via $int_if
# DNS
$ipfw allow udp from $lan_net to any 53 in via $int_if
$ipfw allow tcp from $lan_net 53 to $int_ip in via $int_if
$ipfw allow tcp from $lan_net to $int_ip 53 in via $int_if
# Ping
$ipfw allow icmp from $lan_net to any icmptypes 0,3,8,11 in via $int_if
# For admin
$ipfw allow all from $odmin 1025-6000,11111,22222,50000-60000 to any in via $int_if
$ipfw allow all from 10.0.0.2 22 to $int_ip in via $int_if
$ipfw 55100 allow all from any to $int_ip 22 in via $int_if
# Deny and log other
$ipfw deny log all from any to any in via $int_if
#########################################
# External interface, outcoming traffic #
#########################################
# Deny all outcoming traffic to non-route networks
$ipfw deny log all from any to 'table(1)' out via $extM_if
$ipfw deny log all from any to 'table(1)' out via $extS_if
# Deny broadcast ICMP on ext interface
$ipfw deny icmp from any to 255.255.255.255 out via $extM_if
$ipfw deny icmp from any to 255.255.255.255 out via $extS_if
# Deny multicast on ext interface
$ipfw deny all from 224.0.0.0/4 to any out via $extM_if
$ipfw deny all from 224.0.0.0/4 to any out via $extS_if
# Allow me go to internet
$ipfw allow all from $extM_ip to any out via $extM_if setup keep-state 
$ipfw allow all from $extS_ip to any out via $extS_if setup keep-state
# DNS BIND
$ipfw allow udp from $extM_ip to any 53 out via $extM_if keep-state
$ipfw allow udp from $extS_ip to any 53 out via $extS_if keep-state
# Time
$ipfw allow udp from $extM_ip to any 123 out via $extM_if keep-state
$ipfw allow tcp from $extM_ip to any 37 out via $extM_if setup keep-state
# Output whois
$ipfw allow tcp from $extM_ip to any 43 out via $extM_if setup keep-state
# NAT
/sbin/ipfw -q nat 1 config log if $extM_if reset same_ports deny_in unreg_only redirect_port tcp 10.0.0.111:33333 33333 redirect_port udp 10.0.0.111:11111 11111 redirect_port tcp 10.0.0.111:22222 22222 redirect_port udp 10.0.0.111:22222 22222
# NAT outcoming traffic
$ipfw nat 1 ip from any to any out via $extM_if
# Allow traffic on outcoming interface
# Mail
$ipfw allow tcp from any to any 25,110,143,465,993,995 out via $extM_if
# ICQ
$ipfw allow tcp from any to any 443,5190,5222 out via $extM_if
# FTP and some other
$ipfw allow tcp from any to any 21,22,49152-65535 out via $extM_if 
# HTTP
$ipfw allow tcp from any to any 80 out via $extM_if
# Ping
$ipfw allow icmp from any to any icmptypes 0,3,8,11 out via $extM_if
$ipfw allow icmp from any to any icmptypes 0,3,8,11 out via $extS_if
# For admin
$ipfw allow tcp from any 1025-6000 to any out via $extM_if
$ipfw allow all from any 11111,22222,50000-60000 to any out via $extM_if
# Deny and log other
$ipfw deny log all from any to any out via $extM_if
$ipfw deny log all from any to any out via $extS_if
########################################
# External interface, incoming traffic #
########################################
# Deny all incoming traffic from non-route networks
$ipfw deny log all from 'table(1)' to any in via $extM_if
$ipfw deny log all from 'table(1)' to any in via $extS_if
# Deny ident
$ipfw deny tcp from any to any 113 in via $extM_if
$ipfw deny tcp from any to any 113 in via $extS_if
# Deny all Netbios
$ipfw deny tcp from any to any 81,137,138,139 in via $extM_if
$ipfw deny tcp from any to any 81,137,138,139 in via $extS_if
# SSH (also for internal network)
$ipfw allow all from any to me 22 in via $extM_if
$ipfw allow all from any to me 22 in via $extS_if
# NAT incoming traffic
$ipfw nat 1 ip from any to any in via $extM_if
# Allow traffic on outcoming interface
# Mail
$ipfw allow tcp from any 25,110,143,465,993,995 to any in via $extM_if
# ICQ
$ipfw allow tcp from any 443,5190,5222 to any in via $extM_if
# FTP and some other
$ipfw allow tcp from any 21,22,49152-65535 to any in via $extM_if
# HTTP
$ipfw allow tcp from any 80 to any in via $extM_if
# Ping
$ipfw allow icmp from any to any icmptypes 0,3,8,11 in via $extM_if
$ipfw allow icmp from any to any icmptypes 0,3,8,11 in via $extS_if
# For admin
$ipfw allow tcp from any to $odmin 1025-6000 in via $extM_if
$ipfw allow all from any to $odmin 11111,22222,50000-60000 in via $extM_if
# Deny and log other
$ipfw deny log all from any to any in via $extM_if
$ipfw deny log all from any to any in via $extS_if
$ipfw deny log all from any to any

When working through a backup external channel, all settings are the same, only the header changes:
/etc/rules.firewall1 hat

# Parameters
ipfw="/sbin/ipfw -q add"
extM_if="vlan2"
extM_ip="2.2.2.2"
extS_if="vlan1"
extS_ip="1.1.1.1"
int_if="eth2"
int_ip="10.0.0.1"
lan_net="10.0.0.0/24"
odmin="10.0.0.111"
serv="10.0.0.4

Also, sshguard is configured on the “routers”, but an experienced reader will be able to find and install this program himself.

Script source

ToFoIn - Toggle Failover of Internet. Most likely, the name is more than ambitious, but I did not come up with the characteristics of the product more accurately than the existing one. Below is the text of scripts and related files with a little explanation.
tofoin.conf

##         tofoin.conf        ##
## by LordNicky v0.6 20140719 ##
## Little about the modules and about what function they perform.
## Tester - Testing the availability of the Internet on selected channel.
## Judge - Test results analysis, the decision to switch 
## from one channel to another.
## Logger - Event logging.
## Watchdog - Testing and debugging of the scripts.
## Configuration.
## Amouth of the Internet channels.
CNUMBER=2
## Main Internet channel properties.
## Interface name.
EXT_0_IF=vlan10
## Id number of the routing table.
RTABLE_0=0
## Reserve Internet channel properties.
## Interface name.
EXT_1_IF=vlan20
## Id number of the routing table
RTABLE_1=1
## URL's supposed to be used for diagnostic of the availability
## of the Internet channel. PTARGET_0 should be domain name, and
## PTARGET_1 should be IP address.
## Attention: The resources should be different.
PTARGET_0=ya.ru
PTARGET_1=8.8.8.8
## Count of icmp packets used for testing one resource.
PNUMBER=2
## Period of launching of the module "Tester" (in seconds).
## Strongly not recomended to set a value less than 60.
TESTERPERIOD=240
## Period of launching of the module "Judge" (in seconds).
## Strongly not recomended to set a value less than TESTERPERIOD.
## Usually enough TESTERPERIOD + 60.
JUDGEPERIOD=300
## Launching sensitivity for the modules Tester and Judge.
## Usually enough 60.
SENSITIVITY=60
## The maximum operating time for the module Tester.
TESTERMAXDELAY=40
## The maximum operating time for the module Judge.
JUDGEMAXDELAY=30
## The maximum operating time for the module Logger.
LOGGERMAXDELAY=20
## Amount of tests that successfully passed before returning 
## to the main channel. Thereby, time elapsed since the restore
## the work main channel is approximately (WNUMBER+1)*JUDGEPERIOD
## seconds.
WNUMBER=3
## The frequency of writing error message into the log file.
## The main idea is the following. At first time the message 
## is written completely. After LOGFREQ1 repetitions logger 
## writes the only message about LOGFREQ1 the same messages.
## Later in each LOGFREQ2 repetitions logger writes the only 
## message about LOGFREQ2 the same messages. This algorithm
## works only if the same messages are following after each other.
LOGFREQ1=5
LOGFREQ2=20
## File paths.
## Paths for configuration script files IPFW.
## Default file. (It is written in the rc.conf)
FIRESETDEF=/etc/firewall.sh
## Settings for main Internet channel.
FIRESET_0=/etc/rules.firewall0
## Settings for reserve Internet channel.
FIRESET_1=/etc/rules.firewall1
## Paths for all ToFoIn files.
## Daemon.
DAEMON=/path/to/file/tofoin_daemon.sh
## Tester.
TESTER=/path/to/file/tofoin_tester.sh
## Judge.
JUDGE=/path/to/file/tofoin_judge.sh
## Logger.
LOGGER=/path/to/file/tofoin_logger.sh
## Watchdog.
WATCHDOG=/path/to/file/tofoin_watchdog.sh
## Log file. It is recommended to locate it into the /var/log.
LOGFILE=/path/to/file/tofoin.log
## The directory supposed for test results. It is recomended
## to locate it into the /tmp.
TESTER_RESULT=/path/to/directory
## Auxiliary module file Judge. It is recommended to locate
## it into the /tmp.
JUDGEMETER=/path/to/file/judgemeter
## Auxiliary module file Logger. It is recommended to locate
## it into the /tmp.
LOGTMP=/path/to/file/logger.tmp
LOGMETER=/path/to/file/logmeter
## PID files for all executable modules. It is recommended
## to locate it into /var/run.
DAEMON_PID=/path/to/file/tofoin_daemon.pid
TESTER_PID=/path/to/directory
JUDGE_PID=/path/to/file/tofoin_judge.pid
LOGGER_PID=/path/to/file/tofoin_logger.pid
WATCHDOG_PID=/path/to/file/tofoin_watchdog.pid

tofoin_daemon.sh

#!/usr/local/bin/bash
# by LordNicky v0.5 20140717
. /root/ToFoIn/tofoin.conf
test_time=`date +%s`;
judge_time=`date +%s`;
echo $$ > $DAEMON_PID;
$LOGGER "DAEMON: start successfully with pid $$" &
tester_0="$TESTER $RTABLE_0 10 0";
tester_1="$TESTER $RTABLE_1 10 1";
$tester_0 & $tester_1 &
while true
do
  current_time=`date +%s`;
  if [ "`expr $current_time - $test_time`" -ge "$TESTERPERIOD" ]
  then $tester_0 & $tester_1 & test_time=`date +%s`;
  else :;
  fi
  if [ "`expr $current_time - $judge_time`" -ge "$JUDGEPERIOD" ]
  then $JUDGE & judge_time=`date +%s`;
  else :;
  fi
  sleep $SENSITIVITY;
done  	

tofoin_tester.sh

#!/usr/local/bin/bash
# by LordNicky v0.7 20140717
. /root/ToFoIn/tofoin.conf
exit_function () {
rm $tester_pid; 
exit $exit_code;
}
tester_pid=$TESTER_PID/tofoin_test_$3\.pid;
if [ -e $tester_pid ];
then $WATCHDOG "tofoin_test" "$tester_pid" "$3" & exit 0;
else echo `date +%s` $$ > $tester_pid;
     if [ "$2" -eq 10 ];
     then if setfib $1 ping -c $PNUMBER $PTARGET_0 > /dev/null;
          then echo `date +%s` "0 0" > $TESTER_RESULT/result_$3;
          exit_code=0; exit_function;
          else if setfib $1 ping -c $PNUMBER $PTARGET_1 > /dev/null;
               then echo `date +%s` "0 1" > $TESTER_RESULT/result_$3;
	       exit_code=0; exit_function;
	       else echo `date +%s` "1 1" > $TESTER_RESULT/result_$3;
	       exit_code=0; exit_function;
	       fi
          fi
     elif [ "$2" -eq 0 ];
     then setfib $1 ping -c $PNUMBER $PTARGET_0;
     exit_code=0; exit_function;
     elif [ "$2" -eq 1 ];
     then setfib $1 ping -c $PNUMBER $PTARGET_1;
     exit_code=0; exit_function;
     else setfib $1 ping -c $PNUMBER $2;
     exit_code=1; exit_function;
     fi
fi     

As mentioned earlier, the tester module has a slightly expanded functionality for manual launch. The “solution” section describes how. Also, as can be seen from the text of the script, tester writes the results to a file only in case of regular launch.
tofoin_judge.sh

#!/usr/local/bin/bash
# by LordNicky v0.7 20140717
. /root/ToFoIn/tofoin.conf
exit_function () {
rm $JUDGE_PID; 
exit $exit_code;
}
decision_function () {
if [ "$actualchan" -eq "$prefchan" ];
then if [ "$actualchan" -eq 0 ];
     then $LOGGER "JUDGE: No problems detected" & 
     exit_code=0; exit_function;
     elif [ "$actualchan" -eq 1 ];
     then echo -e "0" > $JUDGEMETER; 
     $LOGGER "JUDGE: No problems detected at channel $actualchan" & 
     exit_code=0; exit_function;
     else $LOGGER "JUDGE(decision): Invalid actualchan = $actualchan" & 
     exit_code=1; exit_function;
     fi
else if [ "$prefchan" -eq 1 ];
     then switch_function; exit_code=0; exit_function;
     elif [ "$prefchan" -eq 0 ];
     then if [ "$actualstate" -eq 0 ]
          then meter=`cat $JUDGEMETER`;
               if [ "$meter" -eq "$WNUMBER" ];
	       then switch_function; exit_code=0; exit_function;
	       elif [ "$meter" -lt "$WNUMBER" ];
	       then expr $meter + 1 > $JUDGEMETER; 
	       exit_code=0; exit_function;
	       else echo -e "0" > $JUDGEMETER; exit_code=0; exit_function;
	       fi
	  elif [ "$actualstate" -eq 1 ]
	  then $LOGGER "JUDGE: Emergency switch to $prefchan"; 
	  switch_function; exit_code=0; exit_function;
	  else $LOGGER "JUDGE(decision): Invalid actualstate = $actualstate" & exit_code=1; exit_function;
	  fi     	
     else $LOGGER "JUDGE(decision): Invalid prefchan = $prefchan" & 
     exit_code=1; exit_function;
     fi	   	 
fi
} 
switch_function () {
echo -e "0" > $JUDGEMETER;
if [ "$prefchan" -eq 0 ];
then /etc/rc.d/named stop; 
cp $FIRESET_0 $FIRESETDEF; 
/etc/rc.d/ipfw restart; 
setfib $RTABLE_0 /etc/rc.d/named start; 
$LOGGER "JUDGE: Now switching on channel $RTABLE_0" & 
exit_code=0; exit_function;
elif [ "$prefchan" -eq 1 ]
then /etc/rc.d/named stop;
cp $FIRESET_1 $FIRESETDEF;
/etc/rc.d/ipfw restart;
setfib $RTABLE_1 /etc/rc.d/named start;
$LOGGER "JUDGE: Now switching on channel $RTABLE_1" & 
exit_code=0; exit_function;
else $LOGGER "JUDGE(switch): Invalid prefchan = $prefchan" & 
exit_code=1; exit_function;
fi
}	
createarea_function () {
for ((a=0; a < CNUMBER ; a++))
do
  current_time=`date +%s`
  timearea[$a]=`cut -c 1-10 $TESTER_RESULT/result_$a`;
  if [ "`expr $current_time - ${timearea[$a]}`" -ge 0 ];
  then if [ "`expr $current_time - ${timearea[$a]}`" -lt "`expr $TESTERPERIOD + 120`" ];
       then :;
       else $LOGGER "JUDGE: MAX period" & 
       $WATCHDOG & 
       exit_code=1; exit_function;
       fi
  else $LOGGER "JUDGE: testmodule $a in future" & 
  $WATCHDOG & 
  exit_code=1; exit_function;
  fi	
  statearea[$a]=`cut -c 12 $TESTER_RESULT/result_$a`;
  if [ "$actualchan" -eq "$a" ]
  then actualstate=${statearea[$a]};
  else :;   
  fi	  
done
}
findarea_function () {
for ((a=0; a < CNUMBER ; a++))
do
  if [ "${statearea[$a]}" -eq 0 ]
  then prefchan=$a; decision_function; 
  exit_code=0; exit_function; 
  else if [ "${statearea[$a]}" -eq 1 ]
       then continue 
       else $LOGGER "JUDGE: Invalid channel state" & 
       exit_code=1; exit_function;
       fi  
  fi
done
}
if [ -e $JUDGE_PID ]
then $WATCHDOG "tofoin_judge" "$JUDGE_PID" & exit 0;
else echo `date +%s` $$ > $JUDGE_PID;	
     if ipfw list | grep nat | egrep -q $EXT_0_IF;
     then actualchan=0;
     elif ipfw list | grep nat | egrep -q $EXT_1_IF;
     then actualchan=1;
     else $LOGGER "JUDGE: NAT error" & 
     prefchan=0; switch_function; 
     exit_code=1; exit_function;
     fi
     createarea_function;
     findarea_function;
     $LOGGER "JUDGE: All channels down" & 
     exit_code=1; exit_function;
fi     

In the judge module there are places for further improvement, but in general there are no frills.
tofoin_logger.sh

#!/usr/local/bin/bash
# by LordNicky v0.5 20140713
. /root/ToFoIn/tofoin.conf
exit_function () {
rm $LOGGER_PID; 
exit $exit_code;
}
main_function () {
if [[ `tail -n 1 $LOGFILE | grep -o "$1" | grep -o "JUDGE: No problems detected"` = "JUDGE: No problems detected" ]];
then exit_code=0; exit_function;
else if [[ `cat $LOGTMP` = $1 ]];
     then meter=`cat $LOGMETER`;
          if [ "$meter" -ge "$LOGFREQ2" ];
	  then echo -e "0" > $LOGMETER; 
	  echo -e "`date -j +%Y%m%d%H%M` last message repeat $LOGFREQ2 times" >> $LOGFILE; 
	  exit_code=0; exit_function;
	  elif [ "$meter" -ge "$LOGFREQ1" ];
	  then if [[ `tail -n 1 $LOGFILE | grep -o "last message repeat $LOGFREQ1 times"` = "last message repeat $LOGFREQ1 times" ]];
	       then expr $meter + 1 > $LOGMETER; 
	       exit_code=0; exit_function;
	       elif [[ `tail -n 1 $LOGFILE | grep -o "last message repeat $LOGFREQ2 times"` = "last message repeat $LOGFREQ2 times" ]];
               then expr $meter + 1 > $LOGMETER; 
	       exit_code=0; exit_function;
	       else echo -e "`date -j +%Y%m%d%H%M` last message repeat $LOGFREQ1 times" >> $LOGFILE; 
	       exit_code=0; exit_function;
	       fi
	  elif [ "$meter" -ge 0 ];
	  then expr $meter + 1 > $LOGMETER; 
	  exit_code=0; exit_function;
	  else echo -e "0" > $LOGMETER; 
	  echo -e "`date -j +%Y%m%d%H%M` LOGGER: logmeter index error, write 0" >> $LOGFILE; 
	  exit_code=1; exit_function;
	  fi
     else if [ `cat $LOGMETER` -eq 0 ];
	  then echo -e "$1" > $LOGTMP; 
	  echo -e "`date -j +%Y%m%d%H%M` $1" >> $LOGFILE; 
	  exit_code=0; exit_function;
	  else echo -e "0" > $LOGMETER; 
	  echo -e "$1" > $LOGTMP; 
	  echo -e "`date -j +%Y%m%d%H%M` $1 ; LOGMETER now zero" >> $LOGFILE;
	  exit_code=0; exit_function;
	  fi
     fi
fi
}
if [ -e $LOGGER_PID ];
then sleep $((RANDOM%5+1)); 
     if [ -e $LOGGER_PID ];
     then $WATCHDOG "tofoin_logger" "$LOGGER_PID" & exit 0;
     else echo `date +%s` $$ > $LOGGER_PID;
     main_function "$1";
     fi
else echo `date +%s` $$ > $LOGGER_PID;
main_function "$1";
fi

The most, in my opinion, the most terrible module in terms of perception is the logger. But, unfortunately, it didn’t work out easier to write. Basically, most of the script is dedicated to preventing the occurrence of duplicate messages, hence the apparent complexity.
tofoin_watchdog.sh

#!/usr/local/bin/bash
# by LordNicky v0.5 20140713
. /root/ToFoIn/tofoin.conf
exit_function () {
rm $WATCHDOG_PID; 
exit $exit_code;
}
kill_function () {
if [[ "`ps -o command -p $proc_pid | grep -o "$proc_name"`" = "$proc_name" ]];
then $LOGGER "WATCHDOG: Other $proc_s_name working during $diff, kill him" &
kill $proc_pid; 
else $LOGGER "WATCHDOG: None or other process on $proc_s_name pid, cleaning pid file" &
fi
if [[ "$proc_name" = "tofoin_watchdog" ]];
then main_function;
else rm $proc_pid_file;
fi
}	
main_function () {
echo `date +%s` $$ > $WATCHDOG_PID;
proc_name=${one:-all};
return_wait=10
if [[ "$proc_name" = "all" ]];
b=0; c=0
then for ((a=0; a < CNUMBER ; a++))
     do 
       current_time=`date +%s`;
       tester_result=$TESTER_RESULT/result_$a;
       tester_time=`cut -c 1-10 $tester_result`;
       diff=`expr $current_time - $tester_time`;
       if [ "$diff" -ge 0 ]
       then if [ "$diff" -lt "`expr $TESTERPERIOD + 120`" ];
            then :;
	    else proc_name=tofoin_daemon; proc_pid=`cat $DAEMON_PID`;
		 if  [[ "`ps -o command -p $proc_pid | grep -o "$proc_name"`" = "$proc_name" ]];
                 then $LOGGER "WATCHDOG: Restart daemon" & 
                 kill $proc_pid; $DAEMON & 
                 else $LOGGER "WATCHDOG: None daemon process, start" &
		 $DAEMON & 
                 fi
		 exit_code=0; exit_function;
            fi
       else $LOGGER "WATCHDOG: Check date" &
       fi
     done
elif [[ "$proc_name" = "tofoin_test" ]];
then proc_pid_file=$two; cnumber=$three; 
test_function; return_val=$?;
     if [[ "$return_val" = "$return_wait" ]];
     then sleep $TESTERMAXDELAY; test_function "nowait";
     else :;
     fi
elif [[ "$proc_name" = "tofoin_judge" ]];
then proc_pid_file=$JUDGE_PID;
judge_function; return_val=$?;
     if [[ "$return_val" = "$return_wait" ]];
     then sleep $JUDGEMAXDELAY; judge_function "nowait";
     else :;
     fi
elif [[ "$proc_name" = "tofoin_logger" ]];
then proc_pid_file=$LOGGER_PID;
logger_function; return_val=$?;
     if [[ "$return_val" = "$return_wait" ]];
     then sleep $LOGGERMAXDELAY; logger_function "nowait";
     else :;
     fi
else $LOGGER "WATCHDOG: Incorrect process name";
fi
exit_code=0; exit_function;
}	
test_function () {
if [ -e $proc_pid_file ];
then proc_pid=`cut -c 12-18 $proc_pid_file`;
     proc_s_name="tester $cnumber";
     start_time=`cut -c 1-10 $proc_pid_file`;
     current_time=`date +%s`;
     diff=`expr $current_time - $start_time`;
     if [ "$diff" -ge 0 ];
     then if [ "$diff" -lt "$TESTERMAXDELAY" ];
          then if [[ "$1" = "nowait" ]];
	       then if [ "$proc_pid" = "$proc_temp_pid" ];
	            then kill_function; return 0;
		    else $LOGGER "WATCHDOG: Pid of $proc_s_name was changed, exit" &
		    fi
	       else $LOGGER "WATCHDOG: $proc_s_name now working, try wait" &
	       proc_temp_pid=$proc_pid;
	       return $return_wait;
	       fi
	  else kill_function; return 0;
	  fi
     else $LOGGER "WATCHDOG: Time error in $proc_s_name = $diff" &
     kill_function; return 0;
     fi 
else return 0;
fi
}	
judge_function () {
if [ -e $proc_pid_file ];
then proc_pid=`cut -c 12-18 $proc_pid_file`;
     proc_s_name="judge";
     start_time=`cut -c 1-10 $proc_pid_file`;
     current_time=`date +%s`;
     diff=`expr $current_time - $start_time`;
     if [ "$diff" -ge 0 ];
     then if [ "$diff" -lt "$JUDGEMAXDELAY" ];
          then if [[ "$1" = "nowait" ]];
	       then if [ "$proc_pid" = "$proc_temp_pid" ];
	            then kill_function; return 0;
		    else $LOGGER "WATCHDOG: Pid of $proc_s_name was changed, exit" &
		    fi
	       else $LOGGER "WATCHDOG: $proc_s_name now working, try wait" &
	       proc_temp_pid=$proc_pid;
	       return $return_wait;
	       fi
	  else kill_function; return 0;
	  fi
     else $LOGGER "WATCHDOG: Time error in $proc_s_name = $diff" &
     kill_function; return 0;
     fi 
else return 0;
fi
}	
logger_function () {
if [ -e $proc_pid_file ];
then proc_pid=`cut -c 12-18 $proc_pid_file`;
     proc_s_name="logger";
     start_time=`cut -c 1-10 $proc_pid_file`;
     current_time=`date +%s`;
     diff=`expr $current_time - $start_time`;
     if [ "$diff" -ge 0 ];
     then if [ "$diff" -lt "$LOGGERMAXDELAY" ];
          then if [[ "$1" = "nowait" ]];
	       then if [ "$proc_pid" = "$proc_temp_pid" ];
	            then kill_function; return 0;
		    else echo -e "`date -j +%Y%m%d%H%M` WATCHDOG: Pid of $proc_s_name was changed, exit" >> $LOGFILE;
		    fi
	       else echo -e "`date -j +%Y%m%d%H%M` WATCHDOG: $proc_s_name now working, try wait" >> $LOGFILE;
	       proc_temp_pid=$proc_pid;
	       return $return_wait;
	       fi
	  else kill_function; return 0;
	  fi
     else echo -e "`date -j +%Y%m%d%H%M` WATCHDOG: Time error in $proc_s_name = $diff" >> $LOGFILE;
     kill_function; return 0;
     fi 
else return 0;
fi
}	
one=$1;
two=$2;
three=$3;
if [ -e $WATCHDOG_PID ];
then proc_pid=`cut -c 12-18 $WATCHDOG_PID`;
     proc_name="tofoin_watchdog";
     proc_s_name="watchdog";
     start_time=`cut -c 1-10 $WATCHDOG_PID`;
     current_time=`date +%s`;
     diff=`expr $current_time - $start_time`;
     if [ "$diff" -ge 0 ];
     then if [ "$diff" -lt "`expr $TESTERMAXDELAY + $JUDGEMAXDELAY + $LOGGERMAXDELAY + 30`" ];
          then $LOGGER "WATCHDOG: Other $proc_s_name already working, exit" & exit 0;
	  else kill_function;
	  fi
     else $LOGGER "WATCHDOG: Time error in $proc_s_name = $diff" &
     kill_function; 
     fi 
else main_function;
fi

Watchdog is the largest and, perhaps, an ambiguous script of all presented. It turned out like this, since an attempt was made to provide for all possible failure options. But so far so. Since this module is supposed to be launched using cron, something like this should be added to / etc / crontab:

0	*	*	*	*	root	/path/to/file/tofoin_watchdog.sh


Total


The script was tested for six months. However, no critical errors were found, minor ones were fixed. All modules work according to a given algorithm without deviations and unpredictable actions. The event log file is quite informative and allows you to judge the problems that arose and the time of their occurrence and solution. Thus, we can conclude that the initial goal has been achieved, further development plans are outlined below.

Plans


The plans for the further development of the script:
  1. Place files in the appropriate system directories;
  2. Consider the need to run as a special user using sudo for specific tasks. In case of a positive decision, adapt the script;
  3. Add a module for communication with zabbix;
  4. Make a client-server system. It is for this system that vlan3 and vlan4 were configured, since it is assumed that in the absence of communication between the "routers" on the internal channel, try to communicate via the vlan-s configured on external interfaces;
  5. Perhaps in the distant bright future, rewrite the entire script in a language with more features. At the moment there is a desire to squeeze out everything that is possible from bash.


Questions


Of course, when writing, and especially after, many questions arose. The most important of these is:
There are the following variables:


a =<сами устанавливаем значение>
HI_1=”123”
HI_2=”321”

It is necessary to call the variables HI_1 and HI_2, changing only a, i.e. the call will look something like this:

${HI_$a}    ## это заведомо неверное выражение дано здесь только для примера

And, if we set a = 1 in advance, this expression will mean 123, and if a = 2, then 321. I studied the bash literature, which, in my opinion, should give an answer to this question, but, to Unfortunately, I did not find how to do this. Using this function would greatly simplify the script and make it easy to expand.

The rest, of course, are general questions - how relevant is this decision? What mistakes are made in the script? What is the best way to resolve the issues identified in the plans and in the text of the article? Your comments?

If you want to help improve, then write in personal messages, we will discuss possible cooperation.


Also, when configuring the system and writing the script, many other materials were used from opennet.ru, lissyara.su, habrahabr.ru and many other sites. Unfortunately, many links have been lost over time, so if you find fragments from somewhere here, I will be happy to add links to them. Special thanks to Alexei Eresko and Valery Druba for advice and assistance in solving difficulties in the process of preparing and writing the script, and also to Oleg Matusevich for help in preparing the article.

Z.Y. When using the materials of this article, it is obligatory to indicate a link to the source and author.

Also popular now: