Igor_Le May 24, 2018 at 13:56

Pacemaker / corosync cluster without validol

Imagine the situation. Saturday evening. You are the administrator of PostgreSQL , after a hard week of work left for the cottage 200 km from your favorite job and feel great ... So far, your peace does not violate the SMS from the Zabbix monitoring system . There was a failure on the DBMS server, the database is currently unavailable. A short time is given to solve the problem. And you have no choice but to saddle a service scooter with a heavy heart and rush to work. Alas!

But it could be different. You receive an SMS from the monitoring system that a failure has occurred on one of the servers. But the DBMS continues to work, because the PostgreSQL failover cluster has lost one node and continues to function. There is no need to urgently go to work and restore the database server. Finding out the reasons for the failure and the restoration work are quietly carried over to work Monday.

Be that as it may, it is worth considering the technology of failover clusters with PostgreSQL DBMS. We will talk about building a fail-safe cluster of PostgreSQL DBMSs using Pacemaker & Corosync software.

Pacemaker-based PostgreSQL Failover Cluster

Today, in IT systems at the “business critical” level, the demand for broad functionality is fading into the background. Demand for the reliability of IT systems comes first. For fault tolerance, the redundancy of system components has to be introduced. They are run by special software.

An example of such software is Pacemaker, a ClusterLabs solution that allows you to organize a failover cluster (OUK). Pacemaker runs on a wide range of Unix -based operating systems - RHEL, CentOS, Debian, Ubuntu .

This software was not created specifically for working with PostgreSQL or other DBMSs. The scope of Pacemaker & Corosync is much wider. There are specialized solutions tailored for PostgreSQL, for example multimaster , which is part of Postgres Pro Enterprise (Postgres Professional), or Patroni ( Zalando ). But the PostgreSQL cluster based on Pacemaker / Corosync, considered in the article, is quite popular and is suitable in terms of simplicity and reliability to cost of ownership for a considerable number of situations. It all depends on specific tasks. Comparing solutions is beyond the scope of this article.

So: Pacemaker is the brain and part-time manager of cluster resources. His main task is to achieve the maximum availability of the resources he manages and to protect them from failures.

During the operation of the cluster, various events occur - failure, joining of nodes, resources, transition of nodes to service mode, and others. Pacemaker responds to these events in the cluster by performing the actions for which it is programmed., For example, stopping resources, transferring resources, and others.

In order to make it clear how Pacemaker is structured and works, let's look at what it has inside and what it consists of.

So, let's move on to the Pacemaker entities.

Figure 1. Entities pacemaker - nodes of the cluster

The first and most important entity is the nodes of the cluster. Node (node,nodeA) cluster is a physical server or virtual machine with Pacemaker installed.

Nodes designed to provide the same services must have the same software configuration. That is, if the postgresql resource is supposed to run on nodes node1, node2, and it is located in non-standard installation paths, then these nodes should have the same configuration files, PostgreSQL installation paths, and, of course, the same version of PostgreSQL.

The next important group of Pacemaker entities is cluster resources . In general, for Pacemaker, a resource is a script written in any language. Usually these scripts are written on bash, but nothing prevents writing them on Perl, Python, Cor even onPHP. The script manages services in the operating system. The main requirement for scripts is to be able to perform 3 actions: start, stop, monitor and share some meta-information.

True, in our case - a PostgreSQL cluster - these actions are supplemented by promote , demote, and other PostgreSQL-specific commands.
Resource Examples:

IP address
service launched in the operating system;
block device
file system;
others.

Resources have many attributes that are stored in Pacemaker's XML configuration file. The most interesting of them are: priority, resource-stickiness, migration-threshold, failure-timeout, multiple-active.
Let's consider them in more detail.

The priority attribute is the priority of the resource, which is taken into account if the node has exhausted the limit on the number of active resources (default is 0). If the cluster nodes are not the same in performance or availability, then you can increase the priority of one of the nodes so that it is active whenever it works.

Attribute resource-stickiness- stickiness of the resource (default 0). Stickiness indicates how much the resource “wants” to stay where it is now. For example, after a node fails, its resources go to other nodes (more precisely, they start on other nodes), and after a failed node is restored, resources may or may not return to it, and this behavior is described by the stickiness parameter.

In other words, the stickiness indicates how desirable or not desirable that the resource is returned to the restored node after the failure.

Since the default stickiness of all resources is 0, Pacemaker itself arranges the resources on the nodes “optimally” at its discretion.

But this may not always be optimal from the point of view of the administrator. For example, if the nodes in the failover cluster have different performance, the administrator will want to start services on the node with higher performance.

Pacemaker also allows you to set different stickiness of the resource depending on the time of day and day of the week, which allows, for example, to ensure the transition of the resource to the source node in non-working hours.

The migration-threshold attribute - how many failures must occur for Pacemaker to decide that the node is unsuitable for the given resource and transfer (migrate) it to another node. By default, this parameter is also equal to 0, i.e., for any number of failures, the automatic transfer of resources will not occur.

But, from the point of view of fault tolerance, it is correct to set this parameter to 1, so that at the first failure Pacemaker moves the resource to another node.

The failure-timeout attribute is the number of seconds after a failure before which Pacemaker considers that a failure has not occurred and does nothing, in particular, does not move resources. The default value is 0.

The multiple-active attribute - instructs Pacemaker what to do with the resource if it is running on more than one node. It can take the following values:

block - set the unmanaged option , i.e. deactivate
stop_only - stop on all nodes
stop_start - stop on all nodes and run on only one (default value).

By default, the cluster does not monitor after startup whether the resource is alive. To enable resource tracking, you need to add an operation when creating a resource monitor, then the cluster will monitor the status of the resource. The parameter of intervalthis operation is the interval with which to check.

If a failure occurs on the primary node, Pacemaker “moves” resources to another node (in fact, Pacemaker stops resources on the failed node and starts resources on the other). The process of "moving" resources to another node is fast and invisible to the end customer.

Resource groups

Resources can be combined into groups - lists of resources that must be launched in a certain order, stopped in the reverse order and executed on one node. All resources of a group are launched on one node and are launched sequentially, according to the order in the group. But keep in mind that if one of the group’s resources fails, the whole group will move to another node.

When you turn off a resource of a group, all subsequent resources of the group will also turn off. For example, a PostgreSQL resource of type pgsql and a Virtual-IP resource of type IPaddr2 can be grouped.

The startup sequence in this group is as follows - PostgreSQL is first launched, and when it is successfully launched, the Virtual-IP resource is launched after it.

Quorum

What is a quorum? They say that a cluster has a quorum with a sufficient number of “living” cluster nodes. The sufficiency of the number of "live" nodes is determined by the formula below.
n> N / 2 , where n is the number of living nodes, N is the total number of nodes in the cluster.

As can be seen from a simple formula, a quorum cluster is when the number of “live” nodes is more than half the total number of nodes in the cluster.

Figure 2 - Failover cluster with quorum

As you probably understand, in a cluster consisting of two nodes, if one of the 2 nodes fails, there will be no quorum. By default, if there is no quorum, Pacemaker stops resources.

To avoid this, you need to configure Pacemaker to tell him that the presence or absence of quorum is not taken into account. This is done using the optionno-quorum-policy = ignore .

Pacemaker Architecture

The Pacemaker architecture consists of three levels:

Figure 3 - Pacemaker Levels

Cluster-independent level - resources and agents. At this level, the resources themselves and their scripts are located. The figure is marked in green.
Resource Manager (Pacemaker), this is the "brain" of the cluster. It responds to events occurring in the cluster: failure or joining of nodes, resources, transition of nodes to service mode and other administrative actions. The figure is marked in blue.
Information level (Corosync ) - at this level the network interaction of nodes is carried out, i.e. transmission of service commands (start / stop of resources, nodes, etc.), exchange of information about the completeness of the cluster ( quorum), etc. In the figure, it is indicated in red.

What do you need for Pacemaker to work?

For the failover cluster to function properly, the following requirements must be met:

Time synchronization between nodes in a cluster
Cluster host name resolution
Network Stability
The cluster nodes have the power management / reboot function using IPMI (ILO) for organizing the fencing of the node.
Allow traffic through protocols and ports

Consider these requirements in more detail.

Time synchronization - it is necessary that all nodes have the same time, this is usually done by installing a time server on the local network ( ntpd).

Name resolution - implemented by installing a DNS server on the local network. If it is not possible to install a DNS server, you need to make entries in the / etc / hosts file with host names and IP addresses on all nodes of the cluster.

Network Stability. It is necessary to get rid of false positives. Imagine that you have an unstable local network, in which every 5-10 seconds a link is lost between the cluster nodes and the switch. In this case, Pacemaker will consider the failure of the link to fail for more than 5 seconds. The link has disappeared, your resources have "moved". Then the link recovered. But Pacemaker already considers the node in the cluster “failed”, he has already “transferred” resources to another node. At the next failure, Pacemaker will “transfer” resources to the next node, and so on, until all nodes are exhausted and a denial of service occurs. Thus, due to false positives, the entire cluster may cease to function.

The cluster nodes have the power management / reboot function using IPMI (ILO) for organizing “fencing”. It is necessary to isolate it from other nodes when a node fails. “Fencing” eliminates the situation of split-brain occurrence (when two nodes appear simultaneously that perform the role of PostgreSQL DBMS Wizard).

Allow traffic through protocols and ports . This is an important requirement because security organizations often impose restrictions on the flow of traffic between subnets or restrictions at the switch level at various organizations.

The table below lists the protocols and ports that are required for a failover cluster to function.

Table 1 - List of protocols and ports necessary for the functioning of the OUK

The table shows the data for the case of a failover cluster of 3 nodes -node1, node2, node3. It is also understood that cluster nodes and node power management interfaces (IPMI) are on different subnets.

As can be seen from the table, it is necessary to ensure not only the availability of neighboring nodes in the local network, but also the availability of nodes in the IPMI network.

Features of using virtual machines for OUK

When using virtual machines to build failover clusters, the following features should be considered.

fsync. The fault tolerance of the PostgreSQL DBMS is very much tied to the ability to synchronize recording to persistent storage (disk) and the correct functioning of this mechanism. Different hypervisors implement caching of disk operations in different ways, some do not provide timely dumping of data from the cache to the storage system.
realtime corosync. The corosync process in a Pacemaker-based CMC is responsible for detecting cluster node failures. In order for it to function correctly, it is necessary that the OS is guaranteed to plan its execution on the processor (the OS allocates processor time). In this regard, this process has priority RT ( realtime). In a virtualized OS, there is no way to guarantee such process planning, which leads to false positives for cluster software.
fencing. In a virtualized environment, the fencing ( fencing) mechanism becomes more complex and becomes multi-level: on the first level, you need to turn off the virtual machine through the hypervisor, and on the second, turn off the entire hypervisor (the second level is triggered when the fencing level did not work correctly on the first level). Unfortunately, some hypervisors do not have the option of fencing. We recommend that you do not use virtual machines when building an OUK.

Features of using PostgreSQL for OUK

When using PostgreSQL in failover clusters, the following features should be considered:

Pacemaker, when starting a cluster with PostgreSQL, places the LOCK.PSQL lock file on the node with the DBMS master. Usually this file is located in a directory /var/lib/pgsql/tmp. This is done in order to prevent PostgreSQL from starting automatically in the event of a failure on the Wizard. Thus, after a failure on the Wizard, the DBA's intervention is always required to eliminate the causes of the failure.
Since the standard PostgreSQL Master-Slave scheme is used in the OUK, in case of certain failures, a situation of two Masters - the so-called split-brain . For example, failure. Loss of network connectivity between one of the nodes and other nodes (for all types of failures, see below). To avoid this situation, two important conditions must be met when building failover clusters:
- the presence of a quorum in the OUK . This means that the cluster must have at least 3 nodes. Moreover, it is not necessary that all three nodes must be with a DBMS, it is enough to have a Master and a Replica on two nodes, and the third node acts as a “voter”.
- The presence of fencing devices on nodes with a DBMS . If a failure occurs, the “fencing” devices isolate the failed node - they send a command to turn off the power or restart ( poweroffor hard-reset).
WAL archives are recommended to be placed on shared ( shared) storage devices available to both the Wizard and the Replica. This will simplify the process of restoring the Wizard after a failure and putting it into mode Slave.
To manage the PostgreSQL DBMS, when setting up the cluster, you need to create a resource of the pgsql type. When creating a resource, specific PostgreSQL features are taken into account, such as the data path (for example, the pgdata="/var/lib/pgsql/9.6/data/")path to the batch files ( psql="/usr/pgsql-9.6/bin/psql"and pgctl="/usr/pgsql-9.6/bin/pg_ctl"), the type of replica ( rep_mode="sync"), the virtual IP address for the Wizard ( master_ip="10.3.3.3"), as well as some parameters that are added to the file recovery.confon replica ( restore_command="cp /var/lib/pgsql/9.6/pg_archive/%f %p"and primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5").
PostgreSQL resource creating EXAMPLE pgsql type cluster of three nodes pgsql01, pgsql02, pgsql03:
```
sudo pcs resource create PostgreSQL pgsql pgctl="/usr/pgsql-9.6/bin/pg_ctl" \
 psql="/usr/pgsql-9.6/bin/psql" pgdata="/var/lib/pgsql/9.6/data/" \
 rep_mode="sync" node_list=" pgsql01 pgsql02 pgsql03" restore_command="cp /var/lib/pgsql/9.6/pg_archive/%f %p" \
 primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" master_ip="10.3.3.3" restart_on_promote='false' 
```

Pacemaker Management Commands

Here are some interesting Pacemaker management commands (all commands require OS superuser rights). The main cluster management utility is pcs . Before setting up and starting the cluster for the first time, you need to authorize the nodes in the cluster once.

sudo pcs cluster auth node1 node2 node3 -u hacluster -p 'password'
Starting a cluster on all nodes
sudo pcs cluster start --all

Start / stop on one node:

sudo pcs cluster start
sudo pcs cluster stop

View cluster status using the Corosync monitor:

sudo crm_mon -Afr

Clearing failure counters:

sudo pcs resource cleanup

Clearing the failure counters should be done when we have eliminated the cause of the failure and want to return the node to the cluster. Otherwise, if the cause of the failure is not resolved, PostgreSQL may not start and this node for the cluster will be in the HS: alone or DISCONNECT state (more about the node states in the cluster below).

Monitoring cluster state with crm_mon

Pacemaker has a built-in cluster monitoring utility. The system administrator can use it to see what is happening in the cluster, which resources are currently located on which nodes.

Using the crm_mon command, you can monitor the status of the OAM.

sudo crm_mon –Afr

The screenshot shows the status of the cluster.

Figure 4 - Monitoring the cluster state using the crm_mon

pgsql-status command
PRI - master state
HS:sync- synchronous replica
HS:async- asynchronous replica
HS:alone- the replica cannot connect to the master
STOP- PostgreSQL stopped
pgsql-data-status
LATEST - the state inherent in the master. This node is a master.
STREAMING:SYNC/ASYNC- shows replication status and type of replication (SYNC/ASYNC)
DISCONNECT- replica cannot connect to the master. This usually happens when there is no connection from the replica to the master.
pgsql-master-baseline
Shows the timeline . The timeline changes every time after the promote commandon the replica node. After that, the DBMS starts a new countdown.

Types of failures on cluster nodes

What types of failures does the Pacemaker-based failover cluster protect against?

A power failure on the current master or replica. A power failure occurs when power is lost and the server shuts down. It can be either a Master or one of the Replicas.
PostgreSQL process crashes . Failure of the main PostgreSQL process - the system may crash the postgres process for various reasons, for example, lack of memory, insufficient file descriptors, or the maximum number of open files has been exceeded.
Loss of network connectivity between any of the nodes and other nodes . This is the network inaccessibility of a node. For example, its cause may be a failure of the network card or switch port.
Pacemaker / Corosync process failed . A Corosync / pacemaker process crashes like a PostgreSQL process crashes.

Types of scheduled maintenance OUK

For routine maintenance, it is necessary to periodically withdraw from the cluster individual nodes:

The decommissioning of the Wizard or Replica for scheduled work is necessary in the following cases:
- replacement of failed equipment (not leading to failure);
- equipment upgrade;
- software update;
- other cases.
Changing the roles of the Master and Replica. This is necessary when the servers of the Wizard and the Replica differ in resources. For example, as part of the OUK, we have a powerful server that acts as the PostgreSQL DBMS Wizard, and a weak server that acts as the Replica. After the failure of a more powerful Wizard server, its functions pass to a weaker Replica. It is logical to assume that after eliminating the causes of the failure on the former Master, the administrator will want to return the role of the Master back to the powerful server.

Important! Before changing roles or decommissioning the Wizard, you must use the #crm_mon –Afr command to verify that a synchronous replica is present in the cluster. And the role of the Master is always assigned to a synchronous replica.

Since the goal of this already short article is to introduce you to one of the PostgreSQL DBMS fault tolerance solutions, installation, configuration, and configuration commands for the failover cluster are not considered.

The author of the article is Igor Kosenkov , an engineer at Postgres Professional.
Drawing - Natalya Levshina .

Only registered users can participate in the survey. Please come in.

What postgresql clustering system are you using?

51.6% corosync / pacemaker 16
29% patroni 9
16.1% repmgr 5
6.4% stolon 2

Tags: