High reliability PostgreSQL cluster based on Patroni, Haproxy, Keepalived

  • Tutorial
Hello, Habr! I recently faced a task: to configure the most reliable cluster of PostgreSQL version 9.6 servers.

As planned, I wanted to get a cluster that survives the loss of any server, or even several servers, and is able to automatically commission the server after an accident.

When planning a cluster, I studied a lot of articles, both from the main documentation for PostgreSQL and various howto, including from Habr, and tried to configure a standard cluster with RepMgr, experimented with pgpool.

In general, it worked, but I occasionally had problems with switching, manual intervention was required to recover from accidents, etc. In general, I decided to look for more options.

As a result, somewhere (I don’t remember exactly where) I found a link to a beautiful projectZalando Patroni , and wrap it all up ...

Introduction


Patroni is a python daemon that allows you to automatically maintain PostgreSQL clusters with various types of replication and automatic role switching.

Its special beauty, in my opinion, is that to maintain the relevance of the cluster and wizard selections, distributed DCS repositories are used (supported by Zookeeper, etcd, Consul).

Thus, the cluster is easily integrated into almost any system, you can always find out who is currently the master, and the status of all servers by requests in DCS, or directly to Patroni via http.

Well, it's just beautiful :)

I tested the work of Patroni, tried to drop the wizard and other servers, tried to pour different databases (~ 25 GB base automatically rises from zero to 10 GB network in a few minutes), and in general I really liked the Patroni project. After the full implementation of the scheme described below, I tested with a simple bencher that went to the database at a single address and experienced the fall of all cluster elements (server master, haproxy, keepalived).

The delay in transferring the role to the new master was a couple of seconds. When you return the former master to the cluster, or add a new server, the role change does not occur.

Для автоматизации разворачивания кластера и добавления новых серверов, решено было использовать привычный Ansible (я дам ссылки на получившиеся роли в конце статьи). В качестве DCS выступает уже применяемый у нас Consul.

У статьи две основные цели: показать пользователям PostgreSQL что есть такая прекрасная штука как Patroni (упоминаний в рунете вообще и на Хабре в частности, практически нет), и заодно немного поделиться опытом использования Ansible на простом примере, тем кто только начинает с ним работать.

Я постараюсь разъяснить все действо сразу на примере разбора Ansible ролей и плейбуков. Те, кто не использует Ansible, смогут перенести все действия в любимое средство автоматизированного управления серверами, либо выполнить их же вручную.

Since most of the yaml scripts will be long, I will wrap them in a spoiler.
The story will be divided into two parts - the preparation of servers and the deployment of the cluster itself.

For those who are familiar with Ansible, the first part will not be interesting, so I recommend switching directly to the second.

Part I


For this example, I use Centos 7-based virtual machines. Virtual machines are deployed from a template that is periodically updated (kernel, system packages), but this topic is beyond the scope of this article.

I note only that no application or server software on virtual machines has been installed in advance. Any cloud resources, for example with AWS, DO, vScale, etc., are also quite suitable. For them there are scripts for dynamic inventory and integration with Ansible, or you can screw Terraform, so that the whole process of creating and removing servers from scratch can be automated.

First you need to create an inventory of used resources for Ansible. Ansible for me (and by default) is located in / etc / ansible. Create an inventory in the file / etc / ansible / hosts:

[pgsql]
cluster-pgsql-01.local
cluster-pgsql-02.local
cluster-pgsql-03.local

We use the internal domain zone .local, so the servers have these names.

Next, you need to prepare each server for the installation of all necessary components, and working tools.

For this purpose, create a playbook in / etc / ansible / tasks:

/etc/ansible/tasks/essentialsoftware.yml
---
- name: Install essential software
  yum: name={{ item }} state=latest
  tags: software
  with_items:
   - ntpdate
   - bzip2
   - zip
   - unzip
   - openssl-devel
   - mc
   - vim
   - atop
   - wget
   - mytop
   - screen
   - net-tools
   - rsync
   - psmisc
   - gdb
   - subversion
   - htop
   - bind-utils
   - sysstat
   - nano
   - iptraf
   - nethogs
   - ngrep
   - tcpdump
   - lm_sensors
   - mtr
   - s3cmd
   - psmisc
   - gcc
   - git
   - python2-pip
   - python-devel
- name: install the 'Development tools' package group
  yum:
    name: "@Development tools"
    state: present


The Essential suite of packages serves to create a familiar working environment on any server.
The development tools package group, some -devel and python libraries are needed by pip to build Python modules for PostgreSQL.

We use virtual machines based on VmWare ESXi, and for ease of administration, they need to run the vmware agent.

To do this, we will run the open agent vmtoolsd, and describe its installation in a separate playbook (since not all servers are virtual, and perhaps this task will not be needed for some of them):

/etc/ansible/tasks/open-vm-tools.yml
---
- name: Install open VM tools for VMWARE
  yum: name={{ item }} state=latest
  tags: open-vm-tools
  with_items:
   - open-vm-tools
- name: VmWare service start and enabling
  service: name=vmtoolsd.service state=started enabled=yes
  tags: open-vm-tools


Для того чтобы завершить подготовку сервера к установке основной части софта, в нашем случае, понадобятся следующие шаги:

1) настроить синхронизацию времени с помощью ntp
2) установить и запустить zabbix агент для мониторинга
3) накатить требуемые ssh ключи и authorized_keys.

Чтобы не слишком раздувать статью деталям не относящимися к собственно кластеру, я кратко процитирую ansible плейбуки, выполняющие эти задачи:

NTP:

/etc/ansible/tasks/ntpd.yml
---
    - name: setting default timezone
      set_fact:
        timezone: name=Europe/Moscow
      when: timezone is not defined
    - name: setting TZ
      timezone: name={{ timezone }}
      when: timezone is defined
      tags:
      - tz
      - tweaks
      - ntp
      - ntpd
    - name: Configurating cron for ntpdate
      cron: name="ntpdate" minute="*/5" job="/usr/sbin/ntpdate pool.ntp.org"
      tags:
      - tz
      - tweaks
      - ntp
      - ntpd
    - name: ntpd stop and disable
      service: name=ntpd state=stopped enabled=no
      tags:
      - tz
      - tweaks
      - ntp
      - ntpd
      ignore_errors: yes
    - name: crond restart and enabled
      service: name=crond state=restarted enabled=yes
      tags:
      - tz
      - tweaks
      - ntp
      - ntpd


First, it is checked whether a personal timezone is set for the server, and if not, Moscow is set up (we have the majority of such servers).

We do not use ntpd because of the problems with the floating time on ESXi virtual machines, after which ntpd refuses to synchronize the time. (And tinker panic 0 doesn't help). Therefore, we just start the ntp client with the crown once 5 minutes.

Zabbix-agent:

/etc/ansible/tasks/zabbix.yml
---
    - name: set zabbix ip external
      set_fact:
        zabbix_ip: 132.xx.xx.98
      tags: zabbix
    - name: set zabbix ip internal
      set_fact:
        zabbix_ip: 192.168.xx.98
      when: ansible_all_ipv4_addresses | ipaddr('192.168.0.0/16')
      tags: zabbix
    - name: Import Zabbix3 repo
      yum: name=http://repo.zabbix.com/zabbix/3.0/rhel/7/x86_64/zabbix-release-3.0-1.el7.noarch.rpm state=present
      tags: zabbix
    - name: Remove old zabbix
      yum: name=zabbix2* state=absent
      tags: zabbix
    - name: Install zabbix-agent software
      yum: name={{ item }} state=latest
      tags: zabbix
      with_items:
        - zabbix-agent
        - zabbix-release
    - name: Creates directories
      file: path={{ item }}  state=directory
      tags:
      - zabbix
      - zabbix-mysql
      with_items:
        - /etc/zabbix/externalscripts
        - /etc/zabbix/zabbix_agentd.d
        - /var/lib/zabbix
    - name: Copy scripts
      copy: src=/etc/ansible/templates/zabbix/{{ item }} dest=/etc/zabbix/externalscripts/{{ item }} owner=zabbix group=zabbix  mode=0755
      tags: zabbix
      with_items:
        - netstat.sh
        - iostat.sh
        - iostat2.sh
        - iostat_collect.sh
        - iostat_parse.sh
        - php_workers_discovery.sh
    - name: Copy .my.cnf
      copy: src=/etc/ansible/files/mysql/.my.cnf dest=/var/lib/zabbix/.my.cnf owner=zabbix group=zabbix  mode=0700
      tags:
      - zabbix
      - zabbix-mysql
    - name: remove default configs
      file: path={{ item }} state=absent
      tags: zabbix
      with_items:
        - /etc/zabbix_agentd.conf
        - /etc/zabbix/zabbix_agentd.conf
    - name: put zabbix-agentd.conf to default place
      template: src=/etc/ansible/templates/zabbix/zabbix_agentd.tpl dest=/etc/zabbix_agentd.conf owner=zabbix group=zabbix force=yes
      tags: zabbix
    - name: link zabbix-agentd.conf to /etc/zabbix
      file: src=/etc/zabbix_agentd.conf dest=/etc/zabbix/zabbix_agentd.conf state=link
      tags: zabbix
    - name: zabbix-agent start and enable
      service: name=zabbix-agent state=restarted enabled=yes
      tags: zabbix


When installing Zabbix, the agent config is rolled from the template, you only need to change the server address.

Servers located within our network go to 192.168.x.98, and servers that do not have access to it, to the real address of the same server.

Transferring ssh keys and setting ssh has been moved to a separate role, which can be found, for example, on ansible-galaxy.

There are many options, and the essence of the changes is quite trivial, so I don’t see the point of quoting all its content here.

It is time to roll the created configuration onto the server. In general, I perform the installation of all components and the cluster itself in one step, already with a full config, but it seems to me that for the purposes of this tutorial it would be better to divide it into two steps, respectively in chapters.

Create a playbook for a group of servers:

/etc/ansible/cluster-pgsql.yml
---
- hosts: pgsql
  pre_tasks:
    - name: Setting system hostname
      hostname: name="{{ ansible_host }}"
    - include: tasks/essentialsoftware.yml
    - include: tasks/open-vm-tools.yml
    - include: tasks/ntpd.yml
  post_tasks:
    - include: tasks/zabbix.yml
  roles:
     - ssh.role
     - ansible-role-patroni


We start processing all the servers:

Hidden text
~ # ansible-playbook cluster-pgsql.yml --skip-tags patroni

Если вы полностью скачали мой пример из гитхаб репозитория, то у вас будет также в наличии и роль Patroni, которую нам пока отрабатывать не нужно.

Аргумент --skip-tags заставляет Ansible пропустить шаги помеченные этим тегом, поэтому роль ansible-role-patroni выполняться сейчас не будет.

Если же ее на диске нет, то ничего страшного и не произойдет, Anisble просто проигнорирует этот ключ.

Ansible у меня заходит на сервера сразу пользователем root, а если вам потребуется пускать ansible под непревилегированного пользователя, стоит дополнительно добавить в шаги требующие рутовых прав специальный флаг «become: true», который побудит ansible использовать вызовы sudo для этих шагов.

Подготовка закончена.

Часть II


We proceed to deploy the cluster directly.

Since it takes a lot of work to configure the cluster (install PostgreSQL and all the components, fill in individual configs for them), I singled out this whole process in a separate role.

Roles in Ansible allow you to group sets of adjacent tasks, and thus simplify the writing of scripts and their maintenance in working condition.

I took the role template for installing Patroni here: https://github.com/gitinsky/ansible-role-patroni , for which thanks to its author.
For my purposes, I reworked the existing one and added my haproxy and keepalived playbooks.

My roles are in the / etc / ansible / roles directory. Create a directory for the new role, and subdirectories for its components:

~# mkdir /etc/ansible/roles/ansible-role-patroni/tasks
~# mkdir /etc/ansible/roles/ansible-role-patroni/templates

In addition to PostgreSQL, our cluster will consist of the following components:

1) haproxy for monitoring server status and redirecting requests to the master server.
2) keepalived to ensure a single entry point into the cluster - virtual IP.

We list all the playbooks performed by this role in the file run by ansible by default:

/etc/ansible/roles/ansible-role-patroni/tasks/main.yml
- include: postgres.yml
- include: haproxy.yml
- include: keepalived.yml


Next, we begin to describe individual tasks.

The first playbook installs PostgreSQL 9.6 from the native repository, and additional packages required by Patroni, and then downloads Patroni itself from GitHub:

/etc/ansible/roles/ansible-role-patroni/tasks/postgres.yml
---
- name: Import Postgresql96 repo
  yum: name=https://download.postgresql.org/pub/repos/yum/9.6/redhat/rhel-7-x86_64/pgdg-centos96-9.6-3.noarch.rpm state=present
  tags: patroni
  when: install is defined
- name: Install PGsql96
  yum: name={{ item }} state=latest
  tags: patroni
  with_items:
    - postgresql96
    - postgresql96-contrib
    - postgresql96-server
    - python-psycopg2
    - repmgr96
  when: install is defined
- name: checkout patroni
  git: repo=https://github.com/zalando/patroni.git dest=/opt/patroni
  tags: patroni
  when: install is defined
- name: create /etc/patroni
  file: state=directory dest=/etc/patroni
  tags: patroni
  when: install is defined
- name: put postgres.yml
  template: src=postgres0.yml dest=/etc/patroni/postgres.yml backup=yes
  tags: patroni
  when: install is defined
- name: install python packages
  pip: name={{ item }}
  tags: patroni
  with_items:
    - python-etcd
    - python-consul
    - dnspython
    - boto
    - mock
    - requests
    - six
    - kazoo
    - click
    - tzlocal
    - prettytable
    - PyYAML
  when: install is defined
- name: put patroni.service systemd unit
  template: src=patroni.service dest=/etc/systemd/system/patroni.service backup=yes
  tags: patroni
  when: install is defined
- name: Reload daemon definitions
  command: /usr/bin/systemctl daemon-reload
  tags: patroni
- name: restart
  service: name=patroni state=restarted enabled=yes
  tags: patroni


In addition to installing the software, this playbook also uploads the configuration for the current Patroni server, and systemd has a unit for launching the daemon in the system, after which it starts the Patroni daemon. Config templates and systemd units must lie in the templates directory inside the role.

Patroni config template:

/etc/ansible/roles/ansible-role-patroni/templates/postgres.yml.j2
name: {{ patroni_node_name }}
scope: &scope {{ patroni_scope }}
consul:
  host: consul.services.local:8500
restapi:
  listen: 0.0.0.0:8008
  connect_address: {{ ansible_default_ipv4.address }}:8008
  auth: 'username:{{ patroni_rest_password }}'
bootstrap:
  dcs:
    ttl: &ttl 30
    loop_wait: &loop_wait 10
    maximum_lag_on_failover: 1048576 # 1 megabyte in bytes
    postgresql:
      use_pg_rewind: true
      use_slots: true
      parameters:
        archive_mode: "on"
        wal_level: hot_standby
        archive_command: mkdir -p ../wal_archive && cp %p ../wal_archive/%f
        max_wal_senders: 10
        wal_keep_segments: 8
        archive_timeout: 1800s
        max_replication_slots: 5
        hot_standby: "on"
        wal_log_hints: "on"
pg_hba:  # Add following lines to pg_hba.conf after running 'initdb'
  - host replication replicator 192.168.0.0/16 md5
  - host all all 0.0.0.0/0 md5
postgresql:
  listen: 0.0.0.0:5432
  connect_address: {{ ansible_default_ipv4.address }}:5432
  data_dir: /var/lib/pgsql/9.6/data
  pg_rewind:
    username: superuser
    password: {{ patroni_postgres_password }}
  pg_hba:
  - host all all 0.0.0.0/0 md5
  - hostssl all all 0.0.0.0/0 md5
  replication:
    username: replicator
    password: {{ patroni_replicator_password }}
    network:  192.168.0.0/16
  superuser:
    username: superuser
    password: {{ patroni_postgres_password }}
  admin:
    username: admin
    password: {{ patroni_postgres_password }}
  restore: /opt/patroni/patroni/scripts/restore.py


Since each cluster server requires an individual Patroni configuration, its config is in the form of a jinja2 template (postgres0.yml.j2 file), and the template step forces ansible to translate this template with the replacement of variables, the values ​​of which are taken from a separate description for each server.

We indicate the variables common to the entire cluster in directly in the inventory, which will now take the following form:

/ etc / ansible / hosts
[pgsql]
cluster-pgsql-01.local
cluster-pgsql-02.local
cluster-pgsql-03.local
[pgsql:vars]
patroni_scope: "cluster-pgsql"
patroni_rest_password: flsdjkfasdjhfsd
patroni_postgres_password: flsdjkfasdjhfsd
patroni_replicator_password: flsdjkfasdjhfsd
cluster_virtual_ip: 192.xx.xx.125

А отдельную для каждого сервера - в каталоге host_vars/имя_сервера:

patroni_node_name: cluster_pgsql_01
keepalived_priority: 99


I will decipher why some variables are needed:

patroni_scope - cluster name when registering in Consul
patroni_node_name - server name when registering in Consul
patroni_rest_password - password for the Patroni http interface (required for sending commands to change the cluster)
patroni_postgres_password: password for the user postgres. It is installed if a new base is created by patroni.
patroni_replicator_password - password for user replicator. On his behalf, replication to slaves is performed.

Also in this file are listed some other variables used in other playbooks or roles, in particular, it can be ssh settings (keys, users), timezone for the server, server priority in the keepalived cluster, etc.

The configuration for the remaining servers is similar, the server name and priority are changed accordingly (for example, 99-100-101 for three servers).

Install and configure haproxy:

/etc/ansible/roles/ansible-role-patroni/tasks/haproxy.yml
---
- name: Install haproxy
  yum: name={{ item }} state=latest
  tags:
    - patroni
    - haproxy
  with_items:
    - haproxy
  when: install is defined
- name: put config
  template: src=haproxy.cfg.j2 dest=/etc/haproxy/haproxy.cfg backup=yes
  tags:
    - patroni
    - haproxy
- name: restart and enable
  service: name=haproxy state=restarted enabled=yes
  tags:
    - patroni
    - haproxy


Haproxy is installed on each host, and contains in its config links to all PostgreSQL servers, checks which server is currently the master, and sends requests to it.
For this test, a wonderful feature of Patroni is used - the REST interface.

When accessing the server : 8008 URL (8008 is the default port) Patroni returns a report on the state of the cluster in json, and also reflects the response code http whether this server is a master. If it is, there will be a response with a code of 200. If not, a response with a code of 503.

I highly recommend contacting the Patroni documentation, the http interface is quite interesting there, it is also possible to force role switching and manage the cluster.
Similarly, this can be done using the patronyctl.py console utility from Patroni.

The haproxy configuration is quite simple:

/etc/ansible/roles/ansible-role-patroni/templates/haproxy.cfg
global
maxconn 800
defaults
log global
mode tcp
retries 2
timeout client 30m
timeout connect 4s
timeout server 30m
timeout check 5s
frontend ft_postgresql
bind *:5000
default_backend postgres-patroni
backend postgres-patroni
  option httpchk
  http-check expect status 200
  default-server inter 3s fall 3 rise 2
  server {{ patroni_node_name }} {{ patroni_node_name }}.local:5432 maxconn 300 check port 8008
  server {{ patroni_node_name }} {{ patroni_node_name }}.local:5432 maxconn 300 check port 8008
  server {{ patroni_node_name }} {{ patroni_node_name }}.local:5432 maxconn 300 check port 8008


According to this configuration, haproxy listens on port 5000, and sends traffic from it to the master server.

Checking the status occurs with an interval of 1 second, for transferring the server to the downloader, 3 unsuccessful responses are required (code 500), for switching the server back, 2 successful responses (with code 200).
At any time, you can directly contact any haproxy, and it will correctly route traffic to the master server.

Also included with Patroni is a template for configuring the confd daemon, and an example of its integration with etcd, which allows you to dynamically change the haproxy config when deleting or adding new servers.

I'm still making a fairly static cluster, unnecessary automation in this situation, IMHO, can lead to unforeseen problems.

Нам хотелось, чтобы на клиентах особые изменения логики, отслеживание серверов на живости и т.д. не требовались, поэтому мы делаем единую точку входа в кластер с помощью keepalived.

Демон keepalived работает по протоколу vrrp со своими соседями, и в результате выборов одного из демонов как главного (приоритет указан в конфиге, и шаблонизирован в переменную keepalived_priority в host_vars для каждого сервера), он поднимает у себя виртуальный ip адрес.
Остальные демоны терпеливо ждут. Если текущий основной сервер keepalived по какой-то причине умрет либо просигналит соседям аварию, произойдут перевыборы, и следуюший по приоритету сервер заберет себе виртуальный ip адрес.

To protect against haproxy crashes, keepalived daemons perform a check by running the killall -0 haproxy command once per second. It returns code 0 if there is a haproxy process, and 1 if it does not.
If haproxy disappears, the keepalived daemon will signal a crash via vrrp and remove the virtual ip.
The virtual IP will immediately pick up the next highest priority server, with live haproxy.

Install and configure keepalived:

/etc/ansible/roles/ansible-role-patroni/tasks/keepalived.yml
---
- name: Install keepalived
  yum: name={{ item }} state=latest
  tags:
    - patroni
    - keepalived
  with_items:
    - keepalived
  when: install is defined
- name: put alert script
  template: src=alert.sh.j2 dest=/usr/local/sbin/alert.sh backup=yes mode=755
  tags:
    - patroni
    - keepalived
  when: install is defined
- name: put config
  template: src=keepalived.conf.j2 dest=/etc/keepalived/keepalived.conf backup=yes
  tags:
    - patroni
    - keepalived
- name: restart and enable
  service: name=keepalived state=restarted enabled=yes
  tags:
    - patroni
    - keepalived


In addition to installing keepalived, this playbook also copies a simple script to send alerts via telegram. The script accepts the message as a variable, and simply pulls the curl-th telegram API.

In this script, you only need to specify your token and telegram group ID to send alerts.

The keepalived configuration is described as a jinja2 template:

/etc/ansible/roles/ansible-role-patroni/templates/keepalived.conf.j2

global_defs {
   router_id {{ patroni_node_name }}
}
vrrp_script chk_haproxy {
        script "killall -0 haproxy"
        interval 1
        weight -20
        debug
        fall 2
        rise 2
}
vrrp_instance {{ patroni_node_name }} {
        interface ens160
        state BACKUP
        virtual_router_id 150
        priority {{ keepalived_priority }}
        authentication {
            auth_type PASS
            auth_pass secret_for_vrrp_auth
        }
        track_script {
                chk_haproxy weight 20
        }
        virtual_ipaddress {
                {{ cluster_virtual_ip }}/32 dev ens160
        }
        notify_master "/usr/bin/sh /usr/local/sbin/alert.sh '{{ patroni_node_name }} became MASTER'"
        notify_backup "/usr/bin/sh /usr/local/sbin/alert.sh '{{ patroni_node_name }} became BACKUP'"
        notify_fault "/usr/bin/sh /usr/local/sbin/alert.sh '{{ patroni_node_name }} became FAULT'"
}


The variables patroni_node_name, cluster_virtual_ip and keepalived_priority translate the corresponding data from host_vars.

The keepalived config also contains a script for sending messages about status changes to the telegram channel.

We roll the full cluster configuration onto the servers:

~# ansible-playbook cluster-pgsql.yml

Since Ansible is idempotent, i.e. performs steps only if they have not been completed previously, you can start the playbook without additional parameters.

If you do not want to wait longer, or you are sure that the servers are completely ready, you can run ansible-playbook with the -t patroni switch.

Then only steps from the Patroni role will be completed.

I note that I do not separately indicate the role of the server - master or slave. This configuration will create an empty database, and the first configured server will simply become the master.

When adding new servers, Patroni will see through DCS that the cluster master is already there, will automatically copy the database from the current master, and connect the slave to it.

If you start a slave that is behind the wizard for some time, Patroni will automatically upload the changes using pg_rewind.

We make sure that all servers have started and selected roles:

~# journalctl -f -u patroni

Messages from the slave (server cluster-pgsql-01):

spoiler
Feb 17 23:50:32 cluster-pgsql-01.local patroni.py[100626]: 2017-02-17 23:50:32,254 INFO: Lock owner: cluster_pgsql_02; I am cluster_pgsql_01
Feb 17 23:50:32 cluster-pgsql-01.local patroni.py[100626]: 2017-02-17 23:50:32,255 INFO: Lock owner: cluster_pgsql_02; I am cluster_pgsql_01
Feb 17 23:50:32 cluster-pgsql-01.local patroni.py[100626]: 2017-02-17 23:50:32,255 INFO: does not have lock
Feb 17 23:50:32 cluster-pgsql-01.local patroni.py[100626]: 2017-02-17 23:50:32,255 INFO: no action. i am a secondary and i am following a leader


Messages from the wizard (in this case, the server is cluster-pgsql-02):
spoiler
Feb 17 23:52:23 cluster-pgsql-02.local patroni.py[4913]: 2017-02-17 23:52:23,457 INFO: Lock owner: cluster_pgsql_02; I am cluster_pgsql_02
Feb 17 23:52:23 cluster-pgsql-02.local patroni.py[4913]: 2017-02-17 23:52:23,874 INFO: Lock owner: cluster_pgsql_02; I am cluster_pgsql_02
Feb 17 23:52:24 cluster-pgsql-02.local patroni.py[4913]: 2017-02-17 23:52:24,082 INFO: no action. i am the leader with the lock
Feb 17 23:52:33 cluster-pgsql-02.local patroni.py[4913]: 2017-02-17 23:52:33,458 INFO: Lock owner: cluster_pgsql_02; I am cluster_pgsql_02
Feb 17 23:52:33 cluster-pgsql-02.local patroni.py[4913]: 2017-02-17 23:52:33,884 INFO: Lock owner: cluster_pgsql_02; I am cluster_pgsql_02
Feb 17 23:52:34 cluster-pgsql-02.local patroni.py[4913]: 2017-02-17 23:52:34,094 INFO: no action. i am the leader with the lock


The logs clearly show that each server constantly monitors its status and the status of the master.
Let's try to stop the master:

~# systemctl stop patroni

spoiler
Feb 17 23:54:03 cluster-pgsql-02.local patroni.py[4913]: 2017-02-17 23:54:03,457 INFO: Lock owner: cluster_pgsql_02; I am cluster_pgsql_02
Feb 17 23:54:03 cluster-pgsql-02.local patroni.py[4913]: 2017-02-17 23:54:03,880 INFO: Lock owner: cluster_pgsql_02; I am cluster_pgsql_02
Feb 17 23:54:04 cluster-pgsql-02.local patroni.py[4913]: 2017-02-17 23:54:04,092 INFO: no action. i am the leader with the lock
Feb 17 23:54:11 cluster-pgsql-02.local systemd[1]: Stopping Runners to orchestrate a high-availability PostgreSQL...
Feb 17 23:54:13 cluster-pgsql-02.local patroni.py[4913]: waiting for server to shut down.... done
Feb 17 23:54:13 cluster-pgsql-02.local patroni.py[4913]: server stopped


But what happened at that moment on the slave:

spoiler
Feb 17 19:54:12 cluster-pgsql-01 patroni.py: 2017-02-17 23:54:12,353 INFO: does not have lock
Feb 17 19:54:12 cluster-pgsql-01 patroni.py: 2017-02-17 23:54:12,776 INFO: no action. i am a secondary and i am following a leader
Feb 17 19:54:13 cluster-pgsql-01 patroni.py: 2017-02-17 23:54:13,440 WARNING: request failed: GET http://192.xx.xx.121:8008/patroni (HTTPConnectionPool(host='192.xx.xx.121', port=8008
): Max retries exceeded with url: /patroni (Caused by NewConnectionError(': Failed to establish a new connection: [Er
rno 111] Connection refused',)))
Feb 17 19:54:13 cluster-pgsql-01 patroni.py: 2017-02-17 23:54:13,444 INFO: Got response from cluster_pgsql_03 http://192.xx.xx.122:8008/patroni: {"database_system_identifier": "63847
30077944883705", "postmaster_start_time": "2017-02-17 05:36:52.388 MSK", "xlog": {"received_location": 34997272728, "replayed_timestamp": null, "paused": false, "replayed_location": 34997272
728}, "patroni": {"scope": "clusters-pgsql", "version": "1.2.3"}, "state": "running", "role": "replica", "server_version": 90601}
Feb 17 19:54:13 cluster-pgsql-01 patroni.py: server promoting
Feb 17 19:54:13 cluster-pgsql-01 patroni.py: 2017-02-17 23:54:13,961 INFO: cleared rewind flag after becoming the leader
Feb 17 19:54:14 cluster-pgsql-01 patroni.py: 2017-02-17 23:54:14,179 INFO: promoted self to leader by acquiring session lock
Feb 17 19:54:23 cluster-pgsql-01 patroni.py: 2017-02-17 23:54:23,436 INFO: Lock owner: cluster_pgsql_01; I am cluster_pgsql_01
Feb 17 19:54:23 cluster-pgsql-01 patroni.py: 2017-02-17 23:54:23,857 INFO: Lock owner: cluster_pgsql_01; I am cluster_pgsql_01
Feb 17 19:54:24 cluster-pgsql-01 patroni.py: 2017-02-17 23:54:24,485 INFO: no action. i am the leader with the lock


This server intercepted the role of the master.

Now return server 2 back to the cluster:

~# systemctl start patroni

Spoiler heading
Feb 18 00:02:11 cluster-pgsql-02.local systemd[1]: Started Runners to orchestrate a high-availability PostgreSQL.
Feb 18 00:02:11 cluster-pgsql-02.local systemd[1]: Starting Runners to orchestrate a high-availability PostgreSQL...
Feb 18 00:02:13 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:13,186 INFO: Lock owner: cluster_pgsql_01; I am cluster_pgsql_02
Feb 18 00:02:13 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:13,190 WARNING: Postgresql is not running.
Feb 18 00:02:13 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:13,190 INFO: Lock owner: cluster_pgsql_01; I am cluster_pgsql_02
Feb 18 00:02:13 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:13,398 INFO: Lock owner: cluster_pgsql_01; I am cluster_pgsql_02
Feb 18 00:02:13 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:13,400 INFO: starting as a secondary
Feb 18 00:02:13 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:13,412 INFO: rewind flag is set
Feb 18 00:02:13 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:13,609 INFO: Lock owner: cluster_pgsql_01; I am cluster_pgsql_02
Feb 18 00:02:13 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:13,609 INFO: Lock owner: cluster_pgsql_01; I am cluster_pgsql_02
Feb 18 00:02:13 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:13,609 INFO: changing primary_conninfo and restarting in progress
Feb 18 00:02:13 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:13,631 INFO: running pg_rewind from user=superuser host=192.xx.xx.120 port=5432 dbname=postgres sslmode=prefer sslcompression=1
Feb 18 00:02:13 cluster-pgsql-02.local patroni.py[56855]: servers diverged at WAL position 8/26000098 on timeline 25
Feb 18 00:02:13 cluster-pgsql-02.local patroni.py[56855]: rewinding from last common checkpoint at 8/26000028 on timeline 25
Feb 18 00:02:14 cluster-pgsql-02.local patroni.py[56855]: Done!
Feb 18 00:02:14 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:14,535 INFO: postmaster pid=56893
Feb 18 00:02:14 cluster-pgsql-02.local patroni.py[56855]: < 2017-02-18 00:02:14.554 MSK > LOG: redirecting log output to logging collector process
Feb 18 00:02:14 cluster-pgsql-02.local patroni.py[56855]: < 2017-02-18 00:02:14.554 MSK > HINT: Future log output will appear in directory "pg_log".
Feb 18 00:02:15 cluster-pgsql-02.local patroni.py[56855]: localhost:5432 - accepting connections
Feb 18 00:02:15 cluster-pgsql-02.local patroni.py[56855]: localhost:5432 - accepting connections
Feb 18 00:02:15 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:15,790 INFO: Lock owner: cluster_pgsql_01; I am cluster_pgsql_02
Feb 18 00:02:15 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:15,791 INFO: Lock owner: cluster_pgsql_01; I am cluster_pgsql_02
Feb 18 00:02:15 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:15,791 INFO: does not have lock
Feb 18 00:02:15 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:15,791 INFO: establishing a new patroni connection to the postgres cluster
Feb 18 00:02:16 cluster-pgsql-02.local patroni.py[56855]: 2017-02-18 00:02:16,014 INFO: no action. i am a secondary and i am following a leader


Patroni discovered that it was connecting to the cluster with the existing master, and updating the database to its current state, it correctly assumed the role of a slave.

Let's try to create an error on another layer of the cluster by stopping haproxy on the main keepalived server.

By priority, the second server accepts this role for me:

[root@cluster-pgsql-02 ~]# ip a
2: ens160: mtu 1500 qdisc mq state UP qlen 1000
link/ether 00:50:56:a9:b8:7b brd ff:ff:ff:ff:ff:ff
inet 192.xx.xx.121/24 brd 192.168.142.255 scope global ens160
valid_lft forever preferred_lft forever
inet 192.xx.xx.125/32 scope global ens160 <---- виртуальный адрес кластера
valid_lft forever preferred_lft forever
inet6 fe80::xxx::4895:6d90/64 scope link
valid_lft forever preferred_lft forever


Stop haproxy:

~# systemctl stop haproxy ; journalctl -fl

Feb 18 00:18:54 cluster-pgsql-02.local Keepalived_vrrp[25018]: VRRP_Script(chk_haproxy) failed
Feb 18 00:18:56 cluster-pgsql-02.local Keepalived_vrrp[25018]: VRRP_Instance(cluster_pgsql_02) Received higher prio advert
Feb 18 00:18:56 cluster-pgsql-02.local Keepalived_vrrp[25018]: VRRP_Instance(cluster_pgsql_02) Entering BACKUP STATE
Feb 18 00:18:56 cluster-pgsql-02.local Keepalived_vrrp[25018]: VRRP_Instance(cluster_pgsql_02) removing protocol VIPs.
Feb 18 00:18:56 cluster-pgsql-02.local Keepalived_vrrp[25018]: Opening script file /usr/bin/sh
Feb 18 00:18:56 cluster-pgsql-02.local Keepalived_healthcheckers[25017]: Netlink reflector reports IP 192.xx.xx.125 removed

Keepalived caught the problem, and removed the virtual address from itself, and also signaled to the neighbors about it.

We look at what happened on the second server:
Feb 18 00:18:56 cluster-pgsql-01.local Keepalived_vrrp[41190]: VRRP_Instance(cluster_pgsql_01) forcing a new MASTER election
Feb 18 00:18:56 cluster-pgsql-01.local Keepalived_vrrp[41190]: VRRP_Instance(cluster_pgsql_01) forcing a new MASTER election
Feb 18 00:18:56 cluster-pgsql-01.local Keepalived_vrrp[41190]: VRRP_Instance(cluster_pgsql_01) forcing a new MASTER election
Feb 18 00:18:56 cluster-pgsql-01.local Keepalived_vrrp[41190]: VRRP_Instance(cluster_pgsql_01) forcing a new MASTER election
Feb 18 00:18:57 cluster-pgsql-01.local Keepalived_vrrp[41190]: VRRP_Instance(cluster_pgsql_01) Transition to MASTER STATE
Feb 18 00:18:58 cluster-pgsql-01.local Keepalived_vrrp[41190]: VRRP_Instance(cluster_pgsql_01) Entering MASTER STATE
Feb 18 00:18:58 cluster-pgsql-01.local Keepalived_vrrp[41190]: VRRP_Instance(cluster_pgsql_01) setting protocol VIPs.
Feb 18 00:18:58 cluster-pgsql-01.local Keepalived_vrrp[41190]: VRRP_Instance(cluster_pgsql_01) Sending gratuitous ARPs on ens160 for 192.xx.xx.125
Feb 18 00:18:58 cluster-pgsql-01.local Keepalived_vrrp[41190]: Opening script file /usr/bin/sh
Feb 18 00:18:58 cluster-pgsql-01.local Keepalived_vrrp[41190]: VRRP_Instance(cluster_pgsql_01) Received lower prio advert, forcing new election
Feb 18 00:18:58 cluster-pgsql-01.local Keepalived_vrrp[41190]: VRRP_Instance(cluster_pgsql_01) Sending gratuitous ARPs on ens160 for 192.xx.xx.125
Feb 18 00:18:58 cluster-pgsql-01.local Keepalived_healthcheckers[41189]: Netlink reflector reports IP 192.xx.xx.125 added
Feb 18 00:18:58 cluster-pgsql-01.local Keepalived_vrrp[41190]: VRRP_Instance(cluster_pgsql_01) Received lower prio advert, forcing new election
Feb 18 00:18:58 cluster-pgsql-01.local Keepalived_vrrp[41190]: VRRP_Instance(cluster_pgsql_01) Sending gratuitous ARPs on ens160 for 192.xx.xx.125
Feb 18 00:19:03 cluster-pgsql-01.local Keepalived_vrrp[41190]: VRRP_Instance(cluster_pgsql_01) Sending gratuitous ARPs on ens160 for 192.xx.xx.125

Re-elections took place twice (because the third server of the cluster managed to send its announcement before the first election), server 1 assumed the role of the leader, and set the virtual IP.

We are convinced of this:
[root@cluster-pgsql-01 log]# ip a
2: ens160: mtu 1500 qdisc mq state UP qlen 1000
link/ether 00:50:56:a9:f0:90 brd ff:ff:ff:ff:ff:ff
inet 192.xx.xx.120/24 brd 192.xx.xx.255 scope global ens160
valid_lft forever preferred_lft forever
inet 192.xx.xx.125/32 scope global ens160 <---- виртуальный адрес кластера присутствует!
valid_lft forever preferred_lft forever
inet6 fe80::1d75:40f6:a14e:5e27/64 scope link
valid_lft forever preferred_lft forever

The virtual IP is now present on a server that is not a replication master. However, this does not matter, since we access the database through haproxy, and it monitors the state of the cluster independently, and always sends requests to the master.

When haproxy returns to the second server, re-elections occur again (keepalived with the highest priority gets into operation), and the virtual IP returns to its place.

In rare cases, it happens that the slave cannot catch up with the master (for example, it fell a very long time ago and the wal log managed to partially retire). In this case, you can completely clear the directory with the base on the slave:

"rm -rf /var/lib/pgsql/9.6/data", and restart Patroni. She will merge the base with the master entirely.
(Caution when cleaning up "unnecessary" databases, carefully look at which server you are running the command !!!)


In this case, you need to use the patronictl utility. The reinit command allows you to safely clear a specific cluster node; it will not be executed on the wizard.
Thanks for the addition ofCyberDemon.

The patronictl utility itself allows you to see the current situation with the cluster through the command line, without accessing DCS, and manage the cluster.

Cluster status report example:
/opt/patroni/patronictl.py -c /etc/patroni/postgres.yml list cluster-pgsql:
+ --------------- + ------------------ + -------------- --- + -------------- + ------------------ + ----------- +
| Cluster | Member | Host | Role | State | Lag in MB |
+---------------+------------------+-----------------+--------------+------------------+-----------+
| cluster-pgsql | cluster_pgsql_01 | 192.xxx.xxx.120 |    Leader    |     running      |       0.0 |
| cluster-pgsql | cluster_pgsql_02 | 192.xxx.xxx.121 | Sync standby |     running      |       0.0 |
| cluster-pgsql | cluster_pgsql_03 | 192.xxx.xxx.122 |              | creating replica |   33712.0 |
+---------------+------------------+-----------------+--------------+------------------+-----------+



In this case, the third node is poured, its lag from the master is 33 GB.
After completing this process, it also enters the Running state with zero lag.
You can also pay attention that the State field is empty. This is because the cluster in my case works in synchronous mode. To reduce the lag of synchronous replication, one slave works in synchronous mode, and the other in normal asynchronous. If the master disappears, the roles will shift, and the second slave will go into synchronous mode to the first slave that has become the master.

Afterword


The only thing this cluster, in my opinion, is not enough for happiness is connection pooling and proxying of read requests to all slaves to increase read performance, and insert and update requests only to the wizard.

In a configuration with asynchronous replication, unloading the read load may lead to unexpected answers, if the slave lags behind the master, this should be taken into account.

Streaming (asynchronous) replication does not ensure cluster consistency at any given time, and synchronous replication is required for this.

In this mode, the master server will wait for confirmation of copying and applying transactions to the slaves, which will slow down the database. However, if transaction losses are unacceptable (for example, some financial applications), synchronous replication is your choice.

Patroni supports all options, and if synchronous replication is more suitable for you, you just need to change the value of several fields in Patroni configs.
Questions about different replication methods are well-documented in the Patroni documentation.

Someone will probably suggest using pgpool which itself, in fact, covers all the functionality of this system. It can monitor databases, proxy requests, and set virtual IP, and also implements pooling of client connections.

Да, он все это может. Но на мой взгляд схема с Patroni гораздо прозрачнее (конечно это только мое мнение), и во время экспериментов с pgpool я ловил странное поведение с его вочдогом и виртуальными адресами, которое не стал пока слишком глубоко дебажить, решив поискать другое решение.

Конечно возможно, что проблема тут только моих в руках, и позже я к тестированию pgpool планирую вернуться.

Однако, в любом случае, pgpool не сможет полностью автоматически управлять кластером, вводом новых и (особенно) возвратом сбойных серверов, работать с DCS. На мой взгляд это самый интересный функционал Patroni.

Спасибо за внимание, буду рад увидеть предложения по дальнейшему улучшению этого решения, и ответить на вопросы в комментариях.

Many thanks to Zalando for Patroni, and to the authors of the original Governor project , which served as the basis for Patroni, as well as to Alex Chistyakov for the role template for Ansible.

The full code for the playbooks and Ansible templates described in the article lies here . I would be grateful for the improvements from the gurus Ansible and PostgreSQL. :)

The main used articles and sources:

Several options for PgSQL clusters:

https://habrahabr.ru/post/301370/
https://habrahabr.ru/post/213409/
https://habrahabr.ru/company/ etagi / blog / 314000 /

Zalando's blog post about Patroni
Patroni Project
ansible-role-patroni by Alex Chistyakov
Governor - unfortunately the development has long been frozen.
Ansble for Devops is a great tutorial with tons of Ansible examples.

Also popular now: