An “ideal” cluster. Part 3.1 Deploying MySQL Multi-Master Cluster

  • Tutorial

In continuation of the series of articles on the “Ideal” cluster, I want to share my experience in deploying and configuring MySQL Multi-Master clusters.

My other publications on the “Ideal” cluster

To begin with, it is worthwhile to figure out which of the MySQL Galera technology implementation you will use. There are Galera implementations from Percona and MariaDB on the market. It is these two implementations that divided the lion's share of implementations of MySQL Galera.

Both forks use the Percona XtraDB Storage Engine as a plugin for InnoDB.

This engine is based on the InnoDB-plugin code and is fully compatible with it, but differs noticeably higher performance due to the integration of patches from Google and Percona.
In particular, XtraDB has improved the mechanism for working with memory, improved the operation of the InnoDB I / O subsystem, added support for multiple read and write streams, support for bandwidth management,
implementation of read-ahead data access, adaptive checkpointing , scaling capabilities for large projects have been expanded, the lock organization system has been adapted
to work on systems with a large number of CPUs, additional features have been added for accumulating and analyzing statistics.

At the same time, MariaDB Galera has a number of other improvements and features .

  • Aria (formerly Maria) - MyISAM-based highly reliable storage, characterized by increased stability and preservation of data integrity after the crash, with full compatibility with MyISAM. Thanks to the logging of operations, in the event of a crash, the results of the current operation are rolled back. It also supports the ability to restore state from anywhere in the operation log (including support for CREATE / DROP / RENAME / TRUNCATE).
  • PBXT (PrimeBase XT) - a repository developed from scratch and supporting the multi-version method of organizing data storage MVCC (multi-version concurrency control), which allows you to get rid of locks when performing read operations.
    PBXT supports ACID-compliant transactions, fast rollback of transactions, and recovery from server shutdowns. There are tools to ensure referential integrity of data, support for foreign key definitions, cascading updates, and data deletion. Supports the ability to direct streaming input and output of binary data (BLOB) in the database;
  • FederatedX - Positioned as a replacement for Federated storage developed by Sun Microsystems. FederatedX allows you to organize access to remote tables as local.
    There is support for transactions, the simultaneous installation of several connections to a remote DBMS, the use of "LIMIT" operations;
  • OQGRAPH - a repository for organizing hierarchical (tree-like) structures and complex graphs (nodes that have many connections);
  • Sphinx - storage for building search engines. Built-in Sphinx client allows MariaDB to exchange data with searchd, perform search queries and receive search results;

In addition, MariaDB Galera 10 has a number of improvements over version 5.5:

  • The new Connect storage , which allows you to organize access to arbitrary local or remote data, as if they were saved in a table. For example, you can associate the contents of a virtual table with data from a file in a specific format;
  • The new Cassandra Storage Engine (SE) , which adds support for MariaDB and MySQL to access data stored in a distributed Apache Cassandra database. Using Cassandra SE, developers are able to access data and add data to the Cassandra database using regular SQL queries. At the same time, the data storage model used in Cassandra as a family of columns (ColumnFamily) is displayed in the form of tables typical for MariaDB / MySQL, for which you can use the standard SQL directives SELECT, INSERT, DELETE and UPDATE, as well as perform join operations (JOIN) with other tables.
  • Integration of SPIDER storage with the implementation of a sharding system that allows you to distribute large tables to multiple servers. From the point of view of query generation, such tables do not differ from ordinary local tables, but in fact, when using SPIDER, different chunks of data that make up one table are stored on different servers. To ensure high availability of tables distributed across servers using SPIDER, new replication tools can be used.
  • Sequence storage for creating virtual tables filled with ascending or descending sequences (for example, seq_1_to_5 or seq_5_to_1_step_2).
  • Improved implementation of dynamic columns, allowing you to get a different set of "virtual columns" for each row in the table.
  • Added support for JSON requests and the ability to integrate with the Cassandra database;
  • Numerous performance optimizations that allow MariaDB 10 to accelerate several operations several times over MySQL and previous MariaDB branches. Key optimizations include support for parallel replication and the development of a group commit system. Additional optimizations for executing nested queries have been added, for example, the conversion of “NOT EXISTS” expressions to “IN” blocks;
  • Improved replication tools. Replicated slave servers are protected from problems in the event of a crash.
  • Added support for data replication from multiple master servers (multi-source replication) . From examples of the use of multi-source replication, the solution to the problems of collecting data in one place distributed on different machines with the aim of performing analytical queries or creating a backup copy is mentioned;
  • Support for global transaction identifiers;
  • Ability to use IF (NOT) EXIST checks for ALTER TABLE expressions;
  • Improved error message output. All numerical error numbers are now accompanied by explanatory texts.
  • Support for the expression "SHOW EXPLAIN FOR thread_id" for parsing a query running on a given thread. Since “SHOW EXPLAIN” takes into account the plan of the optimizer’s execution of the real query, it allows getting indicators closer to reality than the execution of the query inside “EXPLAIN”;
  • Additional optimizations have been added to InnoDB to significantly speed up transactions that do not perform write or data changes. To perform transactions in read mode, a new command “TRANSACTION READ ONLY” has been added;
  • The execution of the “LIMIT ... ORDER BY” construct has been optimized;
  • Support for automatic time updating (timestamp) in DATETIME;
  • Tables in memory with effective support of the VARCHAR and BLOB types;
  • A universal system of collecting statistics on activity and filling tables for use by the query optimizer, implemented without reference to specific storage engines;
  • Support for analysis of memory consumption in relation to a single thread;
  • Significantly accelerated work of ALTER TABLE constructions for Aria and MyISAM storages in the presence of verification of unique keys;

Improvements ported from MySQL 5.6:

  • An updated version of the InnoDB repository.
  • Support for the PERFORMANCE_SCHEMA engine and the performance_schema database associated with it, which provides low-level tools for monitoring query execution and various events during DBMS operation;
  • Read-only mode for transactions in InnoDB, support for the expression "TRANSACTION READ ONLY";
  • Optimization of query execution speed of the form "ORDER BY ... LIMIT".
  • Support "--plugin-load-add";
  • Ability to perform "ALTER TABLE" on the fly;
  • Setting privileges for temporary tables;
  • Extensions related to encoding support;
  • The expression "GET DIAGNOSTICS";
  • Temporary literals (e.g. TIME'12: 34: 56 ').

I want to add on my own that both forks also support HandlerSocket and Memcached plugin.

A more detailed description of the stable release of the MariaDB 10.0 DBMS can be found in the source on opennet

Why did I choose MariaDB Galera 10?


MariaDB Galera 10 supports MySQL Query Cache out of the box. Any installation instructions for any of the MySQL Galera implementations clearly indicates the need to disable Query Cache. As a result, when switching from a single database server to a clustered version, the speed of reading complex queries drops at times. And the load on the server is growing proportionally.
Percona XtraDB Cluster in version 5.6 also approached the implementation of full-fledged support for Query Cache, but here it is required to enable it “live”, after starting the node using queries:

SET GLOBAL query_cache_size =128*1024*1024;
SET GLOBAL query_cache_type = 1;

When Query Cache is enabled, up to 95% of queries return the result from the cache instead of being executed again.

I want to immediately give a couple of my comments.

There should not be much cache . The largest size that is generally worth installing is no more than 512MB . Even 512MB is a lot, you really need less. And that's why:

If changes occur in any of the tables the selection from which is in the cache (insert or change rows), then MySQL removes such samples from the cache . This approach speeds up MySQL, but it can be inefficient for systems with a large number of table modification requests. This causes the tables to simply lock in Waiting for query cache lock mode .

A query cache can be thought of as a hash whose keys are the queries, and the values ​​are the results of the queries.
If query cache usage is enabled, then upon receipt of the query, MySQL determines whether the first three characters of the “SEL” query are equal. If so, MySQL looks to see if there is an entry in the query cache with a key equal to the query.

Two important rules follow from here:

  • MySQL performs byte comparisons, so queries that differ by at least one character (for example, SELECT * FROM table and select * from table) will be treated as two different queries. Therefore, it is necessary to write queries in a single style;
  • In MySQL prior to version 5.0, queries that have a space or a comment at the beginning will never be taken from the cache.

In addition to the results, MySQL stores in the cache a list of tables, the selection of which is cached.

More information about the query cache can be found in the source on habrahabr

From words to deeds

I think that you use, you have already figured out. Further in the text I describe the work with MariaDB Galera 10, but almost everything described is true for Percona XtraDB Cluster 5.6.

If we translate a single installation of MySQL into cluster execution:

  • Make sure that all our databases do not contain tables with the MyISAM engine
    SELECT table_schema, table_name FROM INFORMATION_SCHEMA.TABLES WHERE engine = 'myisam';
  • Make sure that all tables in our databases have primary keys:

    SELECT table_catalog, table_schema, table_name, engine
    	FROM information_schema.tables
    	WHERE (table_catalog, table_schema, table_name) NOT IN
    	(SELECT table_catalog, table_schema, table_name
    	FROM information_schema.table_constraints
    	WHERE constraint_type = 'PRIMARY KEY')
    	AND table_schema NOT IN ('information_schema', 'pg_catalog');

There are 2 ways to solve the first problem:

# Option 1

mysql имя_базы_данных -e "show table status where Engine='MyISAM';" | awk 'NR>1 {print "ALTER TABLE "$1" ENGINE = InnoDB;"}' | mysql имя_базы_данных

# Option 2

mysql имя_базы_данных -e "show table status where Engine='MyISAM';" | awk '{print $1}' | xargs -t -i pt-online-schema-change --alter-foreign-keys-method=auto --alter "ENGINE=InnoDB" --execute --statistics --set-vars="wait_timeout=10000,innodb_lock_wait_timeout=10,lock_wait_timeout=180" --progress=time,1 D=имя_базы_данных,t={}

For small tables, the first option works pretty quickly. But with large tables there are problems. Since the conversion will take a long time, the table will be blocked and all operations with it will become impossible, which will certainly affect the provision of services. To solve this problem, the pt-online-schema-change utility from the percona-toolkit kit will help us .

This utility is installed from the repository for CentOS :

rpm -Uhv

Important It is necessary that the table to be converted has either a primary (PRIMARY) or a unique (UNIQUE) key, otherwise it will produce an error, for example, this:
Cannot chunk the original table `database`. NAMETABLE01_NOKEY`: There is no good index and the table is oversized. at / usr / bin / pt-online-schema-change line 5442.

To solve the second problem, alas, there is only one way - to add a PRIMARY or UNIQUE key through ALTER.

All tables should have a primary key (multi-column primary keys are supported). DELETE operations are unsupported on tables without a primary key. Also, rows in tables without a primary key may appear in a different order on different nodes.

Those. possible node drops, deadlocks and other problems. Plus, the order of the lines goes astray. This must be repaired first.

If we left these problems behind, then we will proceed to the installation and configuration of the database server itself.

cat > /etc/yum.repos.d/MariaDB.repo << EOL
# MariaDB 10.0 CentOS repository list - created 2015-02-18 14:04 UTC
name = MariaDB
baseurl =

yum install MariaDB-Galera-server MariaDB-client rsync galera ntp nscd

chkconfig nscd on $$ /etc/init.d/nscd start

# You need to disable selinux, this is a requirement of MariaDB developers

sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config
echo 0 > /selinux/enforce

# On all nodes, the time must be set correctly, this is mandatory. Otherwise, you will encounter the fact that, with SST nodes from the donor, the synchronized node will just wait for something, without any sign of activity.

yum install ntp -y
chkconfig ntpd on
/etc/init.d/ntpd stop
/etc/init.d/ntpd start

To configure MariaDB servers and Galera clusters, I wrote a script, it creates a configuration file for each server individually.

I want to say again, at the output we get a blank requiring further editing.

# wget --no-check-certificate -q -O - '' | bash -x -
# fetch -o ''
# sh
if [ "$(uname)" == 'Linux' ]; then
    IBS=innodb_buffer_pool_size\ \=\ $((`free -m | grep Mem | awk '{print $2}'`*60/100000))G;
    cpu=$((`cat /proc/cpuinfo | grep -c processor`*2))
    IBS=innodb_buffer_pool_size\ \=\ $((`dmesg |grep real\ memory | awk '{print $5}' |cut -c 2- | tail -1`*60/100000))G;
    cpu=$((`sysctl hw.ncpu | awk '{print $2}'`*2))
mkdir -p ~/backup/mysql > /dev/null 2>&1
mkdir $conf/mysql.d > /dev/null 2>&1
mkdir $conf/mysql.d/ssl > /dev/null 2>&1
mkdir /var/log/mysql > /dev/null 2>&1
chown mysql:mysql $conf/mysql.d
chown mysql:mysql $conf/mysql.d/ssl
chown -R mysql:mysql /var/log/mysql
if [ -f $conf/my.cnf ]; then
cp $conf/my.cnf ~/backup/mysql/my.cnf.`date +%Y-%m-%d_%H-%M`
if [ -f $conf/mysql.d/000-galera.cnf ]; then
cp $conf/mysql.d/000-galera.cnf ~/backup/mysql/000-galera.cnf.`date +%Y-%m-%d_%H-%M`
if [ -f $conf/mysql.d/001-server.cnf ]; then
cp $conf/mysql.d/001-server.cnf ~/backup/mysql/001-server.cnf.`date +%Y-%m-%d_%H-%M`
if [ -f $conf/mysql.d/002-myisam.cnf ]; then
cp $conf/mysql.d/002-myisam.cnf ~/backup/mysql/002-myisam.cnf.`date +%Y-%m-%d_%H-%M`
if [ -f $conf/mysql.d/003-rep-master.cnf ]; then
cp $conf/mysql.d/003-rep-master.cnf ~/backup/mysql/003-rep-master.cnf.`date +%Y-%m-%d_%H-%M`
if [ -f $conf/mysql.d/004-rep-slave.cnf ]; then
cp $conf/mysql.d/004-rep-slave.cnf ~/backup/mysql/004-rep-slave.cnf.`date +%Y-%m-%d_%H-%M`
if [ -f $conf/mysql.d/005-mariadb-opt.cnf ]; then
cp $conf/mysql.d/005-mariadb-opt.cnf ~/backup/mysql/005-mariadb-opt.cnf.`date +%Y-%m-%d_%H-%M`
if [ -f $conf/mysql.d/006-ssl.cnf ]; then
cp $conf/mysql.d/006-ssl.cnf ~/backup/mysql/006-ssl.cnf.`date +%Y-%m-%d_%H-%M`
if [ -f $conf/mysql.d/007-handlersocket.cnf ]; then
cp $conf/mysql.d/007-handlersocket.cnf ~/backup/mysql/007-handlersocket.cnf.`date +%Y-%m-%d_%H-%M`
if [ -f $conf/mysql.d/008-threadpool.cnf ]; then
cp $conf/mysql.d/008-threadpool.cnf ~/backup/mysql/008-threadpool.cnf.`date +%Y-%m-%d_%H-%M`
cat > $conf/my.cnf << EOL
!includedir $conf/mysql.d/
# galera-only
cat > $conf/mysql.d/000-galera.cnf << EOL
wsrep_provider = /usr/lib64/galera/
wsrep_cluster_address = gcomm://,,,,,,
# Node4 address
wsrep_node_address =
# Cluser name
wsrep_cluster_name = cluster
wsrep_node_name = prod-db-new-04
#wsrep_slave_threads = $cpu
innodb_autoinc_lock_mode = 2
# SST method
wsrep_sst_method = xtrabackup
wsrep_sst_auth = "sstuser:s3cretPass"
##wsrep_sst_method = rsync
wsrep_retry_autocommit = 3
wsrep_provider_options = "gcache.size=5G; repl.commit_order=1; gmcast.segment=2"
cat > $conf/mysql.d/001-server.cnf << EOL
default_storage_engine = InnoDB
innodb_file_per_table = 1
#character-set-server = utf8
# network
connect_timeout = 600000
wait_timeout = 28800
max_connections = 600
max_allowed_packet = 512M
max_connect_errors = 10000
net_read_timeout = 600000
connect_timeout = 600000
net_write_timeout = 600000
# innodb engine settings
innodb_open_files = 512
innodb_buffer_pool_instances = 2
innodb_file_format = barracuda
innodb_locks_unsafe_for_binlog = 1
innodb_flush_log_at_trx_commit = 2
innodb_flush_method = O_DIRECT
transaction-isolation = READ-COMMITTED
innodb-data-file-path = ibdata1:10M:autoextend
innodb-log-file-size = 256M
innodb_log_buffer_size = 8M
# performance settings
query_cache_size                = 128M
query_cache_type                = 1
query_cache_min_res_unit        = 2K
join_buffer_size        = 8M
read_rnd_buffer_size    = 3M
table_definition_cache  = 2048
table_open_cache        = 2048
thread_cache_size       = 128
tmp_table_size          = 2048M
max_heap_table_size     = 2048M
log_error	=	/var/log/mysql/mysql-error.log
#slow_query_log_file	=	/var/log/mysql/mysql-slow.log
# myisam
cat > $conf/mysql.d/002-myisam.cnf << EOL
key_buffer_size = 512M
# rep-master
cat > $conf/mysql.d/003-rep-master.cnf << EOL
#log-bin = /var/log/mysql/mysql-bin
# cluster
# binlog_format=ROW
# single installation
server_id = 226
sync-binlog = 0
expire-logs_days = 3
max-binlog-size = 1G
# rep-slave
cat > $conf/mysql.d/004-rep-slave.cnf << EOL
slave-skip-errors = 1062
log_slave_updates = 1
relay-log = /var/log/mysql/mysql-relay-bin
relay-log-index = /var/log/mysql/mysql-relay-bin.index
relay-log-info-file = /var/log/mysql/
# replicate-rewrite-db=from_name->to_name
# replicate-ignore-table=db_name.table_name
# replicate-wild-ignore-table=db_name.table_name
# mariadb-opt
cat > $conf/mysql.d/005-mariadb-opt.cnf << EOL
# ssl
cat > $conf/mysql.d/006-ssl.cnf << EOL
#ssl-ca          =  $conf/mysql.d/ssl/ca-cert.pem
#ssl-cert        =  $conf/mysql.d/ssl/server-cert.pem
#ssl-key         =  $conf/mysql.d/ssl/server-key.pem
# handlersocket
cat > $conf/mysql.d/007-handlersocket.cnf << EOL
# threadpool
cat > $conf/mysql.d/008-threadpool.cnf << EOL
thread_handling = pool-of-threads
thread_pool_size = $cpu

Life does not stand still and I, like you, continue to develop continuously, the latest version of the script, it is better to take immediately from its permanent page , probably from the moment of writing the article, much has already changed in it.

Explanations for the config and script generation

wsrep_sst_method = xtrabackup

If you use rsync mode, then at the time of synchronization of the node with the donor, the donor will be completely blocked for recording. In xtrabackup mode, the lock will last only a few seconds, while xtrabackup "clings" to the base.
If you use HAProxy as described here HAPRoxy for Percona or Galera on CentOS. Its configuration and monitoring in Zabbix is to work with the server while it is in donor mode, we need to edit the clustercheck script on the nodes.

# Replacing the string

# per line

In this mode, in the event of a complete fall of all cluster nodes, we can reduce the simple synchronization of nodes to a minimum.

transaction_isolation = REPEATABLE-READ

It is worth trying to change to transaction-isolation = READ-COMMITTED i.e. Transition to snapshot transaction execution. Each transaction becomes a kind of independent sandbox. A snapshot of the data.
This is something similar to the Oracle isolation level. All SELECT ... FOR UPDATE and SELECT ... LOCK IN SHARE MODE expressions only block index records and do not block the interval before them. Therefore, they allow you to freely add new entries after locked. UPDATE and DELETE, which use a unique index and unique search conditions, block only the found index record, and do not block the interval before it. But in UPDATE and DELETE of a range type, InnoDB must set the next key lock or interval lock and block additions by other users to the range covered by the range. This is necessary because Phantom strings must be locked for replication and recovery to work successfully in MySQL. Consistent reading works just like in Oracle: every consistent reading, even within a single transaction,
In most cases, the transition gives an increase in speed on competitive recordings, but the effect of phantom reading is also possible. In my practice, I met only one application that was sick of phantom. Those. these are applications using the DBMS, you need to check for the possibility of working in this mode.

innodb_flush_log_at_trx_commit = 2

A value of "1" means that any completed transaction will synchronously flush the log to disk. This is the default option, it is the most reliable in terms of data security, but the slowest in speed.
A value of "2" does the same, only flushes the log not to disk, but to the cache of the operating system (that is, flush does not occur after each operation). This value is suitable in most cases, because Does not perform an expensive write operation after each transaction. In this case, the log is written to the disk with a delay of several seconds, which is very safe from the point of view of data safety.
But we have a cluster and in case of collapse, the data will still be transferred from the donor. The main thing is that the transaction is committed on other nodes. Then we get the data at SST

innodb_buffer_pool_instances = 2

By default, InnoDB uses one instance for Buffer Pool.
At the same time, it is possible to select several blocks - and MySQL works with them in InnoDB in some cases much more efficiently. This is due to less cache locks when writing data.

innodb_file_format = barracuda

This format is the most “new” and supports compression. This allows you to reduce the load on IO (disks) by using compression. Just as a recommendation, you can use a 16K recording block size.

Here is an alter example:

Here are the results of testing the speed and size of data during compression.
But there are downsides to compression . ALTERs will take much longer on compressed tables, as ALTER is known, like any DDL, it locks the table and with it the entire cluster. ALTER is not a transactional instruction, which means that they are not replicated by ROW-binary diffs, but are transmitted in the form of reserves. And until this request is executed on all nodes of the cluster, all commits will be frozen.
Those. compression makes sense either on large tables where DDL is not planned in principle, or on single mysql instances.

innodb_flush_method = O_DIRECT

Reset data bypassing the disk cache. This is to prevent double buffering of data in the innodb_buffer_pool cache and the file system cache. Allows more rational use of RAM.

It’s worth adding the skip-innodb_doublewrite parameter, which is important in terms of performance

Even though double write requires each page written twice its overhead is far less than double. Write to double write buffer is sequential so it is pretty cheap. It also allows Innodb to save on fsync()s – instead of calling fsync() for each page write Innodb submits multiple page writes and calls fsync() which allows Operating System to optimize in which order writes are executed and use multiple devices in parallel. This optimization could be used without doublewrite though, it was just implemented at the same time. So in general I would expect no more than 5-10% performance loss due to use of doublewrite.

tmp_table_size = 2048M
max_heap_table_size = 2048M

Очень интересные параметры, их значения нужно выжать на максимум. Это позволит сократить количество создаваемых на диске временных таблиц. Как правило, именно создаение временных таблиц на диске, занимает большее количество времени на сортировках, группировках и других сложных select.


Бывают проблемы с совместимостью приложения с базой, после перехода на percona 5.6 и galera 10. Наиболее значительные из них стоит сразу предупредить параметром

thread_handling = pool-of-threads
thread_pool_size = количество_ядер

Так же, стоит использовать thread_pool

wsrep_retry_autocommit = 3

Important! If there is deadlock in the database, commits will retreat, i.e. the node will not drop out of the cluster at the first sneeze, but will continue to work and we will not lose the commit.

wsrep_provider_options = "gcache.size = 5G; repl.commit_order = 1; gmcast.segment = 2 "

Here is a detailed description, I usually set these parameters by default always.

The wsrep_replicate_myisam = 1 parameter is almost a 100% guarantee of the death of a cluster if at least one battle myisam table appears there.

This feature is still experimental and its inclusion adds to the ROW (based on binary diff snapshots) replication statements as well, the same as when replicating DDL commands. This means constant conflicts, locks and cluster collapse after any grandfather of myisam table.

If you have questions, difficulties or need advice:
My contacts in the profile

Also popular now: