vozbu April 14, 2014 at 16:19

Delivering updates from a MySQL database to an application using the libslave replication client

When writing any fairly large project, there are always more or less similar problems. One of them is the problem of the speed of receiving system updates. Relatively easy, you can arrange the quick receipt of small updates. It’s quite easy to occasionally receive large volume updates. But what if you need to quickly update a large data set?

For Mail.Ru Target, as well as for any advertising system, quick tracking of changes is important for the following reasons:
• the ability to quickly turn off the campaign if the advertiser stopped it in the interface or if it runs out of money, which means we won’t show it for free ;
• convenience for the advertiser: he can change the price of the banner in the interface, and in a few seconds his banners will begin to appear at the new cost;
• quick response to changing situations: changes in CTR, the arrival of new data for the training of mathematical models. All this allows you to adjust the advertising display strategy, sensitively responding to external factors.

In this article I will talk about updating the data that lies in large tables in the MySQL database, focusing on speed and consistency - after all, I would not want to get a new banner, but not get this advertising campaign.

How could we do this using standard MySQL tools? It would be possible to periodically reread all the entire tables. This is the most non-optimal option, because along with new data a lot of old, already known data will be distilled, a huge load will fall on the network and on the MySQL server. Another option is to properly prepare the data storage scheme: enter the time of the last update in all the tables and make selections for the necessary time intervals. However, if there are a lot of tables, and there are a lot of machines where you need to store a copy of the data, then there will be a lot of selects, and the load on the MySQL server, again, turns out to be large. In addition, you will have to take care of the consistency of the received updates.

A convenient way to solve the problem offers us libslave:
• data comes to us from the database “ourselves” - there is no need to fill the database if there are no updates at the moment;
• updates come in the order in which they were performed on the wizard, we can see the whole history of changes;
• no need to specially prepare tables, enter timestamps that are unnecessary from the point of view of business logic;
• transaction boundaries are visible - that is, data consistency points;
• low load on the master: it does not fulfill requests, does not load the processor - it simply sends files.

This article will be about how we use libslave for our purposes. I will briefly tell you how MySQL data replication works (I think everyone imagines it well, but still), how libslave itself works, how to use it, I will present the results of benchmarks and some comparative analysis of existing implementations.

Replication device

The master database writes each request that changes data or a data scheme into a special file - the so-called binary-log . When the binary log reaches a certain size, the record goes to the next file. There is a special index of these binary logs, as well as a specific set of commands for managing them (for example, to delete old binlogs).

The slave receives these files over the network in raw form (which simplifies the implementation of the wizard) and applies them to its data. Slave remembers the position to which he read the binlogs, and therefore, when restarting, asks the master to continue sending logs to him starting from the desired position.

There are two modes of replication - STATEMENT and ROW. In the first mode, the wizard writes to the binlog the initial queries that it performed to modify the data (UPDATE / INSERT / DELETE / ALTER TABLE / ...). On the slave, all these requests are executed in the same way as they would be executed on the master.

In the ROW mode, available starting with MySQL version 5.1, not requests are written to the binlog, but data already modified by these requests (however, requests that change data schemes (DDL) are still written as they are). An event in ROW mode is:
• one row of data - for the INSERT and DELETE commands. Accordingly, the inserted row is written for INSERT, the deleted row for DELETE
• two rows - BEFORE and AFTER - for UPDATE.

With some massive changes to the data, such binlogs turn out to be much more than if we recorded the request itself, but this is a special case. To use ROW replication in our daemon, it’s very convenient for us that we don’t need to be able to fulfill requests and we get what we need right away - changed strings.

By the way, it is not necessary to enable ROW replication on your most important master server. The master server can feed several slaves using the usual STATEMENT replication (reserves, backups), and some of these slaves can write ROW logs, and they will already be accessing them for daemon data.

How does it work for us?

Getting data from the database takes place in two stages:
1. initial loading of data at the start of the daemon
2. receiving updates The

initial loading of data, in principle, can be facilitated by the presence of dumps - the daemon can dump its state to disk and write the binlog position to the same dump. Then, at startup, you can load data into memory from the dump and continue reading the binlog from the last position. But when there is no dump, something needs to be done. And this is, of course, Select.

The data must be selected so that they are consistent in memory. To do this, you can select data from all tables in one transaction. But after that, we need to start reading the libslave updates and decide what position we will read from. It is impossible to take the current position after the selections, because during the selections new data could be written to the database that did not enter the selections, but the binlog position was moved. It is also impossible to take a position before the start of selections, since from the moment we take the current position to the start of the selec- tion, new data may come. Again, starting a transaction with the BEGIN command, and then getting the current binlog position, will not work out - there is no synchronization between them, and if one client made BEGIN, then other clients could write data at this time, and the binlog position, respectively, could shift .

All these considerations lead us to the idea that ensuring the consistency of reading will be partly the task of the demon. The data in our memory is arranged so that if an inconsistent object comes to us (in the sense that we can detect its inconsistency - for example, it lacks the necessary connections), then it will simply be erased from the memory of the demon; however, if later it arrives consistent, it will be inserted into memory. If he comes consistent two times, then his last state will remain in his memory. If it was consistent in the memory, and it would come to be non-consistent, then we will not apply the non-consistent state, and the object in the memory of the demon will not change.

Based on all this, the correct, from our point of view, bootstrap model looks like this:
1. We get the current binlog position on the master p1 - we do this at an arbitrary point in time.
2. We do all the selections.
3. Get the new current binlog position on the p2 master.
4. Using libslave, we read all the events between p1 and p2. At the same time, both new objects that were formed during the selection and modified old ones that already exist in the memory can come to the daemon.
5. After that, the daemon has a consistent copy of the data, the daemon is ready to work - we can respond to requests and receive updates using libslave, starting at position p2.

I emphasize that since the consistency of the data is maintained in the daemon, we do not need to make selections in step 2 in one transaction. Consider, for example, such a complex sequence of events:
1. Get the current binlog position.
2. We started reading the campaign table, in which there is the campaign campaign1.
3. A new banner banner1 was created in state 1 in the existing campaign1 campaign.
4. A new campaign2 campaign was created that does not fall into the result of the selection.
5. A new banner2 banner has been created, which is tied to the campaign2 campaign.
6. Banner1 moved to state 2.
7. We finished reading the campaign table, switched to reading the banner table, and banner1 in state 2 and banner2 will get into this reading.
8. Read the banner table.
9. A new banner3 banner has been created in the campaign2 campaign.
10. Got the new current p2 binlog position.

And now let's see what happens in the demon:
1. The demon selects all campaigns and stores them in memory (perhaps by checking some criteria - maybe we do not need all campaigns in memory).
2. The demon switched to the banner selection. He will read banner1 immediately in state 2 and remember it, linking it to the campaign1 already read campaign.
3. He will read banner2 and throw it away, because there is no campaign2 campaign for him in his memory - she did not get into the selection results.
4. The selection is over. We proceed to read the changes from position p1 to position p2.
5. We meet the creation of banner1 banner in state 1. Let me remind you that it is already in its last state 2 in the memory of the daemon, but, however, we will apply this update and transfer the banner to state 1 - this is not scary, because the data will be used for work only after we read to the position p2, and before this position we will receive changes to this banner again.
6. Read the creation of the new campaign2 campaign - remember it.
7. We read the creation of the banner2 banner attached to it - now we will remember it, because there is a corresponding campaign for it, and the data is consistent.
8. We read the translation of banner1 to state 2 - applied, now and here it is consistent.
9. We read the creation of banner3 in the campaign2 campaign - inserted into memory.
10. We reached the position p2, stopped - all the data is loaded consistently, we can give it to the user and read the updates further in normal mode.

Note that at the stage of the initial data sampling, their inconsistency does not mean that errors are present in the database - it is simply such a feature of the data loading process. Such inconsistency will be corrected further by the daemon when reading the data. But if after that there are some inconsistencies in them - then already yes, then this is the base.

The code looks something like this:

slave::MasterInfo sMasterInfo;	// заполняем опции коннекта к базе
    Slave sSlave(sMasterInfo);		// создаем объект для чтения данных
    // запоминаем последнюю позицию бинлога
    const slave::Slave::binlog_pos_t sInitialBinlogPos = sSlave.getLastBinlog();
    select();	// селектим данные из базы
    // получаем новую последнюю позицию бинлога — изменилась за время селекта
    const slave::Slave::binlog_pos_t sCurBinlogPos = sSlave.getLastBinlog();
    // теперь нам надо дочитать данные из слейва до этой позиции
    init_slave();	// здесь добавляются колбеки на таблицы, вызываются Slave::init и Slave::createDatabaseStructure
    sMasterInfo.master_log_name = sInitialBinlogPos.first;
    sMasterInfo.master_log_pos = sInitialBinlogPos.second;
    sSlave.setMasterInfo(sMasterInfo);
    sSlave.get_remote_binlog(CheckBinlogPos(sSlave, sCurBinlogPos));

The CheckBinlogPos functor will cause reading data to be completed from the binlog when it reaches the sCurBinlogPos position. After this, the initial preparation of the data for use occurs and the reading of the data from the slave from the last position is started without any functors.

Libslave device

Let's take a closer look at what libslave is. My description is based on the most popular implementation . Below I will compare several forks and a completely different implementation.

Libslave is a C ++ library that can be used in your application to receive updates from MySQL. Libslave is not connected at the code level with the MySQL server; it is collected and linked only with the client - libmysqlclient. The library was tested on the wizards of version 5.1.23 to 5.5.34 (not at all! Only on those that came to hand).

To work, we need a MySQL server with enabled recording of binlogs in ROW mode. To do this, it should have the following lines in the config:

[mysqld]
log-bin = mysqld-bin
server-id = 1
binlog-format = ROW

The user under whom libslave will go will need REPLICATION SLAVE and REPLICATION CLIENT access rights, as well as SELECT on those tables that he will process (those that will be in binlogs, but which he will skip, SELECT is not needed). The SELECT right is needed to get the table schema.

The micros classic nanomysql :: Connection is built into libslave to execute normal SQL queries on the server. In life, we use it not only as part of libslave, but also as a client for MySQL in general (I did not want to use mysqlpp, mysql-connector, and other things).

The main class is called Slave. Before starting work, we set custom callbacks for events from tables, which we will follow. Information about the event in the RecordSet structure is transmitted to the callback: its type (Write / Update / Delete) and data (inserted / updated / deleted record, in the case of Update - its previous state).

When the library is initialized with the wizard parameters, the following checks occur:
1. We check the ability to connect to the server.
2. We do SELECT VERSION () - we are convinced that the version is not less than 5.1.23.
3. We make SHOW GLOBAL VARIABLES LIKE 'binlog_format' - we make sure that the format of binlogs is ROW.
4. Read the saved position of the last message read through the user-defined function. If the user-defined function did not return anything (empty binlog name, zero position), then we read the current position of the binlog in the wizard through the SHOW MASTER STATUS request.
5. We read the base structure for the tables, which we will follow.
6. Generate slave_id so that it does not match any of the SHOW SLAVE HOSTS requests.
7. Register the slave on the wizard by executing simple_command (COM_REGISTER_SLAVE).
8. Request the dump transfer using the simple_command command (COM_BINLOG_DUMP).
9. We start the processing cycle of incoming packets - their parsing, calling the necessary callbacks, error handling.

Separately, it is worth mentioning about the fifth point - reading the database structure. In the case of a true MySQL-slave, we always know the correct device labels, because we started with some kind of SQL dump and continued to read the tables from the appropriate binlog position, following all the DDL-statement. Libslave, in the general case, starts from the position of the binlog that the user will provide it (for example, from the one on which we saved last time, or from the current position of the wizard). In the general case, she does not have previous knowledge about the database structure; therefore, she receives the table schema by parsing the output of the SHOW FULL COLUMNS FROM query results. And it is from there that information is taken about which fields, what types and in what order to parse from the binlog. There may be such a problem: we get the description of the tables current, and we can start reading the binlogs from the previous ones, when the table still looked different. In this case, libslave will most likely fail with the error that the data is not consistent. You have to start reading from the current position of the wizard.

It is not scary to change the description of tables during libslave operation. It recognizes ALTER TABLE and CREATE TABLE queries, and immediately after receiving them, re-reads the table structures. Of course, problems are possible here. Suppose that we quickly changed the structure of a table twice, writing some data between these events. If libslave receives a record of the first alter only when the second is completed, then through SHOW FULL COLUMNS FROM will immediately receive the second state of the database. Then the event to update the table, which will correspond to the first description, has a chance to stop replication. However, in practice this happens extremely rarely (we have never had it before), and in which case it is treated by restarting the daemon from scratch.

With libslave, you can track transaction boundaries. Despite the fact that until a transaction is completed, not one of its entries falls into the binlog, it can still be important to distinguish between transactions: if you have any two related changes in different tables, then you may not want to use only one updated until the second one is updated. BEGIN events do not get into the binlog - at the start of the transaction, immediately changed lines go that end with COMMIT. Those. Transactions are tracked not by BEGIN / COMMIT, but by two consecutive COMMITs.

If the master has disappeared

The main libslave loop called by the get_remote_binlog function receives a user functor as a parameter, which is checked before reading each new package. If the functor returns true, then the loop will end.

If any errors occur, the error is output to the log and the cycle continues. In particular, if the server is rebooted, then libslave will try to reconnect with the server to the bitter end, after which it will continue to read data. Eternal sticks are also possible - for example, if the logs read by libslave were dragged from under the wizard, then libslave will always cry in the loop that there are no necessary logs.

In case of network failure, the connection is configured with a timeout of 60 seconds. If no new data packet has arrived in 60 seconds (which could be the case if there are simply no updates), then the database will be reconnected and the message flow will continue to be read from the last read position.

In principle, you can complete the cycle earlier, without waiting for the timeout to end. Suppose that you want to be able to quickly, but correctly terminate the application by Ctrl + C, without waiting for a possible 60 seconds if the network is disconnected or there are no updates. Then in the signal handler it is enough to set a flag that will cause the next call to the user functor to return true, and call the Slave :: close function, which will force close the MySQL socket. Because of this, the call to read the package will fail, and when checking the response from the user functor, the loop will exit.

Statistics
The library has an abstract class ExtStateIface , to which various information is transferred from libslave: the number of reconnects, the time of the last event, the status of the connection to the database. The same class is responsible for saving and loading the current binlog position into some permanent storage. There is a default implementation of this class DefaultExtStateworking through a mutex (since statistics can be set by slave in one thread, and read by someone else in another). The sad news is that the correct implementation of this class is necessary for the libslave to work correctly, that is, it is not just a statistics object - it is an object that can control the operation of the library.

Benchmarks

Benchmarks were carried out on two sets of machines.

The first set was one machine on which the database was installed, and the test was performed on it. Configuration:
• CPU: Intel® Core (TM) i7-4770K CPU @ 3.50GHz
• mem: 32 GB 1666 MHz
• MB: Asus Z87M-PLUS
• HDD: SSD OCZ-VERTEX3
• OS Gentoo Linux, kernel 3.12.13-gentoo x86_64
Database settings were by default. Honestly, I do not think that they are of great importance for the wizard, who actually just “pours” the file over the network.

The second set was two cars. First DB machine:
• CPU: 2 x Intel® Xeon® CPU E5620 @ 2.40GHz
• mem: 7 x 8192 MB TS1GKR72V3N 800 MHz (1.2ns), 1 x 8192 MB Kingston 9965434-063.A00LF 800 MHz (1.2ns) , 4 x Empty
• MB: ETegro Technologies ETRS370G3
• HDD: 14 x 300 GB HUS156030VLS600, 2 x 250 GB WDC WD2500BEVT-0
• PCI: LSI Logic / Symbios Logic SAS2116 PCI-Express Fusion-MPT SAS-2 [Meteor], Intel Corporation 82801JI (ICH10 Family) SATA AHCI Controller
• OS CentOS release 6.5 (Final) kernel 2.6.32-220.13.1.el6.x86_64 Test

machine:
• CPU: 2 x Intel® Xeon® CPU E5-2620 0 @ 2.00GHz
• mem: 15 x 8192 MB Micron 36KSF1G72PZ-1G4M1 1333 MHz (0.8ns), 1 x 8192 MB Micron 36KSF1G72PZ-1G6M1 1333 MHz (0.8ns)
• MB: ETegro Technologies ETRS125G4
• HDD: 2 x 2000 GB Hitachi HUA72302, 2 x 250 GB ST250DM 1BD14
• PCI: Intel Corporation C602 chipset 4-Port SATA Storage Control Unit, Intel Corporation C600 / X79 series chipset 6-Port SATA AHCI Controller
• OS CentOS release 6.5 (Final) kernel 2.6.32-358.23.2.el6.x86_64

Network between 1 Gb / s machines.

It should be noted that in both tests the database did not access the disk, that is, the binlogs were cached in memory, and the test did not rest on data transfer. CPU utilization in all tests was 100%. This suggests that we rested on the libslave library itself, i.e., investigated its performance.

Two tables were created for the test - small and large:

CREATE TABLE short_table (
    id int NOT NULL auto_increment,
    field1 int NOT NULL,
    field2 int NOT NULL,
    PRIMARY KEY (id)
);
CREATE TABLE long_table (
    id int NOT NULL auto_increment,
    field1 timestamp NOT NULL DEFAULT 0,
    field2 timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    field3 enum('a','b','c') NOT NULL DEFAULT 'a',
    field4 enum('a','b','c') NOT NULL DEFAULT 'a',
    field5 varchar(255) NOT NULL,
    field6 int NOT NULL,
    field7 int NOT NULL,
    field8 text NOT NULL,
    field9 set('a','b','c') NOT NULL DEFAULT 'a',
    field10 int unsigned NOT NULL DEFAULT 2,
    field11 double NOT NULL DEFAULT 1.0,
    field12 double NOT NULL DEFAULT 0.0,
    field13 int NOT NULL,
    field14 int NOT NULL,
    field15 int NOT NULL,
    field16 text NOT NULL,
    field17 varchar(255) NOT NULL,
    field18 varchar(255) NOT NULL,
    field19 enum('a','b','c') NOT NULL DEFAULT 'a',
    field20 int NOT NULL DEFAULT 10,
    field21 double NOT NULL,
    field22 double NOT NULL DEFAULT 1.0,
    field23 double NOT NULL DEFAULT 1.0,
    field24 double NOT NULL DEFAULT 1.0,
    field25 text NOT NULL DEFAULT "",
    PRIMARY KEY (id)
);

Each table contained one million records. When inserting data, one text field was filled with a short line. All other lines were practically empty, the fields were filled with default values. Those. this insertion method allowed us to get full-fledged binlogs, but most string fields were empty:
INSERT INTO short_table (field1, field2) values (1, 2);
INSERT INTO long_table (field5, field25) values ("short_string", "another_short_string");

Each of these tables was first inserted into the database, after which it was completely updated with queries:
UPDATE short_table SET field1 = 12;
UPDATE long_table SET field6 = 12;

Thus, it was possible to obtain a set of binlogs of the INSERT type and a set of binlogs of the UPDATE type (which are twice as large, since they contain, in addition to the changed line, its previous state). Before each operation, the binlog position was remembered, i.e., they received 4 binlog intervals in this way:
short table insert (5429180 - 18977180 => 13548000)
short table updates (18977180 - 45766831 => 26789651)
long table insert (45768421 - 183563421 )
updates of the long table (183563421 - 461563664 => 278000243).

The test was compiled by the gcc-4.8.2 compiler with the flags -O2 -fomit-frame-pointer -funroll-loops. Each test was run three times, the results of the third test were taken as a result.

And now a few tables and graphs. But first, the notation:
• “no callbacks” means that we asked libslave to read a specific set of binlogs without hanging any callbacks on the table, that is, no RecordSet records were created.
• “With benchmark-kolbekov” means that the kolbeks were measured, measuring the second-second productivity, trying to minimize the impact on the total test execution time (needed to build graphs). They did nothing else - all the work on libslave was only to parse the record, create the RecordSet object (s) and pass them to the user function by reference.
• “With lockfree-malloc” means that the allocator was used in the test .

“Time 1” and “Time 2” are the test execution times for a set of machines 1 and 2, respectively.

Test	Time 1 sec	Time 2, sec.
Inserts into a small table without callbacks	00.0299	00.1595
Inserts into a small table with benchmark cones	02.4092	03.8958
Updates to a small table without callbacks	00,0500	00,2336
Updates to a small table with benchmark-cones	04.8499	07,4892
Inserts into a large table without callbacks	00.2627	01.1842
Inserts into a large table with benchmark callbacks	20.2901	33,9604
Inserts into a large table with benchmark-cones with lockfree-malloc	19,0906	34.5743
Updates to a large table without callbacks	00.6225	02.3860
Updates to a large table with benchmark callbacks	40,4330	70,7851
Updates to the big table with benchmark callbacks with lockfree-malloc	37.9637	68.3616
Inserts and updates to both tables without callbacks	00.9499	03.9179
Inserts and updates to both tables with a short benchmark callbacks	08.0445	14.8126
Inserts and updates to both tables with benchmark callbacks for a long	62.8213	100.9520
Inserts and updates to both tables with benchmark callbacks on both	67.8092	118.3860
Inserts and updates to both tables with benchmark callbacks to both with lockfree-malloc	64,5951	113.3920

Below is a graph of the reading speed for the latest benchmark from both machines. The graph gradually decreases: small insertions are read most quickly, followed by small updates, then large insertions, and small updates are processed most slowly. You can roughly imagine the processing speed of each type of binlog.

I did not investigate the DELETE event processing speed in this benchmark, but I suspect that it is identical to the INSERT speed, since a message of the same length as the insert appears in the log.

Miscellaneous implementations

At the moment, I only know two implementations of libslave. One of them is the one already mentioned , the Begun company opened it at one time, and a lot has been written about this (for example, on OpenNet ). This implementation is used in FreeBSD ports.

Mail.Ru Group uses the Runner fork, which I sometimes file . Part of the changes in it were also made to the runner fork. From unreported ones: cutting out unused code, reducing the inclusion of headers, more tests (tests for BIGINT, for long SETs), checking the version of the format of each binlog, support for mysql-5.5, type decimal (returns as double - of course, it is not used in billing, and where there is enough rough idea of balances), support for bit fields (borrowed fromfork , which is now in almost the same condition as mine).

The second implementation that I know of is from the aforementioned Pianist and Dimarik . What it is architecturally and in terms of performance, I have yet to find out.

Code examples

There are code examples in the library itself, but I will give a few comments.

The types.h file: through typedefs, shows mapping between MySQL types and C ++ types. You may notice that all string types, including BLOBs, are just std :: string, and all integer types are unsigned. Those. even if just int (not unsigned) is written in the table definition, the library will return the uint32_t type.

The same file contains two convenient functions for translating the DATE and DATETIME types provided by libslave into regular time_t. These two functions are external (not called inside libslave) for historical reasons: initially, libslave returned strange encoded numbers for these dates, and I did not change it.

The recordset.h file contains a RecordSet structure definitionrepresenting a single binlog entry. It contains the type of message, its time, the name of the database and the table to which it belongs, as well as two lines - the new and the previous (for update).

A string is an associative array from the column name to an object of type boost :: any, which will contain the type described in types.h and corresponding to the field.

The main Slave object is described in the Slave.h file.

The simplest code to start reading replication is:

void callback(const slave::RecordSet& event) {
    switch (event.type_event) {
    case slave::RecordSet::Update: std::cout << "UPDATE"; break;
    case slave::RecordSet::Delete: std::cout << "DELETE"; break;
    case slave::RecordSet::Write:  std::cout << "INSERT"; break;
    default: break;
    }
}
    slave::MasterInfo masterinfo;
    masterinfo.host = host;
    masterinfo.port = port;
    masterinfo.user = user;
    masterinfo.password = password;
    slave::Slave slave(masterinfo);
    slave.setCallback(«database», «table», callback);
    slave.init();
    slave.createDatabaseStructure();
    slave.get_remote_binlog();

Example of a session with the test client test_client
For clarity, consider a small example of the session: creating a user, database, table, filling it with data, and the corresponding output of test_client. That is an example of a ready-made replication client contained in the libslave source code.
The test_client call looks like this:

Usage: ./test_client -h  -u  -p  -d  
dev.mysql.com/doc/internals/en/event-data-for-specific-event-types.html

Tags:

libslave