ryabov-aa December 16, 2013 at 21:00

Data replication. Attunity Replicate and Greenplum

In this article, I would like to continue the description of the technologies used by the TCS Bank in building DWH. This article may be of interest to those who plan to use LogMining Change Data Capture (CDC) to replicate data from operational sources to the online storage location, built on the basis of GreenPlum DBMS.

Instead of entry.

In all modern IT and financial organizations, there are many different operating accounting systems that are designed to store and process operational data. Since the load on these sources in order to build reporting is not permissible (it can affect the performance of business-critical processes), organizations build DWH, and in most cases data is downloaded to the Warehouse at night during the period of the least load on operating systems.
CDC LogMining technology allows you to solve several problems at once:
- Reducing the load on operational sources at the time of loading data into DWH. Products that implement this technology in most cases read the transaction log files of the source systems, parse them and use them in the receiver system. It is due to the fact that transaction logs are parsed by third-party software and the load on source systems is reduced;
- an online (or close to online) replica of the source system data in the receiver system, which allows building close to online reporting;
- the ability of ETL developers to access replicas of source tables at any time (not only at the intervals that are allocated for reading source systems);
- Most products in parallel with replicas allow you to create the so-called. change-tables, which display the entire history of changes in the source tables.

Summary indicators for the Bank's source systems required for replication

DWH at the Bank is built on the basis of GreenPlum DBMS, all data sources necessary for replication are managed by Oracle DBMS. Summary indicators of replicated sources are given in the table:

Parameter	Value
Number of source systems for replication	5
Number of replicated tables	~ 200
Volumes of transaction logs (GB / hour)
minimum	10
maximum	160
average	~ 40
The total number of operations per day	~ 50 million

Search for ready-made solutions.

As a CDC solution, we considered the following products:

Oracle Golden Gate. A product currently developed and maintained by Oracle. A wonderful product, as it seemed to us, in terms of stability and performance of reading changes from source systems. However, the disadvantages of Golden Gate include the fact that it does not support GreenPlum out of the box, there are no built-in means of initial data loading
Informatica Data Replicator. The product, supporting Oracle as the source and Greenplum as the receiver, has initial data loading tools. The product guarantees data integrity between the source and the receiver, however, this requires the availability of transaction logs for all transactions that were open at the time of start of replication, which is not always permissible in view of the limitations on the volume of disk storage systems for transaction logs
Attunity Replicate The new product, supporting Oracle as a source and Greenplum as a receiver, as well as Informatica Data Replicator has tools for initial data loading. The solution has a simple control system, aimed at the work of "out-of-the-box"

The mechanisms for applying changes to the receiver for different CDC solutions are different and largely depend on the level of supplemental logging at the source (I recommend reading a wonderful article on supplemental loggingAlexander Ryndin). Oracle Golden Gate is a system that can be used wonderfully with full supplemental logging (successively applying all INSERT, last UPDATE by key and all DELETE to the receiver as a batch). A feature of one of our sources is that the number of columns in its tables is large, the number of row updates in the source is also large, but the number of updated fields within one update on the source is small. Enabling full supplemental logging on all fields in the source led to an incredible increase in transaction logs, and it was decided to include supplemental logging only on primary keys. In this regard, I had to write complex scripts for aggregation and application of changes.

Start of a project in production

So, we have chosen the product Attunity Replicate. It turned out to be the only product that meets all the requirements of our pilot project. At the time of the introduction, version 2.0 was relevant.
In this version, the only method for applying changes was implemented - row-by-row, in which each change at the source was applied to the receiver as an independent SQL query:

update target set f1=’a’, f2=’b’, f3=’c’ where key=’key1’

The disadvantages of this approach in Greenplum were identified immediately after the inclusion in the replication of data whose volumes (both in the number of rows and in the number of operations) exceeded those that we used in the pilot project. Obviously, Greenplum cannot digest the total number of transactions from source databases. On the one hand, this is an analytical DBMS, not intended for high transactional load, and on the other hand, the total number of transactions of all source databases is so large that it will be a problem for any DBMS. For this kind of CDC system, in our opinion, the mini-batches mode is optimal, in which, on the receiver base, the changes are applied in the form of pre-aggregated detailed transactions. Changes are accumulated in the CDC system, and then, for example, per minute, are applied to the receiver. We used this regimen, previously exploring the applicability of GoldenGate. Aggregation and change application scripts have been developed. These scripts were submitted to Attunity for study. Having studied the scripts in detail, Attunity releases a new version of replicate 2.1, in which the mini-batch mode appears. Now updates can be applied as follows:

update target
set f1 = decode(net.f1, null, target.f1, '', null, net.f1)
   ,f2 = decode(net.f2, null, target.f2, '', null, net.f2)
   ,...
from net
where target.key=net.key

Here net is the table into which the changes are “calculated” by Attunity Replicate. As a result of introducing this refinement, the number of queries to the receiver base has sharply decreased and we are now able to process the full amount of changes that our sources produce. This technology was called Attunity turbo stream CDC technology and it is fully justified: with an average load on the source data is replicated in GreenPlum with a delay of an average of 60-80 seconds. There are, of course, problems. With a peak load at the source (for example, when closing the trading day), the delay increases, sometimes significantly, up to 1.5 hours. The reason lies in the fact that Attunity Replicate works through LogMiner, which does not always manage to process the database logs during peak loads.
In the process of applying changes, so-called apply conflicts are possible - conflicts of applying changes to the receiver. The most common ones include: 0 rows affected (no records were updated on the receiver) and duplicate keys (inserting a record into the receiver that already exists). During normal operation of the system, these conflicts are possible at initial loading, since in order to ensure data consistency in the receiver, Replicate begins to capture changes in the data a little earlier than a full upload. Attuity Replicate detects such conflicts and switches to row-by-row mode, storing them in a special system table for further analysis.

Work on LogMiner.

CDC solutions can use two different mechanisms for reading transaction source logs: their own binary reader mechanism and the DBMS source mechanism itself. In the case of Oracle, it is Oracle LogMiner, which can return data in two modes:

commited data (returns DML for commited transactions only)
raw data (returns data about all transactions)

In commited data mode, the performance of Oracle LogMiner is terrible and requires additional resources on the source. In the case of raw data, performance is much better, fewer resources are required, but in this case, the application should take on the functions of additional parsing the results returned by LogMiner and generating on this basis SQL statements that will be applied to the receiver.
At this stage, Attunity Replicate works using the Oracle LogMiner raw mode, but they plan to stabilize their own Binary Reader, which can be configured to read transaction logs directly from the rack where they are stored, and even a slight load on the source system at the time capture changes will disappear completely.

Project Current Status

Version 2.2 is currently running successfully in production, and version 3.0 of the Attunity Replicate is being tested in parallel. The load on the source systems (outside the periods of initial loading) turned out to be insignificant due to the fact that when the LogMiner methods are called, the raw data mode is used. As for GreenPlum, everything is in order here too. Although the number of transactions performed by Attunity Replicate is large, it is much less than the number of transactions performed on source systems and, due to the small amount of changes made by each transaction, the load on the receiver is also not significant.

Instead of a conclusion

This article is the second step on the way of covering DWH technologies in our Bank. Ahead is a story about online-warehouse (the basis of which, as you might guess, was CDC Log mining) and much more.

Tags: