yusman August 3, 2016 at 09:20

How we replicated NoSQL to relational

NoSQL continues to gain popularity these days, but few people know that non-relational DBMSs appeared much earlier than even the most relational algebra. 40 and even 50 years ago, only NoSQL products “cooked” in the primary “broth” of the emerging IT industry. And what’s most interesting - the products born in those difficult times are still alive and they feel great.
One of these products was the GT.m DBMS , developed by Graystone Tehnologies in the 70-80s of the last century. DBMS is widely used in medicine, insurance and banking.

In our bank, we also use GT.m, and this tool does an excellent job of processing a large number of transactions. But ... There is one problem: GT.m is not for analytics, it does not have SQL, analytic queries and everything that makes financial analytics happy. Therefore, we developed our own “bicycle” for replicating data from GT.m to “relational” DBMSs.

And here there should have been a picture with a flying bicycle. We

invite everyone interested under cat.

Pss ... Do you want some more GT.m? Already in those prehistoric times, GT.m had (or had) ACID support, transaction serialization, and the presence of indexes and the procedural language MUMPS. By the way, MUMPS is not just a language, it is a whole direction that appeared back in the 60s of the 20th century!

One of the most successful and popular MUMPS-based DBMSs was Caché , and you most likely heard about it.

The basis of MUMPS DBMSs are hierarchical structures - globals. Imagine JSON, XML, or the folder and file structure on your computer — about the same. And all this, our fathers and grandfathers enjoyed before it became mainstream.
One important point - in 2000, the DBMS became Open Source.

So, the old GT.m is reliable and, despite her advanced years, serves a large number of specific transactions without any efforts, unlike, for example, her SQL counterparts (the phrase is, of course, holistic, but for us it’s a fact: on a certain the load of NoSQL is still faster than SQL). However, all problems begin when we need to do the simplest analytics, transfer data to analytical systems or, God forbid, automate all this.

For a long time, the ubiquitous “unloading” was the solution to this problem. CSV files were generated by procedures written in the M language (MUMPS), and each such upload was developed, tested and implemented by highly qualified specialists. The effort involved in developing each unloading was enormous, and the contents of two different unloadings could overlap significantly. It happened that customers required unloading, several fields different from the existing ones, and they had to do it all over again. At the same time, the M language itself is quite difficult to understand and read, which entails additional “costs” both for the development and support of all this hardcode.

Unloading Solution

ODS (Operational Data Store)

We already had an implemented architectural pattern called ODS ( Operational Data Store ), into which we replicate our sources with minimal lags from 2 seconds to 15 minutes.

We load data from ODS into a data warehouse (DWH) or build operational reports on it. And with relational DBMSs like Oracle, MS SQL, Sybase, etc. no problem - load source tables into the same tables on ODS.

We really wanted to implement similar GT.m replication in ODS.
But how to load NoSQL structures into simple tables? How, for example, to download a client who can have 2 phones, and maybe 22?

We understood that it would be better to organize replication based on the DBMS binary logs (in other DBMSs they are called WAL, Redo, transaction log, etc.), well, GT.m logs every transaction, data being modified. At the same time, finished products already existed on the market, one of which is the Evolve Replicator from CAV systems .

Evolve CAV systems

Evolve reads transaction logs, transforms them and writes rows to tables already on the relational receiver.

But there was one very small problem - this solution did not suit us ... In our structures there were a large number of calculated values (Computed Data Items or CDI).

I'll try to explain on the fingers. This is somewhat reminiscent of a “virtual field” in DBMSs such as Oracle, in which the value is not stored, but is calculated at the time of accessing this field. At the same time, CDI can have rather complex logic and be based on data from child nodes, etc. And, as you probably already guessed, such Computed Data Items cannot be replicated from DBMS logs, since information on them is not stored there, because only changes in physical fields are written to the logs. But we really need such ghost fields for analytics, they have complex logic, and it would be pointless to have an analytic replica without these fields.

It is unrealistic to implement similar logic with calculated fields in a replica. Firstly, due to performance, and secondly, rewriting all this hardcode from M to SQL is a thankless task.

Fis profile

In addition to the data level, in our system we also have the level of applications written in M. Yes, in our times it sounds wild, but most banking systems still live in a two-tier architecture paradigm.

Such an application is FIS Profile (hereinafter Profile) - it is an automated banking system that is fully integrated with GT.m. In addition to banking automation functions, Profile provides the following functionality:
1. Simple SQL (select * from table where id = 1)
2. Access to JDBC data
3. Representation of globals in a tabular form, while one global can be represented in several different Tables
4. Triggers
5. Security
In fact, we have another DBMS on top of another DBMS. In this case, one of them will be relational, and the other - NoSQL.

Profile is completely proprietary software, but there are also Open Source counterparts, for example, Vista Fileman .

Logical levels of our GT.m-system.

Concept implementation

To replicate NoSQL data structures in SQL DBMS, first of all, you need to:

1. Present the globals in a table form.
Moreover, one node of the "tree" can be represented in the form of several interconnected tables. Profile already provides such an opportunity, and all we need is to properly configure such table views. The task, although difficult, is quite solvable.

2. Capture changes.
Unfortunately, the presence of CDI in our system does not allow “correct replication” from DBMS logs. The only possible option is logical replication with triggers. The value in the table changed - the trigger caught it and wrote the change to the log table. By the way, a journal table is the same global. Now you will see for yourself!

This is what a typical global

looks like : Understand, it looks at least ... strange, but in those distant years, the concepts of beauty were completely different. The global structure is also called the "multidimensional sparse array." And the key is, as it were, the coordinate of the data that lies in it.
By the way, according to the "data" you can also build indexes, which is very convenient for tabular presentation.

Actually, from such a global we can get 2 tables:

TABLE_HEADER:

TABLE_SHED - change log:

By the way, numerical values were converted to dates, for example, for the TJD field.

Based on the available tables, a query is executed.

where::
STARTPOINT - date of the last launch;
'T'- current date (it looks at least strange, but this function is an analogue of sysdate () or now () in ~~normal~~ other DBMSs)

As we can see, the “tables” are connected; in fact, the connection is local, within each node, which does not create a significant load.

3. Fetching data from the journal tables and then transferring them to ODS.
The JDBC driver that existed at that time of data worked fine with atomic queries, but caused memory leaks during massive Select operations. The existing driver had to be significantly rewritten.

4. Delivery and application changes.
A very important aspect is the quick application of data on the receiver. If GT.m successfully copes with a large number of atomic transactions, then for relational DBMSs such as Oracle, this carries a heavy load. At the same time, data from a large number of other sources (about 15 in total) is being poured into our ODS.
To solve this problem, it is necessary to collect all such operations in batches and apply them as a group. Such operations are called Bulk and are completely dependent on the specific DBMS receiver.

The current replication architecture

Our application - by the way, we called it Profile Loader - loads two types of tables in ODS: journal and mirror. We will try to talk about ODS in future articles, but in short:
journal tables- tables of change logs, these tables are convenient for incremental loading, for example, into analytical systems and DWH
mirror tables - tables containing a complete copy of the source data, such tables are convenient for audit and operational analytics.

5. Control room.
For convenient administration, we made a web face for starting and stopping replication flows. Anyway, all the main logic was written in Java, which allowed using ready-made Open Source components to solve some specific cases.

Tasks Solved by SQL Replica

1. Disposal of scattered landings. We got a single window for all data consumers.
2. Audit. The audit procedure is simplified due to the fact that the data is in a convenient form, and the power of SQL allows you to conveniently and quickly manipulate this data.
3. Data quality. For example, in GT.m there are only 2 data types - numeric and string. When data arrives in ODS, it is converted to other types, including dates. If the date is in the wrong format, we can easily catch such an incident and improve the quality of the data already at the source.
4. The calculation of the increment for further loading in DWH.

Further development

For the future we plan to implement the following:
1. Completely get rid of existing CSV unloadings. Now they are still alive, but we will slowly “shoot them”.
2. Optimize some performance issues.
3. Share ideas with an interested community, possibly support a project in OpenSource.
4. Try integration with Oracle GoldenGate at the change delivery level.
5. It is possible to make a reverse replica (optional, not ODS) Replica -> GT.m, for service processes to improve data quality.
6. Develop operational reporting on top of ODS.

Summary

In the article, we talked about our brainchild - Profile Loader and how NoSQL data can be analyzed in SQL DBMS. This solution may not be entirely correct and elegant, but it works great and fulfills its obligations.

If you decide to replicate your NoSQL database in a “relational” environment for convenient analytics, first of all, evaluate the volume of changes, the data model, and the capabilities of those technologies that will provide all this.

We wish you success in your endeavors!
Always ready to answer your questions.

PS We also express our gratitude to the colleagues who participated and actively helped in the project: Dmitry Shevelev, Nikolay Chebanov, Roman Bubon, Denis Bystrov, Kaisar Beispekov, Andrey Pavel, Kudyurov, Sergey Vorobyov, Olha Lysych, Lena Kuleshovtech Denis, Julia, Yury Pasynkov and colleagues from CAV Systems and FIS.

Tags: