msetkin July 5, 2017 at 17:27

Big Data at Raiffeisenbank

Hello! In this article we will talk about Big Data at Raiffeisenbank. But before moving on to the point, I would like to clarify the definition of Big Data itself. Indeed, in the past few years this term has been used in many contexts, which has led to the blurring of the boundaries of the term itself and the loss of the content. We at Raiffeisenbank identified three areas that we relate to Big Data:

(Note that despite the fact that this scheme looks quite simple, there are a lot of “borderline” cases. If they arise, we resort to an expert assessment to evaluate whether Big Data technologies are needed to solve incoming problems, or can you get by with solutions based on the "classic" RDBMS technologies).

This article will focus primarily on the technologies used and the solutions developed with their help.

First, a few words about the prerequisites for interest in technology. By the time work on Big Data began, the bank had several solutions for working with data:

Data Warehouse (DWH, Enterprise Data Warehouse)
Operational Data Store (ODS, Operational Data Warehouse)

What made us look towards Big Data?

A universal solution was expected from us in IT, which would allow us to analyze all the data available to the bank as efficiently as possible in order to create digital products and improve our customer experience.

At that time, DWH and ODS had some limitations that did not allow developing these solutions as universal tools for analyzing all data:

The stringent data quality requirements for DWH greatly affect the relevance of data in the repository (data are available for analysis the next day).
Lack of historical data in ODS (by definition).
Using relational DBMS in ODS and DWH allows you to work only with structured data. The need to define a data model while still writing to DWH / ODS (Schema on write) entails additional development costs.
Lack of horizontal scaling solutions, limited vertical scaling.

When we realized these limitations, we decided to look towards Big Data technologies. At that moment, it was clear that competencies in this area would provide a competitive advantage in the future, so it was necessary to increase internal expertise. Since there was no practical competence in the Bank at that time, we actually had two options:

- or form a team from the market (from the outside);
- or find enthusiasts through internal transitions, without actual expansion.

We chose the second option, because he seemed to us more conservative.

Further, we came to understand that Big Data is just a tool, and there can be many options for solving a specific problem with this tool. The problem to be solved presented the following requirements:

It is necessary to be able to analyze data together in all the variety of their forms and formats.
You need to be able to solve a wide range of analytical problems - from flat deterministic reports to exotic types of visualization and predictive analytics.
It is necessary to find a compromise between large amounts of data and the need to analyze them online.
You need to have (ideally) an unlimited scalable solution that will be ready to fulfill requests from a large number of employees.

After studying the literature, reading the forums and familiarizing ourselves with the available information, we found that a solution that meets these requirements already exists in the form of an established architectural template and is called “Data Lake”. Having made the decision to implement Data Lake, we thus aimed to get a self-sufficient ecosystem “DWH + ODS + Data Lake” that can solve any tasks related to data, be it management reporting, operational integration or predictive analytics.

Our version of Data Lake implements a typical lambda architecture in which the input data is divided into two layers:

- “speed” layer, in which mainly streaming data is processed, data volumes are small, transformations are minimal, but then the minimum latency between the occurrence of the event and its display in the analytical system is achieved. For data processing we use Spark Streaming, and for storing the result - Hbase.

- a “batch” layer in which data is processed by batches, which can include several million records at once (for example, balances on all accounts based on the results of closing a business day), this may take some time, but we can process rather large amounts of data (throughput). We store the data in the batch layer in HDFS, and for access to it we use Hive or Spark, depending on the task.

Separately, I want to mention Spark. We widely use it for data processing and for us the most significant advantages are the following:

Can be used as ETL funds.
Faster than standard MapReduce Jobs.
Faster code writing compared to Hive / MapReduce, as the code is less verbose, including through DataFrames and the SparkSQL library.
More flexible, supports more complex processing pipelines than the MapReduce paradigm.
Python and JVM languages are supported.
Built-in machine learning library.

We try to store data in Data Lake in its original, raw form , implementing the “schema on read” approach . For process management, we use Oozie as a task scheduler.

The structured input data is stored in the AVRO format. This gives us advantages:

The data scheme may change during the life cycle, but this will not interfere with the performance of applications reading these files.
The data schema is stored along with the data; no need to describe it separately.
Native support for many frameworks.

For data marts with which users will work through BI tools, we plan to use the Parquet or ORC formats, as this in most cases will speed up data sampling due to column storage.

As an assembly Hadoop considered Cloudera and Hortonworks. Hortonworks was chosen because its distribution does not contain proprietary components. In addition, Hortonworks out of the box is available in the 2nd version of Spark, and in Cloudera - only 1.6.

Among the analytical applications that use Data Lake data, we note two.

The first is a Jupyter Hub with Python and installed machine learning libraries, which our Data Scientists use for predictive analytics and model building.

For the role of the second, we are now considering an application of the Self-Service BI class, with which users can independently prepare most of the standard retrospective reports - tables, graphs, pie charts, histograms, etc. It is understood that the role of IT will be to add data to Data Lake, provide access to data for the application and users, and ... that’s all. Users can do the rest themselves, due to which, in particular, we expect a decrease in the final time for finding answers to questions of interest.

In conclusion, I would like to tell you what we have achieved so far:

We brought the Batch Layer branch to the Prod, we load the data that is used both for retrospective analysis (that is, analysts using the data try to answer the question “how do we send here”) and predictive analysis: a daily forecast based on machine learning of demand for cash withdrawal at ATMs and optimization of the collection service.
They raised the Jupyter Hub, gave users the opportunity to analyze data with the most modern tools: scikit learn, XGBoost, Vowpal Wabbit.
We are actively developing and preparing to launch the Speed Layer branch in Prod, implementing a Real Time Decision Making class system on Data Lake.
We compiled a product backlog, the implementation of which will allow us to increase the maturity of the solution at the fastest pace. Among the planned:
1. Disaster tolerance. Now the solution is deployed in one data center, and in fact we do not guarantee the continuity of the service, and we can also irreversibly lose the accumulated data if it happens with a data center (this probability is small, but still exists). We ran into a problem: with built-in HDFS, you cannot achieve guaranteed data storage in different data centers. There is a revision in this regard, its fate is still unclear, we plan to implement our own solution.
2. Metadata Enrichment (Atlas), Metadata-based Data Management / Governance, metadata-based role-based access.
3. Explore alternatives to selected architectural components. First candidates: Airflow as an alternative to Oozie, more advanced CDC as an alternative to Scoop for downloading data from relational DBMSs.
4. Implementation of the CI / CD pipeline. With all the variety of technologies and tools used, we want to ensure that any code change can be automatically rolled out as quickly as possible into a productive environment, while guaranteeing the quality of delivery.

Tags:

Big Data at Raiffeisenbank

What made us look towards Big Data?

Also popular now: