Machine learning against credit risk, or "come on, Gini, come on"

A bank is, by definition, a “monetary organization,” and its future depends on how successfully this organization issues and repays loans. To successfully work with loans, you need to understand the financial situation of borrowers, what the credit risk factors (FKR) help with. Credit analysts identify them in vast amounts of banking information, process these factors and predict further changes. Usually, descriptive and diagnostic analytics are used for this, but we decided to connect machine learning tools to work. About what happened, read the post.

Some credit risk factors are on the surface, others need to be sought deep in the depths of bank data. Changes in the dollar exchange rate, customer revenue, debt load, falling sales and ratings, courts, criminal cases, mergers and acquisitions — all this gives a statistical signal of varying strength. In order to properly draw up a general picture of the borrower, it is necessary not only to catch all the signals associated with it, but also to evaluate their strength.

In working with PCR, descriptive and diagnostic analytics have proven themselves well, but still these methods are not without flaws. The use of analytics is limited to regulators - not all advanced methods and models can be approved by them. Analytics is not flexible and does not allow to present data in an arbitrary slice - and this is often very necessary. And with efficiency in this case, not everything is great. And it also happens that there is simply not enough data to run some analytical models.

Why not try machine learning for this purpose? So it is quite possible to improve the calculation of the significance of credit risk factors, in technical terms, to increase the Gini indicator by several percentage points, by which we estimate the accuracy of predictive models. The better the calculation of the FKR, the more accurate the assessment of the financial condition of the clients - the higher the quality of the bank’s loan portfolio. And the lower the share of manual labor.

Project progress

Cloudera Hadoop was chosen for storing large data, Apache Spark and Apache Hive SQL were deployed for accessing raw data, and Apache Oozie was used to coordinate and launch download flows and calculate data. Using Apache, Zeppelin and JupyterHub visualized and explored data. In addition, a number of machine learning libraries supporting parallel processing — Spark MLIB, PySpark, and H20 — were used.

Seven nodes were allocated to all this:

  • 3 master sites with 64 GB of vRAM and 2 TB of disk space each
  • 3 data nodes with 512 GB of vRAM and 8 TB each
  • 1 node for applications with 128 GB vRAM, 2.5 TB

The whole project took three months and consisted of three demo stages, four weekly sprints each. For calculation during the project, 22 credit risk factors were selected.

In the first stage, we deployed the infrastructure and connected the first data sources:

  • Corporate Information Store (KIH) - the main repository in the bank. In order to freely operate on data within Data Lake and not create a load on production systems, we have actually loaded it entirely.
  • The rating calculation system (CPP) is one of the main databases for assessing risks associated with the activities of corporate clients. Contains information on ratings of enterprises, financial statements indicators.
  • Data from external sources, reflecting affiliation and other criteria.
  • Separate files containing additional information and data for the work of data scientists.

At the second stage, the first FKRs were calculated, they tried to build models based on these indicators, established a BI tool, and discussed how to visualize the dynamics of the FKRs. As a result, we decided to keep the familiar Excel table structure in the new tool, leaving the advanced visualization for the future.

Finally, at the final stage, we downloaded all the missing data, including from an external source. The bank feared that their statistical significance would be small, so we conducted statistical tests that proved the opposite. In the final demo demonstrated the work of datascience-tools, BI, regular download and update data. Of the 22 factors within the pilot, only two were not calculated, due to external reasons - the lack of data of the required quality.


The cluster on Hadoop scales easily and allows models to feed more data, and they can perform computations in parallel. The Gini index has grown - the models have become more accurate in predicting certain events related to credit risk factors.

Our analysts had to contact the IT department before, so that we could write SQL queries to the corporate repository, and then process the models on our personal computers. And now the pilot cluster allows analysts to write queries on their own, that is, to raise the raw data and process the models is much faster.


This year we will continue to develop the project. We deploy Data Lake infrastructure on specialized equipment to increase the speed of sampling and processing. We organize a single, centralized resource for credit analytics on the basis of the “lake”. Add some more data sources and add new machine learning libraries.

Other divisions of the bank became interested in our project - CRM, internal audit (search for fraudsters, identification of suspicious operations), operational support (antifraud), industry analysts. When using the sandbox we give them our development, they will have easy access to the data, the ability to connect any data sources and experiment on them using machine learning models.

Also popular now: