Secrets of resiliency of our front office

How does a modern bank? There is a back office where various operations are carried out, accounts are kept and reporting. There is a middle office where decisions are made and risks are assessed, where credit risks are assessed and fraudsters are counteracted. And there is a front office where they serve customers and are responsible for their interaction with the bank through various channels.

Sberbank employs hundreds of systems of varying availability and reliability. It has its own development, and boxed solutions with varying degrees of customization, different SLA. All systems are integrated with each other in a huge number of ways. In this post we will describe how the whole front-end anthill is assembled in such a way as to provide uninterrupted customer service.

Let's start with the theory. The key principles on which a fault-tolerant system is built can be borrowed from a submarine:

The submarine is divided into independent compartments. If one compartment is flooded, the rest still survive.
All critical components are reserved. Engines, oxygen cylinders. And the Beatles also reserved periscopes with portholes.
The submarine is protected from critical conditions on the surface - if necessary, it can go deep and work there as if nothing had happened.

We illustrate the first principle with an example from our practice. We had a distributed cache system. And once under load, one of the cache data nodes failed. No problem: the controller redistributed the data to the remaining nodes to maintain the necessary replication. But as a result of the redistribution, network traffic jumped and packets began to be lost - including service cache traffic. At one point, the controller decided that another data node failed, redistributed the data again, the traffic increased ... As a result, in less than a minute the system went down completely. Fortunately, it was on the load circuit and no one was hurt. But we spent a lot of time searching for the cause.

It can be argued that with clustered databases and high-end servers this does not happen - there is a built-in redundancy at the hardware level. To quote Werner Vogels, Amazon CTO: "Everything fails all the time." Both the database clusters and the high-end server fell down. Falling due to configuration errors, due to problems in the management software. With the solution of each problem, our confidence in such solutions decreased. As a result, we came to the conclusion: only those systems that are divided into parts independent from each other — first of all, independent in control, do not fail.

Multiblock architecture

The solution to the problems for us was a multi-block architecture. In it, all hardware components, including databases, are divided into loosely coupled, almost independent blocks. Each unit serves a portion of clients, as when shardirovanii in databases. Nodes within each block are reserved at all levels, including geo-reservation. Any problem in one unit does not affect the others. With the increase in the number of customers, we can easily add new blocks and continue working normally.

The overall architecture of the block. All blocks are reserved according to the 2N scheme. Each data center has a productive load balancer. Data centers are connected by 2-3 independent communication channels.

Servers are divided into five types of blocks:

Router - a control block that distributes clients to other blocks.
The client unit is the main unit serving up to 10 million clients.
Pilot block - here we are testing new versions of applications on loyal customers (approximately 300 thousand people, mostly Sberbank employees)
Guest block - non-authenticated users are served through it; those for example who come through the site
Backup unit is a safety unit, powerful enough to replace two client units at once.

Within each block, the application server and the web server are divided by service channels, but the databases are shared. So we can isolate the most common failure scenarios so that they do not go beyond the limits of their channel.

How it works?

First, the user enters the router block. This block checks to which client block the person belongs, and sends it there (or to the guest block). Then the person quietly works inside his block. If a failure occurs in the native unit, the person returns to the router and automatically receives direction to the backup unit for further work.

What happens to the data while working? Information about client interaction with the bank is continuously replicated from client blocks into the archive database. Having met the user, the backup unit tightens the necessary information about it from the archive database and, if necessary, provides data - so the user does not hang up if problems arise from our side.

The operations that are conducted in the backup unit are stored in it. When the user's native client block is restored, it switches back. The operations accumulated in the backup block are asynchronously transferred to the necessary client blocks. While the data is reduced to consistency, the client sees a message stating that all operations have been accepted and saved, but due to technical work, the last operations may not be displayed.

The general scheme of the system

In some cases, switching to the backup unit is planned in advance - for example, when updating in the client unit. Then the backup unit does not pick up the client session, and at some point it simply starts all new operations instead of it. If it is necessary to urgently switch to the backup unit, the administrator can disable all sessions. In this case, the user session is interrupted, and he will start a new one on the backup unit. The router block, by the way, has its own dedicated backup unit. So no one is without spare wheel.

System update

New software versions are deployed first on the pilot unit and are demonstrated to a limited audience. Then gradually on the client blocks, and already at the end - on the reserve. So if there are problems in the client block with the new software version, we can transfer the clients to the backup block, with the old one.

When a new functionality rolls out to a block, it does not turn on automatically. Administrators do this with the help of checkboxes - feature toggle. You can switch clients to the new version by groups - this is how we check the reaction of updates to the growth of the audience.

Autonomy

By itself, our system is reliable, but still depends on the backend, which is used for operations. How to protect against problems? We use three tools.

Pending requests . The client requests an operation. We save it in our database and try to execute it in the backend. If the backend does not respond, we show the client a message that the operation has been accepted for execution and is being processed. When the backend rises, a separate “docker” reads incomplete operations from the database, and “pushes” them in batches into the backend system. In order not to overload the main table with operations by a large number of low effective queries, in addition we use the so-called marker table - a list of identifiers of pending operations. In order not to drop the just-raised backend with hundreds of thousands of operations, we use batching - we throw two hundred operations and wait, for example, a few seconds.

But what if there were important changes between the user's request and the restoration of the backend? For example, have currency rates moved? In this case, double verification is triggered. These operations are saved as they are entered and then checked during execution. If something does not converge, the operation will be adjusted or rejected.
Data caching . When a user visits, for example, Sberbank Online, all the necessary information about him is already visible there - accounts, cards, loans, etc. These data are requested through a service bus from a dozen systems. If the response was collected quickly, within a few seconds, we show the data to the client and save it in the system cache of our database. If not, then we search for previously cached data in the database and show it to the client. Of course, for this, the cache must be no older than a certain age. When the service bus still collects the necessary data on request, it is updated in the database cache and sent to the client in return for the older ones.

When using the application, this means that a person will see the state of his account a maximum a few seconds after logging in. Although the data may be somewhat outdated. If this happens, then after a few seconds, the data is usually replaced by the current one, which means that the service bus collected everything you need.

In addition, we have pre-caching using replication. Basically, for different reference data. We load this data into the backend in advance, the client quietly makes a request for an operation, and we send it. Even if the systems responsible for maintaining the data do not work, the user will not have to wait again.
Technical breaks . If the backend system is down or undergoing maintenance, we flag it. And then the operations passing through it immediately meet with a failure. So we save the application server from overflowing with requests waiting for a response on timeout. In this mode, the caching of operations and data that we described earlier can be used. Technical breaks are set for each integration scenario, manually by the administrator or automatically, based on the number of requests.

In any case, we are striving to minimize the waiting for the user - if suddenly there are problems, he immediately receives a message about the impossibility of the operation. We try to keep the number of such messages to a minimum, so we increase the lifetime of some cached data - this allows us to extend the normal interaction with the bank services.

In some scenarios, caching is not worth it - for example, when issuing cash. There may be fraud on the part of the client. Similar operations in ATMs and branches are not cached. In the Internet bank this is easier - we accept the application, then process it or reject it.

As a result, following the principles described in the article, it is possible to obtain systems with 99.99% availability and higher.

Our plans

Now there are plans to minimize the time-to-market of our single system, to ensure omnichannelity taking into account the technical and business features of the channels. As well as migrate legacy systems while preserving their efficiency in the process of moving.

We thank Roman Shekhovtsov for active assistance in preparing the post.

Tags:

fault tolerance