javaspecialist May 3, 2011 at 07:39

Building a fault tolerant system

In the development of banking software, this aspect of the system is given the greatest attention. Often, when describing a fault-tolerant system, they use the words: Fault Tolerance, Resilience, Reliability, Stability, DR (disaster recovery). This characteristic is the essence of the ability of the system to continue to work correctly when one or more of the subsystems on which it depends. I will briefly describe what approaches can be applied in this area and give a couple of examples.

Immediately I ask you to ask that some things will be specific only to java, but still, by and large, everything described below applies to any platform, so the topic is placed on this topic.

Where to begin?

First of all, you need to know the architecture of your application well, to cover the entire stack of your system - from software to hardware.

You can also look towards such a thing as fault tree analysis. This is when you build a diagram of your components in the form of a dependency tree, and from the bottom up, you begin to put down the cumulative probability of the fall of your components. In this diagram, the most vulnerable areas that could potentially cause the greatest damage will be clearly visible.

The most difficult option is when your system depends on subsystems that are outside your competence and are not fault tolerat. First of all, you should certainly try to convince the owners of these subsystems to provide you with a fault-tolerant solution, thus making it their problem, not yours. But, unfortunately, this is not always possible to do, so in this case, first of all, you will need to understand in detail how these subsystems work, and make fault-tolerant decisions yourself, for example, by connecting simultaneously to two identical duplicated instances of the subsystems, but more on that below .

Implementation methods

There are two fundamentally different approaches, although they may well combine independently or semi-independently for your system. One approach is DR (disaster recovery), when a full copy of your entire system can be picked up in another data center. This method helps out in almost any situation, however, it can have a very long period of inactivity. I know one system in which it takes about an hour. But here, too, you need to be very careful, not only the configuration of the system, but also the hardware in the DR must match the production. For example, when we tested DR, in one of the systems I worked with, it turned out that under heavy load one of the components was designed to work in a single-threaded environment and its throughput was proportional to the processor frequency, and we had a mega piece of hardware on DR with lots of cores but with a small clock speed, so this component began to shut up. In general, the test was very unsuccessful and the “cooler" hardware on the current configuration of our system was clearly not suitable for us.

The second way is to programmatically implement fault tolerance for each of the components and their interaction. There are many different techniques to make your system more stable. Let's look at some of them.

low level fault tolerant services

The principle here is the following - your system should consist of more or less independent systems, each of which must be fault tolerant. Ideally, do not reinvent the wheel, but use ready-made solutions. This is actually the best option when your whole system will rely on the so-called low level fault tolerant services.

Single point of failure

Avoid architecture in which your entire system crashes when one of the components crashes. This can be achieved either by using the principle of redundancy (see below), or by making the components as independent as possible so that when one of the components crashes, only part of the functionality ceases to work, and the rest of the system continues to work. Of course, the last solution is not suitable for the main functionality of your system, but if there was a problem with problems in some of the things in your system that provide auxiliary functions, and this turned off your entire system, then this is probably not very good.

Redundancy

This is when there is an excessive amount of components you need on your system. And if one of these redundant components falls, everything should continue to work. With this design approach, two strategies can be distinguished: active-active and active-passive.

Active active

This means that you are working simultaneously with two identical components at the same time. For example, in a system where a client receives quotation prices, he can subscribe to two different components at once and receive prices simultaneously from two places. If one of these components falls, the client will not notice at all that some problem has occurred in the system, which is an undoubted advantage of this approach. As disadvantages, it can be noted that the amount of traffic and the time for processing it doubles, as well as an additional server infrastructure, which is constantly in working mode and consumes resources. Still, this approach can not be implemented due to various business constraints. For example, a trading system cannot send the same order simultaneously to two independent connectors to the exchange,

Active passive

This strategy represents only one constantly working component, in the event of a fall of which the second one automatically rises, restores the state and takes on all the work. For example, such functionality is in TIBCO EMS. If you enable this option, then the working instance of EMS is constantly monitored, and in case of failure, the second instance of EMS is launched, sends all unread messages and starts receiving new ones. But here it is very important to understand what you will have to pay for. In one of the projects where we turned on this option, the maximum throughput of this component fell about two times.

It is also important to understand that if you are not using a ready-made solution, then its acttive-passive implementation is likely to require more efforts than an active-active solution, since here you need a reliable way to verify that the active component is functioning, and you need to be able to restore the state to moment of fall or constantly synchronize it. Also, this solution will always have a certain delay in operation when the component falls. But the passive component is likely to consume a lot less resources and you can get a fault-tolerant solution on weaker hardware.

Load balancing

Usually this term pops up when building highly loaded systems. However, for fault-tolerant systems, this principle can also be applied. The idea is that you have several identical components between which the entire load is more or less evenly distributed. Unlike the active-active strategy described above, here only one component performs each task. This mechanism is ideal for stateless components, otherwise for fault tolerance you will constantly have to synchronize the state. For example, in the case of web servers, do session replication. In this solution, it is very important to have at least N + 1 redundancy, i.e. if for peak loads you need N components working on the entire coil, then N + 1 such components should be present in your system, since otherwise, if you have one of the elements falling,

The most common load balancing example is probably the load balancing of web servers. There are both software solutions (Nginx), and special glands. In backend systems, load-balancing is often used to implement balancing; they implement queues using JMS, hanging several identical listeners on it. In this case, the queue guarantees that the message will fall into only one of the components, and until it processes it, it will not be able to take the next message. By the way, in the same system, if we take topic as a queue, we can implement an active-active system.

Protective Programming (Defencive coding)

To make your application as fault-tolerant as possible, you need to not only think about it during design, but also in programming. You should always think, and what if something happens and write code that handles all such unforeseen situations. I will give a few examples.

Infinite Loop . Make sure that if an error occurs during the processing of the message, you do not go into an endless loop that constantly tries to perform this operation, especially if the message arrives from outside you, and you cannot guarantee that it is correct. I have never seen how the whole system hangs because of this error. What to do in this case? For example, you can send an error message to the monitoring system (see below) and proceed with processing the next task.

Reconnection. Be sure to write reconnect logic, especially on systems with a firewall .

Degradate gracefully. Throttling . If your application receives messages from outside, then be sure to think about what you will do if you receive a lot more messages than you can process. For example, you can start to redirect requests or emulate slow connection, slowing down the response.

State management. Try to make sure that your system maintains its checkpoint from time to time, so that in case of big problems you can always roll back to the last consistent state. For example, in the FIX protocol, common in financial applications, there is the concept of sequence. Each message has its own serial number, the server remembers the number of the last sent message, and the client remembers the number of the last received message. At the restart, they exchange these numbers, and if there is a gap between them, the server sends all the missed messages.

Infrarastructure. If an error has occurred in your application from which you cannot recover, for example OutOfMemoryError in java, then you can try to stop the application and start it again. This can be automated using a tool like Tanuki Java Service Wrapper, which I wrote about in detail here .

NullPointerException . However, don't go crazy by checking every input parameter of a method or constructor for null. Better to take it as a rule: never pass null between components on a system.

Infrastructure isolation

Sharing the iron of your system with any other applications is very dangerous. It is also called the principle of student kitchen (student kitchen), if someone does something there, then it cannot but affect other people who use the same things. The most unpleasant thing is that this case is very difficult to determine if the analysis is not carried out immediately, but after a certain time. There are many examples of this problem. For example, the third application began to actively use RAM - this way your processes began to swap and your system’s productivity is terribly degrading. Or the third system began to actively use the hard drive, and you write logs synchronously, in this case your application will slow down again. Overfilling the hard drive with third systems and clogging the communication channel are also possible problems, True, they can also occur when iron is used only by your system, it just consists of a large number of processes. For example, a very common case of how applications written in java affect each other is a parallel garbage collector. When it works in one of the components, by default it uses all available CPU resources to the fullest, so if you have an application that needs a quick response on this machine, this problem is not contrived. Although the number of available threads for garbage collection can always be limited to special flags when starting the JVM. A very common case of how applications written in java affect each other is a parallel garbage collector. When it works in one of the components, by default it uses all available CPU resources to the fullest, so if you have an application that needs a quick response on this machine, this problem is not contrived. Although the number of available threads for garbage collection can always be limited to special flags when starting the JVM. A very common case of how applications written in java affect each other is a parallel garbage collector. When it works in one of the components, by default it uses all available CPU resources to the fullest, so if you have an application that needs a quick response on this machine, this problem is not contrived. Although the number of available threads for garbage collection can always be limited to special flags when starting the JVM.

Monitoring

To increase the fault tolerance of your system, you can also use monitoring. Very often after learning about upcoming problems in advance, you can take certain actions to avoid their onset. For example, seeing that the disk is running out of space, you can quickly go to the box and clean the logs.

There are many ready-made solutions for monitoring (Triton, Nagious), both paid and open-source. Standard features are disk, RAM, processor and traffic monitoring. There are also various plugins that allow you to monitor log files and when errors occur there, send a message to the monitoring system.

Despite the availability of ready-made solutions, banks for some reason develop their own solutions. However, they are already more focused on receiving messages sent programmatically from internal applications. For example, if for some reason a certain back-end system was not able to process the order, then you can send messages with all the details of the order to the described monitoring system so that support can enter the details of the order manually.

Another type of monitoring is the health monitor, when your application sends special heartbeat, and if nothing was heard from the application for a certain period of time, a fault message pops up in the tracking system.

Problem Management

Of course, you should start thinking about the fault tolerance of your system as early as possible, even at the design stage. But bringing the resilience of your system to the ideal is never too late. It is very important to investigate every problem that has happened in your system, find the true cause of the errors and make sure that in the future it no longer causes problems in your system.

In custody

I talked about several approaches to building fault-tolerant systems. Which one should you choose? I would recommend the one that is the most transparent and easiest to implement, provided that it still meets the minimum fault tolerance requirements of your system.

Tags: