PavelOborin February 21, 2019 at 08:14

How to ensure the availability of a web service in the cloud in the event of a data center failure

The article describes the option of ensuring the availability of a web service deployed in the cloud in the event of a malfunction in the data center. The proposed solution is based on a compromise consisting in partial duplication : a backup system is deployed in another data center, which can operate in a mode of limited functionality when the main data center is unavailable. This scheme is primarily aimed at application for short-term failures, but also provides the ability to quickly turn the backup system into the main one in case of large-scale problems.

Description of the problem

Last year, we were touched by an incident in the data center of a famous cloud provider - one of our services was unavailable to users for half an hour. Then we saw with our own eyes that in case of problems in the cloud data center there are practically no levers for restoring the application’s health, and there is nothing left for the team responsible for the application except to put up and wait. This experience made us seriously think about using clouds for our products.

What exactly happened that day was never found out. We are used to perceiving the clouds as some indestructible outpost, but this is not so. The truth is that there is no one-hundred-percent guarantee of service availability in the cloud, as in any other place. Clouds are an abstraction behind which all the same racks with iron in data centers and the human factor are hidden. Any hardware sooner or later fails (although hardware failures for data centers are more likely a regular situation). In addition, there are cases of more serious problems leading to the inaccessibility of data centers: fires, DDoS attacks, natural disasters, interruptions in electricity and the Internet, etc.

If we talk about the human factor, this is not the latest cause of accidents:"According to statistics, 80% of network infrastructure failures are to blame for people . " People, no matter how well-intentioned they are guided, are unreliable. Even you and your colleagues - people directly interested in the stability of supported products - have probably made mistakes, not to mention the personnel of someone else’s company, for which your instances are no different from thousands of others. Whatever the professional team behind the infrastructure, a new glitch is a matter of time.

Everything has a price. When you move to the cloud, you get a simple abstraction, which is convenient to work with, a weak dependence on your operations department in return for full control over the situation. In this case, if you do not take care of yourself in advance, having foreseen the possibility of other people's mistakes, no one will do this.

Solution options

For us, the unavailability of the service, even for several minutes, is already critical. Therefore, we decided to find a way to insure ourselves against similar problems in the future, without abandoning the clouds.

When starting to solve the problem of service availability in the cloud, it should be borne in mind that accessibility is a fairly broad concept and, depending on what is meant by it, various scenarios of its provision are considered. Although this article only discusses the problem of accessibility as a result of a data center failure, it would be appropriate to say a few words about solutions to other accessibility problems.

Availability as a technical opportunity to provide access to a resource for a specific time at a certain load. The problem occurs when the service is running, but due to limited resources and the architectural framework of the system, not all users can access it in a certain response time. The task is most often solved by deploying additional instances with the application. With this scaling, the clouds do a great job.

Accessibility as the availability of a web service for users from a specific region. The obvious solution here is sharding. In other words, dividing the system into several independent applications in different data centers with its own data and assigning each user to his system instance, for example, based on his geo-location. When sharding, the failure of one data center in the worst case will result in the unavailability of the service for only a part of users tied to this data center. Not the last argument in favor of sharding - this is a different ping time to the data center in different regions.

However, often restrictions on working with the cloud and the need for decentralization are legislative requirements that are usually taken into account even at the stage of system design. These requirements include: Yarovaya law - storage of personal data (PD) of users in Russia; General Data Protection Regulation (GDPR) - restrictions on the cross-border transfer of PD of EU users to some countries; and Chinese Internet censorship, where ALL communications and ALL parts of the application should be located in China and, preferably, on their servers.

The problem of technical inaccessibility of the data center is solved by duplicating the service in another data center. This is not an easy technical task. The main obstacle to the parallel deployment of services in different data centers is the database. Typically, small systems use a single-wizard architecture. In this case, the failure of the data center with the master makes the entire system inoperative. A master replication replication scheme is possible, but it imposes strong limitations that not everyone understands. In fact, it does not scale the record to the database, but even gives a small time penalty, since it is necessary to confirm all nodes that the transaction has been accepted. The write operation time increases even more when the nodes must be spaced at different data centers.

Justification of the decision

Analysis of the load on our service showed that on average about 70% of calls to the API are made by GET methods. These methods use a read-only database.

Web service HTTP method call distribution

Web service HTTP method call distribution

Distribution of calls to HTTP methods of a web service

I think these results reflect the picture as a whole for publicly available web services. Therefore, we can say that in the average web service API, reading methods are called much more often than writing methods .

The second statement that I would like to put forward is that if we talk about absolute accessibility, then the customers of the service really need such accessibility not only of the wealth of available API methods, but only those that are necessary to continue the “usual” work with the system and executing "normal" queries. No one will be upset if a method that is accessed a couple of times a month is unavailable for several minutes. Often, “normal” flow is covered by reading methods.

Therefore, ensuring the absolute availability of only reading methods can already be considered as a possible option for a short-term solution to the problem of system availability in case of data center failure.

What do we want to implement

In case of failures in the data center, we would like to switch traffic to a backup system in another data center. In the backup system, all reading methods should be available, and when calling the remaining methods, if it is impossible to do without writing to the database, the correct error should be displayed.

In normal operation, the user request is sent to the balancer, which in turn redirects it to the main API. If the main service is unavailable, the balancer determines this fact and redirects requests to the backup system operating in the limited functionality mode. At this time, the team analyzes the problem and decides to wait for the restoration of the data center or switch the backup system to the main mode.

Implementation algorithm

Necessary infrastructure changes

Creating a database slave replication in another data center.
Setting up a web service deployment, collecting logs, metrics in the second data center.
The balancer configuration for switching traffic to a spare data center in case the first is unavailable.

Code Changes:

Adding a separate connection to the replica in the web service.
Migrate all read-only API routes to a replica.
For the remaining methods, the introduction of read only mode through an environment variable or other trigger, in which they instead of writing to the database, will partially work out or, if their functionality breaks without writing to the database, give a correct error.
Improvements on the frontend to display the correct error when calling recording methods.

Pros and cons of the described solution

Benefits

The main advantage of the proposed scheme is that there is always a duplicate service, at any time ready to serve users. In case of problems with the main data center, you won’t have to write deployment scripts on some other infrastructure and run everything in a hurry.
The solution is cheap to implement and maintain. If you have a microservice architecture and the product needs not one but many services, then in this case there should not be any special problems with the transfer of all microservices to this scheme.
There is no threat of data loss, since there is always a full copy of the database on the replica in another data center.
The solution is intended primarily for temporary traffic switching, up to half an hour. It is this half an hour that is not enough to navigate in case of problems with the infrastructure. If during this period the first data center is not restored, the slave replica of the database turns into a master, and the duplicate service becomes the main one.
In the proposed scheme, the application and the database are in the same data center. If you have an API and a database in different data centers, then it is best to transfer them to one: this will significantly reduce the query execution time. For example, our measurements showed that for Google Cloud, the request from the API to the database within one data center is on average 6 ms, and when going for data to another data center, the time increases by tens of milliseconds.

disadvantages

The main drawback of the whole scheme is that for instant traffic switching, a balancer is required that is not located in the same data center with the main service. The balancer is the point of failure: if the data center with the balancer fails, then your service becomes unavailable in any case.
The need to deploy the code to another server, monitor additional resources - for example, monitor the replica so that there is no lag.

Conclusion

You cannot create a system that is resistant to all types of failures. Nevertheless, protecting yourself from certain types is a feasible task. The solution described in the article, which allows to ensure the availability of the application in case of malfunctions in the data center, can be interesting and useful in practical applications in many cases.

Converting a regular web service into a fully distributed system in order to protect against hypothetical failures in the data center is most likely impractical. At first glance, even the proposed scheme seems redundant and “heavy”, but these disadvantages are more than overlapped by its advantages and ease of implementation. You can draw an analogy with accident insurance: there is a high probability that you will never need such insurance, but if an accident occurs, it will be most welcome. With the proposed scheme, you will be sure that you always have a backup system ready, which, for short-term problems, will ensure the availability of most of the service methods, and in case of long failures it can completely turn into the main one in a matter of minutes. Many will agree to pay this price for such confidence.

Each system has its own unique load parameters and accessibility requirements. That is why there is no right or wrong answer to the question: “Is it possible to completely trust Google Cloud or AWS?” - in each specific situation it will be different.

Tags: