Problems of ensuring 100% availability of the project

To argue that the site should always be available - moveton and banality, but 100% availability, although it is a mandatory requirement, most often still inaccessible ideal. Now there are a lot of solutions on the market that promise the maximum uptime or offer solutions to increase it, but their use is not enough that does not always help, in some cases can even lead to increased risks and reduced project availability. In this article we will go through the classic mistakes that we constantly face. Most of the problems are elementary, but people admit them again and again.

Prerequisite: before trying to ensure the maximum uptime of the project, you should relate the costs and cost of downtime. Usually, this is very important for companies whose work depends on the work of other companies - B2B solutions, API services, delivery services. Inaccessibility for even a few minutes will result in at least a load on the call-center from dissatisfied customers. For companies of a different type, for example, a small online store, or a company whose customers work from 9 to 18, the inaccessibility, even for a few hours, can be cheaper than a full-fledged reserve site.

1. Localization of the entire project in one data center / one cloud hosting zone

Cloud hosting marketing has firmly fixed an erroneous concept in people's heads: cloud hosting is not tied to hardware and this means that the cloud infrastructure will not fall. Three 24-hour crashes of Amazon Web Services, a recent cloud4y crash, and the loss of cloudmouse data showed that localizing data and the project itself in one data center is a guaranteed way to get many hours of downtime without the ability to easily lift a project on another site. The law on personal data, in this regard, creates additional problems. We believe that any cloud hosting should go through several major accidents to learn how to prevent them (lightning strikes Amazon, network-related problems with human factors, etc.), and if Western cloud hosting companies went through this series of catastrophes,

The situation is similar with the “iron” data centers. Often we see a client configuration where several servers are reserved on the same site, in case of failure of the hardware of one of them, however, in our experience, network problems, when several racks in one data center or the entire data center as a whole become inaccessible happen much more often than the crashes of individual servers, and this must also be taken into account.

The recommended AWS project operation scheme involves the use of several default zones by default to achieve maximum project uptime.

2. Lack of adequate duplication at the reserve site

So, we came to the banal conclusion about the need to have a backup site to achieve the maximum uptime of the project, however, in order to switch to it - the data must be adequate to the production site. What is important here is not the initial creation of a reserve — this is a fairly simple and understandable procedure; synchronization and monitoring of synchronization of further changes is much more important. First of all, we are talking about:

Synchronize cluster / data configuration in a cluster when we talk about a complex site
File structure synchronization and synchronization lag monitoring
Tracking server configuration changes
Well-established processes of control and adding new projects / services to the synchronization on the site.
Tracking the addition of new secondary services (new queues, processing and interaction mechanisms, etc.).
Adequate continuous monitoring of all these processes.

3. Lack of a plan for switching and regular switching to a backup site

Anyone, even the best monitoring, cannot guarantee that the backup site will be ready to switch when it is really needed. In our experience, the first failover necessarily happen the accident, and so will be a few more times. In their reports, Stack Overflow says that it took about five switchings to the reserve, before they were convinced that he was now fully prepared to accept traffic after an accident. Therefore, in the plan of work to increase the uptime of the project, it is necessary to include test switchings to the reserve, and consider that such switchings will lead to an accident. After working out and fixing in the documentation of the switching mechanism, it is necessary to continue to regularly switch to the reserve, in order to make sure that everything is still working.

4. Localization of the reserve site on the same channel / in the same cloud region

If the production and backup sites are located within the same hosting company, then it is quite possible that in the event of an accident both of your sites will stop working at once. Several major accidents in AWS immediately affected all the availability zones of one region, Selectel fell simultaneously at data centers in St. Petersburg and Moscow, companies could talk about complete isolation, but the accident cloud4y, which led to the complete unavailability of Bitrix24, says that even there are big risks. Ideal, from our point of view, is the configuration where one backup configuration is located in the same hosting company (for using regular backup switching tools, such as VRRP ), and the secondary backup site in another hosting company.

5. Placement of identical versions on the main and reserve site.

Even the use of a tested backup site and the use of a secondary site in another data center does not guarantee the readiness of the reserve to quickly take on the production load. This is due to the essence of the reservation: the new version of the code, which created a fatal load on the production environment will create exactly the same load on the backup site, and the project will become completely inaccessible. As a simple solution, there should be a rollback mechanism to the previous version, however in the business race for releases it is not always possible, and then we start thinking about another backup platform with the previous version. We should also talk about backup: accidental deletion of data on the main site will also be reflected in the backup site, so you should think about deferred (for 15 minutes, an hour) replication in order to be able to switch to a database that has not yet had a fatal operation.

6. Dependence on external services addressed by the project.

But this is not enough. A huge number of projects now use external services to provide their own services. Most use SMS for double authentication, online stores calculate delivery time using delivery services, payment is accepted through third-party payment-gateway-and if these services fall, then it doesn't matter whether there is a reservation or not, the project will still be unavailable. We rarely see backup of external services, and, meanwhile, these are exactly the same projects that may have problems with a backup site, or there may be no reserve at all. And in case of unavailability of external service, the service of your customers will also be impossible. We recommend duplicating all critical external systems, monitoring their availability and having a plan to switch them in case of an accident.

This is not all, but at least basic things. We discuss this in more detail at uptime.community meetings , the next one will be in October, but for now you can chat in the telegrams group .

Tags: