Vasiliskov October 6, 2014 at 12:19

A few words about planning a recovery strategy

Take a break from reading for a minute and answer the question for yourself: how critical is your simple 1 minute service in reality? Have you answered? I think, if not all, then most of the readers thought: “We will survive”. Now answer, how critical is downtime in 5 minutes? And at 30, hour, day? At one of the steps in my head will sound: "No, well, that's already a bit too much." You have just laid down one of the important parameters necessary to draw up a plan for ensuring the continuity of IT service. Read what it is and what sauce is best for it, read under the cut.

Everything once fails. As a service provider for leasing dedicated servers, we periodically observe how different users solve problems related to ensuring and restoring the functionality of their services. And we made a sad conclusion: despite how much has been written and said on the topic of data and equipment backups, some resources still do not have any well-developed recovery strategies. When something happens, they simply begin to agonize, randomly yank employees, and sometimes blame everyone and everything for something.

“Business continuity planning (also sometimes referred to as business continuity and resiliency planning) determines how much the organization is exposed to internal and external threats, and sets the necessary hardware and software tools to ensure effective counteraction and restoration of the normal functioning of the organization while maintaining competitive advantage and system integrity” ( Elliot et al. 1999).

This term was originally introduced for more “difficult” cases - disruptions in the work of offices or data centers caused by fires, natural disasters, criminal acts of third parties and other cases that usually occur much less often than, for example, a hard disk failure. The British Standards Institute has even issued a special standard for business continuity management - BS 25999. However, we will not go so deep, but simply try to help you understand for yourself how and how thoroughly you should prepare for possible interruptions.

What are you willing to lose?

Any business is fraught with certain risks. And in order for the business to be successful, risks should not be something that lives on its own, they must be managed. For IT projects and services placed on the network, there is a certain set of characteristic risks leading to temporary unavailability of the project, each of which can be mathematically characterized by such parameters as the probability of occurrence, duration of exposure, the cost of full or partial smoothing / elimination of the action.

In the event of an emergency, there are three main parameters that can be "lost": data, time and money. Related problems in the form of loss of reputation, lost profits, etc. in the end, it can be reduced to these three.

There is a very subtle connection between the parameters. For example, the less you are willing to lose time and data, the more money you need to invest in reserving capacities and information. While lowering costs while maintaining recovery time, data loss may increase. Etc.

Even before you look under the cat, I hope you have already determined what maximum downtime is permissible for your project. In terminology of fault tolerance planning, this parameter is called the recovery time objective (RTO). This is the time during which the normal functioning of the service or business process must be restored to prevent serious consequences. Naturally, for you it is difficult consequences, you must also determine for yourself.

The second important parameter that you must evaluate when planning is the recovery point objective (RPO). This is another time interval. It characterizes the maximum acceptable time for which the data of the IT service may be lost. This parameter is somewhat more difficult to describe. You can’t just say that this is an allowable amount of data loss, although in the zeroth approximation it is considered that way. Roughly speaking, this is the time limit from the beginning of the last available backup to the point of accident.

There are two more parameters - actual time and recovery point, but they can be found out either during the simulation or in the event of an accident.

In large companies, the target indicators are determined by special analysts involved in fault tolerance, which then transfer the task to a group of specialists in technical support of the given indicators. They, in turn, determine where, what and in what quantities should be stored, reserved and kept for a rainy day.

But if your project consists of you and your programmer or system administrator, this is absolutely no reason to completely abandon such an analysis and say that this is not about you. In our practice, there was more than one case when, due to the complete absence of a well-thought-out strategy for monitoring performance and recovery, people had problems from subsidence in the search engines index to about a half-day inaccessibility of a financial instrument or service, because all data was stored on one server, and no current online replication was performed.

Who's guilty?

First of all, the project manager and his responsible specialists. Providers do everything in their power to ensure maximum uptime, but in almost any offer agreement it will be written (possibly in a third font) that the provider does not bear any responsibility for any interruptions and data loss for any reason. Even if a drunken engineer accidentally formats the wrong server, with a high probability you will not have to count on anything more serious than sincere apologies and regrets. In addition, I recall the thesis: everything will ever fail. Even that which is positioned as an uninterrupted service (just remember the massive downtime of the Amazon cloud).

The safety of your data and the serviceability of services should concern you first of all. You must answer the following question:

What to do?

Learn from other people's mistakes as much as possible. The modern information space allows you to analyze the experience of a great many failures and evaluate the potential weaknesses of your project.

The first thing you need to do on the way to creating a plan for ensuring business continuity is to get rid of illusions. There was a case in our practice where the user simply ignored the need to make backups. Automatic backups in the control panel did not work correctly - well, no. He truly believed that RAID1 would save him. Imagine his surprise when the first disk in the array degraded significantly, and the second had a lot of errors in the file table. An attempt to quickly replace the first disk and rebuild the array did not lead to anything good, as you might guess. Our administrators had to return a drive working on the verge of a complete failure and painfully pull out data byte by byte from it. The argument why the user did not make backups surprised us: “I have never had this in 6 years of work.” Apparently

Second, identify potential threats, their likelihood and duration of exposure. How long does it take to switch to a DDoS filtering service? How long will it take to replace the entire disk or server in your data center? How much time will it take to deploy the project to another data center if a fire, a flood occurs in yours, or just a provider suddenly ceases to exist? Where to deploy it, for how long will new equipment be provided, etc. If the received figures do not fit the expected RTO, look for other providers in advance whose infrastructure will help you recover. Also decide how much data you are willing to lose, and select the appropriate backup scheme.

Third - count. As I already wrote, the less loss of time and data, the more expensive it will cost you. Evaluate one-time and recurring expenses to ensure the necessary continuity indicators. Are you ready to pay the amount received? If not, then the data is not as important to you as you thought before. Re-evaluate, but with your recovery budget in mind.

Fourth - implement. Just counting and evaluating is not enough. It is necessary to apply the necessary measures in practice. Order the necessary backup equipment and services, sign the necessary contracts, turn on monitoring. Write down for yourself in a text document what service in which cases to contact, what procedure in one case or another. You can even simulate one or another failure. For the presence of clear and consistent instructions, you still say thank you when something happens. Having a prescribed recovery plan will save you a lot of time and a bunch of nerves. The situation from the category of unforeseen will simply go into the category of emergency. You will no longer wander in the darkness like a blind kitten.

The value of something in our life is determined by how much we are willing to give in order to preserve it. If you really appreciate the results of your work, do not forget to take care of their safety. Who, if not you?

Only registered users can participate in the survey. Please come in.

Do you have a well-developed action plan in case of failure?

8.1% Yes, strictly documented 17
37.5% Yes, in general 78
54.3% No, I will improvise, if 113

Tags:

A few words about planning a recovery strategy

What are you willing to lose?

Who's guilty?

What to do?

Do you have a well-developed action plan in case of failure?

Also popular now: