Disaster Recovery Planning. Part Three - Final

We correlate the needs of the business with its capabilities

In previous articles ( 1 , 2 ) on disaster recovery planning, procedures for collecting and processing information about the organization’s IT infrastructure were described that provide accurate information about:

IT services critical for the company’s business,
The current recovery time in the event of a failure,
Minimum achievable disaster recovery times,
The necessary resources to achieve them.

And everything would be fine if it were not for the limited financial capabilities of the organization, which would not allow them to acquire all the necessary reserves for operational recovery. For this reason, the final goal of disaster recovery planning is to find a balance between the needs and financial capabilities of the business, and consolidate it in the form of a Service Level Agreement (SLA) regarding the elimination of incidents.

This stage consists entirely of the coordination with the management of the company of the following aspects of interaction:

1. Time support business internal IT service

The willingness of technicians to begin disaster recovery immediately after receiving information about the failure is the main factor in determining the time of support. An eight-hour working day, holidays, illness, days off naturally limit this opportunity. If you do not have specialists with the competencies necessary for carrying out restoration work or if there is no sufficient overlap by engineers both in time and in case one of them is absent, then the business should not count on support on a 24/7 schedule. If the current overlap by experts does not guarantee a responsiveness even in a 9 * 5 schedule, then the following options are possible:

Measure the recovery time not from the moment the incident occurred, but from the beginning of the work of the accident specialist,
Make preliminary preparations for the possibility of user service restoration by less competent specialists,
To train the reserve specialist in the necessary skills,
Transfer the point of failure or fully user service for service to an external contractor that meets the necessary SLA parameters.

However, with external contractors everything is not so clear:

2. SLA with external contractors

Behind the external prosperity of cooperation with an external contractor may be hiding his inability to resolve incidents within the time frame required by business. Convenience and efficiency can turn into a headache at the first problems due to the lack of understanding by the external provider of the level of service you require.

If the existing agreement on the level of service of the external supplier is unsatisfactory for your business (or simply is absent), then the following options are possible:

Agree on a change of conditions with an existing contractor. Assign the right to several random SLA checks,
Change the contractor to the one whose standard SLA meets your requirements. And again check its execution,
Connect a backup service operator to quickly switch to it in case of problems with the main one,
Accept and leave everything unchanged if the contractor is a monopolist. To bring this state of affairs to the management of the company and consolidate it with them,
Organize this service on your own.

After you have decided on the people and / or companies that will be engaged in restoration work, you can indicate the time for support of user services, which can be laid down in the framework of a service level agreement between the IT department and the business. It remains only to agree on the deadlines for their restoration, and for this it is necessary to discuss:

3. Obtaining the reserves necessary for disaster recovery

The presence of the necessary equipment reserves directly affects the possibility of operational restoration of the service. If you have one physical server in your company, then when it refuses, there will simply be nothing to restore work (for more details on determining the necessary reserves, see the previous article ). If, at the moment, your company does not have all the necessary equipment reserves for restoration work, then the following options are possible:

Purchase equipment in advance if the cost of downtime obviously exceeds their price. For example, a redundant switch costs significantly less downtime for the period of its acquisition,
Sign a service contract for the replacement of failed equipment if the condition "next business day replacement" is acceptable for business,
To coordinate the prompt allocation of funds for the acquisition of the necessary element in the event of a failure, if the cost of downtime is comparable to the reserve element,
To coordinate the decline in the quality of systems in the event of a malfunction and / or disconnection of secondary services to launch business critical systems,
To coordinate the operational allocation of funds for the purchase of less powerful equipment for the temporary launch of a failed service with worse quality parameters.

In principle, at this stage you can already indicate the time frame in which the restoration of certain user services in case of any failures is possible. If the terms, even if all the necessary reserves are not satisfied with the management, then this is an occasion to discuss:

4. Prefabrication to accelerate disaster recovery

This can be either an additional monitoring and backup system, or an additional server or network equipment configured and working in hot swap mode. They may be required by you to localize and restore the user service even a little faster.

After you have approved with the management all the necessary investments in people, service contracts, equipment and software, you can, in addition to the support time, agree on the deadlines for restoring user services. But to guarantee the achievement of these deadlines, you need another small touch:

5. Scope of routine tasks

To guarantee recovery in the event of a failure, you must be sure that in the event of an emergency you will have all the necessary resources for recovery. For this, it is necessary to constantly monitor their presence and correctness. With information on previously agreed reserves and resources, you can make an accurate list of necessary regulatory activities, the regular implementation of which may require the involvement of additional technical specialists. This is a necessary fee for reliability, but, unfortunately, sometimes it is even useless:

6. Situations Beyond the SLA.

There are situations in which it is difficult to predict the timing of recovery and which go beyond planning. These are not only force majeure situations, but also events with the simultaneous failure of two or more elements of the same type, the occurrence of which is allowed by probability theory.

Often it does not make economic sense to prepare the IT infrastructure and IT specialists for the prompt elimination of any accidents. In some cases, it is much cheaper and more efficient to prepare the business itself for action in case of their occurrence. For example, preparing invoice forms for manual clearance of goods, in case of a complete failure of computer systems, or organizing a strict accounting of primary documentation in order to restore business operations from the moment of the last force-majeure backup of the database was not difficult. Possible technical measures to reduce the negative impact of such situations on business have been described previously .

At this stage, the coordination stage can be considered completed - only minor formalities remained:

We fix the agreed parameters and act

The results of your negotiations with management should be fixed on paper, reflecting in it:

Business-agreed time to support user services,
Guaranteed terms of restoration of their work in case of failures,
Money (including the timing of its allocation) and activities necessary to achieve the goals,
Situations that go beyond planning and a list of measures to reduce damage in case of their occurrence.

The agreements fixed in the document will allow you to move from a situation where “IT infrastructure pretends that it works, and a business pretends to invest in it”, to a situation where a business understands what level of service it can count on depending on IT investments.

With this, disaster recovery planning can be considered successfully completed. True, sometimes, after evaluating all the necessary changes and their cost, it becomes clear that it is cheaper to fundamentally change the existing IT infrastructure. But this is a completely different story.

Good luck!

Ivan Kormachev
IT Department Company
www.depit.ru

Tags: