Instructions for Business Impact Analysis
Not everyone knows where and when to begin to implement business continuity plans. I usually say this: when possible losses are higher than the cost of countering the threat - it's time to take action, the cost of them will be adequate. And vice versa. If with the cost of counteraction everything is more or less clear, then the assessment of losses is not a trivial task. I invite you to the backstage of the Business Impact Analysis (BIA) project for assessing the impact of emergency situations and developing an IT continuity strategy using the example of a large retailer. So let's go.
We participated in the X5 Retail Group project - the largest retailer in Russia. The company operates the Pyaterochka, Karusel and Perekrestok networks.
She already had her own business interruption risk management policy, which contained:
- risk insurance;
- the formation of crisis management;
- minimization of risks to human life and health;
- business risk management;
- preparing IT recovery plans for emergency situations.
In accordance with this policy, it was necessary to develop IT service continuity and disaster recovery plans for a centralized IT infrastructure. The ideal solution would be to build a backup data center, set up synchronous replication, set up an emergency move automation and watch the screen on the screen as the IT systems move to the backup data center and the green lights come on, signaling that there is no danger to the business.
But taking into account the economics of the process, the company assumed that an adequate measure in case of an emergency would be to back up only the most critical IT systems, without which the stores would not be able to work and the company would begin significant financial losses. An important question arises - which systems and for what time should be restored?
The customer's IT department determined the classification of IT systems and the allowable recovery time for each system. However, it was later decided to conduct a full assessment of the impact of emergencies on the company's business (BIA) according to ISO 22301 and best practices.
Scope and boundaries
The theater starts with a hanger, and BIA starts with defining the scope of work. To do this, you need to examine the business processes of the company, its services, financial statements, relationships with partners, customers and contractors. Then identify and agree on the key business processes and services that will be included in the project boundary. The duration and cost of BIA depends on the volume. In this case, our experience suggests that you should not stretch the project for more than 9 months.
In our case, the customer has already determined the boundaries by selecting the most important business processes for trading activities.
After the boundaries and frameworks of the BIA are fixed, a list of stakeholders from business and other departments that need to be interviewed is determined. It is very important to collect information from different departments in order to get an objective picture of the processes in the company, to understand how they work, to get an assessment of "what will happen if ...". At this stage, we get information about exactly how business processes depend on IT and build a matrix of these dependencies. Also, business representatives and parties interested in the business process assess the consequences, probable damage, possible scenarios. For this, we developed a special questionnaire and interviewed about 50 respondents (submitted 50 presentations about the project itself, conducted, received and processed all completed questionnaires).
In parallel with the interviewing, we described the business processes, taking into account the time taken to complete individual operations and the depth of study sufficient for further analysis. Breaking up the process into smaller components and specific operations is necessary in order to understand how an IT system affects a particular process at different times of the day and different times of the year. At this stage it is important to understand that we do not describe the business processes according to GOST or another methodology. We are not engaged in the optimization of business processes and, in general, we do not give recommendations on how to improve business processes, at least within the framework of the BIA. We describe business processes in such detail that allows us to justify the method of calculating losses and estimate losses by several criteria. For the graphic description, EPC, ARIS and MS Visio notation were used as tools.
In order to determine the objective target recovery time, it is necessary to agree on the bank on the criteria by which we will assess the damage, and on their threshold values. If these thresholds are exceeded, we will consider damage as critical, and the time interval at which the threshold value is reached will become the target recovery time. Two options were suggested:
- determine the RTO by one criterion - financial loss;
- determine the RTO by three criteria - financial losses, loss of reputation, loss of controllability of business processes.
The first option with one criterion seems to be preferable, since any losses can be conditionally transferred to money - the main thing is to agree on a recalculation formula. But, as practice shows, no one deliberately calculates reputational losses into financial losses, and it may take indefinite time to reconcile such a formula. It was decided to consider the recovery time for both options, and at the stage of presenting the results, the customer himself will determine which of them reflects the reality more objectively.
Looking ahead, I will say that using the first option with one criterion, it turned out that the RTO in the “pricing” process, for example, can reach 10 days. When calculating according to the second option, the RTO did not exceed 24 hours. In any case, the management decision - which losses to take into account and which not - remains with the customer.
Together with the customer determined the list of operational risks. That is, those that affect IT, and they in turn affect business processes that ... well, you understand. This stage is important because an emergency situation is not considered as a spherical horse in a vacuum, they say, what will happen to the Motherland and with us if we lose IT. Risks divided into global and local. For each of them, they determined the development scenario and the impact on the company's processes taking into account the results of interviewing. Obviously, the same IT system can affect several business processes upon failure, but we were terribly worried about the two processes within the project. Then we evaluated the claims in accordance with the following parameters:
- spread of threat;
- possibility of notification;
- duration of exposure;
- probability of occurrence;
- estimated damage.
Following the results, we drew a heat map, where each application received an assessment of how hot a business could burn during its idle time. For example, in 4 hours of idleness of individual SAP modules, the company will not receive serious problems yet, but even the first hours of idle cash register software on the heat map are marked with a fiery red color.
It is necessary to clarify that risk assessment and further ranking are formed with the help of a group of experts and are necessary in order to determine the most critical situations for the customer.
Conditional risk and scenario.Fire in the data center: the server room burned down completely, the SAP module involved in the “Deposit” process is unavailable. This means that every day, until the burnt SAP module is restored, the product range decreases. First of all, this concerns perishable products, secondly - products that are in high demand (for example, cereals and bread), in the third - household chemicals. Obviously, this situation will lead to a decrease in revenue in stores. But what is not quite obvious: a buyer who has come for beer and cigarettes, in the absence of one of the goods, may very likely not buy anything. Similarly for the process of "pricing". If the conditional buyer, who learned about the discounts on Wednesday at 12:00, comes to the store in the afternoon, and the “Pricing” process does not work (that is, prices without discounts), then he:
A) will not buy anything (= financial loss);
B) will accuse the store of fraud (= loss of reputation) C) will
complain to the regulator (= fine for unfair advertising).
Method of estimating losses
As you probably understood from the above, in order to calculate even financial losses, it is necessary to develop methods and formulas for counting them, which take into account discounts, promotions, time of day, high season (for example, the rush at the end of December). The method should contain a descriptive part (what is coming from and why it is multiplied by weights), as well as tables and graphs for clarity of perception.
Also, the method describes:
- How is the recovery time for a business process determined?
- how the recovery time for a business process is translated into RTO / RPO for IT systems;
- criticality classes and recovery classes — why?
We go further.
After all the interviews have been conducted, the business processes are described, the risks are assessed, the methodology is determined and approved, the losses are calculated. Since the business of Pyaterochka, Karusel and Perekrestok differs at least in scale - for each network we have developed our own tables, our own schedules and loss calculations.
For the business process as a whole, the recovery time is determined (see methodology) when losses exceed the threshold value (see thresholds). This recovery time is assigned to those IT systems that are involved in the business process (see interview and dependency matrix). It would seem that the continuity parameters are defined - the project is completed (see boundaries and frames). But it is not enough to say “the process should be restored in 12 hours”. It is important to determine how it works now. How many hours does an IT system manage to reanimate today? And what if the current recovery time is greater or significantly longer than the target? For those who still have sanity and concentration, welcome to GAP!
GAP analysis and follow-up plan
As a result of the previous steps, we determined the state for the processes and systems “TO BE”, that is, how it should ideally be. At the current stage, we determine the status of "AS IS". In this case, we are less affected by business processes, and focus on the IT component. For the customer, we evaluated his current decisions in terms of recovery from an emergency. And in this case it was not necessary to carry out a real recovery with a timer. It was enough to go deep into the details and there was enough table testing to understand that the target RTO is unattainable.
After that, we developed a number of recommendations, both of a general nature (on ensuring IT continuity) and directly relating to IT systems and their architecture. These are sketches of technical solutions and a rough estimate of the cost of their implementation. In fact, now there is a basis for decision making. On one side of the scale - the loss, and on the other - the cost of events.
If some IT systems do not pass GAP analysis, or rather, their recovery time is more than targeted, we make a project program to achieve a target state or, if you will, a roadmap with justification for the order of implementation of projects and an interim assessment of the organization’s sustainability.
In addition, for the customer, we developed teaching materials and templates for the formation of continuity plans and disaster recovery plans.
Wait, wait, I'm almost done.
Following the BIA results, we developed an IT continuity strategy. The continuity strategy described two key points.
- What IT risks affecting the company's activities are taken into account, and which are not (that is, what we fear and will solve within the framework of continuity, and what we are not afraid of and for this we have incident management).
- What organizational, architectural, infrastructural and other solutions will we defend against threats?
Strategy we kill two birds with one stone. Firstly, everyone in the company understands how and from what we will defend ourselves. Secondly, for non-IT specialists (for example, financiers), the process of budgeting IT disaster recovery solutions looks more transparent. And no matter how pathetic it may sound, the strategy helps to make the right management decisions (there is always the option not to spend money on DR, and now we know exactly how this will affect the company in the event of an accident).
What's next? Further implementation of the continuity strategy and business impact analysis for other business processes and IT systems. Development of continuity plans, periodic testing of these plans, but this is a completely different story.
Igor Tukachev, Consultant, Jet Info Systems Design Center