Disaster Recovery Planning. Part one

    Determine the places where to lay straws

    Failures in the operation of information systems are events that cannot be completely ruled out. Regardless of the causes of the failure, at the time of its occurrence, the system administrator bears the burden of responsibility for the operational restoration of the performance of not only IT systems, but also the business as a whole.

    In a series of three short articles, I will try to describe in an accessible way the process of creating a disaster recovery plan, which allows us to transfer the tasks of restoring the systems to a state of work in the category of activities previously agreed with the management that have their own schedule, resources and budget.

    The first article will focus on determining the planning zone, or searching for those infrastructure elements whose failure to work negatively affects the heart rate of the system administrator. So, in order:

    1. Make a list of critical user IT services

    The purpose of disaster recovery planning is to provide operational recovery of the end service that the user receives, and not some particular piece of hardware or program. The user does not care if his printer is working or broken - it is important for him whether he can print documents or not. The user will complain not that the hard drive failed in the server, but that “1C-ka” or “mail” does not work for him.

    For this reason, the first thing we do is define a list of critical user IT services for which we will plan for disaster recovery. Usually this:

    • Email,
    • Telephone communications,
    • Enterprise management system,
    • Collaboration with documents
    • Printing documents
    • Access to the Internet,
    • Etc.

    In fact, user services are those working tools that a business buys by investing in hardware, software, and specialist salaries and which are critical for its functioning. For example, the Counter Strike server, of course, is an important element in improving the working mood of employees, but not critical for the business.

    2. We determine the points of failure of user services

    If a user complains about problems in some end-service, then a specific element in the IT infrastructure will still have to be repaired. Therefore, at this stage, it is necessary to detect all systems, applications and IT services, the failure of which will inevitably lead to a halt or decrease in the quality of critical user services. Simply put, your task is to find all the points of failure.

    By failure point we mean that infrastructure unit about which we cannot say more than “it does not work”. For example, if your router is modular, then both the chassis itself and the modules inserted into it may fail. If your competence is enough to localize and replace failed blocks in the event of a failure, you have several points of failure in one device, if not, then there is one point of failure.

    So, the “Email” service may have the following points of failure (including, but not limited to):

    • Server OS
    • Server mail application
    • Kernel switch
    • Power supply
    • External DNS Zone
    • Blacklisting
    • Air conditioning server room.

    Important! It is not necessary to exclude from the points of failure ultra-reliable equipment with which “nothing will happen”. When (exactly when, and not if) your highly reliable storage will lose all the data, whether you continue to laugh in the circus or not, will depend only on your readiness for this situation.

    3. Determine the dependence of the points of failure

    Malfunctions of some points of failure can cause malfunctions in the work of others. For example, a UPS failure will lead to a shutdown of the servers and, as a result, you may not be able to earn something else when restoring the power supply. Also, stopping the hypervisor can cause errors in the virtual servers hosted on it. At the same time, failure of the client switch does not affect the operation of other equipment or services, and if it is correctly replaced, everything will work as before.

    For the "Email" user service, the dependencies of the points of failure can look like this:

    Figure 1. The dependencies of the points of failure.

    Other critical user services and corresponding points of failure must be added to this scheme.

    A clear understanding of the impact of failure points on each other and on user services will help you with further planning, namely when drawing up procedures for the localization of failure points, determining recovery conditions and risk factors. But more about this in the next article.

    Part 2: habrahabr.ru/post/226681
    Part 3: habrahabr.ru/post/228115


    Ivan Kormachev
    IT Department Company

    Also popular now: