So that the carriage does not turn into a pumpkin, or why do we need test restorations from backups



    In this post, he promised to dwell on the history of testing backups. Today is just about it. To do without unpleasant surprises and in the already exciting moments of data loss, backups need to be tested. Further, we will not talk about checking the integrity of backup files (checking the checksum of data blocks in a backup file), but about a full-fledged test recovery, when we check the operability of what has been restored.

    What could be wrong with the backup file


    In addition to cases where the backup file itself is damaged, there are many technical and organizational reasons why restoring from a backup copy may fail. I will dwell on those that I encountered.

    Already damaged data / files are backed up. A disk started to crumble on the server. Monitoring did not work. Some of the files went bad, but they were safely backed up. Such a problem can go unnoticed for weeks until you need to open the desired file. During the recovery process, it turns out that the files in the backup are also inactive.

    Inconsistent backup. This can happen when you choose the wrong backup tool. For example, a database is running on a virtual machine, for backup of which the administrator decided to use VM backup without application integrity backup support.

    The fact is that during its work the database actively uses the cache in RAM, and part of the data is there. The DBMS writes data to disk so that they are consistent at any time, and when the server shuts down suddenly, the database does not turn into a useless set of bytes. The backup system does not record data instantly, and does not know anything about synchronizing the cache with the file system, so when backing up part of the data may be written in the wrong order. Then, after restoring the VM, we will get a damaged base, parts of which do not correspond to each other.

    When using special backup agents, this will not happen.

    There is a working backup, but not there.This is quite common, because the life cycle of the system is approximately the following: they made the system and set it for backup. Then sooner or later they changed the system architecture, added / decreased servers, disks, renamed, restored next to the backup, and forgot to reflect the changes in the backup policy. So it turns out that the backup is not what you need.

    Why test


    It would seem that the answer is simple: to make sure that you can recover from backup. But there are a couple of important organizational issues that it would be nice to clarify for myself.

    Understanding Real RTO. Speculative assessments will differ from reality. Especially if the entire recovery process is not limited to deploying data or applications from backup. Before you recover, you need to understand what and where we are restoring. After recovery, the system is not always immediately ready for use, sometimes manual settings are required. After that you need to check the operability of the restored systems. If backups are stored on tapes outside the office, you need to understand how quickly they will be delivered to your office. All this increases the recovery time or hours to the recovery time.

    So if we look at the entire door-to-door recovery path, then RTO is likely to get more than just a “clean” data recovery speed.

    Who does what. During test restorations, not only equipment is tested, but also the work of people, processes, regulations, if any;) This is an opportunity to identify weaknesses, to think about what you will do if the right person is not in place.

    The more people involved in the restoration, the more necessary such military exercises.

    How to test


    The frequency of testing. After setting up the backup system, check at least once that you have backup there, and try to restore.

    Further, the check schedule is determined by the service owner, for example, the developer, based on how often changes are made to applications / data, the importance of certain data, what resources it has for testing.

    Various scenarios of disasters and recovery from backups. Turn on your imagination and think of various reasons why you might need to restore from a backup. So you check the equipment, processes, people in combat conditions, and do not conduct a spherical recovery in a vacuum. It is convenient to outline threat models for this. As an option:

    • hardware failure: failed drive, server with source information;
    • software failure: unsuccessful update, virus;
    • human factor: administrator deleted the desired file.

    In each of these cases, it will be necessary to recover in a different volume: somewhere separate files, and somewhere to deploy everything.

    Be sure to try to recover remotely from your home computer. After all, failures occur not only during business hours.

    And also think over actions for a couple of steps forward: what will you do next if during the recovery the backup turned out to be zilch or failed to recover. If during tests it turned out that the last backup turned out to be inoperative, if possible, make a new backup out of turn or warn colleagues to work with the data as carefully as possible until the next backup cycle.

    Recover from different points in time.It is not known which backups you will need, so when testing, try to recover from different recovery points. So you check that you have everything in order, for example, not only Friday’s backup, but also what you do on Wednesday. The larger the sample, the less reason to worry about the performance of backups.

    Documenting recovery procedures.I once read that in one office they use the following approach to testing recovery from backup: a person who knows nothing about the system is offered to do all the recovery only according to the documentation, none of his colleagues tell him. Then, according to the results, they check whether they managed to recover, and draw conclusions about the relevance of the instructions. It is not necessary to go to such extremes during combat exercises, but it would be nice to record in the regulations and other documentation all the necessary actions to restore a particular system.

    This is done for this to be able to start the recovery process if the person responsible for the system is temporarily unavailable.

    You also need to make sure that all the necessary information for system recovery (configuration settings, license keys, passwords) is not only in the head of the absent administrator, but is duplicated in electronic form and is stored securely away from prying eyes.

    Just in case: we test recovery in a separate sandbox , without risking a productive.

    Only registered users can participate in the survey. Please come in.

    And one more thing: I wonder how often do you test recovery from backup, and do you do it at all?

    • 16.6% I test regularly 14
    • 39.2% Tested once or twice 33
    • 45.2% not testing 38

    Also popular now: