How to find out that backup was successful

    Greetings!

    Everyone knows that admins are divided into those who have not yet done backups and those who already do backups. However, there is an opinion that there are still those who are firmly convinced that backups are being done, but in reality this is not so. In this post I would like to tell a couple of real stories and (if possible) to take stock, draw conclusions.



    Disclaimer: all stories are true, but in some places cut off the edges, the image of the company and admin is collective, all names are changed, faces are distorted beyond recognition, my first topic, blah blah blah, one ...

    Introductory: imagine the company as a classic company of developers: it actively uses the version control system (subversion is important in our case), the version assembly system, well, and the load of the system of turnovers and wikis. The volumes are large, data loss costs a lot of money, everything should work “like a clock” and “and if suddenly a fire” does not bother anyone - you need to store the data! We assume that backups after they are made automatically fall on the magn. tape / dvd in the safe of the general director / cell in a Swiss bank, so we have no problem with the availability of the latest backup.

    Story number times



    Prelonch
    Admin writes a script that makes backup databases and writes about this to the log.

    Drama

    - Chef, it's all gone, chef!
    -Not a problem, we have backups! Where is this our pot-bearded?

    The administrator picks up dumps from backups neatly folded into daddies time_time and sees there that the dump files, starting from "half a year_to_back", have a zero size.

    After a thunderstorm,

    Error was at least funny. Instead , it was actually written , the fact that the log entry was added precisely with the help of >> and saved the situation, helped to avoid the worst, but this, of course, is a great success. When an error was found and a log.txt file of ten gigabytes in size was a matter of technique to find the necessary lines near the end of the file and deploy a dump.

    mysqldump db > db.sql &2>> log.txt




    mysqldump db > db.sql &>> log.txt




    Story number two



    Prelonch
    Admin writes a script that using svnadmin dumps the entire repository and throws a copy to the backup server. “And if something where goes wrong?” Having made the correct conclusions from the “history number of times”, the Admin adds the logging that on such and such a day the repository was reserved for so many bytes.

    Drama
    Actually, drama was avoided, but, again, lucky, everything could be significantly worse. I wanted to make a second svn server, a kind of sandbox, a little later I wanted to roll the freshest dump on it once a day. When solving this problem, the Admin found out that the repository dump file was broken from some day. At the same time, the size check passed successfully - all the revisions were backed up to this critical one.

    After a thunderstorm

    This time svnadmin was to blame, which makes a full backup iteratively, starting from the very first revision. Some kind of revision in the middle was a bat, svnadmin reached it, broke, honestly informed about it and went inside. Starting from here, all the details, unfortunately, I do not know, but they are not very important to us. There was no way to fix the revision, there was no way to remove it either (I don’t know, by the way, again, how are things with this in the latest versions of subversion). Therefore, a commander’s decision was made to port the giant repository to the sandbox daily using rsync.

    Here it is necessary to summarize



    What did I want to say to all of this? What seems to me personally, it is very difficult to automate the decision-making process that the backup was successful. Those. let’s say it passed successfully, but how can I verify that the data in it is correct and up-to-date without deploying this backup? And, for example, new data, after some time they will not begin to break the backup process itself? Moreover, the errors leading to this can be as old as the world:

    1. Human factor
    2. Insecurity of Tools
    3. Errors in the logic of checks
    4. other


    Personally, I do not know the answers to the questions voiced above. If a respected habr is in the know, I’m sorry to share my experience.
    In the meantime, I have long believed that the “made and forgot” principle, at least in the case of backups, does not work. And I advise you to deploy the entire backup on a separate test server, if there is such an opportunity (here we kill two birds with one stone at the cost of writing the backup deployment time). Or write a separate backup integrity check script. You need to check:
    1. backup creation date
    2. file size
    3. file size change from previous backup
    4. list of files in backup (or at least the number of unique ones)
    5. ... and send all this information in an accumulating letter to the mail once a day.


    Thanks for attention.

    UPD Thanks for the karma => transferred to the blog "system administration".

    Also popular now: