Backup of a large number of small files

    Sooner or later, any self-respecting modern IT specialist is puzzled by setting up backup of working files. After a series of typos / errors by programmers, I found time for this as well.
    The specifics of the web application is such that the working directory occupies more than 50 GB of hard drives, including about 900 thousand small files (pictures, previews, ...). Therefore, the forehead to solve the problem using tar and analogues did not work. Yes, and I would like to have some variability of the stored data, and in the case of a full backup, the implementation required large expenses for storing essentially the same data with minor changes. Plus, it would be nice to duplicate copies on a remote backup server to reduce the risk of losing critical information as a result of an iron crash. After a meticulous analysis of search results and rejection of methods that were obviously inappropriate for me, I stopped at a couple of options, most often imposed in the comments on enthusiastic self-written shell bicycles.


    rdiff-backup

    rdiff-backup seemed more appropriate and convenient in appearance. Written in python, it stores data incrementally, allowing you to get the state of a file or directory at any given time (in the foreseeable past, given the storage time of the pictures). Flexible control from the console gives you complete freedom of action and provides complete control over the situation. The automatic creation of backups requires the addition of a couple of commands to the scheduler (the second is to clean up old images that do not carry any value due to the old changes).
    But testing showed that the utility is very voracious and is extremely reluctant to cope with my task. The fact is that a small amount of data (300MB) changes daily, but the changes affect about 30 thousand files. Apparently, most of the program’s time is spent on identifying modified files. After an hour of observation, iowait increased to indecent 20% during the next launch of the script, it was decided to try another software and compare them with each other.

    rsnapshot

    rsnapshot, written in Perl, is based on rsync. In the working directory of the program (let's call the place where the backups are put), a number of folders with an index are created, which increases each time the program starts up to the value specified in the configuration. Then the obsolete copy is deleted. If you go to any of the created folders, then inside you can find a full copy of the backup data. This is indicated by the total size of the folder (when viewed by standard means of Midnight Commander, for example) - it is equal to the sum of all folders. This is actually not the case. The program creates hard links between the same data within the working directory. Thus, the latest actual copy is the “heaviest” one, and the size of all the others is the difference in the changed data.

    Testing

    Since both options use approximately the same size for storage, it's time to check the speed of completing backup tasks.

    For tests, a random project folder of 11 GB was taken, containing 593 subdirectories of varying degrees of nesting and 230911 files. File sizes float from 4 to 800KB, as indicated above, this is in graphic material. Both utilities were tested one at a time, external factors were almost completely absent (there are no other users, workloads or heavy processes). Using the time utility, the execution time of each of the test tasks was calculated, as well as for comparison, copying the entire directory using cp:

    First backup - full copy to 11090MB backup location
    realusersys
    cp6m30.885s0m1.068s0m24.554s
    rsnapshot7m53.879s1m57.299s1m22.441s
    rdiff-backup10m50.314s3m26.073s1m0.928s


    Restart (no changes in the folder)
    realusersys
    rsnapshot0m10.129s0m4.936s0m6.708s
    rdiff-backup1m3.969s1m0.616s0m2.048s


    Inside the directory, one random folder is duplicated (the total size increases to 13267MB)
    realusersys
    rsnapshot0m31.175s0m22.001s0m17.365s
    rdiff-backup27m53.517s1m58.819s0m19.005s


    Restarting after increasing the directory size (no changes since the last execution)
    realusersys
    rsnapshot0m11.477s0m5.748s0m7.368s
    rdiff-backup1m16.366s1m13.713s0m1.912s


    We delete the duplicated folder, reducing the size of the directory to the original
    realusersys
    rsnapshot0m13.885s0m6.388s0m9.077s
    rdiff-backup52m55.794s2m1.560s0m21.941s


    Test restart without modification
    realusersys
    rsnapshot0m11.250s0m5.132s0m7.068s
    rdiff-backup1m2.380s1m0.088s0m1.792s


    Summary

    As can be seen from the comparative test tables, rdiff-backup is difficult to digest the change in a large number of small files, therefore, it is more cost-effective to use rsnapshot so as not to spend most of the server’s time idly picking on the file system.
    Perhaps it will be useful for someone to see the test result and save my time spent on finding the optimal file backup cases described in the article.

    Also popular now: