Backup, Part 1: Purpose, Overview of Techniques and Technologies

Why do you need to make backups? After all, the equipment is very, very reliable, in addition there are “clouds” that are better than physical servers in reliability: with proper configuration, a “cloud” server will easily survive the failure of an infrastructure physical server, and from the point of view of service users, there will be a small, barely noticeable jump in time service. In addition, duplication of information often requires payment for "extra" processor time, disk load, network traffic.
An ideal program runs fast, does not flow through RAM, does not have holes, and does not exist.Since the programs are still written by protein developers, and the testing process is often absent, plus the delivery of programs is extremely rare with the use of "best practices" (which are also programs in themselves, and therefore imperfect), system administrators often have to solve problems that sound briefly but succinctly: “return as it was,” “bring the base back to normal operation,” “it works slowly - we roll back”, and also my favorite “I don’t know what, but fix it”.
-Unknown
In addition to logical errors that come out as a result of careless work of developers, or a combination of circumstances, as well as incomplete knowledge or misunderstanding of the small features of building programs - including binders and system ones, including operating systems, drivers and firmware - there are also other errors. For example, most developers rely on runtime, completely forgetting about the physical laws that are still impossible to bypass with the help of programs. This includes the infinite reliability of the disk subsystem and of any data storage subsystem in general (including RAM and processor cache!), And zero processing time on the processor, and the absence of errors during transmission over the network and during processing on the processor, and network latency, which are 0. Do not neglect the notorious deadline,

But what about the problems that arise in full growth and hang over valuable data? There is nothing to replace the live developers, and not the fact that it will be possible in the near future. On the other hand, to fully prove that the program will work as intended, so far only a few projects have succeeded, and it is not at all possible to take and apply the evidence to other, similar projects. Also, such evidence takes a lot of time, and requires special skills and knowledge, and this practically minimizes the possibility of their application taking into account deadlines. In addition, we still do not know how to ultrafast, cheap and infinitely reliable technology for storing, processing and transmitting information. Such technologies, if they exist, then in the form of concepts, or - most often - only in science fiction books and films.
Good artists copy, great artists steal.The most successful solutions and surprisingly simple things usually happen where there are absolutely incompatible, at first glance, concepts, technologies, knowledge, fields of science.
—Pablo Picasso.
For example, birds and planes have wings, but despite the functional similarity - the principle of operation in some modes is the same, and technical problems are solved similarly: hollow bones, the use of strong and lightweight materials, etc. - the results are completely different, although very similar. The best samples that we observe in our technology are also mostly borrowed from nature: airtight compartments in ships and submarines - a direct analogy with annelids; building raid arrays and checking data integrity - duplication of the DNA chain; as well as paired organs, the independence of the work of various organs from the central nervous system (automatic heart function) and reflexes are autonomous systems on the Internet. Of course, taking and applying ready-made solutions “head on” is fraught with problems, but who knows, maybe there are no other solutions.
If I knew where you would fall, I would lay straws!So, backups are vital for those who want:
—Belarusian proverb
- To be able to restore the operation of their systems with minimal downtime, or even without them
- Feel free to act, because in case of an error there is always the possibility of a rollback
- Minimize the effects of intentional data corruption
Here is a bit of theory
Block storage of data implies that there is a physical device where data is recorded in some fixed portions, blocks. Access to the blocks goes to a certain address, each block has its own address within the device.
A backup is usually done by copying data blocks. To ensure data integrity at the time of copying, the recording of new blocks, as well as the modification of existing ones, is suspended. If we take an analogy from the ordinary world, the closest closet is with the same numbered cells.

File storage of data by the principle of a logical device is close to block storage and is often organized on top. Important differences are the existence of a storage hierarchy and human-readable names. The abstraction is highlighted in the form of a file - a named data area, as well as a directory - a special file in which descriptions and accesses to other files are stored. Files can be supplied with additional metadata: creation time, access flags, etc. They usually back up this way: they look for the changed files, then copy them to another file storage with the same structure. Data integrity is usually implemented by the absence of the files that are being written to. File metadata is backed up similarly. The closest analogy is the library, which has sections with different books, as well as a catalog with human-readable book names.

Recently, another option is sometimes described, with which, in principle, file storage of data began, and which has the same archaic features: object data storage.
It differs from file storage in that it does not have more than one nesting (flat layout), and the file names, although human-readable, are nevertheless more suitable for processing by machines. When backing up, object stores are most often treated like file storages, but occasionally there are other options.
The storage format should be understood as follows:
To achieve the maximum effect of rule 3-2-1, it is recommended to change the storage format in both ways.
From the point of view of the readiness of the backup for its intended purpose - restoration of operability, there are “hot” and “cold” backups. Hot from cold differ only in one thing: they are immediately ready for work, while cold for recovery require some additional actions: decryption, extraction from the archive, etc.
Do not confuse hot and cold copies with online and offline copies, which imply physical isolation of the data, and in fact, are another sign of the classification of backup methods. So an offline copy - not connected directly to the system where it needs to be restored - can be either hot or cold (in terms of readiness for recovery). An online copy may be available directly where it needs to be restored, and most often it is hot, but there are also cold ones.
In addition, do not forget that the process of creating backups usually does not end with creating a single backup, and there can be quite a lot of copies. Therefore, it is necessary to distinguish between full backups, i.e. those that are recoverable independently of other backups, as well as differential (incremental, differential, decremental, etc.) copies - those that cannot be restored on their own and require the preliminary restoration of one or more other backups.
Differential incremental backups - an attempt to save the amount of space for storing backups. Thus, only modified data from the previous backup is written to the backup.
Difference decrementals are created for the same purpose, but in a slightly different way: a full backup is made, but only the difference between the fresh copy and the previous one is actually stored.
Separately, it is worth considering the backup process on top of the storage, which supports the absence of duplicate storage. Thus, if you write full backups on top of it, in reality only the difference between the backups will be recorded, however, the process of restoring backups will be similar to restoring from a full copy and completely transparent.
It is very unpleasant when there are no backups, but it is much worse if the backup seems to be made, but during the restoration it turns out that it cannot be restored, because:
A properly built backup process must take into account such comments, especially the first two.
The integrity of the source data can be guaranteed in several ways. The most commonly used are: a) creating file system snapshots at the block level, b) freezing the state of the file system, c) a special block device with version storage, d) sequential recording of files or blocks. Checksums are also used to ensure data verification during recovery.
Damage to the storage can also be detected using checksums. An additional method is the use of specialized devices or file systems in which it is impossible to modify already recorded data, but you can add new ones.
To speed up recovery, data recovery is used with several recovery processes - provided that there is no “bottleneck” in the form of a slow network or a slow disk system. In order to circumvent the situation with partially restored data, it is possible to break the backup process into relatively small subtasks, each of which is performed separately. Thus, it becomes possible to consistently restore performance with prediction of recovery time. This problem most often lies in the organizational plane (SLA), so we will not dwell on this in detail.
The practice regarding the software used by system administrators may vary, but the general principles are still the same, one way or another, in particular:
The following common programs are available for removing backups from block devices:
For file systems, the backup task is partially solved using methods applicable to block devices, however, the problem can be solved more efficiently, using, for example:
Separately, it is worth mentioning the data consistency software when creating backups. The most commonly used options are:
Any classification is arbitrary. Nature does not classify. We classify, because it is more convenient for us. And we classify according to the data, which we also take arbitrarily.Regardless of the physical storage method, the logical storage of data can be divided into 2 ways of accessing this data: block and file. This division has recently been very blurry, because purely blocky, as well as purely file, logical storages do not exist. However, for simplicity, we assume that they are.
—Jan Bruler
Block storage of data implies that there is a physical device where data is recorded in some fixed portions, blocks. Access to the blocks goes to a certain address, each block has its own address within the device.
A backup is usually done by copying data blocks. To ensure data integrity at the time of copying, the recording of new blocks, as well as the modification of existing ones, is suspended. If we take an analogy from the ordinary world, the closest closet is with the same numbered cells.

File storage of data by the principle of a logical device is close to block storage and is often organized on top. Important differences are the existence of a storage hierarchy and human-readable names. The abstraction is highlighted in the form of a file - a named data area, as well as a directory - a special file in which descriptions and accesses to other files are stored. Files can be supplied with additional metadata: creation time, access flags, etc. They usually back up this way: they look for the changed files, then copy them to another file storage with the same structure. Data integrity is usually implemented by the absence of the files that are being written to. File metadata is backed up similarly. The closest analogy is the library, which has sections with different books, as well as a catalog with human-readable book names.

Recently, another option is sometimes described, with which, in principle, file storage of data began, and which has the same archaic features: object data storage.
It differs from file storage in that it does not have more than one nesting (flat layout), and the file names, although human-readable, are nevertheless more suitable for processing by machines. When backing up, object stores are most often treated like file storages, but occasionally there are other options.
- There are two types of system administrators, those who do not make backups, and those who already do.It is also worthwhile to understand that the process of backing up data is carried out by programs, so it has all the same disadvantages as another program. To remove (not exclude!) The dependence on the human factor, as well as features - which individually do not strongly influence, but together can give a tangible effect - apply the so-called rule 3-2-1. There are many options for decrypting it, but I prefer the following interpretation: you need to store 3 sets of the same data, 2 sets must be stored in different formats, and 1 set must be stored in a geographically remote storage.
- Actually, there are three types: there are also those who verify that backups can be restored.
-Unknown
The storage format should be understood as follows:
- If there is a dependence on the physical storage method, we change the physical method.
- If there is a dependence on the logical storage method, we change the logical method.
To achieve the maximum effect of rule 3-2-1, it is recommended to change the storage format in both ways.
From the point of view of the readiness of the backup for its intended purpose - restoration of operability, there are “hot” and “cold” backups. Hot from cold differ only in one thing: they are immediately ready for work, while cold for recovery require some additional actions: decryption, extraction from the archive, etc.
Do not confuse hot and cold copies with online and offline copies, which imply physical isolation of the data, and in fact, are another sign of the classification of backup methods. So an offline copy - not connected directly to the system where it needs to be restored - can be either hot or cold (in terms of readiness for recovery). An online copy may be available directly where it needs to be restored, and most often it is hot, but there are also cold ones.
In addition, do not forget that the process of creating backups usually does not end with creating a single backup, and there can be quite a lot of copies. Therefore, it is necessary to distinguish between full backups, i.e. those that are recoverable independently of other backups, as well as differential (incremental, differential, decremental, etc.) copies - those that cannot be restored on their own and require the preliminary restoration of one or more other backups.
Differential incremental backups - an attempt to save the amount of space for storing backups. Thus, only modified data from the previous backup is written to the backup.
Difference decrementals are created for the same purpose, but in a slightly different way: a full backup is made, but only the difference between the fresh copy and the previous one is actually stored.
Separately, it is worth considering the backup process on top of the storage, which supports the absence of duplicate storage. Thus, if you write full backups on top of it, in reality only the difference between the backups will be recorded, however, the process of restoring backups will be similar to restoring from a full copy and completely transparent.
Quis custodiet ipsos custodes?
(Who will guard the watchmen themselves? - lat.)
It is very unpleasant when there are no backups, but it is much worse if the backup seems to be made, but during the restoration it turns out that it cannot be restored, because:
- The integrity of the source data has been violated.
- The backup storage is corrupt.
- Recovery works very slowly, you cannot use data that is partially restored.
A properly built backup process must take into account such comments, especially the first two.
The integrity of the source data can be guaranteed in several ways. The most commonly used are: a) creating file system snapshots at the block level, b) freezing the state of the file system, c) a special block device with version storage, d) sequential recording of files or blocks. Checksums are also used to ensure data verification during recovery.
Damage to the storage can also be detected using checksums. An additional method is the use of specialized devices or file systems in which it is impossible to modify already recorded data, but you can add new ones.
To speed up recovery, data recovery is used with several recovery processes - provided that there is no “bottleneck” in the form of a slow network or a slow disk system. In order to circumvent the situation with partially restored data, it is possible to break the backup process into relatively small subtasks, each of which is performed separately. Thus, it becomes possible to consistently restore performance with prediction of recovery time. This problem most often lies in the organizational plane (SLA), so we will not dwell on this in detail.
Knows a lot about spices not the one who adds them to each dish, but the one who never adds anything superfluous to it.
-IN. Sinyavsky
The practice regarding the software used by system administrators may vary, but the general principles are still the same, one way or another, in particular:
- Ready-made solutions are highly recommended.
- Programs should work predictably, i.e. There should be no undocumented features or bottlenecks.
- Setting up each program should be simple enough so that you do not have to read the manual or cheat sheet every time.
- The solution should be universal, if possible. servers in their hardware specifications can vary very, very.
The following common programs are available for removing backups from block devices:
- dd, familiar to veterans of system administration, similar programs also apply here (the same dd_rescue, for example).
- Utilities (utilities) built into some file systems that create a dump of the file system.
- Omnivorous utilities; e.g. partclone.
- Own, often proprietary, decisions; e.g. NortonGhost and later.
For file systems, the backup task is partially solved using methods applicable to block devices, however, the problem can be solved more efficiently, using, for example:
- Rsync, a universal program and protocol for synchronizing the state of file systems.
- Built-in archiving tools (ZFS).
- Third-party archiving tools; the most popular representative is tar. There are others, for example, dar - replacing tar with a focus on modern systems.
Separately, it is worth mentioning the data consistency software when creating backups. The most commonly used options are:
- Mounting the file system in read-only mode (ReadOnly), or freezing the file system (freeze) - the method is limited.
- Creating snapshots of the state of a file system or block device (LVM, ZFS).
- The use of third-party tools for organizing casts, even in cases where the previous paragraphs cannot be provided for any reason (programs like hotcopy).
- The copy-on-change technique (CopyOnWrite), however, it is most often tied to the FS used (BTRFS, ZFS).
So, for a small server, you need to provide a backup scheme that meets the following requirements:
- Easy to use - no special additional steps are required when working, minimal steps to create and restore copies.
- Universal - works on both large and small servers; this is important when increasing the number of servers or scaling.
- It is installed by the package manager, or in one or two commands of the "download and unzip" type.
- Stable - uses a standard or long-established storage format.
- Fast in work.
Applicants from those who more or less meet the requirements:
- rdiff-backup
- rsnapshot
- burp
- duplicati
- duplicity
- deja dup
- dar
- zbackup
- restic
- borgbackup

A virtual machine (based on XenServer) with the following characteristics will be used as a test bench:
- 4 cores 2.5 GHz,
- 16 GB of RAM
- 50 GB hybrid storage (storage with caching on SSD at 20% of the size of the virtual disk) as a separate virtual disk without partitioning,
- 200 Mbps Internet channel.
Almost the same machine will be used as the backup destination server, only with a 500 GB hard drive.
Operating system - Centos 7 x64: the breakdown is standard, an additional partition will be used as a data source.
Let's take a wordpress site as source data, with 40 GB media files, a mysql database. Since virtual servers vary greatly in characteristics, as well as for better reproducibility, there is
server test results using sysbench.
sysbench --threads = 4 --time = 30 --cpu-max-prime = 20,000 cpu run
sysbench 1.1.0-18a9f86 (using bundled LuaJIT 2.1.0-beta3)
Running the test with the following options:
Number of threads: 4
Initializing random number generator from current time
Prime numbers limit: 20,000
Initializing worker threads ...
Threads started!
CPU speed:
events per second: 836.69
Throughput:
events / s (eps): 836.6908
time elapsed: 30.0039s
total number of events: 25104
Latency (ms):
min: 2.38
avg: 4.78
max: 22.39
95th percentile: 10.46
sum: 119923.64
Threads fairness:
events (avg / stddev): 6276.0000 / 13.91
execution time (avg / stddev): 29.9809 / 0.01
sysbench --threads = 4 --time = 30 --memory-block-size = 1K --memory-scope = global --memory-total-size = 100G --memory -oper = read memory run
sysbench 1.1.0-18a9f86 (using bundled LuaJIT 2.1.0-beta3)
Running the test with the following options:
Number of threads: 4
Initializing random number generator from current time
Running memory speed test with the following options:
block size: 1KiB
total size: 102400MiB
operation: read
scope: global
Initializing worker threads ...
Threads started!
Total operations: 50900446 (1696677.10 per second)
49707.47 MiB transferred (1656.91 MiB / sec)
Throughput:
events / s (eps): 1696677.1017
time elapsed: 30.0001s
total number of events: 50900446
Latency (ms):
min: 0.00
avg: 0.00
max: 24.01
95th percentile: 0.00
sum: 39106.74
Threads fairness:
events (avg / stddev): 12725111.5000 / 137775.15
execution time (avg / stddev): 9.7767 / 0.10
sysbench --threads = 4 --time = 30 --memory-block-size = 1K --memory-scope = global --memory-total-size = 100G --memory-oper = write memory run
sysbench 1.1.0-18a9f86 (using bundled LuaJIT 2.1.0-beta3)
Running the test with the following options:
Number of threads: 4
Initializing random number generator from current time
Running memory speed test with the following options:
block size: 1KiB
total size: 102400 MiB
operation: write
scope: global
Initializing worker threads ...
Threads started!
Total operations: 35910413 (1197008.62 per second)
35068.76 MiB transferred (1168.95 MiB / sec)
Throughput:
events / s (eps): 1197008.6179
time elapsed: 30.0001s
total number of events: 35910413
Latency (ms):
min: 0.00
avg: 0.00
max: 16.90
95th percentile: 0.00
sum: 43604.83
Threads fairness:
events (avg / stddev): 8977603.2500 / 233905.84
execution time (avg / stddev): 10.9012 / 0.41
sysbench --threads = 4 --file-test-mode = rndrw --time = 60 --file-block-size = 4K --file-total-size = 1G fileio run
sysbench 1.1.0-18a9f86 (using bundled LuaJIT 2.1.0-beta3)
Running the test with the following options:
Number of threads: 4
Initializing random number generator from current time
Extra file open flags: (none)
128 files, 8MiB each
1GiB total file size
Block size 4KiB
Number of IO requests: 0
Read / Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync () each 100 requests.
Calling fsync () at the end of test, Enabled.
Using synchronous I / O mode
Doing random r / w test
Initializing worker threads ...
Threads started!
Throughput:
read: IOPS = 3868.21 15.11 MiB / s (15.84 MB / s)
write: IOPS = 2578.83 10.07 MiB / s (10.56 MB / s)
fsync: IOPS = 8226.98
Latency (ms):
min: 0.00
avg: 0.27
max: 18.01
95th percentile: 1.08
sum: 238469.45
sysbench 1.1.0-18a9f86 (using bundled LuaJIT 2.1.0-beta3)
Running the test with the following options:
Number of threads: 4
Initializing random number generator from current time
Prime numbers limit: 20,000
Initializing worker threads ...
Threads started!
CPU speed:
events per second: 836.69
Throughput:
events / s (eps): 836.6908
time elapsed: 30.0039s
total number of events: 25104
Latency (ms):
min: 2.38
avg: 4.78
max: 22.39
95th percentile: 10.46
sum: 119923.64
Threads fairness:
events (avg / stddev): 6276.0000 / 13.91
execution time (avg / stddev): 29.9809 / 0.01
sysbench --threads = 4 --time = 30 --memory-block-size = 1K --memory-scope = global --memory-total-size = 100G --memory -oper = read memory run
sysbench 1.1.0-18a9f86 (using bundled LuaJIT 2.1.0-beta3)
Running the test with the following options:
Number of threads: 4
Initializing random number generator from current time
Running memory speed test with the following options:
block size: 1KiB
total size: 102400MiB
operation: read
scope: global
Initializing worker threads ...
Threads started!
Total operations: 50900446 (1696677.10 per second)
49707.47 MiB transferred (1656.91 MiB / sec)
Throughput:
events / s (eps): 1696677.1017
time elapsed: 30.0001s
total number of events: 50900446
Latency (ms):
min: 0.00
avg: 0.00
max: 24.01
95th percentile: 0.00
sum: 39106.74
Threads fairness:
events (avg / stddev): 12725111.5000 / 137775.15
execution time (avg / stddev): 9.7767 / 0.10
sysbench --threads = 4 --time = 30 --memory-block-size = 1K --memory-scope = global --memory-total-size = 100G --memory-oper = write memory run
sysbench 1.1.0-18a9f86 (using bundled LuaJIT 2.1.0-beta3)
Running the test with the following options:
Number of threads: 4
Initializing random number generator from current time
Running memory speed test with the following options:
block size: 1KiB
total size: 102400 MiB
operation: write
scope: global
Initializing worker threads ...
Threads started!
Total operations: 35910413 (1197008.62 per second)
35068.76 MiB transferred (1168.95 MiB / sec)
Throughput:
events / s (eps): 1197008.6179
time elapsed: 30.0001s
total number of events: 35910413
Latency (ms):
min: 0.00
avg: 0.00
max: 16.90
95th percentile: 0.00
sum: 43604.83
Threads fairness:
events (avg / stddev): 8977603.2500 / 233905.84
execution time (avg / stddev): 10.9012 / 0.41
sysbench --threads = 4 --file-test-mode = rndrw --time = 60 --file-block-size = 4K --file-total-size = 1G fileio run
sysbench 1.1.0-18a9f86 (using bundled LuaJIT 2.1.0-beta3)
Running the test with the following options:
Number of threads: 4
Initializing random number generator from current time
Extra file open flags: (none)
128 files, 8MiB each
1GiB total file size
Block size 4KiB
Number of IO requests: 0
Read / Write ratio for combined random IO test: 1.50
Periodic FSYNC enabled, calling fsync () each 100 requests.
Calling fsync () at the end of test, Enabled.
Using synchronous I / O mode
Doing random r / w test
Initializing worker threads ...
Threads started!
Throughput:
read: IOPS = 3868.21 15.11 MiB / s (15.84 MB / s)
write: IOPS = 2578.83 10.07 MiB / s (10.56 MB / s)
fsync: IOPS = 8226.98
Latency (ms):
min: 0.00
avg: 0.27
max: 18.01
95th percentile: 1.08
sum: 238469.45
This note starts big
backup articles cycle
- Backup, part 1: Why do you need a backup, an overview of methods, technologies
- Backup, Part 2: Overview and Testing rsync-based backup tools
- Backup, Part 3: Overview and Testing duplicity, duplicati
- Backup, Part 4: Overview and Testing zbackup, restic, borgbackup
- Backup, Part 5: Testing Bacula and Veeam Backup for Linux
- Backup: part requested by readers: AMANDA review, UrBackup, BackupPC
- Backup, Part 6: Comparing Backup Tools
- Backup Part 7: Conclusions
Authors : Finnix