As we at ISPsystem did backups. Part two
Continuation of the story of the adventures of a single task in ISPsystem. Says development manager Alexander Bryukhanov. The first part is here .
The best is the enemy of the good.
Writing backups or installing and setting up software has always been a shooting task for us. When you put any of the repositories, you can not be completely sure of the result. Yes, even if everything is done perfectly, the maintainers will break something sooner or later. As for the backups: they are remembered when problems arise. People are already on the verge, and if something else goes wrong as they expected ... well, you understand.
There are quite a lot of backup approaches, but each has one goal: to make the process as fast as possible and at the same time as cheap as possible.
Outside 2011. More than one year has passed since the total backup of servers has sunk into oblivion. No, they did backups of virtual servers, and they are doing it now. For example, at WHD.moscow I was told a truly elegant way to back up virtual servers through live migration. And all the same, now this is not happening as massively as 10-15 years ago.
We started developing the fifth version of our products based on our own framework, in which a powerful system of events and internal calls was implemented.
It was decided to implement a truly flexible and universal approach to setting up backups, so that users can set the time, choose the type and contents of backups, and lay them out in different storages. Moreover, we decided to stretch this solution to several products.
In addition, backup goals can also vary significantly. Someone makes backups to protect against equipment failures, someone insures against data loss through the fault of the administrator. Naive, we wanted to please everyone.
From the side, our attempt to make a flexible system looked like this:
Another funny mistake: there was a user who limited the backup time from 4:00 to 8:00. The problem was that the process itself was launched through the scheduler every day at 3:00 (standard @daily setting). The process started, determined that at that time he was forbidden to work, and left. No backups were made.
In the middle of 10x, hype about clusters began to grow, followed by clouds. There is a tendency - let's manage not just one server, but a group of servers and call it a cloud :) This also affected ISPmanager.
And, since we had a lot of servers, the idea was revived of putting data compression on a separate server. Like many years ago, we made an attempt to find a ready-made solution. Oddly enough, they found the bacula alive, but just as complex. To manage it, it was necessary, perhaps, to write a separate panel. And then I came across dar, which implemented many of the ideas that were once invested in ispbackup. It seemed likehappiness! But no, experience! The ideal solution that allows you to manage the backup process as we want.
In 2014, a solution using dar was written. But it contained two serious problems: firstly, the received dar archives can be unpacked only by the original archiver (that is, by dar itself); secondly, dar forms a listing of files in memory in XML, its mother !, format.
bike dar - isptar. As you probably guessed, the tar.gz format was chosen - quite easily implemented. I figured out all sorts of PAX headers when I wrote ispbackup.
I must say that there is not much documentation on this issue. Therefore, in due time, I had to spend time studying how tar works with long file names and large sizes, the restrictions on which were originally laid down in the tar format. 100 bytes for the length of the file name, 155 for the directory, 12 bytes for the decimal record of the file size, etc. Well, 640 kilobytes is enough for everyone! Ha! Ha! Ha!
It remained to solve several problems. The first is to quickly obtain a listing of files without the need to completely unpack the archive. The second is the ability to extract an arbitrary file, again, without completely unpacking it. The third is to make it still tgz, which can be deployed by any archiver. We have solved each of these problems!
How to start unpacking an archive from a specific offset?
It turns out that gz threads can be glued together! A simple script will prove this to you:
You get the glued contents of the files without any errors. If each file is written as if it were a separate stream, then the problem will be solved. Of course, this reduces the compression ratio, but not very significantly.
Listing is even easier.
Let's put the listing at the end of the archive as a regular file. And in the listing, we also write the file offsets in the archive (by the way, dar also stores the listing at the end of the archive).
Why at the end? When you back up hundreds of gigabytes in size, you may not have enough space to store the entire archive. Therefore, as you create, you merge it into the storage in parts. The great thing is that if you need to get one file, you only need a listing and the part that contains the data.
There is only one problem left: how to get the offset of the listing itself?
To do this, at the end of the listing itself, I added the service information about the archive, including the packed size of the listing itself, and at the very end of the service information, in the form of a separate gz stream, the packed size of the service information itself (these are only two digits). To quickly get a listing, just read the last few bytes and unpack them. Then read the service information (the offset relative to the end of the file we now know), and then the listing itself (the offset of which we took from the service information).
A simple listing example. Different colors highlight individual gz streams. Accordingly, first we unpack red (just by analyzing the last 20–40 bytes). Then we unpack 68 bytes containing the packed short information (highlighted in blue). And finally, we unpack another 6247 bytes to read the listing, the actual size of which is 33522 bytes. It sounds a little confused, I even had to look into the source to remember how I do it. You can also look at the isptar source, which, like the ispbackup sources, I posted on github . Well, the story does not end there, of course. You can always look at the fire, the woman parked and how people with the help of some crutches try to defeat others.
The best is the enemy of the good.
Writing backups or installing and setting up software has always been a shooting task for us. When you put any of the repositories, you can not be completely sure of the result. Yes, even if everything is done perfectly, the maintainers will break something sooner or later. As for the backups: they are remembered when problems arise. People are already on the verge, and if something else goes wrong as they expected ... well, you understand.
There are quite a lot of backup approaches, but each has one goal: to make the process as fast as possible and at the same time as cheap as possible.
Trying to please everyone
Outside 2011. More than one year has passed since the total backup of servers has sunk into oblivion. No, they did backups of virtual servers, and they are doing it now. For example, at WHD.moscow I was told a truly elegant way to back up virtual servers through live migration. And all the same, now this is not happening as massively as 10-15 years ago.
We started developing the fifth version of our products based on our own framework, in which a powerful system of events and internal calls was implemented.
It was decided to implement a truly flexible and universal approach to setting up backups, so that users can set the time, choose the type and contents of backups, and lay them out in different storages. Moreover, we decided to stretch this solution to several products.
In addition, backup goals can also vary significantly. Someone makes backups to protect against equipment failures, someone insures against data loss through the fault of the administrator. Naive, we wanted to please everyone.
From the side, our attempt to make a flexible system looked like this:
With the toe of your right foot, you push the butt . Added custom storages. After all, what is the problem: pouring ready-made archives in two places? In fact, there is a problem: if the archive cannot be uploaded to one of the repositories, can backup be considered successful?Why am I doing this? Insane flexibility spawned an infinite number of use cases, and all of them became almost impossible to test. Therefore, we decided to follow the path of simplification. Why ask the user if he wants to save metadata if it takes several kilobytes. Or, for example, are you really curious which archiver we use?
You press the second butt with the toe of your left foot . Breaking spears by encrypting archives. It's simple until you think about what should happen when the user wants to change the password.
And now you are pushing both stubs together!
Another funny mistake: there was a user who limited the backup time from 4:00 to 8:00. The problem was that the process itself was launched through the scheduler every day at 3:00 (standard @daily setting). The process started, determined that at that time he was forbidden to work, and left. No backups were made.
Writing your bike dar
In the middle of 10x, hype about clusters began to grow, followed by clouds. There is a tendency - let's manage not just one server, but a group of servers and call it a cloud :) This also affected ISPmanager.
And, since we had a lot of servers, the idea was revived of putting data compression on a separate server. Like many years ago, we made an attempt to find a ready-made solution. Oddly enough, they found the bacula alive, but just as complex. To manage it, it was necessary, perhaps, to write a separate panel. And then I came across dar, which implemented many of the ideas that were once invested in ispbackup. It seemed like
In 2014, a solution using dar was written. But it contained two serious problems: firstly, the received dar archives can be unpacked only by the original archiver (that is, by dar itself); secondly, dar forms a listing of files in memory in XML, its mother !, format.
Thanks to this utility, I learned that allocating memory in C in small blocks (on centos 7, the block should be less than 120 bytes), it is impossible to return it to the system without completing the process.But otherwise, he was very nice to me. Therefore, in 2015 we decided to write your
I must say that there is not much documentation on this issue. Therefore, in due time, I had to spend time studying how tar works with long file names and large sizes, the restrictions on which were originally laid down in the tar format. 100 bytes for the length of the file name, 155 for the directory, 12 bytes for the decimal record of the file size, etc. Well, 640 kilobytes is enough for everyone! Ha! Ha! Ha!
It remained to solve several problems. The first is to quickly obtain a listing of files without the need to completely unpack the archive. The second is the ability to extract an arbitrary file, again, without completely unpacking it. The third is to make it still tgz, which can be deployed by any archiver. We have solved each of these problems!
How to start unpacking an archive from a specific offset?
It turns out that gz threads can be glued together! A simple script will prove this to you:
cat 1.gz 2.gz | gunzip -
You get the glued contents of the files without any errors. If each file is written as if it were a separate stream, then the problem will be solved. Of course, this reduces the compression ratio, but not very significantly.
Listing is even easier.
Let's put the listing at the end of the archive as a regular file. And in the listing, we also write the file offsets in the archive (by the way, dar also stores the listing at the end of the archive).
Why at the end? When you back up hundreds of gigabytes in size, you may not have enough space to store the entire archive. Therefore, as you create, you merge it into the storage in parts. The great thing is that if you need to get one file, you only need a listing and the part that contains the data.
There is only one problem left: how to get the offset of the listing itself?
To do this, at the end of the listing itself, I added the service information about the archive, including the packed size of the listing itself, and at the very end of the service information, in the form of a separate gz stream, the packed size of the service information itself (these are only two digits). To quickly get a listing, just read the last few bytes and unpack them. Then read the service information (the offset relative to the end of the file we now know), and then the listing itself (the offset of which we took from the service information).
A simple listing example. Different colors highlight individual gz streams. Accordingly, first we unpack red (just by analyzing the last 20–40 bytes). Then we unpack 68 bytes containing the packed short information (highlighted in blue). And finally, we unpack another 6247 bytes to read the listing, the actual size of which is 33522 bytes. It sounds a little confused, I even had to look into the source to remember how I do it. You can also look at the isptar source, which, like the ispbackup sources, I posted on github . Well, the story does not end there, of course. You can always look at the fire, the woman parked and how people with the help of some crutches try to defeat others.
etc/.billmgr-backup root#0 root#0 488 dir
etc/.billmgr-backup/.backups_cleancache root#0 root#0 420 file 1487234390 0
etc/.billmgr-backup/.backups_imported root#0 root#0 420 file 1488512406 92 0:1:165:0
etc/.billmgr-backup/backups root#0 root#0 488 dir
etc/.billmgr-backup/plans root#0 root#0 488 dir
…
listing_header=512
listing_real_size=33522
listing_size=6247
header_size=68