Trying deduplication and compression on backups


Overseas merchants used to say that deduplication could significantly save the space needed to store backups. That there are such people in overseas lands who fit the annual volume of backup copies in the same volume that they are busy on production servers. It seems like you copy 10 Terabytes of data every day for a whole year, and on the backup storage device, the same 10 Terabytes are busy. Breshet, I guess.

However, there is a good way to check how the data of numerous backup copies of specifically our servers can be packed into storage with deduplication and compression. In this case, you do not need to deploy an entire backup system, but just download one small (4 megabytes) utility that will give us not only a picture of how you can shake the data right now, but also make a forecast of how much memory we will need in the future.



To get started - download the utility from here:

http://downloads.arcserve.com/tools/RPSPlanning/unsupported/DeduplicationPlanningTool/V1.0/ArcserveUDPDataStoreCapacityPlanningTool.zip

Although the archive is small, the utility is demanding:
  • to work, you need a 64-bit Windows system (preferably a server system. It worked fine for me on Windows 7, counted everything and drew everything, but fell off on exit).
  • every 100 Gigabytes of scanned data may require up to 1 Gigabytes of RAM on the computer where we run the utility when processing statistics (this can be circumvented if you use SSD instead of RAM).
  • ports 10000 and 135 should be open (which ones are not specified, I will assume that TCP)
  • you need to run it from the administrator


If we have everything we need, expand the archive anywhere and run ArcserveDeduplicationAssessment.exe.

Then we add the servers we are interested in to the list of subjects by clicking on the “Add Node” button:



After that , a sample program will be installed remotely on our server, which can be seen in list of services:



By the way, upon completion of work with the utility, the sample program will be offered to be deleted:



In the meantime, we will start collecting statistics by clicking on the “Scan Nodes” button.

By the way, how many resources does the statistics server have on a working server?
The documentation shows an example according to which a server with an i7-4790 processor, 3601 MHz, 4 cores was loaded at 25-30% for 22 minutes to process data from a disk of 199 Gigabytes.

By default, the priority of the task of collecting statistics is set to a low level, giving CPU time to more priority tasks.

This can be changed if statistics collection is too long.


The percentage of completed work on each of the tested servers is displayed on the screen:



Upon completion of statistics collection, go to tab 2 and build a report. It makes sense to tick off all the dates when statistics were collected. This will allow you to see the data in dynamics:



Now, on tab 3, we can use the data obtained and, after playing the parameters, determine the need for backup storage volumes and the configuration of the Arcserve UDP backup storage server.

In the example below, we see the following:
  • Full backups of the two studied machines occupy 35.54 Gigabytes
  • We want to keep history from 31 backup
  • Each new backup differs from the previous one by 17%
  • The data block size for deduplication is 4 kilobytes
  • We use standard compression (without fanaticism, to minimize CPU load)


The output shows that to store 31 backup copies of these machines, we need 76.85 Gigabytes of memory, which means a 94% savings:

(You can also see what requirements will be for the RAM of the Arcserve UDP backup storage server. In this case, it will be necessary 1.19 GB of RAM or 0.06 GB of RAM, combined with 1.19 GB of space on the SSD).



By clicking on “Show Details” we will see more detailed information.

If we only make full backups (“Full Always”), then deduplication will reduce their total volume (1282.99 Gigabytes) by 91% to 118.90 Gigabytes.

Compression will reduce this volume by another 35%, that is, up to 78.85 Gigabytes.



If we use backup in the “Incremental Forever” mode (only incremental backups after one full backup), then the required storage space for the backups will not change and will also be 78.85 Gigabytes. We just have to do fewer calculations for deduplication, and therefore, less working servers will be loaded:



Now let's look at the tab with the graphs.

Select the type of graph “Disk and Memory Usage Trend”.

It is clearly seen that adding to the first backup of 35 Gigabytes the second (also 35 Gigabytes), we need 70 Gigabytes of memory, as shown in the blue graph on the left.

However, if we use deduplication, the memory requirements for backups are significantly reduced. Green, orange and purple graphs show us the required volumes depending on the level of compression used with deduplication.

The right graph shows how the demand for RAM (or RAM in combination with an SSD drive) is growing on the Arcserve UDP backup storage server.



If we select the “Disk and Memory Usage” graph type, we will see how the block size used in deduplication affects the memory requirement. It can be seen that increasing the block size slightly reduces the efficiency of deduplication, but also reduces the requirements for fast memory (RAM or SSD) on the Arcserve UDP backup storage server:



After exiting the program, the statistics data are not deleted, even if you delete the sample programs on production servers. This data can be used in the future to build graphs showing changes in memory requirements.

The described utility is included in the distribution package of the Arcserve UDP product, it is installed with it in the “ ... \ Program Files \ Arcserve \ Unified Data Protection \ Engine \ BIN \ Tools \ RPS Planning " directory , but can be downloaded by itself, as described above.

The utility is not a supported product, that is, you cannot officially contact technical support. But this is offset by its extraordinary simplicity and free.

You can learn more about Arcserve products by reading our blog and visiting the links in the right column,

Also popular now: