OpenMinded July 5, 2013 at 16:00

Extending Git and Mercurial Repositories with Amazon S3

Surely, many of you have heard or know from your own experience that version control systems are not good friends with binary files, large files, and especially with large binary files. Hereinafter, we are talking about modern popular distributed version control systems like Mercurial and GIT.

Often, it does not matter. I don’t know if this is the reason or the consequence, but version control systems are mainly used to store relatively small text files. Sometimes a few pictures or libraries.

If the project uses a large number of pictures in high resolution, sound files, source files, graphic, 3D, video or any other editors, then this is a problem. All these files are usually large and binary, which means that all the advantages and amenities of version control systems and hosting repositories with all associated services become unavailable.

Next, we’ll look at an example integration of version control systems and Amazon S3 (cloud file storage) to take advantage of both solutions and compensate for the disadvantages.

The solution is written in C #, uses the Amazon Web Services API, and shows an example setup for the Mercurial repository. The code is open, the link will be at the end of the article. Everything is written more or less modularly, so adding support for something other than Amazon S3 should not be difficult. I can assume that setting up for GIT will be just as easy.

So, it all started with the idea - you need a program that, after integration with the version control system and the repository itself, would work completely unnoticed, without requiring any additional action from the user. Like magic.

Integration with the version control system can be implemented using the so-called hooks - events that you can assign your own handlers to. We are interested in those that are launched when data is received or sent to another repository. Mercurial has the necessary hooks called incoming and outgoing. Accordingly, you need to implement one command for each event. One for downloading updated data from the working folder to the cloud, and the second for the reverse process - downloading updates from the cloud to the working folder.

Integration with the repository is carried out using a metadata file or an index file, or whatever you like. This file should contain a description of all monitored files, at least the path to them. And this particular file will be under version control. The monitored files themselves will be in .hgignore, the list of ignored files, otherwise the whole point of this venture disappears.

Repository Integration

The metadata file looks something like this:

Content\TexturesContent\SoundsDocsReference Libraries*********************************************************mybucket

This file has three sections: locations, amazonS3 and files. The first two are configured by the user at the very beginning, and the last is used by the program itself to track the files themselves.

Locations are the ways in which tracking files will be searched. These are either absolute paths or paths relative to this xml settings file. The same paths must be added to the ignore file of the version control system so that it itself does not try to track them.

AmazonS3 is, as you might guess, cloud file storage settings. The first two keys are Access Keys, which can be generated for any AWS user. They are used to cryptographically sign Amazon API requests. Bucketname is the name of the bucket, an entity within Amazon S3 that can contain files and folders and will be used to store all versions of monitored files.

Files do not need to be configured, since this section will be edited by the program itself while working with the repository. It will contain a list of all files of the current version with paths and hashes to them. Thus, when together with pull we pick up a new version of this xml file, comparing the contents of the Files section with the contents of the monitored folders themselves, we can understand which files were added, which ones were changed, and which ones were simply moved or renamed. During push, the comparison is performed in the opposite direction.

Integration with version control system

Now about the teams themselves. The program supports three commands: push, pull and status. The first two are for setting up the corresponding hooks. Status displays information about the monitored files and its output is similar to the output of hg status - it can be used to understand which files have been added to the working folder, changed, moved and which files are missing there.

The push command works as follows. First, we get a list of monitored files from the xml file, paths and hashes. This will be the last state recorded in the repository. Next, information about the current state of the working folder is collected - the paths and hashes of all monitored files. After this is a comparison of both lists.

There may be four different situations:

The working folder contains a new file. This happens when there are no matches, either by path or by hash. As a result, the xml file is updated, a record about the new file is added to it, and the file itself is uploaded to S3.
The working folder contains the modified file. This happens when there is a match along the path, but no hash match. As a result, the xml file is updated, the hash of the corresponding record is changed, and an updated version of the file is loaded into S3.
The working folder contains the moved or renamed file. This happens when there is a hash match, but there is no match along the way. As a result, the xml file is updated, the path of the corresponding record is changed, and nothing needs to be loaded in S3. The fact is that the key to storing files in S3 is the hash, and the path information is recorded only in the xml file. In this case, the hash has not changed, so reloading the same file in S3 does not make sense.
The tracked file would be deleted from the working folder. This occurs when one of the entries in the xml file does not match any of the local files. As a result, this entry is deleted from the xml file. Nothing is ever deleted from S3, since its main purpose is to store all versions of files so that you can roll back to any revision.

There is a fifth possible situation - the file has not been modified. This happens when there is a match along both the path and the hash. And no action is required in this situation.

The pull command also compares the list of files from xml with the list of local files and works in exactly the same way, only in the opposite direction. For example, when the xml contains an entry about a new file, that is, there is no match either on the path or on the hash, this file is downloaded from S3 and written locally at the specified path.

Example hgrc with configured hooks:

[hooks]
postupdate = \path\to\assets.exe pull \path\to\assets.config \path\to\checksum.cache
prepush = \path\to\assets.exe push \path\to\assets.config \path\to\checksum.cache

Hashing

Appeals to S3 are minimized. Only two commands are used: GetObject and PutObject. A file is downloaded and downloaded from S3 only if it is a new or modified file. This is possible by using a file hash as a key. That is, physically all versions of all files are in the S3 Bucket without any hierarchy, without folders at all. There is an obvious minus - collisions. If suddenly two files have the same hash, then information about one of them simply will not be recorded in S3.

The convenience of using hashes as a key still outweighs the potential danger, so I would not want to abandon them. It is only necessary to take into account the likelihood of collisions, if possible, reduce it and make the consequences not so fatal.

To reduce the probability is very simple - you need to use a hash function with a longer key. In my implementation, I used SHA256, which is more than enough. However, this still does not exclude the possibility of collisions. You need to be able to define them before any changes are made.

This is also not difficult to do. All local files are already hashed before executing the push and pull commands. You just need to check if there are any matches among the hashes. It is enough to do a check during push so that collisions do not get fixed in the repository. If a collision is detected, the user is notified of this trouble and is asked to change one of the two files and push again. Given the low likelihood of such situations, this solution is satisfactory.

Optimizations

There are no strict performance requirements for such a program. It works for one second or five - not so important. However, there are obvious places that can and should be considered. And probably the most obvious is hashing.

The approach chosen assumes that at run time any of the commands need to calculate the hashes of all the monitored files. This operation can easily take a minute or more if the files are several thousand or if their total size is more than a gigabyte. Calculating hashes for a whole minute is unforgivably long.

If you notice that the typical use of the repository does not involve changing all the files immediately before the push, then the solution becomes obvious - caching. In my implementation, I settled on using a pipe delimited file that would lie next to the program and contain information about all previously computed hashes:

путь к файлу|хеш файла|дата вычисления хеша

This file is loaded before the command is executed, used in the process, updated and saved after the command is executed. Thus, if the hash was last calculated for the logo.jpg file one day ago, and the file itself was last modified three days ago, then it does not make sense to recalculate its hash.

Optimization can also be called a stretch using BufferedStream instead of the original FileStream for reading files, including reading to calculate the hash. Tests showed that using a BufferedStream with a buffer size of 1 megabyte (instead of the standard for FileStram 8 kilobytes) to calculate hashes of 10 thousand files with a total size of more than a gigabyte accelerates the process four times compared to FileStream for a standard HDD. If there are not so many files and they themselves are larger than a megabyte, then the difference is not so significant and amounts to about 5-10 percent.

Amazon s3

Two points are worth clarifying here. The most important thing is probably the price of the matter. As you know, for new users the first year of use is free, if you do not go beyond the limits. The limits are: 5 gigabytes, 20,000 GetObject requests per month and 2,000 PutObject requests per month. If you pay the full cost, then a month will be worth about $ 1. For this, you get a reservation for several data centers within the region and good speeds.

Also, I dare to suggest that the reader is tormented by the following question from the very beginning - why this bike, if there is Dropbox? The fact is that using Dropbox directly for collaboration is fraught - it does not cope with conflicts at all.

But what if not used directly? In fact, in the described solution, Amazon S3 can be easily replaced with the same Dropbox, Skydrive, BitTorrent Sync or other analogs. In this case, they will act as a repository for all versions of files, and hashes will be used as file names. In my solution, this is implemented through FileSystemRemoteStorage, an analogue of AmazonS3RemoteStorage.

Promised source link: bitbucket.org/openminded/assetsmanager

Tags: