1cloud August 19, 2016 at 10:49

“In one basket”: A bit about storing code

Effective data storage interests absolutely everyone who is somehow connected with IT. We at the 1cloud IaaS provider are constantly analyzing the experience of colleagues - just recently we discussed how large companies store their data.

Today we will continue this topic and discuss how best to store our code: in one repository or in several. We also take a look at two examples that demonstrate the features of both approaches. / photo Dennis Skley CC Do I need to save my sources in a single, monolithic repository or do I need to break the code into blocks and write them to several different repositories? As a rule, it depends on the team and the project on which it is working. To begin, consider the advantages and disadvantages of both types of storage.

Monolithic repository

Usually the first thing that comes to mind is to write all the code into one repository, at least in the first stage: most projects start from there. A repository is called monolithic if it stores two or more separate projects. These projects are weakly or completely unrelated, and the repository itself contains too many files, commits and other objects.

The main advantage of storing code in a single repository is that it is much easier to organize collaboration with the code. We can create one common project, consisting of several subprojects, and then link these subprojects, as it is convenient for us.

If the developer needs to change the code or the principle of communication between parts of the project, it is easier to do this when he has access to the code of the entire project. Suppose we are writing a system for online trading, which is built on a microservice architecture. When we write code for the basket service and we need to view or change the shared library, we can immediately go to it: we do not need to open another project or repository. Since we can edit dependencies, we can make global changes faster without worrying about version control.

When all the code is stored in one place, we just have to start the process and, for example, monitor how changes in the shared library affect the work with the basket. Objects are available at any time from anywhere, changes are quick and painless. But not everything is so smooth.

Often, managers choose a single repository simply because it will be easier with it, and they supposedly know what they are doing. Because of such decisions, there are more frequent cases where developers make changes to those parts of the code that they should not touch. And this is easy to do if you have access to the entire code, and the project has no clearly defined boundaries.

Many problems arise when deploying and scaling. Thus, the integrity of the system is lost. The larger the repository, the slower the check will be. If the code is stored in several repositories, the process can be parallelized, and errors that occur in one part of the project will not be able to ruin the work of all services.

Conclusion:If you have a small team or you are not going to expand, it is more logical to store all the code in one place. Using a single repository is also convenient if you are not working with microservices, but are developing a monolithic application. Here are

some tips for mitigating the shortcomings of monolithic repositories in Git (large file sizes, the number of commits and pointers) Habr user.

Storing code in multiple repositories

Part of the problems that arise when there is a single repository is solved by introducing several repositories. If we talk about microservices, then, ideally, each service should have its own repository. This approach facilitates the version control process: they made changes to the library - updated its version, tweaked the service code - updated its version.

The presence of several repositories forces you to write code as if it were going to be viewed by third-party developers (which, by the way, is quite likely). Instead of thinking of changes in the code as a large-scale change in the entire program, the developer begins to think about how to change one module without affecting the operation of the entire system. As a result, the connectivity between the modules weakens.

This allows you to deploy them independently of each other. If our checkout service works with both versions of the protocol, we can deploy it before the basket code is fixed. This approach requires a high level of discipline.

Conclusion: If your team is experienced enough to support regular version updates and work with microservices, or there are a lot of people who are organized in small groups, then it is better to store the code in several repositories. The approach will also be useful in training new employees who will become more disciplined if they follow the rules for updating versions and maintain the boundaries between services.

How Google and Kiln code is stored

Judging by the conclusions made, most companies, especially large ones, would prefer to work with several repositories. Even so, there is at least one big exception to this rule. Oddly enough, tens of thousands of Google developers today use a monolithic repository, which stores about two billion lines of code. To maintain this scale, Google had to develop a version control system, better known as Piper.

Access to Piper is organized using the Clients in the Cloud (CitC) system, which consists of cloud storage and the FUSE file system for Linux. Each developer has a working environment in which they store files modified by him. All recorded files are stored in CitC in the form of snapshots, which allows you to "roll back" the work several steps back if necessary.

The CodeSearch code search tool built into CitC allows you to make minor corrections to the code, as well as transfer the changed code for verification with the possibility of auto-commit: if the verification is passed, a test is performed, after which the system itself commits.

The basis of the monolithic repository model is the approach called trunk-based development (“trunk development”). The main (trunk) line is the latest version of the code, changes to which are made one-time and sequentially. Immediately after the commit, a new version of the code is available to all Piper users, that is, in fact, the developer always has a fresh version of the code before his eyes.

As for adding functionality, both the old and the new code exist in parallel to each other, and their use is controlled using configuration flags. This approach avoids the problems that arise from merging changes.

Stack Overflow users advise storing code in a single repository, even when it is possible to split it into multiple repositories. There are tools for this, such as submodules in Git, external objects in Subversion, and subrepositories in Mercurial.

All of them are designed to build the internal hierarchy of a large project, and they can be used to highlight individual modules: enoughput each project in a separate repository, and then use submodules to include the necessary projects at a certain level of the hierarchy.

In addition, Git has the ability to create independent branches, which are called orphan (orphan). They have nothing to do with each other and keep their history exclusively. This creates a new orphan branch:

git checkout --orphan BRANCHNAME

Each individual project can be represented as a separate orphan branch. For some reason, in Git, you need to do this cleanup after creating this branch:

rm .git/index
rm -r *

Before cleaning, make sure that the appropriate commit is set. After it, a branch can be safely used.
Another option is to create several repositories and drop these branches into each of them (the names of the repositories should not match):

# repo 1
git push origin master:master-1
# repo 2
git push origin master:master-2

Kiln developers, who at one time switched from a monolithic Subversion repository to a Mercurial multi-repository, have a different opinion on code storage. Their project is divided into five parts: exe-clients, server for client interaction (Reflector), website, billing system and Aadvark library.

For each part, they created two repositories - devel and stable. The first one includes new features, which after a while pass into the second, and the fixed bugs, on the contrary, are first placed in stable, and then, as new functions, they are returned to devel. Tags are used for synchronization. In Mercurial, they are repository metadata.

For example, to deploy a new version of a site, the repositories website-stable and aadvark-stable are taken. Each tag is attached, for example, Website-000123. Then, the build process starts, which clones both repositories from the server to the build directory and executes the hg up –C Website-000123 command to switch the local copy to the desired tag. After collecting the build, the deployment is performed.

Conclusion

The choice of where and how to store the code should be approached meaningfully, and this requires some effort. This is not to say that one approach is clearly better than another. It is necessary to take into account the composition of the team, your experience and the goals you are facing, and make a decision based on this. Moreover, if you want, you can always switch from one repository to several, and vice versa.

One way or another, any understanding comes with experience. Sometimes it’s useful to fill up cones in order to know later what to fear and what methods will probably work. Therefore, to truly understand what suits your team more, time and the desire of everyone to make the maximum contribution to the development of the product will help.

PS Our materials on the development of IaaS provider 1cloud :

PPS Our new series of cloud myth myth posts:

Part 1: about "useless" technical support and "tricked out" services

Tags:

“In one basket”: A bit about storing code

Monolithic repository

Storing code in multiple repositories

How Google and Kiln code is stored

Conclusion

Also popular now: