Monorepositions: please do not

Original author: Matt Klein
  • Transfer
From the translator: Hi, Habr! Yes, this is another article about the advantages and disadvantages of monorepositories. I was going to write my article about how we use the mono-repository, how we switched from maven to bazel and what came of it. But while I was thinking about it, a great article came out from the developer from Lyft, which I decided to translate for you. I promise to publish my additions to the article, as well as experience with bazel as a continuation.
We are in the New Year 2019, and I am set to another discussion about the benefits (or lack thereof) of storing the entire source code of the organization in the Monorepository. For those of you who are not familiar with this approach, the idea is to store all the source code in a single repository of the version control system. The alternative, of course, is to store the source code in several independent repositories, usually separated by the boundary of services / applications / libraries.

In this post I will call this approach “poly-repository”.

Some of the IT giants use monographs, including Google, Facebook, Twitter, and others. Of course, if such reputable companies use monorepositories, then the benefits of this approach should be enormous, and we should all do the same, right? Not! As stated in the title of the article: “Please do not use a mono-repository!” Why? Because on a large scale, a mono-repository will solve all the same problems that a poly-repository solves, but at the same time provoking you to a strong connection of your code and requiring incredible efforts to increase the scalability of your version control system .

Thus, in the medium and long term, the monorepository does not provide any organizational advantages, while leaving the best engineers of the company with the post-traumatic syndrome (manifested in the form of drooling and rambling about git performance).


A short digression: what do I mean by "on a large scale"? There is no definitive answer to this question, but because I'm sure you ask me about it, let's say that it’s about 100 developers writing full-time code.

The theoretical advantages of a mono-pository and why they cannot be achieved without the tools that are used for poly-rhository (or false)


Theoretical Advantage 1: Easier Collaboration and Code Sharing


Mono-repository supporters say that when all code is in the same repository, the probability of code duplication is less, and the likelihood of different teams working together on a common infrastructure increases.

Here is the bitter truth about monopositories, even of medium size (and it will sound constantly in this section): it becomes very inappropriate for a developer to keep the entire repository code on his workstation or search the entire code base using tools like grep. Therefore, any mono-repository that wants to scale should provide 2 things:

1) something like a virtual file system that allows you to store locally only part of the code. This can be achieved using a proprietary file system like Perforcewhich supports this mode natively, using Google’s internal G3 tool or Microsoft’s GVFS .

2) complex tools as a service (as a service) for indexing / searching / viewing source code. Because none of the developers are going to keep all source code on their workstation in a state suitable for searching, it becomes critical to be able to conduct such a search across the entire code base.

Assuming that the developer will have access to only a small portion of the source code at any one time, is there any difference between loading a part of a mono-repository or downloading several independent repositories? There is no difference .

In the context of indexing / searching / viewing and similar code, such a hypothetical tool can easily search and across multiple repositories and combine the result. In fact, this is exactly how search on GitHub works, as well as more sophisticated search and indexing tools, such as Sourcegraph .

Thus, from the point of view of collaboration on the code on a large scale, developers, in any case, are forced to work only with part of the code base and use higher-level tools. All the same, the code is stored in a mono-repository or in several independent repositories, the problem is solved in the same way, and the efficiency of conscientious work on the code depends only on the engineering culture, and not on the way the source codes are stored .

Theoretical advantage 2: one build / no dependency management


The next argument, usually given by mono-supporters, is that storing all the code in a single monopository makes it unnecessary for you to manage dependencies, because all code is collected at the same time. It's a lie! On a large scale, there is simply no way to rebuild all source code and run all automated tests every time someone commits changes to the version control system (or more importantly and more often, on the CI server when a new branch or pull request is created). To solve this problem, all large mono-positoras use their complex build system (for example, Bazel / Blaze from Google or Buckfrom Facebook), which is designed to follow changes and their dependent blocks and builds a dependency graph of the source code. This graph allows you to organize effective caching of the results of the assembly and tests, so that only changes and their dependencies need to be reassembled and tested.

Moreover, since the assembled code must eventually be compacted, and, as you know, all software cannot be compiled simultaneously, it is important that all assembly artifacts are controlled so that the artifacts are redone as necessary. In essence, this means that even in the world of mono-repositories, multiple versions of code can exist at the same time in nature, and should be carefully monitored and coordinated.

Proponents of monorepositions will also argue that even with the need to keep track of assemblies / dependencies, this still gives a distinct advantage, because a single commit describes the complete state of the whole world. I would say that this advantage is rather controversial, given that the dependency graph already exists, and it looks like a rather trivial task to include a commit identifier for each independent repository as part of this graph, and in fact Bazel can easily work with several independent repositories as well as with one monorepository, abstracting the underlying level from the developer. Moreover, it is easy to implement such automated refactoring tools that automatically update versions of dependent libraries in several independent repositories at once,

The end result is that the realities of assembly / deployment on a large scale are for the most part the same for mono-repositories and poly-repositories. For tools there is no difference, it should not be for developers writing code .

Theoretical advantage 3: code refactoring is a simple atomic commit


Finally, the last virtue that monorepository advocates mention is the fact that one repository makes code refactoring simpler due to the simplicity of the search and the idea that a single commit can span the entire repository. This is incorrect for several reasons:

1) as described above, on a large scale, the developer will not be able to edit or search the entire code base on his local machine. Thus, the idea that everyone can easily clone their entire repository and simply perform grep / replace is not so easy to implement in practice.

2) even if we assume that using a complex virtual file system, a developer can clone and edit the entire code base, how often will this happen? I’m not talking about fixing a bug in the implementation of a shared library, since this situation is handled in the same way in the case of a single repository and in the case of a poly repository (assuming a similar build / deployment system, as described above). I'm talking about changing the library API, followed by many compilation errors in the places where this library is called. In a very large code base, it is almost impossible to make a change to the base API, which will be verified by all the teams involved before the merge conflicts force you to start the process over. The developer has 2 real possibilities: he can give up and come up with a workaround for a problem with the API (in practice it happens more often than we all would like), or he can stop the existing API, write a new API and then embark on a long path and the painful update of all calls to the old API throughout the codebase. In any case, it is absolutely the same process as in the poly repository .

3) in a service-oriented world, applications consist of a variety of loosely coupled components that interact with each other using some type of well-described API. Large organizations will sooner or later use IDL (Interface Description Language), such as Thrift or Protobuf, which allow you to make type-safe APIs and make backward-compatible changes. As described in the previous section on assembly / deploe, the code cannot be simultaneously compiled. It can be deployed over a period of time: hours, days, or even months. Therefore, developers are required to think about the backward compatibility of their changes. This is the reality of modern software development, which many would like to ignore, but cannot. Therefore, when it comes to services (as opposed to library APIs), developers should use one of the two approaches described above (do not change the API or go through the deprecation cycle) and this is absolutely the same for both the mono-repository and the poly-repository .

Speaking of refactoring on a large code base, many large organizations come to develop their automated tools for automatic refactoring, such as fastmodrecently released Facebook. As always, this tool could easily work with one repository or several independent ones. Lyft has a tool called a “refactorator” that does just that. It works like fastmod, but automates changes across several of our repositories, including creating pull-requests, tracking revisions, etc.

The unique shortcomings of monorepositions


In the previous section, I listed all the theoretical advantages that the mono-repository provides, and noted that in order to use them, it is required to create an incredibly complex toolkit that will not differ from the tools for poly-repositories. In this section I will mention 2 unique shortcomings of mono-repositories.

Disadvantage 1: strong connectivity and open source software


Organizationally, the mono-pository provokes the creation of highly related and fragile software. It gives developers the feeling that they can easily correct errors in abstractions, although in fact they cannot because of the unstable assembly / deployment process and human / organizational / cultural factors that arise when trying to make changes across the entire code base.

The code structure in polyrepositories embodies clear and transparent boundaries between teams / projects / abstractions / code owners and forces the developer to carefully consider the interaction interface. This is an unobtrusive, but very important advantage: it makes developers think on a larger scale and in a longer perspective. Moreover, the use of poly-repositories does not mean that developers cannot go beyond the limits of the repository. Whether this happens or not depends only on the developmental culture, and not on whether a mono-repository or poly-repository is used.

Strong binding also has serious implications regarding the discovery of its source codes. If a company wants to create or consume open source software, the use of poly-repositories is a must. Distortions that occur when a company tries to lay out its project in open source from its monopository (import / export of source codes, public / privat bug tracker, additional layers for abstracting the difference in standard libraries, etc.) do not lead to productive collaboration and building community, and also create significant overhead.

Disadvantage 2: version control system scalability



Scaling a version control system for hundreds of developers, hundreds of millions of lines of code, and a huge stream of commits is a monumental task. The Twitter monorapository, created 5 years ago (based on git), was one of the most useless projects that I watched for my career. The execution of the simplest command git statustook minutes . If the local repository copy is too old, the update might take hours.(at that time there was even the practice of sending hard drives with a repository copy to remote employees with a fresh version of the code). I’m not thinking about this in order to make fun of the Twitter developers, but to illustrate how complicated this problem is. I can say that 5 years later, the performance of the Twitter mono-repository is still far from the one that the developers of the Tulling team would like to see, and this is not because they tried hard.

Of course, during these 5 years some development took place in this area. Git vfsfrom Microsoft, which is used to develop Windows, has led to the emergence of a real virtual file system for git, which I described above as a necessary condition for scaling the version control system (and with the purchase of Microsoft Github it looks like this level of scaling will find its use in features that GiHub offers to its corporate clients). And of course, Google and Facebook continue to invest huge resources in their internal systems so that they continue to function, although almost none of this is publicly available.

So why do we need to solve these problems at all with the scaling of the version control system, if, as described in the previous section, the toolkit requires exactly the same as for the poly repository? There is no reasonable reason for this.

Conclusion


As often happens in software development, we look at the most successful software companies as an example and try to borrow their best practices without understanding what exactly led these companies to success. Monorepositions, in my opinion, are a typical example of such a case. Google, Facebook and Twitter have invested a huge amount of resources in their code storage systems only to come up with a solution that is essentially no different from what is required for a poly repository, but provokes a strong binding and requires a huge investment in scaling the version control system. .

In fact, on a large scale, how does a company act with collaboration with code, collaboration, strong binding, etc.directly depends on engineering culture and leadership, and is irrelevant to whether a mono-repository or a polypository is used . Both solutions look the same to the developer. So why use a mono-repository? Please do not!

Also popular now: