Monorepositions: please do not (part 2)

Hello!

So, a new portion of the promised holivar about monorepositories. In the first part, we discussed the translation of an article by a respected engineer from Lyft (and earlier Twitter) about what are the shortcomings of monospositories and why they negate almost all the advantages of this approach. Personally, I largely agree with the arguments given in the original article. But, as promised, to put an end to this discussion, I would like to voice a few more points, in my opinion even more important and more practical.

I'll tell you a little about myself - I worked in small projects and in relatively large ones, using poly-repositories in a project with more than 100 microservices (and SLA 99.999%). At the moment, I am engaged in transferring a small mono-repository (in fact, not, just the front js + java backend) from maven to bazel. Did not work in Google, Facebook, Twitter, i.e. I didn’t have the pleasure of using a mono-repository that was properly tuned and hung with a tuling.

So, for starters, what is a mono-repository? Comments to the translation of the original article showed that many people believe that the mono-repository is when all 5 developers of the company work on one repository and keep the front end and backend in it. Of course, it is not. A mono-repository is a way to store all the company's projects, libraries, tools for building, plug-ins for IDE, deployment scripts, and everything else in one large repository. Details here trunkbaseddevelopment.com .

How, then, is the approach called when a company is small and it simply does not have such a number of projects, modules, components? This is also a mono-repository, only small.
Naturally, the original article states that all the problems described begin to appear on a certain scale. Therefore, those who write that their mono-repository for 1.5 diggers works fine, of course, absolutely right.

So, the first fact that I would like to fix: a mono-repository is a great start for your new project . Putting all the code in one pile, at first you get only one advantage, because support for multiple repositories will certainly add a bit of overhead.

What is the problem then? And the problem, as noted in the original article, begins on a certain scale. And most importantly, do not miss the moment when such a scale has already arrived.

Therefore, I am inclined to say that, in essence, the problems that arise are not the problems of the approach “put all your code in one pile”, and these are problems of just large source code repositories. Those. If you assume that you have used poly repositories for different services / components, and one of these services has become so large (how much, we will discuss just below), then you will most likely get exactly the same problems, but also without the advantages of monorepositories (if they are , of course have).

So, how big should the repository be to start to be considered problematic?
There are definitely 2 indicators on which it depends - the amount of code and the number of developers working with this code. If your project has terabytes of code, but at the same time 1-2 people work with it, then most likely they will hardly notice any problems (well, or at least it will be easier not to do anything, even if they notice :)

How to determine that is it time to think about how to improve your repository? Of course, this is a subjective indicator, most likely your developers will begin to complain that they are not satisfied with something. But the problem is that it may be too late to change something. I’ll give some numbers on my own behalf: if cloning your repository takes more than 10 minutes, if the project builds take more than 20-30 minutes, if the number of developers exceeds 50, and so on.

An interesting fact from personal practice:

I worked on a rather large monolith in a team of about 50 developers divided into several small teams. The development was carried out in feature-branches, and the merge took place just before the feature-frieze. One day I spent 3 days on the merge of our command line after 6 other teams froze in front of me.

Now let's go through the list of problems that arise in large repositories (some of them were touched upon in the original article, some are not).

1) Repository download time

On the one hand, we can say that this is a one-time operation that the developer performs during the initial setup of his workstation. Personally, I often have situations when I want to clone a project into the next folder, dig into it, and then delete it. However, if cloning takes more than 10-20 minutes, it will not be so comfortable.

But besides, you should not forget that before building a project on a CI server, you need to clone the repository for each build agent. And here you start to think up how to save this time, because if each assembly takes 10-20 minutes longer, and the result of the assembly appears 10-20 minutes later, it will not suit anyone. So the repository begins to appear in the images of the virtual machines from which the agents are deployed, additional complexity and additional costs to support this solution appear.

2) Assembly time

This is a fairly obvious point that has been discussed many times. In fact, if you have a lot of source codes, the build will take a lot of time anyway. A familiar situation when after changing one line of code you have to wait half an hour while the changes are reassembled and tested. In fact, there is only one solution - to use an assembly system built around caching results and incremental assembly.

There are not so many options here - despite the fact that caching features were added to the same gradle (unfortunately, they were not used in practice), they do not bring any practical benefit due to the fact that traditional build systems do not have repeatable results. (reproducible builds). Those. Because of the side effects of the previous build, it will still be necessary at some point to trigger a cache flush (standard approach maven clean build). Therefore, it remains only the option to use Bazel / Buck / Pants and others like them. Why this is not very good, we discuss a little below.

3) IDE Indexing

My current project is indexed in Intellij IDEA from 30 to 40 minutes. And yours? Of course, you can open only part of the project or exclude all unnecessary modules from indexing, but ... The problem is that reindexing occurs every time you switch from one branch to another. That is why I like to clone the project in the next directory. Some people come to the point that they start caching IDE cache :)
<Picture from D-Caprio with narrowed eyes>

4) Assembly logs

Which CI server are you using? Does it provide a convenient interface for viewing and navigating in assembly logs of several gigabytes in size? Unfortunately, my not :(

5) History of commits

Do you like to watch commit history? I love, especially in the GUI tool (I better perceive the information visually, do not scold :).

This is the history of commits in my repository.

Like? Conveniently? I personally do not!

6) Broken Tests

What happens if someone could run broken tests / uncompiled code in the master? You will of course say that your CI does not allow doing this. What about the unstable tests that are held by the author, and no one else? Now imagine that this code has spread to the machines of 300 developers, and none of them can build a project? What to do in this situation? Wait, when the author will notice and correct? Correct for him? Roll back changes? Of course, ideally, it is worth committing only good code, and writing right away without bugs. Then this problem will not arise.
(for those who are in the tank and did not understand the hints, it’s about the negative effect if it happens in the repository with 10 developers and in the repository with 300 will be slightly different)

7) Merge bot

Ever heard of such a thing? Do you know why she is needed? You will laugh, but this is another tool that should not exist :) Imagine that the build time of your project is 30 minutes. And 100 developers are working on your project. Suppose that each of them shoots 1 commit per day. Now imagine an honest CI that allows merge changes to the master only after they have been applied to the most recent commit from the master (rebase).

Attention, the question is: how many hours should be in a day so that such an honest CI-server can read the changes from all developers? The correct answer is 50. Whoever answered correctly, can take the gingerbread from the shelf. Well, or imagine how you just debuted your commit to the most recent commit to the master, started the build, and when it was completed, the master had already left 20 commits ahead. All over again?

So, merge bot or merge queue is a service that automates the process of re-submitting all merge requests to the fresh master, running the tests and the merge directly, and can also combine commits into batch files and test them together. Very handy thing. See mergify.io , k8s test-infra prow from google, bors-nget al. (I promise to write more about this in the future)

Now to less technical problems:

8) Using a single build tool

Honestly, it is still a mystery to me why collecting the entire mono-repository with the help of one common assembly system. Why not build javascript Yarn, java - gradle, Scala - sbt, etc.? If someone knows the answer to this question (does not guess or assumes, and that is, knows) write in the comments.

Of course, it seems obvious that using one build system is better than several different ones. But nevertheless they understand that any universal thing is obviously worse than a specialized one, since it most likely has only a subset of all specialized functions. But worse, different programming languages may have different paradigms in terms of assembly, dependency management, etc., which will be very difficult to wrap up in one common wrapper. I don’t want to go into details, I’ll give one example about bazel (wait for details in a separate article) - we found 5 independent implementations of javascript build rules for bazel from 5 different companies on GitHub, along with the official one from Google. It is worth thinking.

9) General approaches

In response to the original CTO article from Chef, Monorepo wrote his answer : please do! . In his reply, he argues that "the main thing in monorepo is that he makes talk and makes flaws visible." He means that when you want to change your API, you will have to find all its uses and discuss your changes with the maintainers of these pieces of code.

So my experience is exactly the opposite. It is clear that this very much depends on the engineering culture in the team, but I see in this approach solid minuses. Imagine that you have used a certain approach that faithfully served you for a while. And so you decided for some reason, solving a similar problem, to use a slightly different method, perhaps more modern. What is the likelihood that the addition of a new approach will pass a review?

In my recent past, I repeatedly received comments like “we already have a proven path, use it” and “if you want to implement a new approach, update the code in all 120 places that use the old approach and get appruv from all the teams that are responsible for these pieces of code. " Usually the enthusiastic "innovator" ends there.

And how much do you think it would cost to write a new service in a new programming language? In polyrepositories - not at all. You create a new repository and you write, and you also take the most suitable build system. And now the same thing in monorepositories?

I understand perfectly well that “standardization, reuse, sharing of code”, but the project should be developed. In my subjective opinion, a mono-repository rather prevents this.

10) Open source

Recently I was asked: “ are there any open source tools for monorepositories? ” I replied: “The problem is that the tools for monorepositories, oddly enough, are being developed inside the monorepository itself. Therefore, putting them in open source is quite difficult! ”

For an example, look at the project on Github with the bazel plugin for Intellij IDEA . Google develops it in its internal repository, and then “splashes out” parts of it in Github with the loss of commit history, without the ability to send a pull request, and so on. I do not consider this open source (here is an example of my small PRwhich was closed instead of merge, and then changes appeared in the next version). By the way, this fact is mentioned in the original article that mono-suppositories interfere with putting it into open-source and creating a community around the project. I think many did not attach much importance to this argument.

Alternatives

Well, if we talk about what to do to avoid all these problems? The advice is exactly one — strive to have a repository as small as possible.
And what about the mono-repository here? And even though this approach makes it impossible for you to have small, light and independent repositories.

What are the disadvantages of the polyrepository approach? I see exactly 1: the inability to track who is your API consumer. Especially it concerns the approach in micro share “nothing nothing”in which the code is not fumbled between microservices. (By the way, do you think this approach is used by anyone in mono-repositories?) This problem, unfortunately, must be solved either by organizational means, or by trying to use code browsing tools that support independent repositories (for example, https://sourcegraph.com / ).

What about comments like “we tried poly-repositories, but then we had to constantly implement features in several repositories at once, which was tiring and we merged everything into one boiler” ? The answer to this is very simple: “do not confuse the problems of the approach with the wrong decomposition”. No one argues that exactly one microservice should be in the repository and that's it. In my time of using polypositories, we perfectly folded a family of closely related microservices into one repository. However, given that there were more than 100 services, there were more than 20 such repositories. The most important thing to think about in terms of decomposition is how these services will be deployed.

But what about the argument about the version? After all, monorepositories allow you to have no versions and deploy everything from one commit! First, versioning is the simplest of all the problems voiced here. Even in such an old thing like maven there is a maven-version-plugin, which allows you to release a version with just one click. And second, and most importantly, does your company have mobile apps? If yes, then you already have versions, and you will not get anywhere from this!

Well, there is still the most important argument in support of mono-repositories - it allows you to refactor the entire code base into one commit! Not really. As mentioned in the original article, due to the limitations that the warmth imposes. You should always keep in mind that for a long time (the duration depends on how your process is organized) you will have 2 versions of the same service in parallel. For example, on my past project, our system was in this state for several hours with every delay. This leads to the fact that it is impossible to conduct global refactorings affecting the interaction interfaces in one commit even in a mono-repository.

Instead of concluding:

So, those distinguished and few colleagues who work in Google, Facebook, etc. and they will come here to defend their monopositories, I would like to say: “Do not worry, you are doing everything right, enjoy your tuling, which was spent hundreds of thousands or millions of man hours. They have already been spent, so if you do not use it, then no one will. ”

And to all the rest: “You are not Google, do not use monorepositories!”

PS as the highly respected Bobuk noted in the radio-T podcast when discussing the original article: “There are ~ 20 companies in the world who can do mono-repository. The rest should not even try . ”

Tags: