m1rko August 27, 2017 at 11:20

Why GitHub Cannot Host Linux Kernel

Transfer

Some time ago, at an excellent maintainerati conference, I talked with a few fellow maintainers about scaling up really big open source projects and how GitHub is pushing projects to scale. The Linux kernel has a completely different model that GitHub maintainers do not understand. I think it’s worth explaining why and how it works and how it differs.

Another reason for writing this text was the discussion on HN about my talk “ Maintainers do not scale", Where the most popular comment boiled down to the question," Why don't these dinosaurs use modern development tools? " Several well-known kernel developers vigorously defended the mailing lists and offering patches through a mechanism similar to GitHub pull requests, but at least some developers of the graphics subsystem would like to use more modern tools that are much easier to automate with scripts. The problem is that GitHub does not support the way the Linux kernel scales to a huge number of contributors, and therefore we simply cannot switch to it, even for several subsystems. And it's not about hosting data on Git, this part is clearly in order, but the matter is how pull requests, discussion of bugs and forks work on GitHub.

GitHub-style scaling

Git is cool because everyone can fork very easily, create their own branch and modify the code. And if in the end you get something worthwhile, you create a pull request for the main repository and it is reviewed, tested and merged. And GitHub is awesome because it introduced the right UI to make these complex things fun and easy to find and learn, so it’s much easier for newcomers to get in the course.

But if the project as a result has become extremely successful and no number of tags, tags, sorting, bots and automation are already able to cope with all pull requests and problems in the repository, then it's time to split the project again into more manageable parts. More importantly, with a certain size and age of the project, different parts will need different rules and processes: the sparkling new experimental library has different stability and CI criteria than the main code, or maybe you have a legacy trash can with a bunch of excluded plugins that are already not supported, but cannot be removed. One way or another, you have to divide a huge project into subprojects, each with its own process and criteria for merging patches and with its own repository, where their pull requests and problem tracking work.

Almost all projects on GitHub solve the problem by splitting a single source tree into many different projects, each with its own separate functionality. Usually this leads to the appearance of a number of projects that are considered the core, plus a bunch of plugins, libraries, and extensions. Everything is connected by some kind of plugin or package manager, which in some cases directly receives code from GitHub repositories.

Since almost every large project is structured this way, I don’t think we should delve into the benefits of this approach. But I would like to emphasize some of the problems that arise in this situation:

Your community is too fragmented. Most contributors will simply deal with the code and the corresponding repository, where they directly attribute, ignoring everything else. It’s really cool for them, but the likelihood of noticeing duplicate efforts and parallel solutions between different plug-ins and libraries is significantly reduced in time. And people who want to manage the entire community as a whole will have to deal with a bunch of repositories that are either managed through a script, or Git submodules, or something worse. In addition, they will drown in pull requests and problems if they subscribe to something. Any topic (maybe you have a common toolkit for building builds or documentation, or anything else) that does not fit perfectly into separate repositories becomes a headache for maintainers.
Even if you notice the need for refactoring and code sharing, there are more bureaucratic obstacles: first you need to release a new version of the key library, then go through all the plugins and update them, and only then, maybe you can delete the old code in the shared library . But since everything is very scattered around, you can forget about the last step.

Of course, all this requires not so much work, and many projects do an excellent job with management. But it still requires more effort than a simple pull request to a single repository. Very simple refactoring operations (such as the simple sharing of a single new function) occur less frequently, and over time these costs accumulate. Except in cases where you, following the example of node.js, create repositories for each function, and then essentially change Git to npm as a source control system, and this also seems strange.
A combinatorial explosion of theoretically supported different versions that become de facto unsupported. Users have to carry out integration testing. And in the project, it comes down to approved (“blessed”) combinations of versions, or at least this is declared de facto, because developers simply close bug reports with the words “please update all modules first”. And again, this means that you actually have a monorepository, maybe not on Git. Well, only if you use submodules, and I'm not sure if this is considered Git ...
It's a painful reorganization when dividing shared projects into subprojects, because you need to reorganize the Git repositories and how they are divided. In a single repository, changing the maintainer amounts to simply updating the OWNER or MAINTAINERS files, and if your bots are in order, the new maintainers will receive the tags automatically. But if scaling for you means dividing repositories into disparate sets, then any reorganization will be just as painful as the first step from a single repository to a group of divided repositories. This means that your project will get stuck on a bad organizational structure for too long.

Interlude: why are there pull requests

The Linux kernel is one of several projects I know of that is not divided up this way. Before we look at how it works (the kernel is a gigantic project and it simply cannot work without some structure of subprojects), it seems to me interesting to see why pull requests are needed in Git. On GitHub, this is the only way developers can incorporate the proposed patches into common code. But changes in the kernel come as patches on the mailing list, even long after the introduction and widespread use of Git.

But already the first version of Git supported pull requests. The audience for these first, fairly raw releases was kernel maintainers, Git was written to solve Linus's maintainer problems. Obviously, Git was needed and useful, but not for processing changes from individual developers: even today, and even more so then, pull requests were used to process changes to an entire subsystem, synchronize refactored code, or similar cross-cutting changes between different subprojects. As an example, network pull-rekvest 4.12 by Dave Miller , submitted Linus, contains more than 2000 commits from 600 developers and a bunch of mergers for pull requests from subordinate maintainers. But almost all patches themselves are presented by the maintainers and selected from the mailing lists, and not by the authors themselves. This is a feature of kernel development that authors generally don’t commit patches to shared repositories - and that’s why Git separately considers the author of the patch and the author of the commit.

An innovation and improvement in GitHub was the use of pull requests for everything in a row, down to individual patches. But not for this, pull requests were originally created.

Linux kernel scaling

At first glance, the kernel looks like a single repository, where everything is in one place in the repository of Linus. But this is far from the case:

Almost no one uses Linus Torvalds' main repository. If something from upstream also works for them, then usually it is one of the stable kernels . But it is much more likely that they have a kernel from their distribution, which usually has additional patches and backports, and it does not even host kernel kernel , this is a completely different organization. Or they have a kernel from their own hardware vendor (for SoC and almost everything related to Android), which often differs significantly from everything that is hosted in one of the “main” repositories.
No one (except Linus himself) develops anything for the Linus repository. Each subsystem, and often even large drivers, has its own Git repositories, with their own mailing lists for tracking patches and discussing problems completely isolated from everyone else.
The work between the subsystems is performed on top of the linux-next integration tree , which contains several hundred Git branches from about the same number of Git repositories.
All this craziness is managed through the MAINTAINERS file and the get_maintainers.pl script, which for each given snippet of code can tell who is the maintainer, who needs to check the code, where the correct Git repository is located, which mailing lists to use, how and where to report bugs. The information is not just based on the location of the file. It also analyzes code templates to verify that cross-subsystem topics like device-tree maintenance or kobject hierarchy are handled by the right experts.

At first glance, it just looks like a sophisticated way to fill each disk space with a bunch of nonsense that is not interesting to him, but there are many related minor advantages that overlap each other:

It is completely easy to reorganize, highlighting things in a subproject - just update the MAINTAINERS file, and you're done. The rest is a bit more complicated than it should be, because you may need to create a new repository, new mailing lists, and a new bugzilla. This is just a UI issue that GitHub elegantly solved with a cool little fork button .
It is very, very easy to translate the discussion of pull requests and problems between subprojects, you just change the Cc: field in your answer. It is also much easier to coordinate work between the subsystems, since one pull request can be sent to several subprojects, and there is only one general discussion (since the Msg-Ids tags: in the mailing list threads are the same for everyone), although the letters themselves are archived in a bunch of different archives mailing lists go through different mailing lists and are in thousands of different mailboxes. A simple discussion of topics and code between subprojects avoids fragmentation, and it is easier to notice where common code and refactoring are useful.
Work between the subsystems does not need any kind of dance with releases. You just change the code, which is all in your single repository. Note that this is much more efficient than possible in separate repositories. In the case of really aggressive refactoring, you can simply divide the work between several releases, for example, when there are so many users that you can just change them all at once, without causing too much coordination problems.

A huge advantage is that refactoring and code exchange has become easier - no need to drag a bunch of legacy garbage with you. This is explained in detail in the document on the absence of delirium with stable APIs .
Nothing prevents you from creating your own experimental add-ons, which is one of the key advantages of a system with many repositories. They added their code to their fork and left it there - no one will ever force you to push the code back or push it into one common repository or even transfer it to the main organization, simply because there are no central repositories. This really works well, maybe too well, as evidenced by millions of lines of code outside the branches in various repositories of Android hardware vendors.

In general, I think this is a much more powerful model, because you can always roll back and do everything the same way as with numerous disconnected repositories. Hell, there are even kernel drivers that are in their own repository separate from the main kernel, like the Nvidia proprietary driver. Well, it's just like sticking the source code around the blob, but since it may contain no kernel parts for legal reasons, this is a great example.

It looks like a horror movie about a mono repository!

Yes and no.

At first glance, the Linux kernel looks like a monorepository, because it has everything. And many people know from their own experience that mono-repositories will cause many problems, because from a certain size they simply can not be scaled.

But if you look closer, such a model is very, very far from a single Git repository. The upstream subsystem and driver repositories alone give you several hundred. If you look at the entire ecosystem, including hardware vendors, distributions, other Linux-based operating systems and individual products, you can easily count several thousand major repositories and many, many more. And this, excluding Git repositories, is purely for personal use by individual developers.

The key difference is that in Linux, a single file hierarchy as a common namespace for everything, but a lot of different repositories for all sorts of needs and projects. It is a mono-tree with many repositories, not a mono-repository.

Examples please!

Before I begin to explain why GitHub is not currently able to provide such a workflow, at least if you want to preserve the advantages of GitHub UI and integration, you need to look at some examples of how this works in practice. In short, everything is done through Git pull requests between maintainers.

A simple case is the passage of changes through the hierarchy of maintainers, until they finally settle in the tree where necessary. This is easy because a pull request always goes from one repository to another, so it can be done with the current GitHub UI.

Much more fun with cross-subsystem changes, because then pull requests from an acyclic graph turn into a grid. At the first stage, you need to consider the changes and test them with all the involved subsystems and maintainers. In the GitHub workflow, this means pull requests to multiple repositories simultaneously, with a single stream of discussion between them. In kernel development, this step is accomplished by submitting the patch to a bunch of different mailing lists, specifying the maintainers as recipients.

Merging is different from considering a patch. Here one of the subsystems is already selected as the main one, it receives all pull requests, and all other maintainers agree with this option of merging. Usually they choose the subsystem that is most affected by the changes, but sometimes they choose the one in which some work is already in progress, which conflicts with the pull request. Sometimes they create a completely new repository and team of maintainers. This often happens for functionality that extends to the entire tree and is not very neatly contained in several files and directories in one place. A recent example is the DMA mapping tree , which attempts to combine work so far distributed among drivers, platform maintainers, and architecture support groups.

But sometimes there are numerous subsystems that conflict with a set of changes and that everyone needs to somehow solve a non-trivial merge conflict. In this case, the patches are not applied directly (the Rebase pull request on GitHub), but instead the pull request is used only with the necessary patches, based on the commit common to all subsystems - it is applied to all subsystem trees. Such a common base is important in order to avoid contamination of the tree subsystems with unnecessary changes. Since further pulls concern only specific topics, these branches are usually called thematic branches .

As an example, I can cite audio-over-HDMI support, I was directly involved in this process. It applies to both the graphics subsystem and the subsystem of sound drivers. The same commits from the same pull request are included in the Intel graphics driver , as well as in the audio subsystem .

A completely different example is that this is not insanity - the only comparable OS project in the world also chose a mono-tree with a commit stream similar to Linux. I'm talking about guys with such a giant tree that they even had to write a completely new GVFS virtual file system to support it ...

Dear github

Unfortunately, GitHub does not provide support for such a workflow, at least not natively with the GitHub UI. Of course, this can be done simply with clean Git tools, but then you return to the mailing list patches and email pull requests that are done manually. I believe this is the only reason why the kernel development community will not benefit from switching to GitHub. There is also a small problem that several leading maintainers are categorically opposed to GitHub as a whole, but this is no longer a technical issue. And it's not just the Linux kernel. The fact is that, in principle, all giant projects on GitHub have problems with scaling, because GitHub actually does not give them the ability to scale to numerous repositories that are tied to the mono-tree.

So, I have a request for only one feature on GitHub:

Please implement pull requests and bug tracking, covering different repositories of one mono-tree.

A simple idea, huge consequences.

Repositories and Organizations

First, you need to make possible multiple forks of the same repository in the same organization. Just look at git.kernel.orgMost of their repositories are not private. And even if you support different organizations, for example, for different subsystems, the requirement of having an organization for each repository is stupid and redundant, it without any need impedes access and user management to the limit. For example, in the graphics subsystem, we would have one repository for each userspace test suite, a common userspace library and a common set of tools and scripts that are used by maintainers and developers, this GitHub is supported. But then you add a common subsystem repository, plus a repository for the key functionality of the subsystem and additional repositories for each major driver. These are all forks that GitHub does not. And each of these repositories will have a bunch of branches: at least

The combination of all branches in the repository should not be offered, since the meaning of the section on repositories is to also separate pull requests and bugs.

A related question: you need to be able to establish links between forks after the fact. For new projects that have always been on GitHub, this is not a problem. But Linux will be able to move a maximum of one subsystem at a time, and there are already a ton of Linux repositories on GitHub that aren't right forks of each other.

Pull Requests

Pull requests need to be tied to several repositories at the same time, while maintaining a single common discussion thread. You can already reassign a pull request to another branch of the repository, but not to several repositories at the same time. Reassigning pull requests is really important, because new project participants will simply create pull requests to what they consider to be the main repository. Bots can then shuffle them taking into account all the repositories listed in the MAINTAINERS file for the set of files and changes that the repository contains. When I talked with GitHub employees, I first suggested they implement this directly. But I think that everything can be automated with scripts, so it would be better to leave this only for individual projects, since there is no single standard.

There is still a rather vile UI problem, because the list of patches may differ depending on the branch where the pull request goes. But this is not always a user error, because some of the repositories can already apply some patches.

In addition, the status of the pull request must be different for each repository. One maintainer may close it without accepting, because they decided that another subsystem will accept it, while another maintainer may merge and close the question. In another tree, they can even close the pull request as invalid, since it is not applicable to the old version or fork from the vendor. Even more fun, a pull request can go through a merge several times, with different commits in each subsystem.

Bugs

Like pull requests, bugs can relate to many repositories, and you need to be able to move them. As an example, let's take a bug that was first reported to the kernel repository. After sorting, it became clear that this is a driver bug that is still present in the last development branch and, therefore, refers to this repository, plus the main upstream branch and maybe a couple of others.

The statuses should again be separate, because after the appearance of a bugfix in one repository, it does not immediately become available to everyone else. It may even be necessary to port it to previous versions of kernels and distributions, but someone may decide that the bug is not worth it and will close it as WONTFIX, even if it is marked as successfully resolved in the corresponding repository of the subsystem.

Conclusion: a mono-tree, not a mono-repository

The Linux kernel is not going to switch to GitHub. But switching to the Linux scaling model as a mono-tree with numerous repositories will be a good concept for GitHub and will help all the very large projects that are already hosted there. It seems to me that this will give them a new and more effective way to solve their unique problems.

Tags: