m1rko March 29, 2017 at 11:51

Git scaling (and some background)

Transfer

A few years ago, Microsoft decided to start a long process to restore the development system throughout the company. We are a large company, with many teams - each has its own products, priorities, processes and tools. There are some “common” tools, but there are many different ones - and a VERY BIG number of single-use tools developed within the company (by teams I mean units - thousands of engineers).

This has a downside:

A lot of excess investment in teams that develop similar tools.
The inability to finance any toolkit to the "critical mass".
Difficulties for employees in moving around the company due to different tools and process.
Difficulty exchanging code between organizations.
Disagreements with newcomers at the beginning of work due to the excessive abundance of MS-only tools.
Etc...

We came up with an initiative called the “One Engineering System” or 1ES for short. Just yesterday we had 1ES day, where thousands of engineers gathered to mark the progress made, find out about the current state of affairs and discuss plans. It was a surprisingly good event.

We digress a bit from the topic ... You can ask a question - hey, you told us for years that Microsoft uses Team Foundation Server, what are you lying to us? No, I didn’t lie. More than 50 thousand people regularly use TFS, but they do not necessarily use it for all their work. Some apply for everything. Some are only for tracking work topics. Some are for version control only. Some for assemblies ... We have internal versions (and in many cases more than one) of almost everything that TFS does, and someone uses them all somewhere. There is a bit of chaos here, absolutely honest. But if you combine and weigh, then we can confidently say that TFS has more users than any other set of tools.

I also want to note that when I say “engineering system” I use the term VERY broadly. It is included, but is not limited to the following:

Source code management
Work management
Assembly
Releases
Testing
Package management
Telemetry
Incident management
Localization
Security scan
Availability
Compliance Management
Code signature
Static analysis
and many many others

So, back to the story. When we embarked on this path, there was some fierce debate about where we were going, what should be the main thing, etc. You know, developers never have an opinion. :) There was no way to solve everything right away without failing, so we agreed to start with three problems:

Work planning
Source control
Assembly

I don’t want to go into details in detail, besides the fact that these are fundamental principles, and much more is integrated with them, built on them, so it makes sense to choose just these three problems. I also note that we had HUGE troubles with the build time and reliability due to the size of our products - some programs consist of hundreds of millions of lines of code.

Over time, these three main topics have grown, so the 1ES initiative to varying degrees affected almost all aspects of our development process.

We made some interesting bets. Among them:

The future is beyond the cloud- Most of our infrastructure and tools are hosted locally (including TFS). We agreed that the future is in the cloud - mobility, management, evolution, elasticity, all the reasons that may come to mind. A few years ago it was very controversial. How can Microsoft transfer all its intellectual property to the cloud? What about performance? What about security? Reliability? Compliance with legal norms and management? What about ... It took time, but in the end a critical mass of those who agreed with the idea was accumulated. Over the years, this decision has become more and more understandable, and now everyone is delighted with moving to the cloud.

First person == third person- This expression (1st party == 3rd party) we use internally. It means that we strive to the maximum extent possible to use our commercial products - and vice versa, to sell products that we ourselves use. It doesn’t always turn out to be 100% and this is not always a parallel process, but this is the direction of movement - the default assumption until there is a good reason to do otherwise.

Visual Studio Team Services Underpin- We have relied on Team Services as the basis. We need a fabric that brings together our entire development system - a central hub, from where you learn everything and achieve everything. The hub should be modern, abundant, expandable, etc. Each group should be able to contribute and share their special contributions to the development system. Team Services are great for this role. Over the past year, the audience of these services at Microsoft has grown from a couple of thousand people to more than 50,000 loyal users. As with TFS, not every group uses them for everything possible, but the momentum in this direction is strong.

Team Services Planning- Choosing Team Services, it was quite natural to choose the appropriate opportunities for work planning. We have downloaded groups like a Windows group, with many thousands of users and many millions of work items, into a single Team Services account. To make it work, along the way, I had to do a lot of work on performance and scaling. At the moment, almost every group at Microsoft has made this transition, and all of our development is managed through Team Services.

Orchestration Team Services Build and CloudBuild- I will not dig too deeply into this topic, because it is gigantic in itself. I can only say about the result that we chose Team Services Build as our system of orchestrating assembly operations, and Team Services Build as our user interface. We also developed a new “make engine” (which we have not released yet) for some of the largest codebases; it supports finely tuned caching on a large scale, parallel execution and incrementality, that is, phased execution. We saw how many hours of assembly sometimes reduced to minutes. We will tell you more about this in a future article.

After a great background - to the most important thing.

Git for source code management

Probably the most controversial decision we made about the source code management system. We had an internal system called Source Depot, which absolutely everyone used in the early 2000s. Over time, TFS and its Team Foundation Version Control solution gained popularity in the company, but were never able to penetrate the largest development groups - such as Windows and Office. I think there are many reasons. One of them is that for such large teams the transition price turned out to be extremely high, and the two systems (Source Depot and TFS) were not so different as to justify it.

But version control systems generate intense loyalty - more than any other developer tool. So the battle between supporters of TFVC, Source Depot, Git, Mercurial and others was fierce, and to be honest, we made a choice without reaching a consensus - it just had to happen. We decided to make the standard at Git for many reasons. Over time, this decision received more and more supporters.

There were many arguments against choosing Git, but the most reinforced concrete was scaling. There are not many companies with a code base of our size. In particular, Windows and Office (there are others) have a massive size. Thousands of developers, millions of files, thousands of build machines that are constantly working. Honestly, this is amazing. To clarify, when I mention Windows here, I mean all versions - these are Windows for PC, Mobile, Server, HoloLens, Xbox, IOT and so on. And Git is a distributed version control system (DVCS). It copies the entire repository and its entire history to your local machine. It would be ridiculous to do this with a Windows project (and we laughed a lot at first). Both TFVC and Source Depot have been carefully tuned and optimized for large codebases and specific development teams. Gitnever used for such a task (or even within the same order of magnitude), and many argued that the system would never work.

The first big debate was - how many repositories to start, one for the entire company or one for each small component? Great range. Git proved to be exceptionally good at a very large number of modest repositories, so we spent a lot of time thinking about breaking our large codebases into a large number of moderate-sized repositories. Hmmm Ever worked with a huge codebase for 20 years? Have you ever tried to go back later and break it into small repositories? You can guess what answer we came to. This code is very difficult to parse. The price will be too high. The risks from this level of confusion will become monstrous. And we really have scenarios when the only engineer needs to make radical changes in a very large amount of code.

After a long twisting of the hands, we decided that our strategy should be “The correct number of repositories, depending on the nature of the code.” Some code can be distinguished (like microservices) and it is ideal for isolated repositories. Some code cannot be divided into parts (like the Windows kernel) and should be understood as a single repository. And I want to emphasize that the point is not only in the complexity of breaking the code into parts. Sometimes in large codebases connected with each other it is really better to perceive this codebase as a whole. Maybe someday I’ll tell a story about the Bing group’s attempts to separate the components of the Bing key platform into separate packages — and the versioning issues they encountered. Now they are moving away from this strategy.

Thus, we had to start scaling Git to work on code bases with millions of files in hundreds of gigabytes and used by thousands of developers. By the way, even Source Depot never scaled to the entire Windows code base. It was split into more than 40 repositories in order to be able to scale. But the layer was built on top, so in most cases the code base could be perceived as a whole. This abstraction was not perfect and definitely caused some controversy.

We unsuccessfully started scaling Git in at least two ways. Probably the most significant was the attempt to use Git submodules to stitch together many repositories into a single “super-repository”. I will not go into details, but after 6 months of working on the project, we realized that it would not function - there are too many boundary situations, complexity is high, the project is too fragile. We needed a proven, reliable solution that was well supported by almost all Git tools.

Almost a year ago, we returned to the beginning and concentrated on the question of how to scale Git in reality to a single repository containing a whole Windows code base (including grades for growth and history), and how to support all developers and build machines.

We tried the "virtualization" of Git. Normally, Git downloads everything when cloning. But what if not? What if we do storage virtualization so that it downloads only the right parts. Thus, cloning a large repository of 300 GB becomes very fast. As I enter read / write commands at my place, the system imperceptibly loads the content from the cloud (and then stores it locally, so that data will be accessed locally in the future). The only downside here is the loss of support for offline work. To do this, you need to “touch” everything in order to leave the manifest for local work, otherwise nothing changes - you still get 100% correct experience with Git. And for our huge codebases, such a virtualization option was acceptable.

It was a promising approach, and we began to develop a prototype. We called the Git Virtual File System or GVFS project. We set a goal to make a minimum of changes to git.exe. Of course, we did not want to fork Git - this would be a disaster. And they did not want to change it so that the community never accepted these changes. So we chose an intermediate path in which the maximum number of changes is made "under" Git - in the virtual file system driver.

The virtual file system driver basically virtualizes two things:

The .git folder where all batch files, history, etc. are stored. This is the default folder for everything. We virtualized it to extract only the necessary files and only when needed.
The “working directory” is the place where you go to really edit the source code, compile it, etc. GVFS monitors the working directory and automatically “checks” every file that you touch, giving the impression that all the files are really there but not requiring resources until you actually request access to a specific file.

As you progress in our work, as you can imagine, we have learned a lot. Among other things, we learned that the Git server must be smart. It should pack Git files in the most optimal way so as not to send more to the client than it really needs - imagine this as optimizing the locality of links. So we made many improvements to the Team Services / TFS Git server. We also found that Git has many scenarios when it touches files that it should not touch. Previously, it never mattered because everything was stored locally, and Git was used in medium-sized repositories, so it worked very quickly - but if you touch everything, you have to download from the server or scan 6,000,000 files, this is not a joke. So we spent a lot of time optimizing Git performance. Many of the optimizations we have made will benefit the “normal” repositories to some extent, but these optimizations are critical for mega-repositories. We sent many of these improvements to the Git OSS project and enjoyed a good collaboration with them.

So, fast forward to today. Everything works! We have put all the code from more than 40 Windows Source Depot servers into a single Git repository hosted in VS Team Services - and it performs well. You can go in (enlist) in a couple of minutes and do all your normal Git operations in seconds. And in every sense it is a transparent service. Just git. Your developers will continue to work as they worked using the tools they used. Your builds just work, etc. It's just amazing. Magic!

As a concomitant effect, this approach is well reflected in large binary files. It does not extend Git with the new mechanism, as LFS does, no “allocations” and the like. You can work with large binary files as with any other files, but only those blobs that you have affected are downloaded.

Git merge

At the Git Merge conference in Brussels, Saeed Noursalehi shared with the world what we are doing - including the excruciating details of the work done and what we understood. At the same time, we put all our work into open source. We also included some additional server protocols that needed to be introduced. You can find the GVFS project and all the changes made to Git.exe in the Microsoft GitHub repositories. GVFS relies on the new Windows filter driver (the moral equivalent of the Linux FUSE driver), and we worked with the Windows team to release this driver early so you can try GVFS. For more information and links to additional resources, see Said's post.. You can study them. You can even install GVFS and try it out.

While I am noting the performance of GVFS, I want to emphasize that much remains to be done. We are not done. We think we have proven the concept, but there is still a lot of work ahead to bring it to life. We make an official announcement and publish the source code to attract the community to work together. Together, we can scale Git for the largest codebases.

Sorry for the long post, hope it was interesting. I am delighted with the work done - both as part of the 1ES initiative at Microsoft, and on scaling Git.

Tags:

Git scaling (and some background)

Git for source code management

Git merge

Also popular now: