HotWaterMusic September 23, 2015 at 23:45

How to work on Chromium

Transfer

Preface from the author of the original:

In March 2011, I wrote a draft article about how the team responsible for Google Chrome develops and releases its product - after which I safely forgot about it. Only a few days ago I accidentally stumbled upon it. Let it be outdated in some places (Chrome forked WebKit in Blink in 2013, and I myself no longer work for Google), I am inclined to believe that the ideas presented in it are still valid.

Today I am going to tell you how Chromium works. No, it’s not going to be about the Chrome browser , but rather about Chromium , a group of people involved in creating the browser.

Hundreds of engineers work on the Chromium project. Together we commit to the codebase about 800 changes every week. We are also dependent on many other large and rapidly developing projects like V8 , Skia and WebKit .

We are sending out a new stable release to hundreds of millions of users every six weeks, clearly on schedule. And we support several other early access channels that update even faster - the fastest channel, canary, "Quiet" auto-updates almost every weekday.

How does all this continue to work? Why has the “wheels” of this “bus” not yet “fallen off”? Why aren't all the developers crazy yet?

From a technological point of view, the speed of the Chromium team has become achievable thanks to reliable, efficient and quiet auto-updates.

From a human resources point of view, this is the merit of dedicated, hardworking and smart QA teams and release engineers, without which the whole project would fall apart in a matter of weeks. And also - designers, product managers, writers, PR, lawyers, information security and everyone else who works together on every stable release. Today I will not talk about everyone, concentrating only on engineering topics, so as not to slide into a giant post in the spirit of Steve Yegg.

I am going to talk about the Chromium development process, which is built specifically to make quick releases possible. It will be about interesting findings that may be useful to other projects, regardless of what their release schedule looks like; I will mention our difficulties.

Without branches

In many projects, it is common practice to create a branch to work on new "major" features. The idea with this choice is that destabilization due to new code will not affect other developers and users. Once the feature is completed, it merges back into the trunk , after which a period of instability usually follows, while integration issues are resolved .

With Chrome, this approach will not work, as we will be released every day. We cannot allow large pieces of new code to suddenly appear in our trunk , since in this case it is likely that the canary or dev update channels will go into failure for a long time. Also trunkChrome is moving forward at such a speed that it’s impractical for developers to stay isolated on their branch for too long. By the time they get hold of their branch, trunk will look so different that integration will be time consuming and easily error prone.

We create operational branches before each of our beta releases, but they do not live very long - a maximum of only six weeks, until the next beta release. And we never conduct development directly in these branches - all delayed fixes that should be included in the release are first made in trunk , after which a cherry-pick will be made to the branch.

A pleasant side effect of this process: there is no special development team in the project"Second grade" , which are engaged exclusively in working with the production branch. All developers always work with the latest current version of the source code.

Runtime Switches

Although we do not create branches, we still need a way to hide incomplete features from users. The natural way to do this is to use compile-time checks; its problem is that such an approach is not much different from creating branches with code - in fact, you still have two independent code bases that should be found one day. And, since the new code has not been compiled and tested by default, it will not be difficult for developers to accidentally break this code.

Instead, the Chromium project uses runtime checks. Each feature under development is compiled and tested on all configurations from the very beginning. We have command line flagsthat we test at the very beginning; in other places, the code base for the most part has no idea what functions are available. This strategy means that work on new features is integrated into the project code from the very beginning as much as possible. At least, the new code is compiled, so that all the changes in the main code that need to be made for the new function to work are tested, and the user believes that everything works as usual and does not notice the difference. Well, we can easily write automatic tests that check the performance of features that are not available so far, temporarily “redefining” the command line .

When the feature gets closer to completion, we present the option as a flag in chrome: // flagsso that advanced users can start testing it and give us feedback. As a result, when we think that the feature is ready for release, we just remove the command line flag and make it available by default. By this time, the code has usually been tested far and wide, and has also been tested by many users, so that the potential damage from its activation is minimized.

A huge amount of automatic testing

In order to be released every day, we must be sure that our code base is always in good condition. This requires automated tests, and a very large number of them. At the time these lines are written, Chrome has 12k unit of class level tests, 2k automated integration tests and a huge assortment of performance tests, bloat tests, thread safety and memory safety tests, as well as, possibly, many others about which I don’t remember now. And all this for just one Chrome; WebKit, V8 and the rest of our dependencies are tested independently; one WebKit has approximately 27k tests that make sure that the web pages are displayed and functioning correctly. Our basic rule is that every change should go with the tests.

We use a public buildbot , which constantly runs new changes in our code on a test suite . We adhere to the “green tree” policy: if the change “breaks” the test, then it is rolled back right away, and the developer must commit the changes and re-post them. We do not leave such “critical” changes in the tree, because:

This makes it easier to accidentally make even more “critical” changes - if the tree is already red, then no one notices when it becomes even redder
This slows down development as everyone will be forced to work on what's broken
This encourages developers to take careless quick fixes in order to pass tests.
This keeps us from releasing!

To help developers avoid troubles with the tree, we have try-bots that are a way to “push” the change under all tests and configurations before releasing it. Results are emailed to the developer. We also have a commit queue that serves to test changes and automatically apply them if all tests succeed. I like to use it after a long night spent hacking - I press a button, go to bed, and after a while I wake up with the hope that my change is poured.

Thanks to all the automatic testing performed, we can get off with a minimal amount of manual testing on our dev channel , and not do it at all in the “canaries”.

Merciless refactoring

Since we have a fairly extensive test coverage, we can afford to be aggressive in refactoring. The Chrome project is constantly working on refactoring in several main areas - for example, for the period of 2013, these were Carnitas and Aura .

At our scale and pace, it is critical to keep the code base clean and easy to understand. For us, these qualities are more important than preventing regression. Engineers of the entire Chrome project have the right to make improvements anywhere in the system (however, in doing so, we may request a mandatory review of the module owner). If, as a result of refactoring, something eventually breaks down, and this cannot be detected at the testing stage, then from our point of view it is not the fault of the engineer who did the refactoring, but of the one whose feature was not sufficiently covered by tests.

Deps

WebKit is developing at no less rapid pace. And, similarly to the fact that we cannot afford to have branches for features that someday suddenly merge into the main one, we cannot afford to try to delay the monthly set of changes in WebKit all at once, as this destabilizes the tree for several days.

Instead, we try to compile Chrome with the latest version of WebKit (almost always this version is no older than half a day) until we succeed. At the root of the Chrome project is a file that contains the version of WebKit with which it is now successfully compiling. When you check-out and create a working copy, or update the Chrome source code, the gclient tool automatically gets the version of WebKit specified in the file.

Every day, several times a day, an engineer updates the version number, finds out whether new problems have arisen during integration, and assigns bugs to suitable engineers. As a result, we always get minor changes in WebKit all at once, and with a similar approach, the effect on our source tree is usually minimal. We also added bots to the WebKit buildbot, so when WebKit engineers make a change that breaks Chrome, they will know about it immediately.

The big advantage of the DEPS system is that we can add changes to our web platform very quickly. A feature introduced in WebKit will be available to Chrome users on the canary channelin just a few days. This led us to make improvements right away in the WebKit upstream, where they will be useful to everyone who uses WebKit in their applications, rather than applying them locally in Chrome. Honestly, our basic rule is that we don’t make local changes to WebKit at all (as well as other projects that Chrome depends on).

Problems

Thorough testing remains an unresolved issue. In particular, “unstable” integration tests ( flaky integration tests ) have become a constant problem for us. Chrome is large, complex, asynchronous, multi-process, and multi-threaded. So for integration tests, it is easier to fail because of subtle synchronization problems, which happens from time to time. On a project of our size, a test that fails 1% of the time will be guaranteed to fail several times a day.

As soon as the test becomes "unstable", the team quickly becomes in the habit of ignoring it, as a result of which it is easy to miss the file of another, working test for the same part of the code. Therefore, we tend to disable “unstable” tests and lose coverage percentage, making it easier to issue basic regressions for the user.

Another problem is that at such a speed it becomes difficult to “induce beauty”. As for me, it’s easier for the team to achieve the correct implementation of all the details in rare “high-profile” releases than to try to maintain and constantly maintain focus on every little thing for an indefinite time. Since it is often difficult to test small details such as the distance between buttons in the toolbar, errors can easily creep in such places.

Finally, it seems to me that stress is a very real problem. Given all this code, which is constantly changing, even if a person tries to focus only on his area of responsibility, this does not mean that he will not be hurt by something happening in another part of the project. If you are constantly trying to keep your part of Chrome operational, then sooner or later it will begin to seem to you that you live like on a volcano, and you cannot afford a minute of peace of mind.

We deal with the last problem by breaking the code base into main modules. Carnitas engineers are trying to establish clearer and tougher interfaces between some of our main components. At the moment, a significant part of the code has become cleaner and clearer, but it is still too early to talk about how this will help reduce stress globally.

Finally

So, thanks to what the "wheels" have not yet "fallen off"? In short: thanks to life without branches, runtime switches, tons of automated tests, ruthless refactoring, and keeping positions as close as possible to the HEAD of our dependencies.

These techniques will be most useful for large projects that have fast-changing upstream dependencies, but it is possible that some of them will be applicable to smaller projects.

Tags:

chromium