greggyNapalm December 21, 2017 at 10:52

From 15 and more: how to ensure scalability of CI

Now many articles and reports are published about specific technologies in DevOps: Docker, Kubernetes, Ansible ... I want to talk about building processes and how we at Wrike evolved from a release system for 15 front-end developers to almost 60 in two and a half years and 2-3 deployments per day.

This article is about those lessons that we decided along the way. This article is based on my report for the DevOps mitap at the Wrike Tech Club . If you have no time to read, there is a video recording of the presentation . Readers, welcome to cat.

A little introductory

The task of any product company in a very competitive industry is to find the coolest solution, and make it faster than their competitors. That is, you need to sort through the options, experiment, release quickly, recognize errors, invent even faster and release even faster. Accordingly, the speed with which you test your product ideas is your competitive advantage. That is, all continuous integration, continuous delivery and other tricks and tricks are needed in order to outperform competitors. The DevOps department at Wrike does just that.

We took it in 2015, the Front-end then consisted of 15 people, and the deployment looked like two zip files. They could only roll out together and were unpacked using the bash with the warm, skillful hands of system administrators.

And now we have 12 SCRUM teams. They are multifunctional and independent. In total, in these 12 teams there are already more than 60 front-endors. And our product, those two zipniks - these are now 60 Git repositories. We deploy one to three times every day. And this is just the frontend.
I’ll immediately mention that we write the frontend in the Dart language . And he has his own runtime, written in C. This is a strongly typed language, the syntax closest to C # and Java. It has its own compiler, its own system for resolving dependencies and its own libraries. You can read about how we live with Dart in other articles , but we will return to our topic.

Stage 1

So, in 2015, we had no automation - one repository on Mercurial and no ways to collaborate on the code base. That is, no hitlab, it was impossible to discuss some kind of differential, patch, it was impossible to get an issue in the merge request, to discuss it and to resolve it. The model of work with VCS, the version control system was not formalized.

It doesn’t matter when you have a team of five people, and people just informally agree: “Let's go out with you.” Who in whom is getting angry, agreed, let's go. The next day - new faces. Agreed again. But such a model cannot be scaled. And it is not subject to evolution.

The first version cost us cheaply and quickly. We took UI Wrike, although it could be any issue tracker: Jira, YouTrack, or something else. The main thing is that in it you can exchange comments and mark tasks with some tags or folders.

All we have written is a stateless Python application without a persistent data layer at all.

In the end, I used TeamCity and Wrike and wrote a solution using tags. The developer imperatively put down tags that the robot perceived as commands, and ran automatic tests, integrated branches, etc.

Since our product is SaaS, and we deploy a couple of times a day, we do not need long-lived release branches. No need to backport the patch there, no need to have a hotfix there. That is, from a complex Git-flow model, you can go to GitHub flow, and from it to an even simpler one. We made two models in the tracker, wrote out what features go into release, numbers, links, and those who work. That's the whole first version of integration. In fact, we received data from two sources - Gitlab and Team City - and sent data to Wrike through the API and TeamCity. All work with the front-end code base was in TeamCity.

Stage 1: lessons

We moved from Mercurial to Git, which helped us scale painlessly from 15 to 60 front-end developers. There are so many people on the market who, at the very least, understand how Git works. You will definitely break something in Git and be sure to learn how to fix it. Hiring people who understand how to work with Git in your model is three hundred times easier than in any other VCS.

Visualization of working with the code base in Gitlab was very useful to us. People began to work with merge requests, and this helped to get a better code base. The visualization was done in Wrike, very cheaply and very quickly.

Well, the specifics of SaaS and frequent releases played into our hands in working with branches.

Stage 2

Having downloaded the first version, we immediately found one important flaw. In it, we used squash commits in Git. In Git, you can group commits and make one of them. Suppose the developer made 5 commits, and we want to integrate this with the son. Previously, we grouped these five commits into one using squash and thereby got atomicity. That is, any feature was a single commit. You either hold it, or you don’t. That is, it was immediately possible to understand where people tried to get up to this and dragged each other's code. The story is beautiful. Each commit in the release branch is a whole feature. Read conveniently. Atomicity.

But trouble came, from where they did not wait. It turned out that we have features that live for two to three weeks. What does this mean? That while the master bumps every single day, someone aside is being developed. When there is a conflict with upstream, before the person there is not a differential of two commits or one, but two huge sheets. Squash commit against another squash commit. That is, blood flows from the eyes, and it is impossible to control at all - a mountain of text.
Also, after we automated and formalized the process, the gears began to spin much faster. And master began to quickly run away from features that are in development. Accordingly, when a person starts a new feature or continues to do it, he does not have fixes from upstream, he can easily work in the state of the presence of regression defects. That is, when the bug is fixed, but it does not have this patch.
As a result, we quickly had to finish the mechanism that poured the master into actively developing branches.

Stage 2: lessons

We, it would seem, did good to our colleagues, but thereby created two defects. In the end, we had to sacrifice these squash commits, although we really liked them. It was a very beautiful decision on paper, but if we sat a little longer with a pencil and a notebook in our hands, instead of programming programs, we could see these risks and prevent them in advance.

Stage 3

The third version was already provoked by the challenges of the outside world. As I said, our product initiatives are related to quick ideas testing. Once a product team came and said: "We want to test ideas even faster." We are talking about MVP (Minimum Valuable Product). The bottom line is that you create small product functionality to test a business idea and roll it out very quickly. If before you tested ideas consistently in your large product, now the task is to check many ideas at the same time. That is, we have two applications, but we need ten, or most likely 20.

Let me remind you that we have a Single Page Application. Accordingly, these are the bytes on the server that the client downloads. To version artifacts, we used HTTP param versions. Also, sometimes when synchronizing via rsync on the server, old versions of artifacts remained.
And it created wonderful bugs. The man deflated half of Wrike, for example. And part of the application was pumped out on demand. Two weeks later, he deflates the second half, but it is a different version. Debugging it was extremely difficult to reproduce at all unrealistic.

We decided to repack everything in a completely different way. Using RPM, which has a built-in system for resolving dependencies and checking hash sums. That is, at any given time, you can check on many of your boxes that no one has modified your files, they are equal to the original ones. Also, modern distributions have a multithreaded incremental repository indexer. There are ready-made patterns and solutions for the distribution of artifacts around the world, their indexing and everything else. And signing with GPG keys.

How to solve this problem with two parts of Wrike that people can download at any given time? Our Single Page Application is designed so that a person can work on the old version for a month until he reloads the tab or opens a new one. How do we fix this? We began to refer to all permalink artifacts. And the version there is sewn directly into the link. That is, you are either downloading the correct artifact or not downloading any.
In addition, we decided to store the 50 latest versions of Wrike on the file system. And set the current using a symlink. When people simultaneously pump out assets, we can change the version right on the go. Accordingly, for us the rollback of versions is the movement of a symlink. In seconds, we can roll back everywhere to the version we need. This is done quickly, simply. Bytes are everywhere needed.

There is also a positive side effect. We, in fact, can not only roll back, but also forward. That is, we on prod can upload versions that are not yet publicly available, but they can already be clicked, tested. That is, we can get ahead of events.

We achieved this due to the fact that we store many versions and with the help of a symlink we manipulate the current implementation. Naturally, RPM cannot do this out of the box. We wrote a custom Ansible module in Python. For those who did not write modules for Ansible, I highly recommend trying it, it has a simple, compact API. In one evening, after reading the official document, you will learn how to write and debug them.

A bit about Docker. His role in 2015 was extremely modest. We had a huge long-suffering collector who spawned processes. Sometimes I forgot about them. They allocated gigabytes of memory. We decided this using isolation in Docker, because we were very tight in terms of time - we could not fix the root cause. As a result, when terminating the container, our resources were released, and everything was cool.

Stage 3: lessons

What is special about our third version? Remember, I was talking about a notebook and a pencil? We had a week. We realized that for the whole week we will not program this at all. Even if we do not sleep and eat. We spent a third of the time thinking through this engineering solution with symlinks, RPMs, and atomicity. We agreed with the sysops - those who exploit our solution in production. Agreed with the developers. We directly wrote an engineering spec, how it will be assembled, how it will be laid out.
Our colleagues later wrote a library on Dart, an abstract project builder. Now this library collects about 60 projects. We were right in this decision, because we then almost did not touch the spec and rewrote the implementation. The main idea is that you need to think through specs and contracts with other departments if you are changing things a lot. And implementation at first can be sacrificed.

An important point is not to be afraid to defend your point of view, to combine those solutions that give you the product properties you need. That is, it’s fashionable now: no docker - no kid. But RPM is some kind of archaic old system that would seem unnecessary to anyone. In fact, there are some very cool features that many other systems don't have. I strongly urge you to think with your own head and always choose what you need, and not go on about any opinion.

Stage 4

Remember I talked about Dart and libraries? We have created dozens of applications. In essence, these are library code consumers. This triggered a wave of library creation. That is, we have, say, 10 front-end applications. Everyone needs a data layer. The DAL (Data access layer) library appeared, an abstraction for working with data, a component library appeared. And a huge number of other libraries. Dozens. There was a task to execute a composition of this code.

That is, for example, our colleague rolls out a feature. He rules three libraries and the library consumer code, some kind of application where it is embedded. What's happening? He corrects the code of the first library, goes through the merge request procedure, leaves the second one, rules it, goes through the merge request procedure, leaves the third one - the same thing. Then he rules the feature, correcting the references to these libraries. And once again undergoing the merge request. This is a very long, redundant, and a lot of communications.

On the one hand, we helped colleagues, allowed them to reuse the code and quickly check product ideas. On the other hand, they placed a gigantic burden on their shoulders. Although the procedure has become modern (it's all in Gitlab, via merge request), it is wildly complicated and unusual, and inconvenient. The fourth version arose because we had increased the number of libraries, increased reuse of the code, but we had to do this effectively. That is, some kind of answer was needed to meet the need.

We also have MVP. You test a business idea and then make a decision. For example, we have achieved the desired product properties. Now you can slow down. We do not know what to do with this product part. Maybe we will need it in a year, or maybe it will never be needed. We chose between a single repository and a multi repository. And they decided to do the following. We allocated each library or application to a separate repository with our readme, with our change log, our versioning and our tests.
Thanks to this, our applications refer to specific fixed versions of libraries. If a component holds, naturally, it depends on old versions of libraries, then in a year it can be messed up. For example, a security code can be applied without updating to new versions of libraries. If he decided to continue to develop in a year, he just bumps to new versions of libraries.

And if we left in the mono-repository and said, “Dudes, we want to guarantee compatibility from beginning to end,” we would drag a huge baggage of applications with us, which either stopped in their development or died. And compatibility would have to be guaranteed. That would be very expensive. In fact, the help of the Git submodule and subtree can be made from a multi-repository mono-repository, if the task requires it.
To realize our decision, we needed smart dependency management. We wrote it in Python, with a persistent layer in Postgre, from a UI in Angular.

That is, the developer declared simply: I write a feature in which I edit three libraries and two consumers. And our application allowed us to do the following. Roll it all into an atomic unit in RC. And the robot itself went through the repositories, everything was merging. I changed the references myself. And in the case of a feature rollback, it all returned to its original position. If the tester confirmed that yes, it did, he gave a green light, and the feature went to RC. Here. Now we want all this to happen on the basis of green tests, without any participation of people.
At some point, we grew out of Wrike's ability as an issue tracker. We needed our own UI, the guys saw it off on Angular.

But this is not the end. Wrike is growing: both the team and the product, and we face new challenges. In the browser, Wrike is a kind of monolith, and from the point of view of DevOps, this is not at all a joy to us. And the next big step for us is to decompose our solution and make many applications. This will give us a boost in integration, and we will roll out even more features. We also pumped automated testing a lot. They sent their selenium farm to Google Cloud, which allows you to run the entire regression in 20 minutes and $ 1.

General conclusions

I believe that it is critically important to hire the right people, invest in tools and work with the feedback of developers and businesses.

Many decisions about how to develop a system at a certain stage — evolutionarily or revolutionaryly — were made only through feedback. The DevOps department in the product company is lucky, its consumers are its colleagues, that is, engineers with whom it should easily communicate, which it should be able to hear and listen to.

And the last - you do not need to immediately, for each task, build a spaceship. Solve problems iteratively. I strongly recommend and promote not to come up with very long, long-lived and expensive solutions. That is, we all know about the internal services of large companies or about some open source projects that are cool. And at the start we can say: "I know how to solve this problem, only you need to write this, this and this." And it takes a year or a half. The fact that we did not copy some foreign solutions, but made tools that meet current needs, allowed us to maintain the pace of work and save money.

Tags: