danikinApril 25, 2019 at 12:03 PM

Citymobil — a manual for improving availability amid business growth for startups. Part 2

This is a second article out of a series «Citymobil — a manual for improving availability amid business growth for startups». You can read the first part here. Let’s continue to talk about the way we managed to improve the availability of Citymobil services. In the first article, we learned how to count the lost trips. Ok, we are counting them. What now? Now that we are equipped with an understandable tool to measure the lost trips, we can move to the most interesting part — how do we decrease losses? Without slowing down our current growth! Since it seemed to us that the lion’s share of technical problems causing the trips loss had something to do with the backend, we decided to turn our attention to the backend development process first. Jumping ahead of myself, I’m going to say that we were right — the backend became the main site of the battle for the lost trips.

1. How the development process works

The problems are usually caused by the code deployment and other manual actions. The services that are never changed and touched by hand sometimes also malfunction, however, that’s an exception which only proves the rule.

In my experience, the most interesting and unusual exception was the following. Way back in 2006, when I worked at one of the small webmail service, there was a gateway that proxied all the traffic and made sure the IP addresses weren’t on the blacklists. The service worked on FreeBSD and it worked well. But one day it just stopped working. Guess why? The disk in this machine failed (bad blocks had been forming for a while and the inevitable happened) and it happened three years prior to the service failure. Everything was alive with the failed disk. And then FreeBSD, for reasons known only to itself, suddenly decided to address the failed disk and halted as a result.

When I was a child, 10-12 years old, I went hiking to the woods with my dad and heard a phrase from him that I never forgot: «all you need to do to keep the bonfire burning is not to touch it». I believe most of us can remember a situation when we fed some wood to the already burning fire and it would go out for no reason.

The bottom line is the problems are created by humans’ manual actions; for example, when you feed wood to the already well-burning bonfire thus cutting of the oxygen and killing the fire or by deployment of the code with bugs into production. Therefore, in order to understand what causes the services issues, we need to understand the way the deployment and development process work.

In Citymobil the process was fully fast-development oriented and organized in the following way:

20-30 releases per day.
Developers perform deployment by themselves.
Quick testing in test environment by the developer.
Minimum automated/unit tests, minimum reviewing.

The developers worked in rough conditions without QA support, with an enormous flow of very important tasks from product team and experiments; they worked as intently and consistently as they could, they solved hard tasks in a simple way, they made sure that code didn’t turn into spaghetti, they understood business problematics, treated changes responsibly and quickly rolled back what didn’t work. There’s nothing new here. There was a similar situation at Mail.Ru service 8 years ago when I started working there. We started Mail.ru Cloud up quickly and easily, no prelude. We’d be changing our process down the road to achieve better availability.

I bet you’ve noticed that yourself: when there is no holds barred, when it’s just you and production, when you’re carrying a heavy burden of responsibility — you’re doing wonders. I’ve had an experience like that. Long time ago I was pretty much the only developer at Newmail.Ru webmail service (it was acquired a while ago and then taken down); I performed deployment by myself and conducted production testing also on myself via if (!strcmp(username, "danikin")) { … some code… }. So, I was familiar with this situation.

I wouldn’t be surprised to find out that such «quick and dirty» approach has been utilized by many startups both successful and not, but all driven by one passion — desire for rapid business growth and market share.

Why did Citymobil have such a process? There were very few developers to begin with. They’d been working for the company for a while and knew code and business very well. The process worked ideally under those conditions.

2. Why did availability threat come along?

Growth of investments into product development caused our product plans to become more aggressive and we started to hire more developers. Number of deployments per day was increasing, but naturally their quality decreased since the new guys had to dive into the system and business in field conditions. Increase in number of developers resulted in not just linear, but a quadratic drop in availability (number of deployments was growing linearly, and quality of an average deployment was dropping linearly, so «linear» * «linear» = «quadratic»).

Obviously, we couldn’t keep going that way. The process just wasn’t built for these new conditions. However, we had to modify it without time-to-market compromise; that is, keeping 20-30 releases per day (considering their number would grow as the team grows). We were growing rapidly, we conducted many experiments, promptly evaluated the results and conducted new experiments. We quickly tested product and business hypotheses, learned from them and made new hypotheses that we promptly tested again and so on so forth. Under no circumstances would we slow down. Moreover, we wanted to speed up and hire developers quicker. So, our actions aimed at business growth created availability threat, but we had absolutely no intentions to modify these actions.

3. Ok, the task is set, the process is clear. What’s next?

Having an experience of working at the Mail.Ru email service and Mail.Ru Cloud where the availability at some point had been made number one priority, where the deployments took place once every week, where everything was covered by automated and unit tests and the code was reviewed at least once, but sometimes even three times, I faced a totally different situation.

You’d think everything was quite simple: we could replicate the Mail.Ru email/cloud process at Citymobil thus increasing the service availability. However, as they say — the devil is in the details:

the deployments in Mail.Ru email/cloud are conducted once a week, not 30 times a day; at Citymobil we didn’t want to sacrifice the releases quantity;
in Mail.Ru email/cloud the code is covered by auto/unit test and we didn’t have neither time nor resources for that at Citymobil; we hurled all our backend development effort into hypotheses and product improvement testing.

That said, we were short-handed in terms of backend developers, even though they were being hired promptly (a special thanks to Citymobil recruiters — the best recruiters in the world! I think there’s going to be a separate article about our recruitment process), so there was no way we could address testing and reviewing issues without slowing down.

4. When you don’t know what to do, learn from mistakes

So, what is it so magical that we’ve done at Citymobil? We decided to learn from mistakes. Learn-from-mistakes service improvement method is as old as time. If the system works well, it’s good. If the system works not well, it’s also good since we can learn from mistakes. Easier said than… Actually, it can be done easily, too. The key is to set a goal.

How did we learn? First, we started to religiously write down the information on every single outage, big and small. To be honest, I really didn’t feel like doing that at first as I was hoping for a miracle and thought that the outages would just stop by themselves. Obviously, nothing was stopping. New reality mercilessly demanded some changes.

We started logging all the outages in a Google Docs table. For every outage there was the following short information:

date, time, duration;
the root cause;
what was done to fix the problem;
business impact (number of lost trips, other outcomes);
takeaways.

For every big outage, we would create a separate big file with detailed minute-by-minute description from the moment the outage began till the moment it ended: what we did, what decisions were made. It’s usually called a post-mortem. We would add the links to these post-mortems into the general table.

There was one reason for creating such a file: to come up with conclusions that would aim in decreasing the number of lost trips. It was very important to be very specific about what is «the root cause» and what are «the takeaways». The meaning of these words is clear; however, everyone can understand them differently.

5. Example of an outage we’ve learned from

The root cause is an issue that needs to be fixed in order to avoid such accidents in the future. And conclusions — the ways to eliminate the root cause or to reduce the likelihood of its resurgence.
The root cause is always deeper than it seems to be. The takeaways are always more complicated than they seem to be. You should never be satisfied by supposedly found root cause and never be satisfied with alleged conclusions, so that you don’t relax and stop at what seem to be right. This dissatisfaction creates a spark for further analysis.

Let me give you a real-world example: we deployed code, everything went down, we rolled it back, everything was working again. What’s the root cause of the problem? You’d say: Deployment. If you had not deployed code, then there wouldn’t have been an accident. So, what’s the takeaway: no more deployments? That’s a not very good takeaway. So, most likely, that wasn’t the root cause, we need to dig deeper. Deployment with a bug. That’s the root cause? Alright. How do we fix it? You’d say by testing. What kind of testing? For instance — full regression test of all functionality. This is a good takeaway, let’s remember it. But we need to increase availability here and now before we implemented the full regression test. We need to dig even deeper. Deployment with a bug that was caused by debug print in the database table; we overloaded the database, and it went down under the load. That sounds better. Now it became clear that even full regression test won’t save us from this issue. Since there won’t be the workload on the test database similar to the production workload.

What’s the root cause for this problem, if we dig even deeper? We had to talk to engineers to find that out. Turned out, that the engineer got used to the database being able to handle any workload. However, due to the rapid growth of workload the database couldn’t at that time handle what it handled the day before. Very few of us had a chance to work for the projects with 50% growth rate monthly. For me, for instance, that was the first project like that. Having plunged into a project like that, you begin to comprehend new realities. You’ll never know it’s out there until you come across it.

The engineer came up with the correct way to fix it: debug print must be done in a file that should be written via a cron script to the database in one thread. In case there’s too much debug printing, the database won’t go down; debug data will simply appear sooner or later. This engineer has obviously learned from his mistake and won’t make it again. But other engineers should also know about that. How? They need to be told. How to make them listen? By telling them the whole story from beginning to end, by laying out the consequences and proposing a correct way of doing it; and also, by listening and answering their questions.

6. What else can we learn from this mistake or «do’s & don’ts».

Ok, let’s keep analyzing this outage. The company is growing rapidly, new engineers are coming in. How are they going to learn from this mistake? Should we tell every new engineer about it? Obviously, there’ll be more and more mistakes — how do we make everyone learn from them? The answer is almost clear: create a do’s and don’ts file. We’ll be writing all the takeaways into this file. We show this file to all our new engineers and also to all our current engineers in a work group chat every time the do’s & don’ts is updated, strongly urging everyone to read it again (to brush up on the old information and see the new one).

You might say that not everyone will read carefully. You might say that the majority will forget it right after reading. And you’d be right on both accounts. However, you can’t deny the fact that something will stick in someone’s head. And that’s good enough. In Citymobil experience, the engineers take this file very seriously and the situations when some lessons were forgotten occurred very rarely. The very fact that the lesson was forgotten can be seen as a problem; we should draw a conclusion and analyze the details to figure out the way to change something in the future. This kind of digging leads to more precise and accurate wordings in do’s and don’ts.

The takeaway from the above-described outage: create a do’s and don’ts file; write everything we’ve learned in it, show the file to the whole team, request every newcomer to study it and encourage people to ask questions.

General advice that we derived from the outage review: we shouldn’t use a word combination «shit happens». As soon as you say it out loud, everyone decides that nothing needs to be done, no conclusions are necessary since humans have always made mistakes, are making mistakes now and will be making them in the future. Therefore, instead of saying that phrase, you should make a specific conclusion. A conclusion — is maybe a small but still a step to the direction of improvement of the development process, monitoring systems and automated tools. Such small steps result in a more stable service!

7. In lieu of epilogue

In further parts, I’m going to talk about types of outages in Citymobil experience and go into detail about every outage type; I’ll also tell you about the conclusions we made about the outages, how we modified the development process, what automation we introduced. Stay tuned!

Tags: