danikinApril 29, 2019 at 11:51 AM

Citymobil — a manual for improving availability amid business growth for startups. Part 3

This is the next article of the series describing how we’re increasing our service availability in Citymobil (you can read the previous parts here and here). In further parts, I’ll talk about the accidents and outages in detail. But first let me highlight something I should’ve talked about in the first article but didn’t. I found out about it from my readers’ feedback. This article gives me a chance to fix this annoying shortcoming.

1. Prologue

One reader asked me a very fair question: «What’s so complicating about backend of the ride-hailing service?» That’s a good question. Last summer, I asked myself that very question before starting to work at Citymobil. I was thinking: «that’s just a taxi service with its three-button app». How hard could that be? It turned to be a very high-tech product. To clarify a bit what I’m talking about and what a huge technological thing it is, I’m going to tell you about a few product directions at Citymobil:

Pricing. Our pricing team deals with the problem of the best ride price at every point and at every moment of time. The price is determined by supply and demand balance prediction based on statistics and some other data. It’s all done by a complicated and constantly developing service based on machine learning. Also the pricing team deals with implementation of various payment methods, extra charges upon completing of a trip, chargebacks, billing, interaction with partners and drivers.
Orders dispatching. Which car completes the client’s order? For example, an option of choosing the closest vehicle isn’t the best one in terms of maximization of a number of trips. Better option is to match cars and clients so that to maximize the trips number considering a probability of this specific client cancelling his order under these specific circumstances (because the wait is too long) and a probability of this specific driver cancelling or sabotaging the order (e.g. because the distance is too big or the price is too small).
Geo. All about addresses search and suggesting, pickup points, adjustments of estimated time of arrival (our map supply partners don’t always provide us with accurate ETA information with allowance for traffic), direct and reverse geocoding accuracy increase, car arrival point accuracy increase. There’s lots of data, lots of analytics, lots of machine learning based services.
Antifraud. The difference in trip cost for a passenger and a driver (for instance, in short trips) creates an economic incentive for intruders trying to steal our money. Dealing with fraud is somewhat similar to dealing with mail spam — both precision and recall are very important. We need to block maximum number of frauds (recall), but at the same time we can’t take good users for frauds (precision).
Driver incentives team oversees developing of everything that can increase the usage of our platform by drivers and the drivers’ loyalty due to different kinds of incentives. For example, complete X trips and get extra Y money. Or buy a shift for Z and drive around without commission.
Driver app backend. List of orders, demand map (it shows a driver where to go to maximize her profits), status changes, system of communication with the drivers and lots of other stuff.
Client app backend (this is probably the most obvious part and that’s what people usually call «taxi backend»): order placement, information on order status, providing the movement of little cars on the map, tips backend, etc.

This is just the tip of the iceberg. There’s much more functionality. There’s a huge underwater part of the iceberg behind what seems to be a pretty simple interface.

And now let’s go back to accidents. Six months of accidents history logging resulted in the following classification:

bad release: 500 internal server errors;
bad release: database overload;
unfortunate manual system operation interaction;
Easter eggs;
external reasons;
bad release: broken functionality.

Below I’ll go in detail about the conclusions we’ve drawn regarding our most common accident types.

2. Bad release: 500 internal server errors

Our backend is mostly written in PHP — a weakly typed interpreted language. We’d release a code that crashed due to the error in class or function name. And that’s just one example when 500 error occurs. It can also be caused by logical error in the code; wrong branch was released; folder with the code was deleted by mistake; temporary artifacts needed for testing were left in the code; tables structure wasn’t altered according to the code; necessary cron scripts weren’t restarted or stopped.

We were gradually addressing this issue in stages. The trips lost due to a bad release are obviously proportional to its in-production time. Therefore, we should do our best and make sure to minimize the bad release in-production time. Any change in the development process that reduce an average time of bad release operating time even by 1 second is good for business and must be implemented.

Bad release and, in fact, any accident in production has two states that we named «a passive stage» and «an active stage». During the passive stage we aren’t aware of an accident yet. The active stage means that we already know. An accident starts in the passive stage; in time it goes into the active stage — that’s when we find out about it and start to address it: first we diagnose it and then — fix it.

To reduce duration of any outage, we need to reduce duration of active and passive stages. The same goes to a bad release since it’s considered a kind of an outage.

We started analyzing the history of troubleshooting of outages. Bad releases that we experienced when we just started to analyze the accidents caused an average of 20-25-minute downtimes (complete or partial). Passive stage would usually take 15 minutes, and the active one — 10 minutes. During the passive stage we’d receive user complaints that were processed by our call center; and after some specific threshold the call center would complain in a Slack chat. Sometimes one of our colleagues would complain about not being able to get a taxi. The colleague’s complain would signal about a serious problem. After a bad release entered the active stage, we began the problem diagnostics, analyzing recent releases, various graphs and logs in order to find out the cause of the accident. Upon determining the causes, we’d roll back if the bad release was the latest or we’d perform a new deployment with the reverted commit.

This is the bad release handling process we were set to improve.

Passive stage: 20 minutes.
Active stage: 10 minutes.

3. Passive stage reduction

First of all, we noticed that if a bad release was accompanied by 500 errors, we could tell that a problem had occurred even without users’ complains. Luckily, all 500 errors were logged in New Relic (this is one of the monitoring system we use) and all we had to do was to add SMS and IVR notifications about exceeding of a specific number of 500 errors. The threshold would be continuously lowered as time went on.

The process in times of an accident would look like that:

An engineer deploys a release.
The release leads to an accident (massive amount of 500s).
Text message is received.
Engineers and devops start looking into it. Sometimes not right away but in 2-3 minutes: text message could be delayed, phone sounds might be off; and of course, the habit of immediate reaction upon receiving this text can’t be formed overnight.
The accident active stage begins and lasts the same 10 minutes as before.

As a result, the active stage of «Bad release: 500 internal server errors» type of accident would begin 3 minutes after a release. Therefore, the passive stage was reduced from 15 minutes to 3.

Result:

Passive stage: 3 minutes.
Active stage: 10 minutes.

4. Further reduction of a passive stage

Even though the passive stage had been reduced to 3 minutes, it’s still bothered us more than active one since during the active stage we were doing something trying to fix the problem, and during the passive stage the service was totally or partially down, and we were absolutely clueless.

To further reduce the passive stage, we decided to sacrifice 3 minutes of our engineers’ time after each release. The idea was very simple: we’d deploy code and for three minutes afterwards we were looking for 500 errors in New Relic, Sentry and Kibana. As soon as we saw an issue there, we’d assume it to be code related and began troubleshooting.

We chose this three-minute period based on statistics: sometimes the issues appeared in graphs within 1-2 minutes, but never later than in 3 minutes.

This rule was added to the do’s and dont’s. At first, it wasn’t always followed, but over time our engineers got used to this rule like they did to basic hygiene: brushing one’s teeth in the morning takes some time also, but it’s still necessary.

As a result, the passive stage was reduced to 1 minute (the graphs were still being late sometimes). It also reduced the active stage as a nice bonus. Because now an engineer would face the problem prepared and be ready to roll her code back right away. Even though it didn’t always help, since the problem could’ve been caused by a release deployed simultaneously by somebody else. That said, the active stage in average reduced to five minutes.

Result:

Passive stage: 1 minutes.
Active stage: 5 minutes.

5. Further reduction of an active stage

We got more or less satisfied with 1-minute passive stage and started thinking about how to further reduce an active stage. First of all we focused our attention on the history of outages (it happens to be a cornerstone in a building of our availability!) and found out that in most cases we don’t roll a release back right away since we don’t know which version we should go for: there are many parallel releases. To solve this problem we introduced the following rule (and wrote it down into the do’s and dont’s): right before a release one should notify everyone in a Slack chat about what you’re about to deploy and why; in case of an accident one should write: «Accident, don’t deploy!» We also started notifying those who don’t read the chat about the releases via SMS.

This simple rule drastically lowered number of releases during an ongoing accident, decreases the duration of troubleshooting, and reduced the active stage from 5 minutes to 3.

Result:

Passive stage: 1 minutes.
Active stage: 3 minutes.

6. Even bigger reduction of an active stage

Despite the fact that we posted warnings in the chat regarding all the releases and accidents, race conditions still sometimes occurred — someone posted about a release and another engineer was deploying at that very moment; or an accident occurred, we wrote about it in the chat but someone had just deployed her code. Such circumstances prolonged troubleshooting. In order to solve this issue, we implemented automatic ban on parallel releases. It was a very simple idea: for 5 minutes after every release, the CI/CD system forbids another deployment for anyone but the latest release author (so that she could roll back or deploy hotfix if needed) and several well-experienced developers (in case of emergency). More than that, CI/CD system prevents deployments in time of accidents (that is, from the moment the notification about accident beginning arrives and until arrival of the notification about its ending).

So, our process started looking like this: an engineer deploys a release, monitors the graphs for three minutes, and after that no one can deploy anything for another two minutes. In case if a problem occurs, the engineer rolls the release back. This rule drastically simplified troubleshooting, and total duration of the active and passive stages reduced from 3+1=4 minutes to 1+1=2 minutes.

But even a two-minute accident was too much. That’s why we kept working on our process optimization.

Result:

Passive stage: 1 minute.
Active stage: 1 minute.

7. Automatic accident determination and rollback

We’d been thinking for a while how to reduce duration of the accidents caused by bad releases. We even tried forcing ourselves into looking into tail -f error_log | grep 500. But in the end, we opted for a drastic automatic solution.

In a nutshell, it’s an automatic rollback. We’ve got a separate web server and loaded it via balancer 10 times less than the rest of our web servers. Every release would be automatically deployed by CI/CD systems on this separate server (we called it preprod, but despite its name it’d receive real load from the real users). Then the script would perform tail -f error_log | grep 500. If within a minute there was no 500 error, CI/CD would deploy the new release in production onto other web servers. In case there were errors, the system rolled it all back. At the balancer level, all the requests resulted in 500 errors on preprod would be re-sent on one of the production web servers.

This measure reduced the 500 errors releases impact to zero. That said, just in case of bugs in automatic controls, we didn’t abolish our three-minute graph watch rule. That’s all about bad releases and 500 errors. Let’s move onto the next type of accidents.

Result:

Passive stage: 0 minutes.
Active stage: 0 minutes.

In further parts, I’m going to talk about other types of outages in Citymobil experience and go into detail about every outage type; I’ll also tell you about the conclusions we made about the outages, how we modified the development process, what automation we introduced. Stay tuned!

Tags: