Citymobil - a guide for startups to increase stability amid growth. Part 2. What are the types of accidents?



    This is the second article in a series about how we at Citymobil increased the stability of the service (you can read the first here ). In this article, I will delve into the specifics of the analysis of accidents. But before that I will cover one point that I had to think about in advance and cover in the first article, but I did not think about it. And about which I learned from readers feedback. The second article gives me a chance to eliminate this annoying defect.

    0. Prologue


    One of the readers asked a very fair question: “What is difficult in the backend of a taxi service?” The question is good. I asked myself myself last summer before starting to work in Citymobil. I then thought "think, a taxi, an application with three buttons." What could be complicated about it? But it turned out that this is a very high-tech service and a complex product. In order to at least make it clear what it is about and what it really is a big technological colossus, I will talk about several areas of Citymobil's product activities:

    • Pricing. The pricing team deals with price issues at every point and at any given time. Price is determined by predicting the balance of supply and demand based on statistics and other data. All this makes a large, complex and constantly evolving service based on machine learning.
    • Pricing. The implementation of various payment methods, the logic of surcharges after the trip, withholding funds on bank cards, billing, interaction with partners and drivers.
    • Distribution of orders. Which machine to distribute the sales order to? For example, the distribution option for the closest one is not the best in terms of increasing the number of trips. A more correct option is to compare customers and cars in such a way as to maximize the number of trips, given the probability of cancellation by this particular client in these conditions (because it takes a long time) and cancellation or sabotage of the order by this driver (because it takes too long or too low receipt).
    • Geo. Everything related to the search and sagest of addresses, landing points, adjusting the delivery time (our partners, suppliers of cards and traffic jams do not always give accurate information on ETA, taking into account traffic jams), improving the accuracy of direct and reverse geocoding, improving the accuracy of the machine. There is a lot of work with data, a lot of analytics, a lot of services based on machine learning.
    • Antifraud. The difference in the price of a trip for a passenger and for a driver (for example, on short trips) creates an economic incentive for fraudsters who are trying to steal our money. Fighting fraud is somewhat similar to fighting spam in the email service - completeness and accuracy are important. It is necessary to block the maximum number of fraudsters (completeness), but good users should not be mistaken for fraudsters (accuracy).
    • Motivation of drivers. The driver motivation team is engaged in the development of everything related to increasing the use of our platform by drivers and driver loyalty due to various types of motivation. For example, make X trips and get an additional Y rubles for this. Or buy a shift for Z rubles and ride without a commission.
    • Backend driver application. A list of orders, a demand map (a hint where to go to the driver to maximize your revenue), prokidyvaniya status changes, a communication system with drivers and much more.
    • The backend of the client application (this is probably the most obvious part, and what is usually understood by the backend of a taxi): placing orders, scrolling statuses about changing the status of the order, ensuring the movement of cars on the map on the order and on delivery, backend tips and etc.

    This is all the tip of the iceberg. Functionality is much more. The user-friendly interface hides a huge underwater part of the iceberg.

    And now back to the accidents. Over the six months of the history of accidents, we have compiled the following categorization:

    • poor release, 500th errors;
    • poor release, suboptimal code, load on the base;
    • unsuccessful manual intervention in the system;
    • Easter egg;
    • external causes;
    • poor release, broken functionality.

    Below I will write down what conclusions we made on the most common types of accidents.

    1. Bad release, 500th errors


    Almost all of our backend is written in PHP, an interpreted language with weak typing. It happens that you roll out the code, and it crashes due to an error in the name of the class or function. And this is just one example when the 500th error appears. It may also appear in the event of a logical error in the code; licked the wrong branch; accidentally deleted the folder with the code; left in the code temporary artifacts needed for testing; did not change the structure of the tables in accordance with the new code; did not restart or stop the necessary cron scripts.

    We struggled with this problem sequentially in several stages. Lost trips due to poor release are obviously proportional to the time it was in use. That is, we must do our best to ensure that a poor release is in operation as little as possible. Any change in the development process that reduces the average time it takes to get a bad release into use by at least 1 second is positive for the business and needs to be implemented.

    A bad release or any production accident generally goes through two states, which we called the “passive stage” and “active stage”. The passive stage is when we are not yet aware of the accident. The active stage is when we are already in the know. The accident begins in the passive stage, and over time, when we find out about it, the accident goes into the active stage - we begin to fight it: first we diagnose and then repair it.

    To reduce the duration of any accident in production, it is necessary to reduce the average duration of both the passive and active stages. The same goes for a bad release, because it is in itself a kind of accident.

    We began to analyze our current process of repairing accidents. The bad releases that we encountered at the time of the start of the analysis resulted in an idle (full or partial) average of 20-25 minutes. The passive stage usually took 15 minutes, the active 10 minutes. During the passive phase, user complaints began to be processed by the contact center, and after some threshold the contact center complained about general chats in Slack. Sometimes one of the employees complained when he could not order a taxi. An employee complaint was a signal for us about a serious problem. After the transition of a bad release to the active stage, we began to diagnose the problem, analyzed the latest releases, various graphs and logs to determine the cause of the accident. After finding out the reason, we rolled back the code, if the bad release was pumped last,

    Here is a process to deal with bad releases, we had to improve.

    1.1. Passive stage reduction


    First of all, we noticed that if a poor release is accompanied by 500 errors, then we can understand without complaint that a problem has occurred. Fortunately, all the 500th errors were recorded in New Relic (this is one of the monitoring systems that we use), and it only remained to screw on SMS and IVR notifications about the excess of a certain frequency of “five hundred” (over time, the threshold was constantly reduced).

    This led to the fact that the active stage of the accident such as "Bad release, 500th errors" began almost immediately after the release. The process in the event of an accident began to look like this:

    1. The programmer deploys the code.
    2. The release leads to an accident (massive 500s).
    3. SMS arrives.
    4. Programmers and admins begin to understand (sometimes not immediately, but after 2-3 minutes: SMS may be delayed, the sound on the phone may be turned off, and the culture of immediate actions after SMS cannot appear in one day).
    5. The active phase of the accident begins, which lasts the same 10 minutes as before.

    Thus, the passive stage was reduced from 15 minutes to 3.

    1.2. Further reduction of the passive stage


    Despite the reduction of the passive stage to 3 minutes, even such a short passive stage bothered us more than the active one, because during the active stage we already do something to solve the problem, and during the passive stage the service does not work in whole or in part, but “ men don’t know. ”

    To further reduce the passive stage, we decided to sacrifice three minutes of developer time after each release. The idea was very simple: you roll out the code and look at New Relic, Sentry, Kibana for three minutes to see if there are 500 errors. As soon as you see a problem there, a priori you assume that it is related to your code and you begin to understand.

    We chose three minutes based on statistics: sometimes problems appeared on the charts with a delay of 1-2 minutes, but there were never more than three minutes.

    This rule was recorded in do's & dont's. At first it was not always executed, but gradually the developers got used to the rule as elementary hygiene: brushing your teeth in the morning is also a waste of time, but you need to do this.

    As a result, the passive stage was reduced to 1 minute (the schedules were still late sometimes). As a pleasant surprise, this simultaneously reduced the active stage. After all, the developer encounters a problem in good shape and is ready to immediately roll back his code. Although this does not always help, because the problem could have arisen due to someone else's code that was being rolled out in parallel. But, nevertheless, the active stage on average was reduced to 5 minutes.

    1.3. Further reduction in active stage


    More or less satisfied with one minute of the passive stage, we began to think about a further reduction in the active stage. First of all, we paid attention to the history of problems (it is the cornerstone in the building of our stability!) And found that in many cases we do not roll back immediately because we do not understand which version to roll back to, because there are many parallel releases. To solve this problem, we introduced the following rule (and recorded it in do's & dont's): before the release, you write to the chat in Slack, what you are rolling for and what, and in the event of an accident you write to the chat "accident, do not roll!". In addition, we began to automatically report via SMS about the release facts to notify those who do not enter the chat.

    This simple rule sharply reduced the number of releases already during accidents and reduced the active stage - from 5 minutes to 3.

    1.4. An even greater reduction in active stage


    Despite the fact that we warned in the chat about all releases and crashes, sometimes race conditions appeared - one wrote about the release, and the other already rolled out at that moment; or the accident started, they wrote about it in the chat, and someone just rolled out a new code. These circumstances lengthened the diagnosis. To solve this problem, we implemented an automatic prohibition of parallel releases. The idea is very simple: after each release, the CI / CD system forbids everyone to roll out for the next 5 minutes, except the author of the last release (so that he can roll or roll hotfix if necessary) and several especially experienced developers (in case of emergency). In addition, the CI / CD system prohibits rolling out during an accident (that is, from the moment of receipt of the notification of the beginning of the accident to the moment of receipt of the notification of its completion).

    Thus, the process became like this: the developer rolls out, monitors the charts for three minutes, and after that for two more minutes no one can roll out anything. If there is a problem, then the developer rolls back the release. This rule radically simplified the diagnosis, and the total duration of the active and passive stages decreased from 3 + 1 = 4 minutes to 1 + 1 = 2 minutes.

    But two minutes of the accident is a lot. Therefore, we continued to optimize the process.

    1.5. Automatic crash detection and rollback


    We have been thinking for a long time how to reduce the duration of the accident due to poor releases. They even tried to force themselves to look in tail -f error_log | grep 500. But in the end, they all settled on a cardinal automatic solution.

    In short, this is an auto-rollback. We set up a separate web server, on which we loaded 10 times less load from the balancer than on other web servers. Each release was automatically deployed by the CI / CD-system to this separate server (we called it preprod, although, despite the name, the real load from real users went there). And then the automation performedtail -f error_log | grep 500. If no 500th error occurred within one minute, then the CI / CD deployed the new code in production. If errors appeared, then the system immediately rolled back everything. At the same time, at the balancer level, all requests completed with 500 errors on the preprod were duplicated to one of the production servers.

    This measure reduced the effect of the five-hundredth releases to zero. At the same time, in case of bugs in automation, we did not cancel the rule for three minutes to monitor the charts. That's all about bad releases and 500th bugs. We proceed to the next type of accident.

    2. Bad release, suboptimal code, base load


    I'll start right away with a concrete example of an accident of this type. We rolled out the optimization: we added it USE INDEXto the SQL query, during testing this accelerated short queries, as in production, but long queries slowed down. Slowing down of long queries was noticed only in production. As a result, the stream of long requests put the entire master base for one hour. We thoroughly figured out how it works USE INDEX, described it in the do's & dont's file, and warned developers against misuse. We also analyzed the query and realized that it returns mainly historical data, which means that it can be run on a separate replica for historical queries. Even if this replica lies under load, the business will not stop.

    After this incident, we still ran into similar problems, and at some point decided to approach the issue systematically. We scanned the entire code with a frequent comb and carried out to the replica all the requests that can be made there without compromising the quality of the service. At the same time, we divided the replicas themselves according to criticality levels so that the fall of any of them would not stop the service. As a result, we came to an architecture that has the following bases:

    • master base (for write operations and for queries that are supercritical to data freshness);
    • production replica (for short queries that are slightly less critical to data freshness);
    • replica for calculating price ratios, the so-called surge pricing. This replica can lag behind by 30-60 seconds - this is not critical, the coefficients change not so often, and if this replica falls, the service will not stop, just the prices will not quite correspond to the balance of supply and demand;
    • replica for the admin panel of business users and the contact center (if it falls, the main business will not rise, but support will not work and we will not be able to temporarily view and change settings);
    • many replicas for analytics;
    • MPP database for heavy analytics with full slices according to historical data.

    This architecture gave us more space for growth and reduced the number of crashes by an order of magnitude due to suboptimal SQL queries. But she is still far from ideal. Plans to do sharding so that you can scale updates and deletes, as well as short queries supercritical to freshness of these data. The margin of safety of MySQL is not infinite. Soon we will need heavy artillery in the form of a Tarantool. About this will be required in the following articles!

    In the process of the trial with non-optimal code and requests, we realized the following: it is better to eliminate any non-optimality before the release, and not after. This reduces the risk of an accident and reduces the time spent by developers on optimization. Because if the code has already been downloaded and there are new releases on top of it, then it is much more difficult to optimize. As a result, we introduced a mandatory code check for optimality. It is conducted by the most experienced developers, in fact, our special forces.

    In addition, we began to collect at do's & dont's the best code optimization methods that work in our realities, they are listed below. Please do not perceive these practices as absolute truth and do not try to blindly repeat them in yourselves. Each method makes sense only for a specific situation and a specific business. They are given here just for example, so that the specifics are clear:

    • If the SQL query does not depend on the current user (for example, a demand map for drivers indicating the rates of minimum trips and coefficients for polygons), then this query must be done by cron with a certain frequency (in our case, once a minute is enough). Write the result to the cache (Memcached or Redis), which is already used in the production code.
    • If the SQL query operates with data whose backlog is not critical for business, then its result should be cached with some TTL (for example, 30 seconds). And then in subsequent requests read from the cache.
    • If in the context of processing a request on the web (in our case, in the context of the implementation of a specific server method in PHP) you want to make an SQL query, you need to make sure that this data has not “arrived” with any other SQL query (and whether they will come further by code). The same applies to accessing the cache: it can also be flooded with requests if you wish, therefore, if the data has already "arrived" from the cache, then you do not need to go to the cache as to your home and take from it, which is already taken away.
    • If in the context of query processing on the web you want to call any function, then you need to make sure that no extra SQL query or cache access will be made in its giblets. If calling such a function is unavoidable, you need to make sure that it cannot be modified or its logic broken up so as not to make unnecessary queries to the databases / caches.
    • If you still need to go into SQL, you need to make sure that you cannot add the necessary fields higher or lower in the code to the queries that already exist in the code.

    3. Unsuccessful manual intervention in the system


    Examples of such accidents: an unsuccessful ALTER (which overloaded the database or provoked a replica lag) or unsuccessful DROP (ran into a bug in MySQL, blocked the database when a fresh table was dropped); heavy request for a master made by mistake by hand; We performed work on the server under load, although we thought it was out of work.

    To minimize falls for these reasons, it is necessary, unfortunately, to understand the nature of the accident each time. We have not yet found the general rule. Again, try the examples. Say, at some point, the surge coefficients stopped working (they multiply the price of the trip at the place and time of increased demand). The reason was that on the database replica, where the data for calculating the coefficients came from, the Python script worked, which ate all the memory, and the replica went down. The script has been running for a long time, it worked on a replica just for convenience. The problem was solved by restarting the script. The conclusions were as follows: do not run third-party scripts on a machine with a database (recorded in do's & dont's, otherwise this is a blank shot!), Monitor the end of memory on a machine with a replica and alert by SMS if the memory runs out soon.

    It is very important to always draw conclusions and not slip into a comfortable situation "they saw a problem, fixed it and forgot it." A quality service can only be built if conclusions are drawn. In addition, SMS alerts are very important - they set the quality of service at a higher level than it did, prevent it from falling and further improve reliability. As a climber from each stable state, he pulls himself up and is fixed in yet another stable state, but at a higher altitude.

    Monitoring and alerting with invisible but rigid iron hooks cut into the rock of uncertainty and never let us fall below the level of stability that we set, which we constantly raise only up.

    4. Easter egg


    What we call the "Easter Egg" is a time bomb that has existed for a long time, but which we have not run into. Outside of this article, this term refers to an undocumented feature made on purpose. In our case, this is not a feature at all, but rather a bug, but which works like a time bomb and which is a side effect of good intentions.

    For example: 32 bit overflow auto_increment; non-optimality in the code / configuration, "shot" due to the load; A lagging replica (usually either because of a suboptimal request for a replica that was triggered by a new usage pattern, or a higher load, or because of a suboptimal UPDATE on the master that was triggered by a new load pattern and loaded the replica).

    Another popular type of Easter egg is non-optimal code, and more specifically, non-optimal SQL query. Previously, the table was smaller and the load was less - the query worked well. And with the increase in the table, linear in time and load growth, linear in time, the DBMS resource consumption grew quadratically. Usually this leads to a sharp negative effect: everything was “ok”, and bang.

    More rare scenarios are a combination of bug and easter eggs. A release with a bug led to an increase in the size of the table or an increase in the number of records in a table of a certain type, and an already existing Easter egg caused an excessive load on the database due to slower queries to this overgrown table.

    Although, we also had Easter eggs, not related to the load. For example, 32-bitauto increment: after two and a few billion records in the table, inserts cease to be performed. So the field auto incrementin the modern world must be made 64-bit. We learned this lesson well.

    How to deal with Easter eggs? The answer is simple: a) look for old "eggs", and b) prevent new ones from appearing. We try to fulfill both points. The search for old “eggs” in our country is associated with constant code optimization. We identified two of the most experienced developers for near-fulltime optimization. They find in slow.log queries that consume the most database resources, optimize these queries and the code around them. We reduce the likelihood of new eggs by checking the optimality code of each commit by the aforementioned sensei rezrabotchiki. Their task is to point out errors that affect performance; tell you how to do better, and transfer knowledge to other developers.

    At some point after the next easter egg we found, we realized that searching for slow queries is good, but it would be worthwhile to additionally search for queries that look like slow but work fast. These are just the next candidates for putting everything in the event of an explosive growth of the next table.

    5. External causes


    These are reasons that we think are poorly controlled by us. For instance:

    • Trotting by Google Maps. You can get around it by monitoring the use of this service, observing a certain level of load on it, planning the growth of the load in advance and purchasing expansion of the service.
    • The fall of the network in the data center. You can get around by placing a copy of the service in the backup data center.
    • Payment service accident. You can bypass reservation of payment services.
    • Erroneous traffic blocking by the DDoS protection service. You can get around by disabling the default DDoS protection service and enabling it only in case of a DDoS attack.

    Since eliminating an external cause is a long and expensive undertaking (by definition), we just started collecting statistics on accidents due to external causes and waiting for the accumulation of critical mass. There is no recipe for determining the critical mass. It just works intuition. For example, if we were 5 times in full downtime due to problems, say, of the DDoS control service, then with each next drop it will become sharper and sharper to raise the question of an alternative.

    On the other hand, if you can somehow make it work with an inaccessible external service, then we definitely do it. And this helps us post-mortem analysis of each fall. There must always be a conclusion. That means you always want-not-want, but you can come up with a workaround.

    6. Bad release, broken functionality


    This is the most unpleasant type of accident. The only type of accident that is not visible for any symptoms other than user / business complaints. Therefore, such an accident, especially if it is not large, can exist unnoticed in production for a long time.

    All other types of accidents are more or less similar to “bad release, 500th errors”. It’s just that the trigger is not a release, but a load, a manual operation or a problem on the side of an external service.

    To describe the method of dealing with this type of accident, it is enough to recall a bearded anecdote:

    Mathematics and physics were offered the same task: boil a kettle. Ancillary tools are given: stove, kettle, water tap with water, matches. Both alternately pour water into the kettle, turn on the gas, light it and set the kettle on fire. Then the task was simplified: a kettle filled with water and a stove with burning gas were proposed. The goal is the same - to boil water. The physicist puts the kettle on fire. The mathematician pours water from the kettle, turns off the gas and says: "The task has been reduced to the previous one." anekdotov.net

    This type of accident must be reduced by all means to “poor release, 500th errors”. Ideally, if bugs in the code were saved to the log as an error. Well, or at least left traces in the database. From these traces, you can understand that a bug has occurred, and immediately alert. How to contribute to this? We started to analyze each major bug and offer solutions, what kind of monitoring / SMS alert can be done so that this bug immediately manifests itself in the same way as the 500th error.

    6.1. Example


    There were massive complaints: orders paid through Apple Pay do not close. They started to understand, the problem was repeated. We found a reason: we made a finalization to the format expire datefor bank cards when interacting with acquiring, as a result of which they began to transfer it specifically for payments via Apple Pay in the format that it was expected from the payment processing service (in fact, we treat one and cripple the other ), so all payments through Apple Pay began to be declined. Quickly fixed, rolled out, the problem disappeared. But they "lived" with the problem for 45 minutes.

    Following the traces of this problem, we monitored the number of failed payments via Apple Pay, and also made an SMS / IVR alert with some non-zero threshold (because failed payments are the norm from the point of view of the service, for example, the client does not have money on the card or the card is blocked) . From this moment, when the threshold is exceeded, we instantly learn about the problem. If the new release introduces ANY problem into Apple Pay processing, which will lead to service inoperability, even partial, we will immediately learn about it from the monitoring and roll back the release within three minutes (the above describes how the manual rolling process works). It was 45 minutes of partial downtime, it became 3 minutes. Profit

    6.2. Other examples


    We rolled out the optimization of the list of orders offered to drivers. A bug crept into the code. As a result, drivers in some cases did not see the list of orders (it was empty). They found out about the bug by accident - one of the employees looked into the driver’s application. Quickly rolled back. As a conclusion from the accident, we made a graph of the average number of orders in the list of drivers according to the database, looked at the graph retroactively for a month, saw a failure there and made an SMS alert for the SQL query, which forms this graph when the average number of orders in the list below the threshold selected based on the historical minimum for the month.

    Changed the logic of giving cashback to users for trips. Including distributed to the wrong group of users. We fixed the problem, built a schedule of cashbacks handed out, saw a sharp increase there, also saw that there had never been such a growth, made an SMS alert.

    With the release, the functionality of closing orders was broken (the order was closed forever, payment by cards did not work, drivers demanded payment in cash from customers). The problem was 1.5 hours (total passive and active stages). We learned about the problem from the complaints contact center. They made a correction, made monitoring and alerts on the closing time of orders with thresholds found from the study of historical graphs.

    As you can see, the approach to this type of accident is always the same:

    1. Roll out the release.
    2. Learn about the problem.
    3. Fix it.
    4. We determine by what traces (in the database, logs, Kiban) you can find out the signs of the problem.
    5. We plot these signs.
    6. We rewind it into the past and look at bursts / falls.
    7. We select the correct threshold for the alert.
    8. When a problem arises again, we immediately learn about it through an alert.

    What is nice about this method: a huge class of problems is closed immediately with one chart and alert (examples of problem classes: non-closing of orders, extra bonuses, non-payment via Apple Pay, etc.).

    Over time, we made building alerts and monitoring for every major bug a part of the development culture. To prevent this culture from getting lost, we formalized it a little. For each accident, they began to demand a report from themselves. A report is a completed form with answers to the following questions: root cause, method of elimination, impact on business, conclusions. All items are required. Therefore, if you want it or not, you will write the conclusions. This process change, of course, was written down by do's & dont's.

    7. Kotan


    The degree of automation of the process grew, and in the end we decided that it was time to create a web interface in which the current state of the process could be seen. This web interface (and in fact, already a product) we called "Kotan". From the word "roll." :-)

    In "Kotan" there is the following functionality:

    List of incidents. It contains a list of all alerts that have already been triggered in the past - of what an immediate human response was required. For each incident, its start time, closing time (if it is already closed), a link to the report (if the incident has ended and the report is written) and a link to the alert directory are recorded to understand what type of alert the incident relates to. A history of the same incidents (to know how we actually eliminated such incidents).

    Directory of alerts. In fact, this is a list of all alerts. To make it clearer, the difference between an alert and an incident: an alert is like a class, and an incident is like an object. For example, "the number of 500s is greater than 1%" is an alert. And "the number of 500s is more than 1%, which happened on such and such a date, such and such a time, with such and such a duration" - this is an incident. Each alert is added to the system after solving a specific problem that was not previously detected by the alert system. Such an iterative approach guarantees a high probability of the absence of false alerts (for which nothing needs to be done). The reference book contains a complete history of reports for each type, this helps to diagnose the problem more quickly: an alert came, went to Kotan, clicked on the reference book, saw the whole story, and you already roughly know where to dig. The key to successful accident repair is having all the information at hand. Link to the source code of the alert (to understand exactly what specific situation this alert is alert). Textual description of current best practices.

    Reports. These are all reports throughout history. Each report has links to all incidents to which it is associated (sometimes incidents come in a group, the cause of the problem is the same, and the report is one for the whole group), the date the report was written, the confirmation flag for solving the problem, and most importantly: the root cause, the way elimination, impact on business, conclusions.

    List of conclusions. For each conclusion, it is marked whether it is implemented, planned to be implemented or not needed (with an explanation of why it is not needed).

    8. And what has changed in the process itself?


    A very important component in improving stability is the process. The process is constantly undergoing change. The purpose of the changes: to improve the process so as to reduce the likelihood of accidents. Decisions to change the process should ideally be made not speculatively, but on the basis of experience, facts and figures. The process should be built not from the top directively, but from the bottom, with the participation of all interested team members, because one head of the leader is good, but many goals of the whole team are better! The process must be strictly observed and monitored, otherwise it makes no sense. Team members should correct each other in case of deviation from the process, because if not they, then who? There should be maximum automation, which would take over the control functions, because a person, especially in creative work, is constantly mistaken.

    To automatically control the formation of conclusions from the accident, we did the following. For each alert, releases are automatically blocked. When a closing alert arrives (SMS with information that the incident has been completed), the releases are not immediately unlocked, instead it is possible to enter a report into the system with the following information: the cause of the accident, how it was repaired, how the accident affected the business, what conclusions were made. The report is written by the participants in the accident analysis, that is, those people who have the most complete information about the accident. Prior to the appearance and approval of the report in the system, releases are prohibited by automation. This motivates the team, after eliminating the accident, to quickly assemble and form a report. It must be approved by someone else who did not participate in its writing so that there is a second opinion. Thus we achieved, on the one hand,

    9. Instead of an epilogue


    Instead of the epilogue, I summarize briefly in the table what we changed in the process in order to reduce the number of lost trips.

    What have you changed?Why did they change?
    They began to keep a diary of accidents.
    To draw conclusions and not get an accident again.
    For large accidents (with a large number of travel losses), they began to do post-mortem.
    To learn how to fix the accident faster in the future.
    They began to keep the file do's & dont's.
    To form knowledge about what is possible and what is not in development, and why not.
    They banned releases more than once every 5 minutes.
    To reduce the delay in diagnosing an accident.
    We roll out first to one server with a low priority, and then to everything.
    To reduce the effect of a bad release.
    Automatically rollback a bad release.
    To reduce the effect of a bad release.
    We prohibit rolling out at the time of the accident
    To speed up the diagnosis.
    We write about releases and accidents in the chat.
    To speed up the diagnosis.
    After the release, we track the charts for three minutes.
    To speed up problem detection.
    SMS / IVR alerts about problems.
    To speed up problem detection.
    Each bug (especially large) is closed by monitoring and alert.
    To speed up problem detection.
    Code optimality analysis.
    To reduce the likelihood of accidents due to new non-optimal code.
    Periodic code optimization (as input - slow.log).
    To reduce the number of accidents due to "Easter eggs".
    For each accident, we conclude.
    Reduces the likelihood of the same accident for the future.
    For each accident, we make an alert.
    Reduces the duration of the elimination of the same accident in the future.
    Automatically ban releases after an accident before writing and approving a report.
    It increases the likelihood that conclusions will be drawn after the accident, and therefore reduces the likelihood of the same accident in the future.
    "Kotan" is an automatic tool to improve the quality of service.
    Reduces the duration of the accident, reduces its likelihood.
    Directory of incidents.
    Reduces the duration of an accident diagnosis.

    Thank you very much for reading to the end! Good luck to your business and less lost orders, transactions, purchases, trips and everything that is critical for you!

    Also popular now: