Silicon Chainsaw Massacre

Published on October 11, 2018

Silicon Chainsaw Massacre

    Duty is an important component of most modern organizations. After all, it often happens that the problem arrives at 3 in the morning. But who should be on duty? And how to organize this process so that it makes sense?

    Look under the cat, there Baruch Sadogursky ( @jbaruch ) and Leonid Igolnik ( @ligolnik ) will tell a horror story about one unlucky person on duty. Remember Vasya, who always had to fix bugs to Bukhom at three in the morning? This is just the beginning.



    The material is based on the speech of Baruch and Leonid at the autumn conference DevOops 2017.



    So duty. As an example, take the hypothetical Vasya, who has been working in a certain company not so long ago, for about a year. Today is Friday and it so happened that Vasya is on duty. Everything is going well: Vasya is sleeping, and his dreams are beautiful.

    While he does not suspect anything, his technical support colleague receives a call from a client. He says that on Monday he will have to show the CEO a very important demo, but something is not working. An urgent need to fix the problem. Support did everything it could (suggested turning it off and on), and escalated the problem to the duty officer.

    Vasya is on duty alone for the whole project, he will have to do this. A technical support engineer wakes him up and tries to explain the problem with very simple words: “Something is not working there, something is wrong.”

    Vasya listens to his colleague, thinks about what happened and takes the only right decision ...



    The support engineer takes offense, but he still asks the client to wait until Monday. The client does not agree with this state of affairs and escalates the problem to the support manager. Now he calls Vasya and says that it is not serious: “Still, the matter is important, the client is experiencing, it is necessary”. Vasya is a responsible person. It is necessary then it is necessary: ​​coffee and we will dance (we will look for something in logs).

    In the logs everything was not easy at all. At first, Vasya spent 15 minutes looking for a person with access to the correct log. He found it and got a log. But how? To the mail in .doc format, of course. Vasya is responsible, young and reading a log in the Word: nothing is clear at all. But there are keywords that can be searched in the Knowledge Base. Vasya hangs on interesting things for 20 minutes, learns a lot of interesting things, but nothing necessary at the moment.

    What does Vasya do next? We need to look for someone, but Vasya is a young engineer and it’s terrible to wake up a senior. Therefore, he decides to call a colleague, the same junior.



    A colleague is a good friend, with grief in half they understand what the problem is, and they begin to solve it. After two or three hours of work, they found a hot-fix. It must be given immediately to production. But how? There is a change control board, which meets every two weeks on Mondays. But this option is not suitable, it is necessary urgently.

    Naturally, it is impossible to deploy to production, there is no root. So the only option is to send a jar file with two classes and a paragraph of explanations by e-mail. Those guys will log in via ssh to production and there this jar-file in WebSphere is put in the correct directory. And so, at 6-7 am, you can finally go to sleep with a sense of accomplishment.



    On Monday, Vasya comes to work and sees an unusual picture. On VP of Engineering, the general director yelled, because the client had yelled at the general director, because his general director had yelled at him. It turns out that the problem did not work out.

    The bosses have four questions: “What happened on Friday?”; “Why did I find out about it from the boss on Monday?”; “Why did the fix take six hours?” And “Who is to blame?”. Opsies are called into the room, saying that this is not their problem. Vasya and the other maidens are called to the room, who also says "Nothing worked." This whole potato game continues until lunch.

    Since it’s time for lunch, the decision is made “Okay, it’s broken itself, no one is to blame.” The end of the story.



    Let's arrange a little debriefing: let's go through this nightmare and try to give answers to all the questions.

    Who should be on duty


    System administrators can be on duty (or the more buzzword is SRE). They are engaged in this not for the first year, everything is well supplied. DBA can be on duty, those who are engaged in messaging can. If lucky, the NOC (network operation center) is on duty when all these guys are put in one room. They have runbooks that say what to do in a difficult situation.  



    Everything is clear with these guys, they are pros on duty. But, unfortunately, DevOps has the second part of the equation, which is not very willing to be on duty. How to make the second part? Yes, and whether it is necessary to force, because the pros must be aware of the importance of duty.

    There are two reasons why people do not want to be on duty. One is when this is the case:



    No one wants to be on duty when everything is bad. The solution to this problem is quite simple:



    You need to write good code, everything is simple. But this also needs to be learned. A new question arises: “How to learn?”. You need to stick your fingers into the outlet - #painisinstructional. The pain helps.



    The duties themselves improve the quality of the code. The same Vasya on Monday, most likely, will correct his mistakes, so as not to sit one more night in search of mistakes.

    Barriers on duty


    When on duty there are some barriers that should not be there. Let's walk on Vasin Friday fiasco. Remember how he read logs in a Word document?



    We all love Microsoft products, but in the modern world it is impossible to work this way. There are obvious things about logs that everyone should understand. The main point is the number of bodies that solve these problems: Logstash, Loggly, Sumo logic, Kibana. They need to be used.

    Returning to the question of access: why not give access to the log? The answer is sensitive data. Clients were promised to protect data from leaks. This means that logs cannot be shown to anyone or you need to use systems with data masking functionality. He himself hides personal data and does not show it to a person without the required level of access. All today's log parsing tools have this functionality.

    Another advantage of these tools is that “monitoring for the poor” can be built on them. For example, in the logs, after seeing a certain number of errors (the queue is full, etc.), you can trigger an alert.



    Who determines the importance of the problem


    How to understand, go to bed on duty or urgently run to fix the problem? Who appoints severity? Who decides how critical the problem is? Solves support. And why? Because support has a vision problem. He knows how bad everything is.

    Moreover, today almost all companies operate on the principle of “Continuous Delivery”, so all customers receive all features at the same time (and at the same time bugs). Suppose there is a bug that looks to the client as “Something is not working there, nothing terrible”. At the same time, the support sees hundreds of alerts about the problem, and this is obviously serious.

    The client is also involved in determining the importance of the problem. But there is a problem - the client does not always believe that his small problem will be repaired and puts the highest importance. Severity begins to be used as a manipulation tool. But if the severity definition is built correctly and the client understands what an SLA is, this usually does not happen. This is a mutual learning process when customers begin to understand what is really very important and what is simply important.

    Customers need to be given the opportunity to set importance, because the manufacturer of the product does not always understand the context of the problem.

    SLA is a guarantee of response and decision within a day, but it can be faster. This, again, depends on the context.



    Organizational issues


    Vasya did not understand the problem to the end. Of course, he just woke up, and a colleague from technical support also poorly explained. It is treated in only one way: a ticket. A ticket is a reference number that tells you what it is all about. This is necessary for Vasya, because instead of a phone, a support could tell Vasya “Open our ticket management system and see ticket number 42”. This is necessary for several reasons. First, Vasya, instead of listening to a colleague in a sleepy state, wakes up, goes to read the ticket and understand how important this is. Secondly, the problem may not be one.

    There is another difficulty that affects the big picture: Vasya needs to be found. How does the support know what exactly Vasya is on duty today? And what if he doesn’t know Vasya either. It is important that it is easy to find the right person. The solution is Virtual Extension with different prefixes for the engineer, production and so on. Well, or other, more sophisticated systems. With this decision, you do not need to guess where to call in the middle of the night, everything switches automatically.

    Even more convenient - Escalation Chat in Slack, Telegram, Skype, anywhere. The chat title is the number of that same ticket. All communication on this ticket between the right people is in this chatika. One of the most useful features of this chat is that you do not need to tell anything to those who join the work after some time. About what decisions have already been made, you can just read.

    Well, the virtual phone bridge, which can be compared with a gathering place during a fire: everyone knows where to go in case of problems. By the way, in modern systems like Zoom or Bluejeans all the necessary functionality is already built in.



    Escalation path


    Vasya was afraid to call the lord, because they are terrible. What to do with it? Let's talk about the escalation path. We must always know who and when to wake up or take off from work. This should be perfectly clear to everyone: the one who wakes and the one who is woken. You also need to know where to call if the first path did not work. A good escalation path goes right up to the company's CEO.



    There are communications that should come from the CEO. He should be aware of critical issues.

    Errand manager


    The next interesting position is manager on call. It does not have to be a techie and take part in solving the problem. First, in the middle of the night, Vasya cannot tell the client anything useful. The duty manager can help in this situation. In addition, he can do coordination, project management in a difficult situation, resource management. After all, work must go smoothly.



    Access to production


    Should I give Vasya access to production? After all, not all are brilliant engineers, and there are certain things that we would not like to learn, customers will be offended. This is solved by the deployment process. If the process is set up correctly, then Vasya can press a button, which will end up with production code on his production. If you have a normal continuous delivery pipeline, most likely, this can be done automatically (all tests will pass). If not, in many companies the decision is made by a senior engineer or manager. He reviews, evaluates the risk of the code and makes a decision.

    And even before the hotfix appeared, documented tools for troubleshooting should be involved. One of the most frequent problems on duty: he begins to think about how to enable debug or change the level of logs. At the same time there is no problem with everything else. It is advisable to have documented solutions to problems.



    Change of duty


    Now let's talk about the process of duty, which usually takes a week. At some point, need to change. To change effectively, it must be done at a certain time. It is necessary to plan everything clearly and it is desirable to be able to effectively transfer problems to the next person on duty. In many companies this is done by a standard rally, or a page is created in which the duty officer keeps a small journal.



    Other problems


    There are still various other problems that need to be addressed. One of them is the certification of access to production. It is desirable to carry out an elementary certification, so that a person shows that he understands what is possible and what is not.



    How to implement it?


    But how does this all bring in the company? You need to understand that it will take some time.



    We need to start with senor guys on duty. They have experience and know what and how to repair. Senor has fewer problems: there is access to logs, etc.

    It is advisable to start in a small team. If the team is already large, you need to create a small one inside it. Then, when everything starts to work well, you can shame the rest.



    findings


    Go to the main thing. Despite the fact that we, techies, are in love with tech, the most important thing is not the products and technologies that are used in the company, but the feeling of the person on duty “I am involved in the process of improving the product” (or the duty itself). Many understand that it is necessary to be on duty, but I want to see improvements. People need to realize that thanks to them and their improvements, things are getting better.



    PS


    We want to recommend a book called Drive. This is a book about the motivation of people who work in professions like ours. This motivation consists of three main components and (spoiler) none of them is not money.



    Already this Sunday, Baruch and Leonid will give a talk "#DataDrivenDevOps" at DevOops 2018 in St. Petersburg. Come, there will be a lot of interesting things: here are the reports , here is John Willis , and here is the party with BoF and karaoke . Waiting for you!