Why SRE is important documentation. Part 1

Original author: Shylaja Nukala and Vivek Rau

Transfer

Good evening everyone!

The intensity of our launches varies from month to month. September students did not have time to finish the second month of the Devops - Practices and Tools course , as the next stream opens in our country. So we are again ready to share with you useful materials on the subject and look forward to at least useful open lessons .

Today we look at the first part of the article on how documentation allows SRE teams to manage new and existing services.

SRE (site reliability engineering, roughly translated as “ensuring the reliability of information systems”, specialists in this area wear the same abbreviation) - a special discipline, thinking and a set of technical approaches aimed at ensuring the uptime of web products and services. SREs are at the junction of software development and systems engineering, solve operational problems, and develop scalable, reliable, and efficient solutions for designing, building, and operating large-scale distributed systems.

SRE main tasks:

Monitoring and collecting metrics - determining the desired behavior of the service, studying the actual behavior of the service and eliminating differences.
Incident Response — detect and effectively respond to service failures in order to maintain compliance with service availability with its SLA (service-level agreement).
Capacity planning - forecasting future demand and providing the necessary amount of computing resources in the respective locations to meet this demand.
Service scaling - predictable deployment and removal of the computing capacity of the service in the data center, often as a result of capacity planning.
Change Management - change the behavior of the service without losing its reliability.
Performance — design, development, and engineering associated with scaling, isolation, latency, throughput, and efficiency.

SRE focuses on the life cycle of services: from idea and design to deployment, operation, improvement, and, ultimately, decommissioning.

Before launching the service, SRE support it, providing advice in the field of system architecture, develop software platforms, frameworks and capacity plans, conduct a launch review.

When the service is already running, SRE supports it as follows:

Measure and monitor availability, delays and overall system status.
Check for scheduled system changes.
Scale the stability of the system with the help of some mechanisms, such as automation.
Improve the system by promoting changes to improve reliability and speed.
Conduct a response to incidents and “non-accusatory” postmortems.

When a service’s life comes to an end, SRE takes it out of service in a predictable way, with a clear explanation and complete documentation.

In a mature SRE team, there is always full documentation for each SRE function. If you manage a SRE team or plan to organize it, this article will help you understand the types of documentation your team needs, which will allow you to plan and prioritize your work on documentation in parallel with other tasks of the team.

SRE history

Before discussing the nuances of SRE documentation, let's take a look at the day in the life of Zoe, the newly created SRE.

There is a second change of Zoe in the role of SRE on the flagship project AcmeSale in Acme Inc. While she only adapts to the team, oversees the work of her colleagues and takes notes. But now she still has a pager.

As luck would have it, the pager calls at 2:30 in the morning. The message says “Job Ragnarok leaned back”, Zoe has no idea what that means. She scrolls through her notes and finds a link to the main dashboard page. Everything looks OK. She is trying to find a document on the Acme intranet that refers to Ragnarok, and after a few precious minutes she finds an outdated document on the service architecture, which turns out to be a critical dependency for AcmeSale.

Fortunately, in dizdok there is a link to the “Ragnarok Ops” page, on which there was a link to dashboards with useful graphs. The page also mentions the ragtool script, which is probably capable of helping with a solution to the problem, but Zoe is hearing about it for the first time. Therefore, it sends a request for help to a pager to another SRE with many years of experience in this service and tools. Unfortunately, there is no answer. Zoe checks mail and sees a message that her colleague is offline for a full hour due to health problems. After weighing all the pros and cons, she calls her technical list, but the call goes to voice mail. Everything suggests that it is necessary to solve this problem independently.

After spending some time searching for information about the mysterious ragtool script, she finds a document with a brief description of its command line parameters and where to find it. She runs ragtool —restart and in the hope of crossing her fingers. Nothing changes, traffic drops even more. She desperately scans the rest of the command-line options, but she’s not sure that they will do any more harm. Finally, she decides to use ragtool —rebalance e — dc = atlanta, because according to the charts it is clear that the problem is especially noticeable in the data center of Atlanta. The traffic schedule begins to slowly creep up, and Zoe rejoices in the victory. MTTR (mean time to repair, average time to restore service to working condition) is 45 minutes.

The next day, Zoe conducts a post-mortem discussion of this incident. This is because the problem turned out to be particularly large and turned into loss of income, plus the manager asks for more post-mortem levels. She asks the team how the rest of its members would solve this problem, and hears three different approaches. It turns out that a single troubleshooting process simply does not exist. Also, her colleagues admit that the notice “lay back” is not the best name, and the failure occurred due to a known bug that simply was not a priority.

Finally, Steve, her techleed, asks: “What version of ragtool did you take?”, And then notes that the version used is terribly old. The new version was released a week ago, along with completely new documentation describing all the features and even explaining how to solve the problem “Job Ragnarok leaned back”. This version would reduce MTTR to five minutes.

The existence of a new version of ragtool turns out to be a surprise for half the team, while the other half more or less know about the new version and the guide. The latest version of the script is in Steve's home directory, obviously in the bin / folder. Zoe adds this to her notes for future use, hoping to quietly refine the rest of the shift. She wonders whether Techlid or anyone on the team will deal with the problems discussed on the post-mortem, or the whole future SRE will have to endure such a painful experience.
Later that day, Zoe participates in a meeting where the SRE team communicates with the development team about the service handover. Steve manages the meeting, asks several earlier questions about operational procedures and the current problem of service reliability, asks developers to make changes before the SRE team can take responsibility for the service. Zoe was already at several rallies that Steve and other senior SREs held. She understands that the questions and tasks assigned to the developers vary greatly, depending on who holds the meeting and what problem the SRE team dealt with last week.

Zoe secretly dreams of more consistent standards and procedures, but does not yet understand how to arrive at this goal. Later, she hears the two developers laughing about the coffee machine, that many questions were loosely connected with the pager, and they don’t understand at all where they came from. Zoe dreams that developers understand that SRE is not only carrying a pager. Returning to the workplace, Zoe finds several tickets that need to be disassembled, and no longer thinks about it.

Fortunately, all the characters and events of this story are invented. Nevertheless, think about it, but does it look like something that you have encountered in reality. The solution to the problems of this fictional team is quite obvious, and in the next section we will discuss it in more detail.

The importance of documentation

In the early stages of a SRE team, an organization is highly dependent on the work of individual highly skilled individuals within a team. The team stores important concepts and principles of exploitation as crumbs of “tribal knowledge”, orally transmitted to new team members. If these principles are not standardized and not documented, most likely, at some point they will have to be painfully re-taught by trial and error. Sometimes team members perform operational procedures as a strict sequence of steps defined by their predecessors in the distant past, without even understanding the causal relationships of these steps. If this is not stopped, the processes become fragmented and degenerate, as soon as the team starts to grow to solve new problems.

The SRE team can prevent this process by creating high-quality documentation that will serve as the foundation for the growth of such teams and the introduction of a systematic approach to managing new and unfamiliar services. These documents preserve tribal knowledge in the form in which they are easy to find, maintain and search for them. New team members are trained through a systematic and well-designed program. These are the hallmarks of a mature SRE team.

The rest of the article describes the different types of documents that SRE create during the life cycle of a supported service.

THE END

In the next part, we will look at all these types in detail, but for now we are waiting for your comments and questions, as well as inviting you to an open lesson .

Tags:

Why SRE is important documentation. Part 1

SRE history

The importance of documentation

Also popular now: