Why SRE is important documentation. Part 3

https://queue.acm.org/detail.cfm?id=3283589

Transfer

Good evening everyone! We are glad to share the news that in February we are launching a new stream on the course “Devops - Practices and Tools” , which means that it’s time to finish what we started and publish the third part of the article: “Why SRE Documentation is Important” . Go!

Documents for managing SRE teams SRE

teams need reliable and accessible documentation to work efficiently.

Team site

Note: Instead of a site, you can use a separate space or section in the Confluence / Wiki.

The team site is important because it coordinates information and documentation related to the SRE team and its projects. For example, in Google, many SRE teams use g3doc (Google's internal documentation platform, where docks live in source code along with associated code), and some teams use g3doc and Google Sites: in this case, g3doc pages are closely related to implementation code / details.

Charter

teams SRE teams must maintain a published charter that describes motivation for work and documents current engagement. The charter is necessary to establish the identity, main goals and meaning of the team throughout the company.

The charter usually contains the following elements:

High-level description of the responsibility of the team. Including the type of services supported by the command (and how), related systems, examples.
A brief description of a couple of the most important services supported by the team. This section also highlights key technologies and the difficulties involved in their use, the benefits of involving SRE, and their responsibilities.
Key principles and values of the team.
References to the command site and documentation.

It also assumes the presence of a vision statement (vision vision of the future — an inspiring description of the team’s long-term goals) and a road map for several quarters.

Documentation for integrating new SREs

Investing in training tools and materials for new employees has a positive effect on the speed of employee integration in workflows. It is beneficial for the SRE teams to train beginners as soon as possible with all the necessary skills for shift work. Zoe's story clearly shows how the lack of comprehensive training for a new employee makes a minor incident a serious failure.

Many SRE teams prepare new employees for shifts with the help of checklists. The checklist for a shift usually covers high-level areas (divided into subsections) in which team members must understand. Examples of such areas include manufacturing concepts, front-end and back-end, automation and tools, monitoring and logging. Also, the checklist may include instructions for preparing for the shift and tasks performed during the shift.

For training new members of the SRE team, they also use role-playing exercises (they call them Wheel of Misfortune - Wheel of Failure on Google). Such an exercise is a failure scenario with a specific set of data and signals that SRE may hypothetically need to solve a problem during a shift. The team members take turns playing the role of an engineer on duty to hone the skill of eliminating the consequences of a failure and the skill of debugging the system. Wheel of Misfortune checks if every member of the team knows where to find the documentation needed to fix the problem, and how to deal with the failure.

Storage Management

All information of the SRE team can be scattered across multiple sites, the local repository and Google Drive folders, which makes it very difficult to find the right one. As happened in the previously described example, the critical operational tool and instructions for its use were not available for Zoe (SRE on duty), as they were hidden in her technical lead's personal directory, and the inability to find them significantly increased the duration of service failure. To get rid of such problems, you need to structure all the information and make sure that the team members know where to look for and store it, and how to support it. A well-developed structure will help the team find information faster. New team members will get up to speed, and engineers on duty will quickly solve problems.

Here are some guidelines on how to create and maintain a documentation repository:

Identify key stakeholders and conduct brief interviews to identify all needs.
Find as much documentation as possible and analyze the gaps in the content.
Basicly structure your site to create new documentation in the right places.
Move existing documentation to a new location.
Archive and demolish old documentation.
Perform regular checks to ensure the quality / consistency of supported documentation.
Make sure that standard search queries produce the necessary documents at the very top of the search results list.
Use signals, such as Google Analytics, to evaluate standard practices.

Repository support note: it is important to regularly check and update documentation. The name of the owner and the date of the last check should be visible - this information helps to ensure the accuracy of the selected document. Zoe in the history was able to find only outdated documentation of a critical tool, thereby losing the ability to quickly solve the problem. Unreliable and outdated documentation makes SRE less efficient, which negatively affects the reliability of managed services.

Repository availability

SRE commands must ensure that the documentation remains available even in the event of a failure and inaccessibility of the standard repository. Each SRE in Google has its own copy of critical documentation. This copy is available on an encrypted compact storage device or some kind of removable, but secure physical media that each SRE has on duty.

Documentation for decommissioning a service

When the service life cycle comes to an end, SRE decommissions it in a predictable manner. This section provides recommendations for documentation on service outage.

It is important to announce in advance to users about the decommissioning of the service and provide a schedule and steps. Your ad should explain when the registration of new users ends, how the existing and future bugs will be processed, and when the service finally stops working. Clearly mark all important dates and the decline in SRE support, send out interim announcements as you progress.

Simple email distribution is not enough - you need to update the main page of the documentation, playbooks and codelabs. Also, if possible, comment on the header files. Describe the details of the announcement in a document (in addition to the letter) that users can refer to. The letter should be as short as possible, but at the same time informative, reflecting all the main points. Describe additional details: business motivation to turn off the service, which tools users can use to migrate to another service, what support is available during migration. It is also worth creating a FAQ page, filling it over time with new information on questions asked by users.

The Role of Technical Documentation Editors

Technical editors (or technical writers) provide services that make SRE more efficient and productive. The range of tasks is not limited to writing individual documents on the requirements specified by the SRE team.

Here are some practical recommendations for technical editors for working with SRE teams.

Technical editors cooperate with SRE to create documentation for the operation of the launched services and production documentation for SRE products and tools.
They create and update documentation repositories, structure and reorganize them in accordance with the needs of users, improve individual documents as part of the overall management of the repository.
Editors help identify improvements, required documentation, and information management. This includes evaluating documentation for gathering requirements, improving documents and websites created by engineers, advising teams on the rules for creating, organizing, redesigning, searching and maintaining documentation.
Editors should evaluate and improve documentation tools to provide better SRE solutions.

Templates

Technical editors also provide templates that simplify the creation and use of SRE documentation. Templates do the following:

Simplify the creation of documentation, giving engineers a clear structure for creating new documents.
Add sections of all necessary documents to complete the documentation.
Help the reader to quickly understand the topic of the document, the type of information and how it is organized.

Site Reliability Engineering contains several sample documentation templates. In this section, we will provide some more examples to show how templates provide a structure and a guide for engineers to fill out with content.

Service

Overview

What is this? What is he doing? High-level describe the functionality provided to customers (end user, components, etc.).

Architecture

Explain how architecture works. Describe the movement of data between components. Consider adding a system diagram with critical dependencies and flow requests and data.

Customers and Dependencies

List all clients (belonging to other teams) that depend on it and all services (belonging to other teams) on which it depends. (This can also be demonstrated in the form of a system diagram.)

Code and Configuration

Explain the production structure. Where is it running? List binaries, jobs, data centers and configuration file settings, or indicate where they are all located. Also provide the location of the code and, if necessary, information about the build.

List and describe the configuration files, changes and ports required to operate this product or service.

Describe the following: what configuration files have been changed for this product or service? How is the setting?

Processes

Describe the following: What daemons and other processes should be running for the service to work? What control scripts were created to control the service?

Output

List and describe the log files created by the component and what observations are performed. Describe the following: What logs are generated by this component? What is in each file? What are the recommendations for studying these files? What aspects of the component should be monitored for reliable service operation?

Dashboards and Tools

Insert links to relevant dashboards and tools.

Power

Specify the power of a single instance; Data center globally: QPS, bandwidth and latency values.

SLA

Provide accessibility targets.

Standard Procedures

Add references to the procedures, including load testing, updates / push / flag status, and so on. Add links to alert documentation in the playbook of alerts.

References

Add links to the design documentation of the component or related components, usually written by the development team, as well as other related information.

Playbook

Title

In the title, specify the name of the alert (for example, Normal Alert_AlarmVery General).

Overview

Describe the following: What does this alert mean? Does it come to the pager or just to the mail? What factors trigger the alert? What parts of the service are affected? What alerts are associated with it? Who needs to be notified?

Hazard Level Alerts

Explain the severity of the alert and the impact of the affected parts on the overall condition of the service.

Confirmation

Provide clear instructions on how to verify and confirm the status is current.

Resolving Issues

List and describe debugging methods and related sources of information. Do not forget to link to the corresponding dashboards. Enable alerts. Describe the following: What will appear in the logs when the alert is triggered? What are the debugging handlers? Are there any useful scripts and commands? What output do they generate? Are there any additional tasks that need to be solved after the alert has been removed?

Decision

Describe and list all possible solutions to the problem causing the alert. Describe the following: How to solve the problem and eliminate the alert? What commands to run to reboot? Who will be notified if the alert has worked due to user actions? Who has experience debugging a similar problem?

Escalation

List and describe the escalation path. Indicate the person or team to be notified and when to do it. If escalation is not necessary - write about it.

Related Links

Provide links to related alerts, procedures, overview documents.
Quarterly Service Report
Introduction
Describe the service for which the team is responsible.

Capacity Planning

Includes:

Фактический спрос на сервис, начиная с прошлых 6-8 кварталов, выраженный в наиболее релевантных для сервиса метриках (например, QPS или DAU).
Прогноз спроса на следующие 8 кварталов.
План мощностей, удовлетворяющий прогнозируемый спрос на требуемом уровне избыточности — уточните дефицит и/или риски планирования мощностей.

We also recommend adding forecasts for past 2-4 quarters so that the reader can evaluate the stability and accuracy of forecasts.

Performance of SLA / Availability

All services supported by SRE must have a written SLA, by which each quarter performance is evaluated.

The SLA section should contain the parameters of the main service components for measuring the quarterly fulfillment of the SLA conditions, as well as a link to the written SLA team.

Related Incidents (Optional)

List 3-5 major incidents or failures per quarter.

Achievements (Optional)

List the main achievements for the quarter.

Changes SLA (Preferred)

Recent changes in SLA.

Service Details (Preferred)

Section may include growth, statistics delays, and so on.

Team Information (Optional)

May include information on team composition, status, projects, shift statistics.

Data Sources (Required)

Describe the sources used to obtain accessibility values, calculation methods, provide links to appropriate dashboards.

Team Charter

Who We Are

In one sentence (~ 1 line) describe the technological environment, customers and team suggestions, as well as the degree of involvement SRE and special expertise.

Supported Services

To further clarify the scope of work, describe the services (or their group) that the team supports.

How We Distribute Time

Scoping helps to create a roadmap and achieve and support long-term goals.

Team Values

Clearly describe the values. This affects how the team members interact with each other, and how your team is perceived by others.

Conclusion

Regardless of whether you are a SRE, or a SRE manager, or a technical editor, you now understand the critical importance of documentation in the life of an effective SRE team. Good documentation allows the SRE team to grow and adhere to a clear methodology for managing new and existing services.

Thus, we published the final part of this article, the first and secondparts can be read by clicking on hyperlinks, and you can get even more useful information in our open lesson , which will be held on February 19. Waiting for everybody!

Tags:

Why SRE is important documentation. Part 3

Also popular now: