Google and DevOps: two books about SRE

    The first ten years on Google, I worked as an ordinary engineer: I started using public transport on maps, improved my search and caught spam on YouTube. At some point, it turned out that in the neighborhood of the SWE (Software Engineers) teams there are some mysterious SRE (Site Reliability Engineers) who live in production and know everything about infrastructure, configs and monitoring. Usually they came to us with incomprehensible schedules and strongly recommended something to be rewritten in our service, so that it would explode neatly and in pieces, and not all together with all its neighbors. Or built some piece of infrastructure, magically solving all our problems once and for all. Or reported that the second release this week will not be, because one data center was washed away by a hurricane, and next to another they buried a horse and cut the trunk cable.

    As a result, I wondered what all this SRE looks like from the inside, and I went to Mission Control - a rotation program that allows you to spend six months in the role of SRE, gain valuable production-experience and, if you wish, return to your previous team to share the acquired knowledge. I stayed instead, like two thirds of my current Video Processing SRE colleagues, also re-trained from ordinary engineers. Now I myself scare SWE with incomprehensible graphs and evacuate YouTube videos from burning data centers, with breaks for peaceful creative coding. It turned out that in fifteen years a healthy and effective SRE organization with its practices, principles and methods had grown up within Google - but nobody knows about them, because of those who got there, no one has yet come back.

    The solution to this problem of the disappearance of information about the duty, SLO and post-mortem in the black hole of Google SRE was the book “Site Reliability Engineering” , which describes in detail how this our SRE actually works. Actually, this whole post was started for two news:

    1. Two weeks ago a Russian translation of the aforementioned SRE book was released. If you are interested in how to create healthy DevOps practices in your company , this book is for you. If you suspect SRE-inclinations in yourself, then this book is even more for you.
    2. After the first book, I’ve just released (so far only in English) Site Reliability Workbook with practical examples from the life of the Google Cloud Platform - I also strongly recommend it.

    Also popular now: