Kubernetes success stories in production. Part 10: Reddit

    Last week, it was announced that from now on all new Reddit services are launched in production on an infrastructure based on Kubernetes clusters. This is a significant milestone on the path to migration to K8s of one of the most popular online resources, and here’s how it came to ...



    Likbez : Today, Reddit is in the top 20 world sites (and No. 6 in the USA) according to Alexa . This online community of American origin comprises more than 400 million active (within a month) users, 12 million publications, and 2 billion votes per day.

    About why and how Reddit engineers came to Kubernetes, at KubeCon 2018 in December last year he talked ( presentation + video ) Greg Taylor is Head of Project Engineering at Release Engineering Group.



    Why did you come to Kubernetes?


    At the beginning of 2016, the service, implemented as a monolithic application , had only about 20 engineers who formed 3 teams, one of which is a kind of hero of the story - the Infrastructure team. However, this year brought great changes: by the end of the year, more than 60 engineers worked in the company (and by the end of 2018, their number had increased to 200, i.e., in just 3 years, there was a 10-fold increase in staff ).

    Such rapid growth rates put on the agenda the irrelevance (inefficiency) of the monolithic application architecture, since the introduction of numerous changes to its various components (by different teams) has become very difficult. Having gathered to solve the problem and having considered numerous options, the engineers choseService Oriented Architecture (SOA) path .

    Moving to a service architecture instead of a big monolith, Reddit ran into a new problem. The infrastructure team has become a bottleneck in the activities of developers who turned out to be very dependent on it at different stages: during the initialization of services, during their continued operation, during debugging and solving performance problems. As a quick fix for the problem, the company formed more self-sufficient teams called “infrastructure-oriented”: the participants of such teams had the necessary skills in the field of operating the infrastructure, allowing them to overcome many difficulties without waiting for the actions of the Infrastructure team, which was overloaded with an endless backlog from numerous developers.

    However, it was still a temporary solution and practice showed that not everyone wanted to exploit the entire stack for their service:



    How was this situation resolved? The company introduced the concept of the owners of service (service, owners) , which could develop its services from the beginning to the end, deploit service early and often to exploit the service (including issues of its availability and performance). But how to achieve this?

    Instead of expecting teams of engineers with impeccable skills to combine services together from dozens of bricks, for many of which they may not have knowledge, you need to offer them a well-thought-out, pre-defined path for bringing services to production, affecting a minimum of technology. This will save engineers from having to learn numerous new technologies and tools, which can be really many:



    “In order to put this idea into practice, we needed to“ pack ”our knowledge, process, best practices and much more into a more accessible form.”

    InfreRedd - Kubernetes in Reddit


    This is how InfreRedd, the internal infrastructure product of Reddit, based on Kubernetes, came about.

    How were the three needs of service owners specified in their definition met?

    1. Development


    The standard for development in the organization does not indicate the choice of a specific language or framework, but sets the general “form” of the service, which it should correspond to. The standard — a service specification independent of the programming language — includes the definition of an RPC protocol, working with secrets, returning metrics, traceability, and the format for issuing logs. An example of the implementation of such a specification in Python can be found in the baseplate project , which, however, is unlikely to be useful to someone for real use, but it can be an inspiration.

    In addition, materials were created for a quick start when writing new services: code stubs for different languages ​​(Python, Go, Node), as well as Dockerfile, configs for CI and even Helm charts.

    To help with local development, the choice of Reddit engineers fell on the Google product - Skaffold , which offers developers a read -through cycle edit → rebuild → refresh, which:

    • does not require in-depth knowledge of Kubernetes;
    • as close as possible to production;
    • allows you to use standard charts / images;
    • and - unlike the Minikube that was used before - working with Skaffold does not require huge resources from working laptops (because rollout is done to remote clusters).

    2. Deploy


    Reddit uses the continuous delivery platform Drone to run tests and build artifacts (usually Docker images) .

    Kubernetes originally used the Helm plugin for Drone for deployment, but pretty quickly the engineers came to the conclusion that Helm wasn’t happy with it because they wanted a system that “better understands the state of created or updated objects,” and further automation of the deployment processes led to the need for a solution that could appeal to the tools used and pause rollback if there were failures or performance problems.

    As a result, Spinnaker was chosen to orchestrate a deployment in Kubernetes. For him, templates were created for typical pipelines (on Jsonnet). Next, Helm charts are generated, which are already rolled out into Kubernetes by Spinnaker. Users receive information on the progress of the deployment and help for diagnosis in case of any problems. Here's what a typical deployment process in staging / production looks like in a very general way:



    3. Operation


    Firstly, how are the obligations of service owners and infrastructure team shared?

    • Service owners : understand the basics of Kubernetes, deploy and operate their services;
    • Infrastructure team : support the operability (roll-out, support, scaling) of Kubernetes clusters, providing them with all the necessary resources, and also advise the organization’s engineers on the design of reliable, productive, fault-tolerant services (in particular, training sessions are held regularly, the records of which are then distributed throughout the company).

    Service owners are limited in their rights. However, to gain access to production (to diagnose some problem), it is possible to request (through a special console utility) a temporary token that gives them full rights to their namespaces.

    Another important point of operation is minimization of potential damage that may come from different sources. Here's what Reddit does for this:



    To make life easier for maintenance engineers, they are also involved:


    Kubernetes Status in Reddit


    General statistics on the infrastructure of Kubernetes at the time of December last year were as follows:

    • 7 clusters (from 3 to 6 new ones were to be added in the next few months);
    • from a third to half of all engineering teams interact with Kubernetes;
    • about 20 Reddit services are in production with K8s;
    • on a working day, 10-20 deployments of these services to K8s take place.

    The availability of InfreRedd with Kubernetes for the entire organization was planned for the first quarter of 2019, which implied the deployment of any new service in production serviced by Kubernetes. (At that time, this happened for about 3 of 4 new services.)

    As already mentioned at the beginning of the article, this milestone was successfully reached just last week:



    Other articles from the cycle



    Also popular now: