ZmeeeD July 7, 2017 at 09:01

Kubernetes & production - to be or not to be?

Hundreds of containers. Millions of external requests. Billions of internal transactions. Monitoring and notification of problems. Simple scaling. 99% up time. Deployment and rolling back releases.

Kubernetes as a solution to all problems! "To be or not to be?" - that's the question!

Disclaimer

Despite the publicity of this article, most likely I am writing this primarily for myself, as a conversation with a rubber duck . After more than two years of sailing with “hipster” technologies, I have to take a step to the side and adequately assess how reasonable it was and will be adequate for my next project.

Nevertheless, I really hope that this publication will find its reader and will help many prepared to approach the choice or rejection of Kubernetes.

I will try to describe all the experience that we gained in `Lazada Express Logistics`, a company that is part of the` Lazada Group` , which in turn is part of the `Alibaba Group`. We develop and support systems that automate to the maximum the entire operational cycle of delivery and fulfillment in the 6 largest countries in Southeast Asia.

Prerequisites for use

Once a representative of a company selling cloud solutions around the world asked me: “And what is a 'cloud' for you?" Pausing for a couple of seconds (and thinking: “Hmmm ... our dialogue is clearly not about condensations of water vapor suspended in the atmosphere ...”), I replied that, it’s like it’s like one super-reliable computer with unlimited resources and practically no overhead the cost of transmitting data streams (network, disk, memory, etc.). It’s like this is my laptop working for the whole world and able to hold such a load, and I alone can manage it.

Actually, why do we need this cloud miracle? Everything is very simple in general! We strive to make life easier for developers, system administrators, devops, tech managers. And such a thing as a properly prepared cloud makes life easier for everyone. And besides everything else, monomorphic systems working for a business are always cheaper and pose fewer risks.

We set out to find a simple, convenient and reliable private cloud platform for all our applications and for all types of roles in the team listed above. Conducted a small study: Docker, Puppet, Swarm, Mesos, Openshift + Kubernetes, Kubernetes - Openshift ... settled on the latter - Kubernetes without any add-ons.

The functionality described on the very first pageFits perfectly and was suitable for our entire enterprise. A detailed study of the documentation, chatter with colleagues and a little experience in quick testing. All this gave confidence that the authors of the product do not lie, and we can get our magnificent cloud!

Rolled up the sleeves. And away we go ...

Problems and solutions

3-tier architecture

Everything comes from the basics. In order to create a system that can live well in a Kubernetes cluster, you will have to think through the architecture and development processes, configure a bunch of delivery mechanisms and means, learn to put up with the limitations / concepts of the Docker world and isolated processes.

As a result: we come to the conclusion that the ideology of micro-service and service-oriented architecture is inappropriately suited for our tasks. If you read Martin Fowler’s article ( translation ) on this topic, you should more or less represent what titanic work must be done before the first service comes to life.

My checklist divides the infrastructure into three layers, and then roughly describes what you need to remember at each level when creating such systems. Three layers in question:

Hardware - servers, physical networks
Cluster - in our case Kubernetes and system services supporting it (flannel, etcd, confd, docker)
Service - directly a process packaged in Docker - micro / macro service in your domain

In general, the idea of a 3-layer architecture and the tasks associated with it is the topic of a separate article. But it will not be released before this very checklist is impeccably full. It may never happen :)

Qualified specialists

As far as the topic of private clouds is relevant and interesting to more and more medium and large businesses, the issue of qualified architects, devops, developers, database administrators capable of working with it is just as relevant.

The reason for this is new technologies, which, entering the market, do not have time to acquire the necessary amount of documentation, training articles and answers to `Stack Overflow`. However, despite this, technologies such as in our case, Kubernetes - are becoming very popular and create a shortage of personnel.

The solution is simple - you need to cultivate specialists within the company! Fortunately, in our case, we already knew what Docker was and how to cook it - we had to catch up with the rest.

Continuous Delivery / Integration

Despite all the charm of the technology of “smart cloud cluster”, we needed the means of communication and installation of objects inside Kubernetes. Having passed the way from the self-written bash script and hundreds of branches of logic, we ended up with quite understandable and readable recipes for Ansible. To fully transform Docker files into live objects, we needed:

A set of standard solutions:
- Team City - for automated deployments
- Ansible - for building templates and delivering / installing objects
- Docker Registry - for storing and delivering Docker images
images-builder - a script for the recursive search for Docker files in the repository and sending images based on them after assembly in the central registry
Ansible Kubernetes Module - a module for installing objects with different strategies depending on the object (create or update / create or replace / create or skip)

Among other things, we studied the issue of Kubernetes Helm . But nevertheless, they could not find the killer-feature that could make us refuse or replace Ansible templating with Helm charts. We could not find other useful abilities of this solution.

For example, how do you verify that one of the objects is successfully installed and that you can continue rolling out the others? Or how to make finer settings and settings for containers that already work, and you just need to execute a couple of commands inside them?

These and many other questions oblige you to treat Helm as a simple template engine. But why is this? .. if Jinja2 , which is part of Ansible, gives odds to any non-core solution.

Stateful Services

As a complete solution for any type of service, including statefull (stateful), Kubernetes comes with a set of drivers for working with network block devices . In the case of AWS, the only acceptable option is EBS .

As you can see, the k8s tracker is replete with a bunch of bugs related to EBS , and they are solved rather slowly. Today, we do not suffer from any serious problems, in addition to the fact that, sometimes, it takes up to 20 minutes to create an object with persistent storage. Integration of EBS-k8s of very, very, very dubious quality.

However, even if you use other storage solutions and do not experience any special problems, you will still need high-quality solutions for everything that can store data. We spent a lot of time to fill in the gaps and provide quality solutions for each of the cases:

Among other things, Kubernetes, and the Docker world, in principle, obliges, sometimes, to a lot of tricks and subtleties that are obvious at first glance, but require an additional solution.

A small example.
Logs cannot be collected inside the running Docker container. BUT a lot of systems and frameworks are not ready for streaming in `STDOUT`. You need to do `patching` and conscious development at the system level: write in pipes, take care of processes, etc. A little time and we have a Monolog Handler for `php`, which is able to issue logs as Docker / k8s understands them

API Gateway

As part of any micro-service and service-oriented architecture, you will most likely need a gateway. But this is for architecture, here I want to focus on why this is especially important for the cluster and the services embedded in it.

Everything is quite simple - you need a single point of ~~denial of~~ access to all your services.

There are a number of tasks that we solved in the context of the Kubernetes cluster:

Access control and limiting requests from outside - as an example, a small LUA script sheds light on the problem
A single point of user authentication / authorization for any services
Lack of many services requiring HTTP access from the `world` - port reservation on servers for each service is more difficult to manage than routing in Nginx
Kubernetes-AWS Integration for AWS Load Balancer
A single point of monitoring HTTP statuses - convenient even for internal communication of services
Dynamic routing of requests for services or service versions, A / B tests (alternatively, the problem can be solved by different pods for Kubernetes service)

An experienced Kubernetes user will hasten to ask about Kubernetes Ingress Resource , which is designed specifically for solving such problems. All right! But we required a bit more `features`, as you may have noticed, for our Gateway API than there is in Ingress. Moreover, this is just a wrapper for Nginx, with which we already know how to work.

Current state

Despite the myriad nuances and problems associated with the installation, use and support of the solution presented above, being stubborn enough, you are likely to succeed and get, roughly, what we have today.

What is the platform in its current state - a few dry facts:

Only 2-3 people to support the entire platform
One repository storing all information about the entire infrastructure
From 10-50 independent automated releases per day - CI / CD mode
Ansible as a cluster management tool
A few hours to create an identical `life` environment - locally on minikube or on real servers
AWS-based architecture based on EC2 + EBS, CentOS, Flannel
500 ~ 1000 pods per system
Docker / K8s wrapped technology sheet : Go, PHP, JAVA / Spring FW / Apache Camel, Postgres / Pgpool / Repmgr, RabbitMQ, Redis, Elastic Search / Kibana, FluentD, Prometheus, etc
There is no infrastructure outside the cluster , with the exception of monitoring at the `Hardware` level
Elastic Search- based centralized log storage within Kubernetes cluster
A single point of collection of metrics and alerting problems based on Prometheus

The list reflects many facts, but the obvious advantages and pleasant features of Kubernetes as a Docker process control system remain omitted. You can read more about these things on the official Kubernetes website , in articles on the same Habré or Medium .

The long list of our wishes, which are at the prototype stage or still cover a small part of the system, is also very long:

Profiling and tracing system - for example, zipkin
Anomaly detection - machine-trained algorithms for analyzing problems by hundreds of metrics, when we cannot or do not want to understand what each metric / set of metrics separately means, but we want to know about the problem associated with these metrics
Automatic capacity planning and scaling of the number of pods in a service and servers in a cluster based on specific metrics
Intelligent backup management system - for any stateful services, primarily databases
Network monitoring and communication visualization system - inside a cluster, between services and pods, first of all ( interesting example )
Federation mode - distributed and connected operation mode of several clusters

So to be or not to be?

An experienced reader, most likely, already guessed that the article is unlikely to give an unambiguous answer to such a seemingly simple short question. Many details and little things can make your system insanely cool and productive. Or another set of bugs and crooked implementations will turn your life into hell.

You decide! But my opinion about all this is: “BE! .. but very carefully”

Tags: