From clouds to earth: how to create a production-grade Kubernetes in any conditions

Original author: Steven Wong, Michael Gasch
  • Transfer
All good! Well, that time has come for our next Devops course . Probably, this is one of the most stable and standard courses, but at the same time it is the most diverse among students, since no group has yet looked like the other: in one developer, almost completely, then in the next engineers, then admins, and so on. And it also means that the time has come for interesting and useful materials, as well as online meetings.


This article contains recommendations for launching the production-grade Kubernetes cluster in on-premise data center or peripheral locations (edge ​​location).
What does production-grade mean?

  • Safe installation;
  • Deployment is managed through a repetitive and recorded process;
  • Work is predictable and consistent;
  • You can safely update and customize;
  • To detect and diagnose errors and lack of resources, there is logging and monitoring;
  • The service has sufficient “high availability” given the available resources, including restrictions in money, physical space, power, and so on.
  • The recovery process is available, documented and tested for use in case of failures.

In short, a production-grade means anticipating errors and preparing a recovery with a minimum of problems and delays.

This article focuses on the on-premise deployment of Kubernetes on a hypervisor or bare-metal platform, given the limited amount of support resources in comparison with the increase in basic public clouds. Nevertheless, a part of these recommendations may be useful for a public cloud if the budget limits the selected resources.

Deploying a single-metal bare-metal Minikube can be a simple and cheap process, but it is not a production-grade. Conversely, you will not be able to reach the level of Google with Borg in an offline store, branch or peripheral location, although it is unlikely that you need it.

This article outlines tips for achieving a production-level Kubernetes deployment, even in a resource limiting situation.

Important Components in the Kubernetes Cluster

Before delving into the details, it is important to understand the overall architecture of Kubernetes.
The Kubernetes cluster is a highly distributed system based on the control plane and the architecture of clustered worker-nodes, as shown below:

Typically, the components of the API Server, Controller Manager and Scheduler are located in several instances of the control level nodes (it is called the Master). Master nodes typically also include etcd, however there are large and highly available scripts that require running etcd on independent hosts. Components can be run as containers and, optionally, under the supervision of Kubernetes, that is, work as static pods.

For high availability, redundant instances of these components are used. Significance and required level of redundancy may vary.
ComponentRolesConsequences of lossRecommended Instances
etcdMaintains the state of all Kubernetes objectsCatastrophic loss of storage. Most of the loss = Kubernetes loses control level, API Server depends on etcd, read-only API calls that do not need quorum, as well as already created workloads, can continue to work.odd number, 3+
API ServerProvides API for external and internal use.It is impossible to stop, start, update new scams. Scheduler and Controller Manager depend on API Server. Loads continue if they are independent of API calls (operators, custom controllers, CRD, etc.)2+
kube-schedulerPlaces pods on nodes Pods cannot be placed, prioritized and moved between them.2+
kube-controller-managerControls many controllersThe main control loops responsible for the state stop working. The integration of the in-tree cloud provider is breaking.2+
cloud-controller-manager (CCM)Out-of-tree integration of cloud providersIntegration of cloud provider breaks downone
Additions (for example, DNS)VariousVariousDepends on the add-on (for example, 2+ for DNS)

The risks of these components include hardware failures, software bugs, bad updates, human errors, network interruptions and system overload leading to resource depletion. Excessiveness can reduce the impact of these hazards. In addition, thanks to the functions of the hypervisor platform (resource planning, high availability), you can multiply the results using the Linux operating system, Kubernetes, and the container runtime.

The API Server uses multiple load balancer instances to achieve scalability and availability. Load balancer is an important component for high availability. Several A-records of the DNS API server can serve as an alternative in the absence of a balancer.

The kube-scheduler and kube-controller-manager participate in the process of choosing a leader instead of using a load balancer. Since cloud-controller-manager is used for certain types of hosting infrastructure, the implementation of which may vary, we will not discuss them - just let us indicate that they are a component of the control layer.

Pods running on the Kubernetes worker are managed by a kubelet agent. Each worker instance runs a kubelet agent and a CRI- compatible container launch environment. Kubernetes itself is designed to monitor and recover from a crash node. But for critical load functions, managing hypervisor resources and isolating loads, it can be used to improve availability and increase the predictability of their work.


etcd is the persistent storage for all Kubernetes objects. The availability and recoverability of the etcd cluster should be a priority when deploying a production-grade Kubernetes.

An etcd cluster of five nodes is the best option if you can enable it. Why? Because you will be able to maintain one, and still endure failure. A cluster of three nodes is the minimum that we can recommend for a production-grade service, even if only one host hypervisor is available. More than seven nodes are also not recommended, with the exception of very large installations covering several availability zones.

The minimum recommendations for hosting nodes etcd cluster - 2GB RAM and 8GB SSD hard drive. Usually, 8GB of RAM and 20GB of hard disk space is sufficient. Disk performance affects the recovery time of the node after a failure. Take a look to find out the details.

In special cases, think about several clusters etcd.

For very large Kubernetes clusters, consider using a separate cluster etcd for Kubernetes events, so that too many events do not affect the main Kubernetes API service. When using the Flannel network, the configuration is saved in etcd, and version requirements may differ from Kubernetes. This can make it difficult to back up etcd, so we recommend using a separate etcd cluster specifically for the flannel.

Single Host Deployment

The list of accessibility risks includes hardware, software, and human factors. If you are limited to a single host, the use of redundant storage, error-correcting memory, and dual power supplies can improve security against hardware failures. Running a hypervisor on a physical host allows you to use redundant software components and adds operational benefits associated with deploying, updating, and controlling resource utilization. Even in stressful situations, the behavior remains repeatable and predictable. For example, even if you can only afford to launch singletones from master services, they should be protected from overload and resource depletion, competing with the workload of your application. The hypervisor can be more efficient and easy to use,

You can deploy three etc etc virtual machines if resources allow on the host. Each VM should be supported by a separate physical storage device or use separate parts of the storage using redundancy (mirroring, RAID, etc.).
Dual redundant instances of the server API, scheduler and controller manager are the next upgrade, if your only host has enough resources for this.

Single host deployment options, from least suitable for production to most
Type ofSpecificationsResult
Minimum equipmentSingleton etcd and master components.Home laboratory, not at all production-grade. Several single points of failure (Single Point of Failure, SPOF). Recovery is slow, and with the loss of storage is completely absent.
Improved storage redundancyetcd singleton and master components, etcd storage is redundant.At a minimum, you can recover from a storage failure.
Managed level redundancyNo hypervisor, multiple instances of managed-level components in static hearths.There was protection against software bugs, but the OS and container launch environment are still single points of failure with devastating updates.
Adding HypervisorRun three redundant managed level instances in the VM.There was protection against software bugs and human error and an operational advantage in installation, resource management, monitoring and security. OS upgrades and container launch environments are less destructive. The hypervisor is the only single point of failure.

Deploying to two hosts

With two hosts, storage problems etcd are similar to the single-host option — you need redundancy. It is preferable to run three encd instances. It may seem non-intuitive, but it is better to concentrate all etcd nodes on the same host. You do not increase reliability by dividing them by 2 + 1 between two hosts - losing nodes with most encd instances leads to interruption, regardless of whether they are 2 or 3. If the hosts are not identical, place the etcd cluster entirely on the more reliable one.

It is recommended to launch redundant API servers, kube-schedulers and kube-controller-managers. They should be shared between hosts to minimize the risk of failure of the container launch environment, operating system, and equipment.

Running a hypervisor layer on physical hosts will allow you to work with redundant program components, ensuring control of resource consumption. It also has the operational advantage of scheduled maintenance.

Deployment options for two hosts, from the least suitable for production to the most
Type ofSpecificationsResult
Minimum equipmentTwo hosts, without redundant storage. Singleton etcd and master components on the same host.etcd - a single point of failure, it makes no sense to run two on other master services. The separation between two hosts increases the risk of a managed level of failure. The potential advantage of resource isolation is by running a managed layer on one host and application workloads on another. If the storage is lost, there is no recovery.
Improved storage redundancySingleton etcd and master components on the same host, etcd storage is redundant.At a minimum, you can recover from a storage failure.
Managed level redundancyNo hypervisor, multiple instances of managed-level components in static hearths. etcd cluster on the same host, the other components of the managed level are separated.Hardware failure, firmware upgrades, operating systems, and container launch environments on a host without etcd are less destructive.
Adding a hypervisor to both hostsThe virtual machines run three redundant components of the managed level, etcd cluster on one host, the components of the managed level are separated. Application workloads can reside on both VM nodes.Improved isolation of application loads. Updates of the operating system and container launch environments are less destructive. Routine hardware / firmware maintenance becomes non-destructive if the hypervisor supports VM migration.

Deploy to three (or more) hosts

Transition to an uncompromising production-grade service. We recommend dividing etcd between the three hosts. One equipment failure will reduce the amount of possible application workloads, but will not result in a complete service outage.

Very large clusters will require more instances.

Launching a hypervisor layer provides operational advantages and improved isolation of application workloads. This is beyond the scope of the article, but at the level of three or more hosts, improved features may be available (clustered redundant shared storage, resource management with a dynamic load balancer, automated state monitoring with live migration and failover).

Deployment options for three (or more) hosts from least suitable for production to most
Type ofSpecificationsResult
MinimumThree hosts. Etcdd on each node Master components on each node.Loss of a node reduces performance, but does not lead to a drop in Kubernetes. The possibility of recovery remains.
Add hypervisor to hostsIn virtual machines on three hosts running etcd, API server, schedulers, controller manager. Workloads are running on the VM on each host.Protection against bugs of OS / container / software launch environment and human error has been added. Operational benefits of installation, upgrades, resource management, monitoring, and security.

Configuring Kubernetes Configuration

Master and Worker nodes should be protected from overload and resource depletion. Hypervisor functions can be used to isolate critical components and reserve resources. There are also Kubernetes configuration settings that can inhibit things like the speed of API calls. Some installation kits and commercial distributions take care of this, but if you are self-deploying Kubernetes, the default settings may not be appropriate, especially for small resources or a cluster that is too large.

The consumption of resources at a manageable level correlates with the number of hearths and the outflow rate of hearths. Very large and very small clusters will take advantage of the changed settings.slowdown requests kube-apiserver and memory. Node

Allocatable must be configured on the node of the worker based on a reasonable supported load density on each node. Namespaces can be created to divide the cluster of the worker-node into several virtual clusters with quotas for CPU and memory.


Each Kubernetes cluster has a root Certificate Authority (CA). The Controller Manager, API Server, Scheduler, kubelet client, kube-proxy and administrator certificates must be generated and installed. If you use the tool or distribution kit of installation, then probably you should not deal with it independently. The manual process is described here.. You should be ready to reinstall certificates in case of expansion or replacement of nodes.

Since Kubernetes is fully managed by the API, it is extremely important to control and limit the list of those who have access to the cluster. Encryption and authentication options are discussed in this documentation.

Kubernetes application workloads are based on container images. You need the source and contents of these images to be reliable. Almost always it means that you will host the container image in the local repository. Using images from the public Internet can cause problems of reliability and security. You must select a repository that has support for image signing, security scanning, access control for sending and downloading images, and activity logging.

Processes must be configured to support the application of host firmware updates, hypervisor, OS6, Kubernetes, and other dependencies. Versioning is required to support auditing.


  • Strengthen the default security settings for components of the managed level (for example, blocking a worker node );
  • Use the Podov Security Policy ;
  • Consider the integration of NetworkPolicy available to your network solution, including tracking, monitoring and troubleshooting;
  • Use RBAC to make authorization decisions;
  • Consider physical security, especially when deployed in peripheral or remote locations that may be left unattended. Add storage encryption to limit the effects of device theft, and protect against connecting malicious devices such as USB keys;
  • Protect the text credentials of the cloud provider (access keys, tokens, passwords, etc.).

Kubernetes secret objects are suitable for storing small amounts of sensitive data. They are stored in etcd. They can be safely used to store the Kubernetes API credentials, but there are times when a more comprehensive solution is required for the workload or expansion of the cluster itself. The HashiCorp Vault project is a popular solution if you need more than the built-in secret objects can provide.

Disaster Recovery and Backup

Implementing redundancy through the use of multiple hosts and VMs helps reduce the number of certain types of failures. But scenarios such as a natural disaster, a bad update, a hacker attack, software bugs, or a human error can still turn into crashes.

A critical part of the production deployment is waiting for the need for future recovery.

It is also worth noting that part of your investment in the design, documentation, and automation of the recovery process can be reused if large-scale replicated multi-site deployments are required.

Among the elements of disaster recovery, it is worth noting backups (and possibly replicas), replacements, a planned process, people who will perform this process, and regular training. Frequent test exercises and the principles of Chaos Engineering can be used to test your readiness.

Due to availability requirements, it may be necessary to store local copies of the OS, Kubernetes components, and container images to allow recovery even if the Internet fails. The ability to deploy replacement hosts and nodes in a “physical isolation” situation improves security and increases deployment speed.

All Kubernetes objects are stored in etcd. Periodic backup of data of an etcd-cluster is an important element in restoring Kubernetes clusters under abnormal scenarios, for example, if all master nodes are lost.

Cluster etcd backup can be performed using the built-inThe etcd mechanism for snapshots and copying the result to storage in another domain of failure. Snapshot files contain all Kubernetes states and critical information. Encrypt snapshot files to protect sensitive Kubernetes data.

Keep in mind that some Kubernetes extensions can store states in separate etcd clusters, persistent volumes, or through some other mechanism. If these conditions are critical, they should have a backup and recovery plan.

But some important states are stored outside etcd. Certificates, container images, other settings and states associated with operations can be controlled by an automatic install / update tool. Even if these files can be re-generated, a backup or replica will speed up disaster recovery. Consider the need for a backup and recovery plan for the following objects:

  • Certificate and key pairs: CA, API Server, Apiserver-kubelet-client, ServiceAccount authentication, “Front proxy”, Front proxy client;
  • Important DNS records;
  • Assign and reserve IP / Subnet;
  • External load balancer;
  • Kubeconfig files;
  • LDAP and other authentication details;
  • The account and configuration data of the cloud provider.

Suggestions for production workloads

Anti-affinity specifications can be used to divide clustered services between backup hosts. But in this case, the settings are used only if it is scheduled. This means that Kubernetes can restart the failed node of your clustered application, but does not have a built-in mechanism for rebalancing after a failure. This topic deserves a separate article, but additional logic can be useful to achieve optimal placement of workloads after a host or a worker node is restored or expanded. The Prioritization and Crowd Out function can be used to select the desired sorting in the event of a shortage of resources caused by a failure.

In the case of stateful services, external mounted volumes are the Kubernetes recommendation standard for nonclustered services (for example, for a typical SQL database). Currently, external volume snapshots managed by Kubernetes are in the roadmap feature request category , and are most likely related to the integration of the Container Storage Interface (CSI). Thus, the creation of backup copies of such a service will require in-box actions depending on the specific application, which is beyond the scope of this article. Therefore, while we are waiting for the improvement of Kubernetes support for workflow snapshots and backup, it is worth thinking about starting the database service not in a container, but in a virtual machine, and opening it to Kubernetes loads.

Clustered stateful services (for example, Cassandra) can take advantage of the separation between hosts using local persistent volumes if resources allow. This will require deploying multiple Kubernetes work nodes (there may be VMs on hypervisor hosts) to maintain a quorum at the point of failure.

Other suggestions

Logs and metrics (if you collect and store them) are useful for diagnosing failures, but given the variety of available technologies, we will not consider them in this article. If you have an Internet connection, it is advisable to keep logs and metrics outside, centrally.

Production deployments should use unattended installation, configuration, and upgrade tools (for example, Ansible, BOSH , Chef , Juju , kubeadm , Puppet , etc.). The manual process is too laborious and difficult to scale, it is easy to make mistakes and run into problems of repeatability. Certified distributions are likely to include a means to save the settings when upgrading, but if you use your own toolchain for installation and configuration, storing, backing up and restoring configuration artifacts is of paramount importance. It is worth thinking about using a version control system, like Git, to store components and deployment settings.


RanbukiIn which recovery processes are documented, they should be tested and stored offline — perhaps even printed. When an employee is called at 2 am Friday, improvisation is not the best option. It is better to complete all the items from the planned and tested checklist - which can be accessed by both the employees in the office and the remote ones.

Final Thoughts

Buying a commercial airline ticket is a simple and safe process. But if you are flying to a remote place with a short runway, a commercial flight on an Airbus A320 is not an option. This does not mean that airfare should not be considered at all. It only means that compromises are necessary.

In the context of aviation, engine failure in a single-engine vessel means a crash. With two engines, at a minimum, there will be more opportunities to decide where the crash will occur. Kubernetes on a small number of hosts is similar, and if your business case justifies, you should think about increasing the fleet, consisting of large and small machines (for example, FedEx, Amazon).

When designing Kubernetes production-grade solutions, you have many options and options. A simple article can not give all the answers and has no idea about your priorities. Nevertheless, we hope that she has provided a list of things to think about, as well as some useful recommendations. There are options that have not been reviewed (for example, launching Kubernetes components using self-hosting, not static pods). Perhaps they should be discussed in the following articles, if there is enough interest. In addition, due to the high rate of improvement of Kubernetes, if your search engine found this article after 2019, some of its materials might already be outdated.


As always, we are waiting for your questions and comments here, and you can go to the Open Day to Alexander Titov .

Also popular now: