Resiliency of the application when updating the cluster Cubernetes

    Somehow in the comments they asked the question, how does participation in Slurme differ from reading manuals on Couberntes. I asked Pavel Selivanov, Speaker Slurm-2 and MegaSlerm, to give a small example of what he would say at Slurm. I give the floor to him.

    I administer the cluster Cubernétes. Recently, I needed to update the version of k8s and, including, restart all the machines in the cluster. I started the process at 12:00, and by the end of the working day everything was ready. And for the first time, I also followed the update process, and in the second I went for lunch for 1.5 hours (for the sake of justice, taking the laptop). The cluster was updated by itself, without my participation and imperceptibly for the clients, the development did not notice anything, the deployments continued, the service worked as usual.

    What it looked like.

    Probable problems

    When rebooting machines there are two bad scenarios.

    1. The developer has launched the application / redis in one instance. No matter how carefully you take the car out of service, downtime will happen.
    2. There are 2 replicas of the application, and one is deployed. It went out, the only replica remained, and here comes the admin and puts out the last replica. Again, until the replica comes up after deployment, there will be downtime.

    I could coordinate the reboot with the development, they say, stop the deployment, check the instances, I will restart the machines, but I like the idea of ​​DevOps that human communication should be kept to a minimum. It is better to set up automation once, than to coordinate your actions every time.

    Conditions of the problem

    I use Amazon with its convenience and stability. Everything is automated, you can create and extinguish virtuals, check their availability, etc.

    Cluster Kuburnetes is deployed, managed and updated via the kops utility, which I really love.
    When updating kops, the node will automatically drain (kubectl drain node), wait until everything is evacuated from this node, delete it, create a new node in Amazon with the correct version of the components Cubernetic, attach it to the cluster, check that the node has moved into the cluster well, and so with all the nodes in a circle, until everywhere is the right version of Cubernets.


    In CI, I use kube-lint to check all the manifests that will run in Cubernette. Helm Template throws out everything that is going to run, I set up a unloading linter for unloading, which evaluates everything according to specified rules.

    For example, one of the rules says that for any application in the Cuberentes cluster, the number of replicas must be at least 2.
    If there are no replicas at all (which is 1 by default), their number is 0 or 1, kube-lint prohibits the cluster with a cluster to avoid problems in future.

    For example, a warm design by design is designed so that one replica remains. In this case, there is the pod disruption budget, where max_unavailable and min_available are set for the application running in Cubernetworks. If you need to always have at least 1 replica, set min_available = 1.
    There were 2 replicas, it was deployed, 1 replica died, 1 was left. On the machine where the replica lives, the admin starts the kubectl drain node. In theory, Kuburnetes should start to delete this live replica and transport it to another node. But it works pod disruption budget. Kuburnetes says to the admin: sorry, there is a replica here, if we remove it, we will break the pod disruption budget. Smart drain node hangs before the timeout expires and tries to drop the node. If the deployment is finished and both replicas become available, the replica on this node will be displayed.

    On MegaSlurme, I will show a complete set of rules that allows me to drink coffee in a cafe, while the Cubernethes cluster is updated with a restart of all the nodes.

    My topics on Slurm :

    • Familiarity with Kubernetes, the main components
    • Cluster device, main components, fault tolerance, k8s network
    • Kubernetes Advanced Abstractions
    • Logging and monitoring

    My topics on MegaSlreme :

    • The process of creating a failover cluster from within
    • Authorization in the cluster using an external provider
    • Secure and highly available applications in a cluster
    • Implementing deployment strategies other than RollingUpdate
    • Trabshuting in Kubernetes

    Also popular now: