Updating the kernel without rebooting

    Today I want to talk about my favorite feature in the latest Parallels Cloud Server release - rebootless update, or update without rebooting.

    Reboot is a simple server and a loss of state of current activities. It is undesirable for a server that is used by a large number of people. At the moment, there is the popular Ksplice technology, where changes are rolled onto a living system. This is unreliable, not every update manages to roll like that. In general, there are no guarantees that the problem code did not succeed in inheriting. Another important problem is that developers are reluctant to tackle bugs after such updates. Who knows what was boiled in this hodgepodge.

    We at Parallels approached the problem from a different perspective and decided to do everything honestly. To be honest, it means rebooting the kernel, but so that no one will notice. The fastest way to roll a new kernel is to use kexec. Now remember that both containers and virtual machines can save their state (suspend / resume, dump / restore, snapshot, etc). Thus, if we put all virtual environments to sleep, quickly reboot the kernel and restore the environment, the user will notice only a slight delay in maintenance, which will look like network problems. In a first approximation, this is how rebootless update works.

    Parallels developers went a step further and significantly reduced virtual machine downtime. First of all, the PramFS file system was created, similar to tmpfs, but its state persists between rebooting the kernel through kexec. The states of virtual machines and containers are added to this file system. PramFS is several orders of magnitude faster than a disk, therefore, the time to save and restore environments has significantly decreased.

    Saving the state of a container implies saving all of its objects (open files, sockets, pipes; timers; state of processes, etc.) and user memory. The next optimization step allowed us to leave user memory and file system caches in the same place where they were before the reboot. This step also reduced container retention and recovery time and reduced downtime.

    As a result, after such an update, a new kernel is loaded on the server without traces of the old one. All kernel objects are recreated and their states restored. User memory and file system caches are left untouched. Server reboot time has decreased several times, compared to a regular reboot.

    This feature is currently available only to Parallels Cloud Server users, but we have plans to offer this functionality to the Linux community. And the preservation and restoration of containers will be implemented as part of the CRIU project.


    Also popular now: