PID 1 zombie reaping problem in Docker

Original author: Hongli Lai
  • Transfer
Hello, Habr!
We at Hexlet are actively using Docker both to launch the application and related servers, and to run custom code in practical programming exercises. Without these lightweight containers, it would be much more difficult for us to cope with these tasks. Docker is a wonderful technology, but sometimes unexpected problems arise. One of these problems (and its solution) is described on the Phusion blog (these are the creators of Phusion Passenger), today we publish its translation.


About a year ago, when Docker was in version 0.6, we were the first to introduce Baseimage-docker. This is a minimal Ubuntu image modified specifically for Docker. People can pull this basic image from the Docker Registry and use it as the basis for their images.

We were early users of Docker, using it for CI and for creating a working environment long before the release of version 1.0. We made a basic image to solve problems specific to the principles of Docker. For example, Docker does not start processes under a special init process that would correctly process child processes, so a situation where zombie processes cause a bunch of problems is possible. Docker also does nothing with syslog, so important messages may be lost. Etc.

However, we found out that many people do not understand the problems we are facing. Yes, these are rather low-level Unix system mechanisms that are not understood by everyone. Therefore, in this post we will describe the most important problem that we solve - PID 1 zombie reaping problem.



It turned out:
  1. The problems that we solve are relevant for many people.
  2. Many people do not know about their existence, so at some point unexpected problems necessarily begin (Murphy's law).
  3. It will be very ineffective if everyone solves problems on their own.

Therefore, we made the decision in a universal base image that everyone can use: Baseimage-docker. This image adds a bunch of useful tools necessary (as we believe) to the developer of Docker images. We use Baseimage-docker as the basis for all our images.

The community likes what we do: our image is the third most popular in the Docker Registry after the official images of Ubuntu and CentOS.



The PID 1 problem: collecting zombies


All processes in Unix are presented in the form of a tree. Each process spawns child processes, and each process has a parent except the top (or root) one.

The root process is init. It is launched by the kernel when the system boots. init is responsible for starting the rest of the system, for example, the SSH daemon, the Docker daemon, starting Apache / Nginx, starting the GUI, and so on. Each of them, in turn, launches its child processes.



Nothing unusual. But what happens when the process ends? Let's say the bash process (PID 5) has been completed. It turns into the so-called “defunct process”, also known as the “zombie process”.



Why is this happening? Unix is ​​made in such a way that the parent process waits for the child to complete in order to get the exit status code. A zombie process exists until the parent process finishes this action using the waitpid () family of system calls. Here is a quote from man:
A child that terminates, but has not been waited for becomes a “zombie". The kernel maintains a minimal set of information about the zombie process (PID, termination status, resource usage information) in order to allow the parent to later perform a wait to obtain information about the child.

Usually people think zombie processes are some kind of runaway processes that cause a mess. But formally, from the point of view of the Unix operating system, zombie processes have a clear definition. These are processes that have completed, but their parent processes are still waiting for them to complete.

In most cases, this is not a problem. The waitpid () system call for processing zombies is called “reaping” (collection, processing). Many applications process their child processes correctly. In the sshd example above, if bash terminates, the OS will send a SIGCHLD signal to the sshd process to wake it up. Sshd will notice this and process (“reaps”) the child process.



But there is a special case. Imagine that the parent process terminated, intentionally or due to user action. What happens to its child processes? They no longer have a parent, so they become “orphans” (this is a technical term).

This is where the init process comes into play. The init process, PID 1, has a special task: to “adopt” orphaned processes (this is again a real technical term). This means that init becomes the parent of such processes, despite the fact that they were not actually generated by init.

Consider the Nginx example, which is demonized by default. It works as follows: Nginx first creates a child process. Then the main Nginx process ends. Now the Nginx child process is adopted by init.



The kernel of the OS expects special behavior from init: the kernel believes that init should process (collect, “reap”) adopted processes, too.

This is a very important feature on Unix. It is so fundamental that many programs are designed for its correct operation. Most demons are designed to ensure that demonized processes will be adopted and processed (that is, correctly completed after becoming zombies) by init.

I use demons as an example, but this mechanism extends not only to them. Each time a process with children ends, it expects init to clean up everything behind it. This is described in detail in two very good books: Operating System Concepts and Advanced Programming in the UNIX Environment .

Why are zombie processes harmful?


Why are zombie processes harmful even though they are just completed processes? After all, surely the memory allocated to the process has already been freed, and is zombie just a line in ps?

Yes, the memory of this process has already been freed. But the fact that the process is still visible in ps means that it uses kernel resources. Here is a quote from man on waitpid:
As long as a zombie is not removed from the system via a wait, it will consume a slot in the kernel process table, and if this table fills, it will not be possible to create further processes.

Until zombie is removed from the system using wait, it will use the slot in the kernel process table, and if this table is full, creating new processes will be impossible

And here Docker


And then Docker? Many people run only one process in their container. But most likely this process does not behave like a proper init. That is, instead of correctly processing the adopted processes, he believes that another init process should do this. And he thinks so quite rightly.

Let's look at a specific example. Suppose your container contains a web server in which a CGI script written in bash is running. The script calls grep. Then the web server decides that the script has been processing too long and kills it. But grep remains running. When it finishes its work, it turns into a zombie and is adopted by the PID 1 process (web server). The web server does not know anything about grep, therefore, does not process its completion and the zombie grep remains in the system.

The problem applies to other situations. Many create containers for third-party applications, such as PostgreSQL, and run these applications as the only process inside the container. When you run someone else's code, are you sure that it does not spawn child processes, which then turn into zombies? If you run your code and know for sure what it and the libraries it uses, then everything is fine. But in the general case, you need to run the correct init to solve the problems.

But doesn't starting a full system init turn a container into a heavy thing like a virtual machine?


The init system is not necessarily heavy. Perhaps you are thinking of Upstart, Systemd, SysV, and so on. Perhaps you think that inside the container you need to run the whole system. This is not true. A “complete init system” is optional and not needed.

The system we need is a simple little program whose task is to launch your application and collect adopted processes. Using such a simple init system is fully consistent with Docker's philosophy.

Simple init system


Perhaps there are ready-made solutions? Nearly. Good old bash. Bash handles adopted processes. Bash can run anything. So instead of such a line in the Dockerfile ...

CMD ["/path-to-your-app"]()

can write
CMD ["/bin/bash", "-c", "set -e && /path-to-your-app"]()

(the -e directive forbids bash to recognize the script as a simple command and exec () to directly).

The result is a hierarchy of processes:



But, unfortunately, this approach has a problem. It does not process signals! Suppose you use kill to send a SIGTERM signal to a bash process. Bash ends, but does not send SIGTERM to its child processes!



When bash exits, the kernel terminates the entire container with all the processes inside. These processes terminate with SIGKILL. Therefore, there is no way to complete these processes cleanly. Let's say your application writes something to a file. The file may be damaged if the application terminated this way during recording. Unclean termination of processes is bad. It's almost like pulling a power cord from a server.

But why should we care that the init process terminates with a SIGTERM signal? Because docker stop sends SIGTERM to the init process. Docker stop must stop the container correctly so that it can then be started using docker start.

Bash experts will probably want to write a normal EXIT handler that sends signals to their children, like this:

# !/bin/bash
function cleanup()
{
local pids=`jobs -p`
if [\\[ "$pids" != "" ]()]; then
kill $pids \\>/dev/null 2\\>/dev/null
fi
}
trap cleanup EXIT
/path-to-your-app

Unfortunately, this does not solve the problem. Sending signals to child processes is not enough. init must also wait for the child processes to complete before exiting itself. If init finishes earlier, then all child processes will be killed (not purely) by the kernel.

Obviously, a slightly more complicated solution is required, but a complete init system with Upstart, Systemd and SysV is too fat for a lightweight docker container. Fortunately, Baseimage-docker contains a solution. We wrote our own, lightweight init system specifically for use inside a docker container. Without inventing anything better, we named it my_init . This is a Python program with 350 lines.

Key functions of my_init:
  • Handles (reap) child processes
  • Launches subprocesses
  • Waits for the completion of all subprocesses before its completion, with a maximum timeout
  • Records activity in docker logs


Will Docker solve this problem himself?


Ideally, the problem with PID 1 should be solved natively by Docker himself. It would be great, but so far, in January 2015, we have not heard anything like this from the Docker team. This is not criticism - Docker is very ambitious, and I am sure that their team has more important problems. The problem of PID 1 is easily solved at the user level. So until Docker officially solves this problem, we recommend that people solve it themselves using a system like the one described above.

Is this a problem at all?


The problem may seem hypothetical. If you have never seen zombies in your container, you may think that everything is fine. But the only way to make sure that there is no problem is to check all your code, all your libraries and all the libraries that are used by the libraries. If you have not done this, then perhaps somewhere there is a line that starts the child process, which then turns into a zombie.

Do not forget about Murphy's law.

Besides the fact that zombies clog the kernel resource table, they can also interfere with the correct operation of programs that check for processes. For example, Phusion Passengermanages the processes. It restarts the processes when they crash. It parses the output of ps and sends a signal 0 to the process. The zombie is visible in ps and responds to signal 0, so Phusion Passenger thinks the process is still alive.

All you need to protect yourself from a zombie problem is to spend 5 minutes connecting Baseimage-docker or import 350 lines of my_init . Additional costs for disk and memory are minimal: only a couple of megabytes is added to the memory.

Conclusion


The problem of PID 1 is real. One way to solve it is to use Baseimage-docker . Is this the only way? Of course not. The goals of Baseimage-docker are:

  1. To tell people about several important points when working with docker containers.
  2. Provide a turnkey solution so people don’t reinvent the wheel.


In this case, several solutions are possible, the main thing is that they cope with the described task. You can write your own version in C, Go, Ruby or something else.

You might not want to use a basic Ubuntu image. Maybe you are using CentOS. But Baseimage-docker may still be useful to you. For example, ourpassenger_rpm_automation project uses CentOS containers. We simply extracted my_init and inserted it there.

Happy Proof!

Also popular now: