
Deep dive into Linux namespaces
- Transfer
In this series of posts, we will carefully consider one of the main ingredients in the container - namespaces. In the process, we will create a simpler clone of the command docker run
- our own program, which will take the command at the input (together with its arguments, if any) and expand the container for its execution, isolated from the rest of the system, similar to what you would execute docker run
to run from the image .
What is namespace?
Linux namespace is an abstraction of resources in the operating system. We can think of namespace as a box. This box contains system resources that depend on the type of box (namespace). There are currently seven types of namespaces: Cgroups, IPC, Network, Mount, PID, User, UTS.
For example, Network namespace includes system resources associated with the network, such as network interfaces (eg wlan0
, eth0
), routing tables, etc., Mount namespace includes the files and directories in the system, PID contains the process ID, and so on. Thus, two instances of Network namespace A and B (corresponding to two boxes of the same type in our analogy) can contain different resources - perhaps A contains wlan0
, while B contains a eth0
separate copy of the routing table.
Namespaces are not some additional feature or library that you need to install, for example, using the apt package manager. They are provided by the Linux kernel itself and are already a necessity to run any process on the system. At any given point in time, any process P belongs to exactly one instance of namespace of each type. Therefore, when he needs to say “update the routing table in the system”, Linux shows him a copy of the namespace routing table to which he belongs at that moment.
What is it for?
Absolutely for nothing ... of course, I was just joking. One of the great properties of the boxes is that you can add and remove things from the box and this will not affect the contents of other boxes. This is the same idea with namespaces - the P process can “go crazy” and execute sudo rm –rf /
, but another Q process belonging to another Mount namespace will not be affected, since they use separate copies of these files.
Note that the resource contained in namespace is not necessarily a unique copy. In some cases that occurred intentionally or due to a security breach, two or more namespaces will contain the same copy, for example, the same file. Thus, the changes made to this file in one Mount namespace will actually be visible in all other Mount namespaces, which also refer to it. Therefore, we will abandon our drawer analogy, since the item cannot be in two different boxes at the same time.
Restriction is a concern
We can see the namespaces to which the process belongs! Typically for Linux, they appear as files in the directory of /proc/$pid/ns
this process with the process id $pid
:
$ ls -l /proc/$$/ns
total 0
lrwxrwxrwx 1 iffy iffy 0 May 18 12:53 cgroup -> cgroup:[4026531835]
lrwxrwxrwx 1 iffy iffy 0 May 18 12:53 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 iffy iffy 0 May 18 12:53 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 iffy iffy 0 May 18 12:53 net -> net:[4026531957]
lrwxrwxrwx 1 iffy iffy 0 May 18 12:53 pid -> pid:[4026531836]
lrwxrwxrwx 1 iffy iffy 0 May 18 12:53 user -> user:[4026531837]
lrwxrwxrwx 1 iffy iffy 0 May 18 12:53 uts -> uts:[4026531838]
You can open another terminal, execute the same command and this should give you the same result. This is because, as we mentioned earlier, the process must belong to a certain namespace (namespace) and until we explicitly specify which one, Linux adds it to namespaces by default.
Let's get a little involved in this. In the second terminal, we can do something like this:
$ hostname
iffy
$ sudo unshare -u bash
$ ls -l /proc/$$/ns
lrwxrwxrwx 1 root root 0 May 18 13:04 cgroup -> cgroup:[4026531835]
lrwxrwxrwx 1 root root 0 May 18 13:04 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 root root 0 May 18 13:04 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 root root 0 May 18 13:04 net -> net:[4026531957]
lrwxrwxrwx 1 root root 0 May 18 13:04 pid -> pid:[4026531836]
lrwxrwxrwx 1 root root 0 May 18 13:04 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0 May 18 13:04 uts -> uts:[4026532474]
$ hostname
iffy
$ hostname coke
$ hostname
coke
The command unshare
starts the program (optionally) in the new namespace. The flag -u
tells her to run bash
in the new UTS namespace. Please note that our new process bash
points to a different file uts
, while all others remain the same.
Creating new namespaces usually requires superuser access. Hereinafter, we will assume that bothunshare
, and our implementation are performed usingsudo
.
One of the consequences of what we just did is that now we can change the system hostname from our new bash process and this will not affect any other process in the system. You can verify this by running hostname
in the first terminal and seeing that the host name has not changed there.
But what, for example, is a container?
Hopefully now you have some idea of what namespace can do. You can assume that containers are essentially ordinary processes with namespaces that are different from other processes, and you will be right. In fact, this is a quota. A container without quotas is not required to belong to a unique namespace of each type - it can share some of them.
For example, when you type docker run --net=host redis
, all you do is tell the docker not to create a new Network namespace for the redis process. And, as we have seen, Linux will add this process as a participant in the default Network namespace, like any other regular process. Thus, from a network point of view, the redis process is exactly the same as everyone else. This ability to configure not only the network docker run
allows you to make such changes for most of the existing namespaces. This begs the question, what is a container? Is there a container that uses a process that uses all but one of the common namespace? ¯ \ _ (ツ) _ / ¯ Usually containers come with the concept of isolationachieved through namespaces: the smaller the number of namespaces and resources that the process shares with others, the more it is isolated and that’s all that really matters.
Isolation
In the remainder of this post, we will lay the foundation for our program, which we will name isolate
. isolate
accepts the command as arguments and starts it in a new process, isolated from the rest of the system and limited by its own namespaces. In the following posts, we will look at adding support for individual namespaces for the process command to be launched isolate
.
Depending on the application, we will focus on User, Mount, PID and Network namespaces. The rest will be relatively trivial to implement after we finish (in fact, we will add UTS support here in the initial implementation of the program). And consideration, for example, of Cgroups, is beyond the scope of this series (the study of cgroups, another component of containers used to control how much resources a process can use).
Namespaces can turn out to be very fast and there are many different ways that you can use when exploring each namespace, but we cannot select them all at once. We will discuss only those ways that are relevant to the program we are developing. Each post will begin with some experiments in the console on the namespace in question in order to understand the steps required to configure this namespace. As a result, we will already have an idea of what we want to achieve, and then the corresponding implementation will follow isolate
.
To avoid code overloading of posts, we will not include such things as auxiliary functions that are not necessary for understanding the implementation. You can find the full source code here on Github .
Implementation
The source code for this post can be found here . Our implementation isolate
will be a simple program that reads a line with a command from stdin and clones a new process that executes it with the specified arguments. The cloned process with the command will be executed in its own UTS namespace in the same way as we did previously with unshare
. In the next posts we will see that namespaces do not necessarily work (or at least provide isolation) from the box and we will need to perform some configuration after creating them (but before actually running the command) so that the command really runs in isolation.
This namespace create-configure combination will require some interaction between the main process isolate
and the child process of the command being launched. As a result, part of the main work here will be to configure the connecting channel between both processes - in our case, we will use the Linux pipe because of its simplicity.
We need to do three things:
- Create a main process
isolate
reading data from stdin. - Clone a new process that will run the command in the new UTS namespace.
- Configure the pipe so that the command execution process starts its launch only after receiving a signal from the main process that the namespace configuration is complete.
Here is the basic process:
int main(int argc, char **argv)
{
struct params params;
memset(¶ms, 0, sizeof(struct params));
parse_args(argc, argv, ¶ms);
// Создание пайпа для связи между основным и командным процессом.
if (pipe(params.fd) < 0)
die("Failed to create pipe: %m");
// Клонирование командного процесса.
int clone_flags = SIGCHLD | CLONE_NEWUTS ;
int cmd_pid = clone(cmd_exec, cmd_stack + STACKSIZE, clone_flags, ¶ms);
if (cmd_pid < 0)
die("Failed to clone: %m\n");
// Получить доступный к записи конец пайпа.
int pipe = params.fd[1];
// Тут будут размещаться некоторые настройки namespace ...
// Сигнал командному процессу, что мы закончили с настройкой.
if (write(pipe, "OK", 2) != 2)
die("Failed to write to pipe: %m");
if (close(pipe))
die("Failed to close pipe: %m");
if (waitpid(cmd_pid, NULL, 0) == -1)
die("Failed to wait pid %d: %m\n", cmd_pid);
return 0;
}
Pay attention to clone_flags
which we pass into our call clone
. See how easy it is to create a process in its own namespace? All we need to do is set a flag for the namespace type (the CLONE_NEWUTS
flag corresponds to the UTS namespace), and Linux will take care of the rest.
Next, the command process expects a signal before it starts:
static int cmd_exec(void *arg)
{
// Убить процесс cmd если процесс isolate умирает.
if (prctl(PR_SET_PDEATHSIG, SIGKILL))
die("cannot PR_SET_PDEATHSIG for child process: %m\n");
struct params *params = (struct params*) arg;
// Ожидание сигнала 'настройка завершена' от основного процесса.
await_setup(params->fd[0]);
char **argv = params->argv;
char *cmd = argv[0];
printf("===========%s============\n", cmd);
if (execvp(cmd, argv) == -1)
die("Failed to exec %s: %m\n", cmd);
die("¯\\_(ツ)_/¯");
return 1;
}
Finally, we can try to run this:
$ ./isolate sh
===========sh============
$ ls
isolate isolate.c isolate.o Makefile
$ hostname
iffy
$ hostname coke
$ hostname
coke
# Проверьте в новом окне терминала, что имя хоста не изменилось
Now isolate
- this is a little more than a program that simply forkes the team (we have a UTS working for us). In the next post, we will take another step, having examined User namespaces we will force isolate
to execute the command in our own User namespace. There we will see that we actually need to do some work in order to have a usable namespace in which the command can be executed.