New interface for getting process attributes in Linux
When developing CRIU, we realized that the current interface for obtaining process information is not ideal. In addition, a similar problem has been successfully resolved for sockets. We tried to transfer these developments to the processes and got quite good results, which you will learn about by reading this article to the end.
After reading the headline, the question arises: “And what did the old interface not please”? Many of you know that process information is now being collected through the procfs file system . Here, each process corresponds to a directory that contains several dozen files.
Each file contains a number of attributes. The first problem is that we have to read at least one file for each process, i.e. make three system calls. If you need to collect data on hundreds of thousands of processes, this can take a long time, even on a powerful machine. Some might remember that the ps or top utility runs slowly on busy machines.
The second problem is related to how the properties of processes are divided into files. We have a good example that shows that the current split is not very good. CRIU has the task of obtaining data on all regions of the process memory. If we look in the / proc / PID / maps file, we will find that it does not contain the flags that are necessary for restoring memory regions. Fortunately, there is another file - / proc / PID / smaps, which contains the necessary information, as well as statistics on the spent physical memory (which we do not need). A simple experiment shows that the formation of the first file takes an order of magnitude less time.
You have probably guessed that the statistics on memory consumption are to blame for everything - it takes most of the time to collect it.
The third problem can be seen in file format. Firstly, there is no single format. Secondly, the format of some files cannot be expanded in principle (for this reason we cannot add a flag field to / proc / PID / maps). Thirdly, many files in text format that are easy to read by humans. This is convenient when you want to look at one process. However, when the task is to analyze thousands of processes, you will not look through them with your eyes, but write some code. Sorting files of different formats is not a pleasant pastime. The binary format is usually more convenient for processing in program code, and its generation often requires less resources.
When we started doing CRIU, there was a problem with getting socket information. For most types of sockets, as usual, files were used in / proc (/ proc / net / unix, / proc / net / netlink, etc.) containing a fairly limited set of parameters. There was a netlink interface for INET sockets, which represented information in a binary form and in an easily extensible format. This interface was able to generalize to all types of sockets.
It works as follows. First, a request is generated that defines the set of parameter groups and the set of sockets for which they are required. At the output we get the required data, divided into messages. One message describes one socket. All parameters are divided into groups, which can be quite small, since they incur overhead only for the size of the message. Each group is described by type and size. At the same time, we have the opportunity to expand existing groups or add new ones.
When we saw problems with getting data about processes, then immediately came the analogy with sockets, and the idea arose to use the same interface for processes.
All attributes must be divided into groups. There is one important rule here - none of the attributes should have a noticeable effect on the time required to generate all the attributes in the group. Remember I talked about / proc / PID / smaps? In the new interface, we moved these statistics to a separate group.
At the first stage, we did not set the task to cover all the attributes. I wanted to understand how convenient the new interface is for use. Therefore, we decided to make an interface sufficient for the needs of CRIU. The result is the following set of attribute groups:
In fact, the current version of the grouping is presented here. In particular, we see here TASK_DIAG_STAT, which appeared in the second version as part of the integration of the interface with the existing taskstats, built on the basis of netlink sockets. The latter uses the netlink protocol and has a number of known problems that we will address in this article.
And a few words about how to set the group of processes about which information is needed.
In the implementation process, several questions arose. The interface should be accessible to ordinary users, i.e. we needed to save access rights somewhere. The second question is where to get the link to the process namespace (pidns) from.
Let's start with the second one. We use the netlink interface, which is based on sockets and is used mainly in the network subsystem. The link to the network namespace is taken from the socket. In our case, the link must be taken to the process namespace. After reading a little kernel code, it turned out that each message contains information about the sender (SCM_CREDENTIALS). It contains the process identifier, which allows us to take a link to the process namespace from the sender. This runs counter to the network namespace, as the socket binds to the namespace in which it was created. It’s probably acceptable to take a link to pidns from the process that requested the information, and besides, we get the opportunity to set the namespace we need, because information about the sender can be set at the stage of sending.
The first problem turned out to be much more interesting, although for a long time we could not understand its details. Linux file descriptors have one feature. We can open the file and lower privileges, while the file descriptor remains fully functional. This is somewhat true for netlink sockets, but there is a problem that Andy Lutomirski pointed out to me. It consists in the fact that we do not have the ability to specify exactly why this particular socket will be used. That is, if we have an application that creates a netlink socket and then lowers its privileges, then this application will be able to use the socket for any functionality that is available for netlink sockets. In other words, privilege reduction does not affect the netlink socket. When we add new functionality to netlink sockets,
There were other suggestions for the interface. In particular, there was an idea to add a new system call. But I did not really like her, because there may be too much data to write them all in one buffer. A file descriptor involves reading data in batches, which, in my opinion, looks more reasonable.
There was also a proposal to make a transactional file in the procfs file system. The idea is similar to what we did for netlink sockets. Open the file, write the request, read the answer. It was on this idea that we stopped, as on a working version for the next version.
The first version did not cause much discussion, but it helped to find another group of people interested in a new, faster interface for obtaining process properties. One evening, I shared my work with Pavel Odintsov (@pavelodintsov), and he said that he recently had problems with perf, and they were also related to the speed of collecting process attributes. This is how he brought us together with David Ahern, who made a considerable contribution to the development of the interface. He also proved by another example that this work is not only for us.
You can start the performance comparison with a simple example. Suppose that for all processes we need to get the session number, groups, and other parameters from the / proc / pid / stat files.
For an honest comparison, we will write a small program that will read / proc / PID / status for each process. Below we will see that it works faster than the ps utility.
The program for task-diag is more voluminous. It can be found in my repository under the tools / testing / selftests / task_diag / directory.
Even with such a simple example, it can be seen that task_diag is several times faster. The ps utility is slower because it reads more files per process.
Let's see what pref trace --summary shows for both options.
The number of system calls in the case of task_diag is seriously reduced.
Results for the perf utility (quoted from a letter from David Ahern).
Here we see an increase in productivity by an order of magnitude.
This project is under development and can still change many times. But now we have two real projects, on the example of which we can see a serious increase in productivity. I am almost sure that in one form or another this work will sooner or later fall into the main branch of the kernel.
github.com/avagin/linux-task-diag
lkml.org/lkml/2015/7/6/142
lwn.net/Articles/633622
www.slideshare.net/openvz/speeding-up-ps-and-top-57448025
Disadvantages of the current interface
After reading the headline, the question arises: “And what did the old interface not please”? Many of you know that process information is now being collected through the procfs file system . Here, each process corresponds to a directory that contains several dozen files.
$ ls /proc/self/
attr cwd loginuid numa_maps schedstat task
autogroup environ map_files oom_adj sessionid timers
auxv exe maps oom_score setgroups uid_map
cgroup fd mem oom_score_adj smaps wchan
clear_refs fdinfo mountinfo pagemap stack
cmdline gid_map mounts personality stat
comm io mountstats projid_map statm
coredump_filter latency net root status
cpuset limits ns sched syscall
Each file contains a number of attributes. The first problem is that we have to read at least one file for each process, i.e. make three system calls. If you need to collect data on hundreds of thousands of processes, this can take a long time, even on a powerful machine. Some might remember that the ps or top utility runs slowly on busy machines.
The second problem is related to how the properties of processes are divided into files. We have a good example that shows that the current split is not very good. CRIU has the task of obtaining data on all regions of the process memory. If we look in the / proc / PID / maps file, we will find that it does not contain the flags that are necessary for restoring memory regions. Fortunately, there is another file - / proc / PID / smaps, which contains the necessary information, as well as statistics on the spent physical memory (which we do not need). A simple experiment shows that the formation of the first file takes an order of magnitude less time.
$ time cat /proc/*/maps > /dev/null
real 0m0.061s
user 0m0.002s
sys 0m0.059s
$ time cat /proc/*/smaps > /dev/null
real 0m0.253s
user 0m0.004s
sys 0m0.247s
You have probably guessed that the statistics on memory consumption are to blame for everything - it takes most of the time to collect it.
The third problem can be seen in file format. Firstly, there is no single format. Secondly, the format of some files cannot be expanded in principle (for this reason we cannot add a flag field to / proc / PID / maps). Thirdly, many files in text format that are easy to read by humans. This is convenient when you want to look at one process. However, when the task is to analyze thousands of processes, you will not look through them with your eyes, but write some code. Sorting files of different formats is not a pleasant pastime. The binary format is usually more convenient for processing in program code, and its generation often requires less resources.
Socket-diag socket interface
When we started doing CRIU, there was a problem with getting socket information. For most types of sockets, as usual, files were used in / proc (/ proc / net / unix, / proc / net / netlink, etc.) containing a fairly limited set of parameters. There was a netlink interface for INET sockets, which represented information in a binary form and in an easily extensible format. This interface was able to generalize to all types of sockets.
It works as follows. First, a request is generated that defines the set of parameter groups and the set of sockets for which they are required. At the output we get the required data, divided into messages. One message describes one socket. All parameters are divided into groups, which can be quite small, since they incur overhead only for the size of the message. Each group is described by type and size. At the same time, we have the opportunity to expand existing groups or add new ones.
New task-diag process attribute retrieval interface
When we saw problems with getting data about processes, then immediately came the analogy with sockets, and the idea arose to use the same interface for processes.
All attributes must be divided into groups. There is one important rule here - none of the attributes should have a noticeable effect on the time required to generate all the attributes in the group. Remember I talked about / proc / PID / smaps? In the new interface, we moved these statistics to a separate group.
At the first stage, we did not set the task to cover all the attributes. I wanted to understand how convenient the new interface is for use. Therefore, we decided to make an interface sufficient for the needs of CRIU. The result is the following set of attribute groups:
TASK_DIAG_BASE /* основная информация pid, tid, sig, pgid, comm */
TASK_DIAG_CRED, /* права доступа */
TASK_DIAG_STAT, /* то же, что предствляет taskstats интерфейс */
TASK_DIAG_VMA, /* описание регионов памяти */
TASK_DIAG_VMA_STAT, /* дополнить описание регионов памяти статистикой потребления ресурсов */
TASK_DIAG_PID = 64, /* идентификатор нити */
TASK_DIAG_TGID, /* идентификатор процесса */
In fact, the current version of the grouping is presented here. In particular, we see here TASK_DIAG_STAT, which appeared in the second version as part of the integration of the interface with the existing taskstats, built on the basis of netlink sockets. The latter uses the netlink protocol and has a number of known problems that we will address in this article.
And a few words about how to set the group of processes about which information is needed.
#define TASK_DIAG_DUMP_ALL 0 /* обо всех процессах в системе*/
#define TASK_DIAG_DUMP_ALL_THREAD 1 /* обо всех нитях в системе */
#define TASK_DIAG_DUMP_CHILDREN 2 /* о всех детях указанного процесса */
#define TASK_DIAG_DUMP_THREAD 3 /* о всех нитях указанного процесса */
#define TASK_DIAG_DUMP_ONE 4 /* об одном заданном процессе */
In the implementation process, several questions arose. The interface should be accessible to ordinary users, i.e. we needed to save access rights somewhere. The second question is where to get the link to the process namespace (pidns) from.
Let's start with the second one. We use the netlink interface, which is based on sockets and is used mainly in the network subsystem. The link to the network namespace is taken from the socket. In our case, the link must be taken to the process namespace. After reading a little kernel code, it turned out that each message contains information about the sender (SCM_CREDENTIALS). It contains the process identifier, which allows us to take a link to the process namespace from the sender. This runs counter to the network namespace, as the socket binds to the namespace in which it was created. It’s probably acceptable to take a link to pidns from the process that requested the information, and besides, we get the opportunity to set the namespace we need, because information about the sender can be set at the stage of sending.
The first problem turned out to be much more interesting, although for a long time we could not understand its details. Linux file descriptors have one feature. We can open the file and lower privileges, while the file descriptor remains fully functional. This is somewhat true for netlink sockets, but there is a problem that Andy Lutomirski pointed out to me. It consists in the fact that we do not have the ability to specify exactly why this particular socket will be used. That is, if we have an application that creates a netlink socket and then lowers its privileges, then this application will be able to use the socket for any functionality that is available for netlink sockets. In other words, privilege reduction does not affect the netlink socket. When we add new functionality to netlink sockets,
There were other suggestions for the interface. In particular, there was an idea to add a new system call. But I did not really like her, because there may be too much data to write them all in one buffer. A file descriptor involves reading data in batches, which, in my opinion, looks more reasonable.
There was also a proposal to make a transactional file in the procfs file system. The idea is similar to what we did for netlink sockets. Open the file, write the request, read the answer. It was on this idea that we stopped, as on a working version for the next version.
A few words about performance
The first version did not cause much discussion, but it helped to find another group of people interested in a new, faster interface for obtaining process properties. One evening, I shared my work with Pavel Odintsov (@pavelodintsov), and he said that he recently had problems with perf, and they were also related to the speed of collecting process attributes. This is how he brought us together with David Ahern, who made a considerable contribution to the development of the interface. He also proved by another example that this work is not only for us.
You can start the performance comparison with a simple example. Suppose that for all processes we need to get the session number, groups, and other parameters from the / proc / pid / stat files.
For an honest comparison, we will write a small program that will read / proc / PID / status for each process. Below we will see that it works faster than the ps utility.
while ((de = readdir(d))) {
if (de->d_name[0] < '0' || de->d_name[0] > '9')
continue;
snprintf(buf, sizeof(buf), "/proc/%s/stat", de->d_name);
fd = open(buf, O_RDONLY);
read(fd, buf, sizeof(buf));
close(fd);
tasks++;
}
The program for task-diag is more voluminous. It can be found in my repository under the tools / testing / selftests / task_diag / directory.
$ ps a -o pid,ppid,pgid,sid,comm | wc -l
50006
$ time ps a -o pid,ppid,pgid,sid,comm > /dev/null
real 0m1.256s
user 0m0.367s
sys 0m0.871s
$ time ./task_proc_all a
tasks: 50085
real 0m0.279s
user 0m0.013s
sys 0m0.255s
$ time ./task_diag_all a
real 0m0.051s
user 0m0.001s
sys 0m0.049s
Even with such a simple example, it can be seen that task_diag is several times faster. The ps utility is slower because it reads more files per process.
Let's see what pref trace --summary shows for both options.
$ perf trace --summary ./task_proc_all a
tasks: 50086
Summary of events:
task_proc_all (72414), 300753 events, 100.0%, 0.000 msec
syscall calls min avg max stddev
(msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- ------
read 50091 0.003 0.005 0.925 0.40%
write 1 0.011 0.011 0.011 0.00%
open 50092 0.003 0.004 0.992 0.49%
close 50093 0.002 0.002 0.061 0.15%
fstat 7 0.002 0.003 0.008 25.95%
mmap 18 0.002 0.006 0.026 19.70%
mprotect 10 0.006 0.010 0.020 13.28%
munmap 2 0.012 0.020 0.028 40.18%
brk 3 0.003 0.007 0.010 30.28%
rt_sigaction 2 0.003 0.003 0.004 18.81%
rt_sigprocmask 1 0.003 0.003 0.003 0.00%
access 1 0.005 0.005 0.005 0.00%
getdents 50 0.003 0.940 2.023 4.51%
getrlimit 1 0.003 0.003 0.003 0.00%
arch_prctl 1 0.002 0.002 0.002 0.00%
set_tid_address 1 0.003 0.003 0.003 0.00%
openat 1 0.022 0.022 0.022 0.00%
set_robust_list 1 0.003 0.003 0.003 0.00%
$ perf trace --summary ./task_diag_all a
Summary of events:
task_diag_all (72481), 183 events, 94.8%, 0.000 msec
syscall calls min avg max stddev
(msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- ------
read 31 0.003 1.471 6.364 14.43%
write 1 0.003 0.003 0.003 0.00%
open 7 0.005 0.008 0.020 26.21%
close 6 0.002 0.002 0.003 3.96%
fstat 6 0.002 0.002 0.003 4.67%
mmap 17 0.002 0.006 0.030 25.38%
mprotect 10 0.005 0.007 0.010 6.33%
munmap 2 0.006 0.007 0.008 13.84%
brk 3 0.003 0.004 0.004 9.08%
rt_sigaction 2 0.002 0.002 0.002 9.57%
rt_sigprocmask 1 0.002 0.002 0.002 0.00%
access 1 0.006 0.006 0.006 0.00%
getrlimit 1 0.002 0.002 0.002 0.00%
arch_prctl 1 0.002 0.002 0.002 0.00%
set_tid_address 1 0.002 0.002 0.002 0.00%
set_robust_list 1 0.002 0.002 0.002 0.00%
The number of system calls in the case of task_diag is seriously reduced.
Results for the perf utility (quoted from a letter from David Ahern).
> Using the fork test command:
> 10,000 processes; 10k proc with 5 threads = 50,000 tasks
> reading /proc: 11.3 sec
> task_diag: 2.2 sec
>
> @7,440 tasks, reading /proc is at 0.77 sec and task_diag at 0.096
>
> 128 instances of sepcjbb, 80,000+ tasks:
> reading /proc: 32.1 sec
> task_diag: 3.9 sec
>
> So overall much snappier startup times.
Here we see an increase in productivity by an order of magnitude.
Conclusion
This project is under development and can still change many times. But now we have two real projects, on the example of which we can see a serious increase in productivity. I am almost sure that in one form or another this work will sooner or later fall into the main branch of the kernel.
References
github.com/avagin/linux-task-diag
lkml.org/lkml/2015/7/6/142
lwn.net/Articles/633622
www.slideshare.net/openvz/speeding-up-ps-and-top-57448025