avagin January 20, 2016 at 06:06

New interface for getting process attributes in Linux

When developing CRIU, we realized that the current interface for obtaining process information is not ideal. In addition, a similar problem has been successfully resolved for sockets. We tried to transfer these developments to the processes and got quite good results, which you will learn about by reading this article to the end.

Disadvantages of the current interface

After reading the headline, the question arises: “And what did the old interface not please”? Many of you know that process information is now being collected through the procfs file system . Here, each process corresponds to a directory that contains several dozen files.

$ ls /proc/self/ 
attr             cwd      loginuid    numa_maps      schedstat  task
autogroup        environ  map_files   oom_adj        sessionid  timers
auxv             exe      maps        oom_score      setgroups  uid_map
cgroup           fd       mem         oom_score_adj  smaps      wchan
clear_refs       fdinfo   mountinfo   pagemap        stack
cmdline          gid_map  mounts      personality    stat
comm             io       mountstats  projid_map     statm
coredump_filter  latency  net         root           status
cpuset           limits   ns          sched          syscall

Each file contains a number of attributes. The first problem is that we have to read at least one file for each process, i.e. make three system calls. If you need to collect data on hundreds of thousands of processes, this can take a long time, even on a powerful machine. Some might remember that the ps or top utility runs slowly on busy machines.

The second problem is related to how the properties of processes are divided into files. We have a good example that shows that the current split is not very good. CRIU has the task of obtaining data on all regions of the process memory. If we look in the / proc / PID / maps file, we will find that it does not contain the flags that are necessary for restoring memory regions. Fortunately, there is another file - / proc / PID / smaps, which contains the necessary information, as well as statistics on the spent physical memory (which we do not need). A simple experiment shows that the formation of the first file takes an order of magnitude less time.

$ time cat /proc/*/maps > /dev/null
real	0m0.061s
user	0m0.002s
sys	0m0.059s
$ time cat /proc/*/smaps > /dev/null
real	0m0.253s
user	0m0.004s
sys	0m0.247s

You have probably guessed that the statistics on memory consumption are to blame for everything - it takes most of the time to collect it.

The third problem can be seen in file format. Firstly, there is no single format. Secondly, the format of some files cannot be expanded in principle (for this reason we cannot add a flag field to / proc / PID / maps). Thirdly, many files in text format that are easy to read by humans. This is convenient when you want to look at one process. However, when the task is to analyze thousands of processes, you will not look through them with your eyes, but write some code. Sorting files of different formats is not a pleasant pastime. The binary format is usually more convenient for processing in program code, and its generation often requires less resources.

Socket-diag socket interface

When we started doing CRIU, there was a problem with getting socket information. For most types of sockets, as usual, files were used in / proc (/ proc / net / unix, / proc / net / netlink, etc.) containing a fairly limited set of parameters. There was a netlink interface for INET sockets, which represented information in a binary form and in an easily extensible format. This interface was able to generalize to all types of sockets.
It works as follows. First, a request is generated that defines the set of parameter groups and the set of sockets for which they are required. At the output we get the required data, divided into messages. One message describes one socket. All parameters are divided into groups, which can be quite small, since they incur overhead only for the size of the message. Each group is described by type and size. At the same time, we have the opportunity to expand existing groups or add new ones.

New task-diag process attribute retrieval interface

When we saw problems with getting data about processes, then immediately came the analogy with sockets, and the idea arose to use the same interface for processes.

All attributes must be divided into groups. There is one important rule here - none of the attributes should have a noticeable effect on the time required to generate all the attributes in the group. Remember I talked about / proc / PID / smaps? In the new interface, we moved these statistics to a separate group.

At the first stage, we did not set the task to cover all the attributes. I wanted to understand how convenient the new interface is for use. Therefore, we decided to make an interface sufficient for the needs of CRIU. The result is the following set of attribute groups:

        TASK_DIAG_BASE          /* основная информация pid, tid, sig, pgid, comm */
        TASK_DIAG_CRED,         /* права доступа */
        TASK_DIAG_STAT,         /* то же, что предствляет taskstats интерфейс */
        TASK_DIAG_VMA,          /* описание регионов памяти */
        TASK_DIAG_VMA_STAT,     /* дополнить описание регионов памяти статистикой потребления ресурсов */
        TASK_DIAG_PID   = 64,   /* идентификатор нити */
        TASK_DIAG_TGID,         /* идентификатор процесса */

In fact, the current version of the grouping is presented here. In particular, we see here TASK_DIAG_STAT, which appeared in the second version as part of the integration of the interface with the existing taskstats, built on the basis of netlink sockets. The latter uses the netlink protocol and has a number of known problems that we will address in this article.

And a few words about how to set the group of processes about which information is needed.

#define TASK_DIAG_DUMP_ALL      0               /* обо всех процессах в системе*/
#define TASK_DIAG_DUMP_ALL_THREAD       1       /* обо всех нитях в системе */
#define TASK_DIAG_DUMP_CHILDREN 2               /* о всех детях указанного процесса */
#define TASK_DIAG_DUMP_THREAD   3               /* о всех нитях указанного процесса */
#define TASK_DIAG_DUMP_ONE      4               /* об одном заданном процессе */

In the implementation process, several questions arose. The interface should be accessible to ordinary users, i.e. we needed to save access rights somewhere. The second question is where to get the link to the process namespace (pidns) from.

Let's start with the second one. We use the netlink interface, which is based on sockets and is used mainly in the network subsystem. The link to the network namespace is taken from the socket. In our case, the link must be taken to the process namespace. After reading a little kernel code, it turned out that each message contains information about the sender (SCM_CREDENTIALS). It contains the process identifier, which allows us to take a link to the process namespace from the sender. This runs counter to the network namespace, as the socket binds to the namespace in which it was created. It’s probably acceptable to take a link to pidns from the process that requested the information, and besides, we get the opportunity to set the namespace we need, because information about the sender can be set at the stage of sending.

The first problem turned out to be much more interesting, although for a long time we could not understand its details. Linux file descriptors have one feature. We can open the file and lower privileges, while the file descriptor remains fully functional. This is somewhat true for netlink sockets, but there is a problem that Andy Lutomirski pointed out to me. It consists in the fact that we do not have the ability to specify exactly why this particular socket will be used. That is, if we have an application that creates a netlink socket and then lowers its privileges, then this application will be able to use the socket for any functionality that is available for netlink sockets. In other words, privilege reduction does not affect the netlink socket. When we add new functionality to netlink sockets,

There were other suggestions for the interface. In particular, there was an idea to add a new system call. But I did not really like her, because there may be too much data to write them all in one buffer. A file descriptor involves reading data in batches, which, in my opinion, looks more reasonable.

There was also a proposal to make a transactional file in the procfs file system. The idea is similar to what we did for netlink sockets. Open the file, write the request, read the answer. It was on this idea that we stopped, as on a working version for the next version.

A few words about performance

The first version did not cause much discussion, but it helped to find another group of people interested in a new, faster interface for obtaining process properties. One evening, I shared my work with Pavel Odintsov (@pavelodintsov), and he said that he recently had problems with perf, and they were also related to the speed of collecting process attributes. This is how he brought us together with David Ahern, who made a considerable contribution to the development of the interface. He also proved by another example that this work is not only for us.

You can start the performance comparison with a simple example. Suppose that for all processes we need to get the session number, groups, and other parameters from the / proc / pid / stat files.

For an honest comparison, we will write a small program that will read / proc / PID / status for each process. Below we will see that it works faster than the ps utility.

        while ((de = readdir(d))) {
                if (de->d_name[0] < '0' || de->d_name[0] > '9')
                        continue;
                snprintf(buf, sizeof(buf), "/proc/%s/stat", de->d_name);
                fd = open(buf, O_RDONLY);
                read(fd, buf, sizeof(buf));
                close(fd);
                tasks++;
        }

The program for task-diag is more voluminous. It can be found in my repository under the tools / testing / selftests / task_diag / directory.

$ ps a -o pid,ppid,pgid,sid,comm | wc -l
50006
$ time ps a -o pid,ppid,pgid,sid,comm > /dev/null
real    0m1.256s
user    0m0.367s
sys     0m0.871s
$ time ./task_proc_all a
tasks: 50085
real    0m0.279s
user    0m0.013s
sys     0m0.255s
$ time ./task_diag_all a
real    0m0.051s
user    0m0.001s
sys     0m0.049s

Even with such a simple example, it can be seen that task_diag is several times faster. The ps utility is slower because it reads more files per process.

Let's see what pref trace --summary shows for both options.

$ perf trace --summary ./task_proc_all a
tasks: 50086
 Summary of events:
 task_proc_all (72414), 300753 events, 100.0%, 0.000 msec
   syscall            calls      min       avg       max      stddev
                               (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- ---------     ------
   read               50091     0.003     0.005     0.925      0.40%
   write                  1     0.011     0.011     0.011      0.00%
   open               50092     0.003     0.004     0.992      0.49%
   close              50093     0.002     0.002     0.061      0.15%
   fstat                  7     0.002     0.003     0.008     25.95%
   mmap                  18     0.002     0.006     0.026     19.70%
   mprotect              10     0.006     0.010     0.020     13.28%
   munmap                 2     0.012     0.020     0.028     40.18%
   brk                    3     0.003     0.007     0.010     30.28%
   rt_sigaction           2     0.003     0.003     0.004     18.81%
   rt_sigprocmask         1     0.003     0.003     0.003      0.00%
   access                 1     0.005     0.005     0.005      0.00%
   getdents              50     0.003     0.940     2.023      4.51%
   getrlimit              1     0.003     0.003     0.003      0.00%
   arch_prctl             1     0.002     0.002     0.002      0.00%
   set_tid_address        1     0.003     0.003     0.003      0.00%
   openat                 1     0.022     0.022     0.022      0.00%
   set_robust_list        1     0.003     0.003     0.003      0.00%

$ perf trace --summary ./task_diag_all a
 Summary of events:
 task_diag_all (72481), 183 events, 94.8%, 0.000 msec
   syscall            calls      min       avg       max      stddev
                               (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- ---------     ------
   read                  31     0.003     1.471     6.364     14.43%
   write                  1     0.003     0.003     0.003      0.00%
   open                   7     0.005     0.008     0.020     26.21%
   close                  6     0.002     0.002     0.003      3.96%
   fstat                  6     0.002     0.002     0.003      4.67%
   mmap                  17     0.002     0.006     0.030     25.38%
   mprotect              10     0.005     0.007     0.010      6.33%
   munmap                 2     0.006     0.007     0.008     13.84%
   brk                    3     0.003     0.004     0.004      9.08%
   rt_sigaction           2     0.002     0.002     0.002      9.57%
   rt_sigprocmask         1     0.002     0.002     0.002      0.00%
   access                 1     0.006     0.006     0.006      0.00%
   getrlimit              1     0.002     0.002     0.002      0.00%
   arch_prctl             1     0.002     0.002     0.002      0.00%
   set_tid_address        1     0.002     0.002     0.002      0.00%
   set_robust_list        1     0.002     0.002     0.002      0.00%

The number of system calls in the case of task_diag is seriously reduced.

Results for the perf utility (quoted from a letter from David Ahern).

> Using the fork test command:
>    10,000 processes; 10k proc with 5 threads = 50,000 tasks
>    reading /proc: 11.3 sec
>    task_diag:      2.2 sec
>
> @7,440 tasks, reading /proc is at 0.77 sec and task_diag at 0.096
>
> 128 instances of sepcjbb, 80,000+ tasks:
>     reading /proc: 32.1 sec
>     task_diag:      3.9 sec
>
> So overall much snappier startup times.

Here we see an increase in productivity by an order of magnitude.

Conclusion

This project is under development and can still change many times. But now we have two real projects, on the example of which we can see a serious increase in productivity. I am almost sure that in one form or another this work will sooner or later fall into the main branch of the kernel.

References

github.com/avagin/linux-task-diag
lkml.org/lkml/2015/7/6/142
lwn.net/Articles/633622
www.slideshare.net/openvz/speeding-up-ps-and-top-57448025

Tags: