How effective is the procfs virtual file system and can it be optimized?

The proc file system (hereinafter simply procfs) is a virtual file system that provides information about processes. It is an “excellent” example of the interfaces following the “everything is a file” paradigm. Procfs was developed a long time ago: at a time when servers, on average, served several dozen processes, when opening a file and deducting information about a process was not a problem. However, time does not stand still, and now servers serve hundreds of thousands, or even more processes at the same time. In this context, the idea of “opening a file for each process to subtract data of interest” no longer looks so attractive, and the first thing that comes to mind to speed up reading is to get information about a group of processes in one iteration. In this article we will try to find procfs elements that can be optimized.

The very idea of improving procfs arose when we discovered that CRIU wastes a significant amount of time just reading procfs files. We saw how a similar problem was solved for sockets, and decided to do something similar to the sock-diag interface, but only for procfs. Of course, we assumed how difficult it would be to change the old and well-established interface in the core, to convince the community that the game was worth the trouble ... and were pleasantly surprised by the number of people who supported the creation of the new interface. Strictly speaking, no one knew how the new interface should look, but there is no doubt that procfs does not meet current performance requirements. For example, such a scenario: the server responds to requests for too long, vmstat shows that the memory has gone to the swap, and the launch of “ps ax” is performed for 10 seconds or more, top and does not show anything at all.

Each running procfs process is represented by the / proc / directory <pid>.
In each directory there are many files and subdirectories that provide access to specific information about the process. Subdirectories group data by attributes. For example ( $$this is a special shell variable that is expanded in the pid - identifier of the current process):

$ ls -F /proc/$$
attr/            exe@        mounts         projid_map    status
autogroup        fd/         mountstats     root@         syscall
auxv             fdinfo/     net/           sched         task/
cgroup           gid_map     ns/            schedstat     timers
clear_refs       io          numa_maps      sessionid     timerslack_ns
cmdline          limits      oom_adj        setgroups     uid_map
comm             loginuid    oom_score      smaps         wchan
coredump_filter  map_files/  oom_score_adj  smaps_rollup
cpuset           maps        pagemap        stack
cwd@             mem         patch_state    stat
environ          mountinfo   personality    statm

All these files give out data in different formats. Most in ASCII text format, which is easily perceived by man. Well, almost easy:

$ cat /proc/$$/stat
24293 (bash) S 218112429324293348542487642106886325197020101573335200104789201613548748833881844674407370955161594447405350912944474064161321407297194868160006553636700201266777851100172000009444740851652894447408563556944474296770561407297194946551407297194946601407297194946601407297194966860

To understand what each element of this set means, the reader will have to open man proc (5), or the kernel documentation. For example, the second element is the name of the executable file in brackets, and the nineteenth element is the current value of the execution priority (nice).

Some files are quite readable by themselves:

$ cat /proc/$$/status | head -n 5
Name:   bash
Umask:  0002
State:  S (sleeping)
Tgid:   24293
Ngid:   0

But how often do users read information directly from procfs files? How long does the kernel need to translate binary data into text format? What is the overhead for procfs? How convenient is this interface for state monitor programs, and how much time do they spend to process this text data? How critical is such a slow implementation in emergency situations?

Most likely, it will not be a mistake to say that users prefer programs like top or ps, instead of reading the data from procfs directly.

To answer the remaining questions we will conduct several experiments. First, find where the kernel spends the time to generate procfs files.

In order to obtain certain information from all processes in the system, we will have to go through the / proc / directory and select all the subdirectories whose name is represented in decimal digits. Then, in each of them, we need to open the file, read it and close it.

In total, we will execute three system calls, and one of them will create a file descriptor (in the kernel, a file descriptor is associated with a set of internal objects for which additional memory is allocated). The open () and close () system calls themselves do not give us any information, so they can be attributed to the overhead of the procfs interface.

Let's try to just make open () and close () for each process in the system, but we will not read the contents of the files:

$ time ./task_proc_all --noread stat
tasks: 50290real0m0.177s
user0m0.012s
sys 0m0.162s

$ time ./task_proc_all --noread loginuid
tasks: 50289real0m0.176s
user0m0.026s
sys 0m0.145

task-proc-all is a small utility that can be viewed through the link below.

It does not matter which file to open, since real data is generated only at the time of read ().

And now let's look at the perf core profiler output:

-   92.18%     0.00%  task_proc_all[unknown]- 0x8000- 64.01% __GI___libc_open- 50.71% entry_SYSCALL_64_fastpath-do_sys_open- 48.63% do_filp_open-path_openat- 19.60% link_path_walk- 14.23% walk_component- 13.87% lookup_fast- 7.55% pid_revalidate
                                   4.13% get_pid_task
                                 + 1.58% security_task_to_inode
                                   1.10% task_dump_owner
                                3.63% __d_lookup_rcu
                        + 3.42% security_inode_permission
                     + 14.76% proc_pident_lookup
                     + 4.39% d_alloc_parallel
                     + 2.93% get_empty_filp
                     + 2.43% lookup_fast
                     + 0.98% do_dentry_open
           2.07% syscall_return_via_sysret
           1.60% 0xfffffe000008a01b
           0.97% kmem_cache_alloc
           0.61% 0xfffffe000008a01e- 16.45% __getdents64- 15.11% entry_SYSCALL_64_fastpathsys_getdentsiterate_dir-proc_pid_readdir- 7.18% proc_fill_cache
                  + 3.53% d_lookup
                    1.59% filldir
               + 6.82% next_tgid
               + 0.61% snprintf- 9.89% __close
         + 4.03% entry_SYSCALL_64_fastpath
           0.98% syscall_return_via_sysret
           0.85% 0xfffffe000008a01b
           0.61% 0xfffffe000008a01e
        1.10% syscall_return_via_sysret

The kernel spends almost 75% of the time just to create and delete the file descriptor, and about 16% to list the processes.

Although we know how long it takes to open () and close () for each process, we cannot yet assess how significant it is. We need to compare the obtained values with something. Let's try to do the same with the most famous files. Usually, when you need to display a list of processes, use the ps or top utility. They both read / proc / <pid>/ stat and / proc / <pid>/ status for each process in the system.

Let's start with / proc / <pid>/ status - this is a massive file with a fixed number of fields:

$ time ./task_proc_all status
tasks: 50283real0m0.455s
user    0m0.033s
sys 0m0.417s

-   93.84%     0.00%  task_proc_all[unknown][k] 0x0000000000008000- 0x8000- 61.20% read- 53.06% entry_SYSCALL_64_fastpath-sys_read- 52.80% vfs_read- 52.22% __vfs_read-seq_read- 50.43% proc_single_show- 50.38% proc_pid_status- 11.34% task_mem
                                 + seq_printf
                              + 6.99% seq_printf- 5.77% seq_put_decimal_ull
                                   1.94% strlen
                                 + 1.42% num_to_str- 5.73% cpuset_task_status_allowed
                                 + seq_printf- 5.37% render_cap_t
                                 + 5.31% seq_printf- 5.25% render_sigset_t
                                   0.84% seq_putc
                                0.73% __task_pid_nr_ns
                              + 0.63% __lock_task_sighand
                                0.53% hugetlb_report_usage
                        + 0.68% _copy_to_user
           1.10% number
           1.05% seq_put_decimal_ull
           0.84% vsnprintf
           0.79% format_decode
           0.73% syscall_return_via_sysret
           0.52% 0xfffffe000003201b
      + 20.95% __GI___libc_open
      + 6.44% __getdents64
      + 4.10% __close

It can be seen that only about 60% of the time is spent inside the read () system call. If you look at the profile more closely, you find that 45% of the time is used inside the core functions seq_printf, seq_put_decimal_ull. So, converting from binary format to text is quite an expensive operation. What causes a well-founded question: do we really need a text interface to pull data from the kernel? How often do users want to work with raw data? And why do the top and ps utilities have to convert this text data back to a binary form?

It would probably be interesting to know how much faster the output would be if binary data were used directly, and if three system calls were not required.

Attempts to create such an interface have already been. In 2004, tried to use the netlink engine.

[0/2][ANNOUNCE] nproc: netlink access to /proc information (https://lwn.net/Articles/99600/)
nproc is an attempt to address the current problems with /proc. In
short, it exposes the same information via netlink (implemented for a
small subset).

Unfortunately, the community has not shown much interest in this work. One of the last attempts to rectify the situation occurred two years ago.

[PATCH 0/15] task_diag: add a new interface to get information about processes (https://lwn.net/Articles/683371/)

The task-diag interface is based on the following principles:

Transactional nature: sent a request, received a response;
The format of messages is in the form of netlink (the same as in sock_diag interface: binary and extensible);
Ability to request information about multiple processes in one call;
Optimized attribute grouping (any attribute in a group should not increase response time).

This interface has been presented at several conferences. It was integrated into the utilities of pstools, CRIU, and also David Ahern integrated the task_diag into perf, as an experiment.

The kernel developer community has become interested in the task_diag interface. The main subject of discussion was the choice of transport between the core and user space. The initial idea of using netlink sockets was rejected. Partly because of unresolved problems in the code of the netlink engine itself, and partly because many people think that the netlink interface was designed exclusively for the network subsystem. Then it was proposed to use transactional files inside procfs, that is, the user opens the file, writes the request to it, and then simply reads the answer. As usual, there were also opponents of this approach. The solution, which everyone would like, has not yet been found.

Let's compare the performance of task_diag with procfs.

The task_diag engine has a test utility that is well suited for our experiments. Suppose we want to request process IDs and their rights. Below is the output for one process:

$ ./task_diag_all one  -c -p $$
pid  2305 tgid  2305 ppid  2299 sid  2305 pgid  2305 comm bash
uid:1000100010001000gid:1000100010001000CapInh:0000000000000000CapPrm:0000000000000000CapEff:0000000000000000CapBnd:0000003fffffffff

And now for all processes in the system, that is, the same thing that we did for the experiment with procfs, when we read the / proc / pid / status file:

$ time ./task_diag_all all  -c
real0m0.048s
user0m0.001s
sys 0m0.046s

It took only 0.05 seconds to get the data to build the process tree. And with procfs it took 0.177 seconds only to open one file for each process, and without reading the data.

The perf output for the task_diag interface:

-   82.24%     0.00%  task_diag_all[kernel.vmlinux][k]entry_SYSCALL_64_fastpath-entry_SYSCALL_64_fastpath- 81.84% sys_readvfs_read
           __vfs_readproc_reg_readtask_diag_read-taskdiag_dumpit
            + 33.84% next_tgid
              13.06% __task_pid_nr_ns
            + 6.63% ptrace_may_access
            + 5.68% from_kuid_munged- 4.19% __get_task_comm
                 2.90% strncpy
                 1.29% _raw_spin_lock
              3.03% __nla_reserve
              1.73% nla_reserve
            + 1.30% skb_copy_datagram_iter
            + 1.21% from_kgid_munged
              1.12% strncpy

There is nothing interesting in the listing itself, except for the fact that there are no obvious functions suitable for optimization.

Let's look at the perf output when reading information about all processes in the system:

 $ perf trace -s ./task_diag_all all -c  -q
 Summary of events:
 task_diag_all (54326), 185events, 95.4%
   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------
   read                  4940.2090.0020.8214.1269.50%
   mmap                  110.0510.0030.0050.0079.94%
   mprotect               80.0470.0030.0060.00910.42%
   openat                 50.0420.0050.0080.02034.86%
   munmap                 10.0140.0140.0140.0140.00%
   fstat                  40.0060.0010.0020.00210.47%
   access                 10.0060.0060.0060.0060.00%
   close                  40.0040.0010.0010.0012.11%
   write                  10.0030.0030.0030.0030.00%
   rt_sigaction           20.0030.0010.0010.00215.43%
   brk                    10.0020.0020.0020.0020.00%
   prlimit64              10.0010.0010.0010.0010.00%
   arch_prctl             10.0010.0010.0010.0010.00%
   rt_sigprocmask         10.0010.0010.0010.0010.00%
   set_robust_list        10.0010.0010.0010.0010.00%
   set_tid_address        10.0010.0010.0010.0010.00%

For procfs, we need to make more than 150000 system calls to pull out information about all processes, and for task_diag - a little more than 50.

Let's look at real situations from life. For example, we want to display a process tree along with command line arguments for each. To do this, we need to pull out the pid of the process, the pid of its parent, and the command line arguments themselves.

For the task_diag interface, the program sends one request to get all the parameters at once:

$ time ./task_diag_all all--cmdline -qreal0m0.096s
user0m0.006s
sys 0m0.090s

For the original procfs, we need to read / proc // status and / proc // cmdline for each process:

$ time ./task_proc_all status
tasks: 50278real0m0.463s
user    0m0.030s
sys 0m0.427s

$ time ./task_proc_all cmdline
tasks: 50281real0m0.270s
user    0m0.028s
sys 0m0.237s

It is easy to see that task_diag is 7 times faster than procfs (0.096 versus 0.27 + 0.46). Usually, performance improvement by a few percent is already a good result, and here the speed has increased by almost an order of magnitude.

It is also worth mentioning that the creation of internal kernel objects also greatly affects performance. Especially in the case when the memory subsystem is under heavy load. Compare the number of objects created for procfs and task_diag:

$ perf trace --event'kmem:*alloc*'  ./task_proc_all status 2>&1 | grep kmem | wc -l
58184
$ perf trace --event'kmem:*alloc*'  ./task_diag_all all -q 2>&1 | grep kmem | wc -l
188

You also need to find out how many objects are created when you start a simple process, for example, the utility true:

$ perf trace --event'kmem:*alloc*'true2>&1 | wc -l
94

Procfs creates 600 times more objects than task_diag. This is one of the reasons why procfs works so badly when memory is heavy. At least, therefore, it is worth optimizing it.

We hope that the article will attract more developers to optimize the state of the procfs kernel subsystem.

Many thanks to David Ahern, Andy Lutomirski, Stephen Hemming, Oleg Nesterov, W. Trevor King, Arnd Bergmann, Eric W. Biederman, and many others who helped develop and improve the task_diag interface.

Thanks to cromer , k001 and Stanislav Kinsbursky for helping to write this article.

How effective is the procfs virtual file system and can it be optimized?

Links

Have you had problems with procfs

Also popular now: