How effective is the procfs virtual file system and can it be optimized?
The proc file system (hereinafter simply procfs) is a virtual file system that provides information about processes. It is an “excellent” example of the interfaces following the “everything is a file” paradigm. Procfs was developed a long time ago: at a time when servers, on average, served several dozen processes, when opening a file and deducting information about a process was not a problem. However, time does not stand still, and now servers serve hundreds of thousands, or even more processes at the same time. In this context, the idea of “opening a file for each process to subtract data of interest” no longer looks so attractive, and the first thing that comes to mind to speed up reading is to get information about a group of processes in one iteration. In this article we will try to find procfs elements that can be optimized.
The very idea of improving procfs arose when we discovered that CRIU wastes a significant amount of time just reading procfs files. We saw how a similar problem was solved for sockets, and decided to do something similar to the sock-diag interface, but only for procfs. Of course, we assumed how difficult it would be to change the old and well-established interface in the core, to convince the community that the game was worth the trouble ... and were pleasantly surprised by the number of people who supported the creation of the new interface. Strictly speaking, no one knew how the new interface should look, but there is no doubt that procfs does not meet current performance requirements. For example, such a scenario: the server responds to requests for too long, vmstat shows that the memory has gone to the swap, and the launch of “ps ax” is performed for 10 seconds or more, top and does not show anything at all.
Each running procfs process is represented by the / proc / directory
In each directory there are many files and subdirectories that provide access to specific information about the process. Subdirectories group data by attributes. For example (
$$this is a special shell variable that is expanded in the pid - identifier of the current process):
$ ls -F /proc/$$ attr/ exe@ mounts projid_map status autogroup fd/ mountstats root@ syscall auxv fdinfo/ net/ sched task/ cgroup gid_map ns/ schedstat timers clear_refs io numa_maps sessionid timerslack_ns cmdline limits oom_adj setgroups uid_map comm loginuid oom_score smaps wchan coredump_filter map_files/ oom_score_adj smaps_rollup cpuset maps pagemap stack cwd@ mem patch_state stat environ mountinfo personality statm
All these files give out data in different formats. Most in ASCII text format, which is easily perceived by man. Well, almost easy:
$ cat /proc/$$/stat 24293 (bash) S 218112429324293348542487642106886325197020101573335200104789201613548748833881844674407370955161594447405350912944474064161321407297194868160006553636700201266777851100172000009444740851652894447408563556944474296770561407297194946551407297194946601407297194946601407297194966860
To understand what each element of this set means, the reader will have to open man proc (5), or the kernel documentation. For example, the second element is the name of the executable file in brackets, and the nineteenth element is the current value of the execution priority (nice).
Some files are quite readable by themselves:
$ cat /proc/$$/status | head -n 5 Name: bash Umask: 0002 State: S (sleeping) Tgid: 24293 Ngid: 0
But how often do users read information directly from procfs files? How long does the kernel need to translate binary data into text format? What is the overhead for procfs? How convenient is this interface for state monitor programs, and how much time do they spend to process this text data? How critical is such a slow implementation in emergency situations?
Most likely, it will not be a mistake to say that users prefer programs like top or ps, instead of reading the data from procfs directly.
To answer the remaining questions we will conduct several experiments. First, find where the kernel spends the time to generate procfs files.
In order to obtain certain information from all processes in the system, we will have to go through the / proc / directory and select all the subdirectories whose name is represented in decimal digits. Then, in each of them, we need to open the file, read it and close it.
In total, we will execute three system calls, and one of them will create a file descriptor (in the kernel, a file descriptor is associated with a set of internal objects for which additional memory is allocated). The open () and close () system calls themselves do not give us any information, so they can be attributed to the overhead of the procfs interface.
Let's try to just make open () and close () for each process in the system, but we will not read the contents of the files:
$ time ./task_proc_all --noread stat tasks: 50290real0m0.177s user0m0.012s sys 0m0.162s
$ time ./task_proc_all --noread loginuid tasks: 50289real0m0.176s user0m0.026s sys 0m0.145
task-proc-all is a small utility that can be viewed through the link below.
It does not matter which file to open, since real data is generated only at the time of read ().
And now let's look at the perf core profiler output:
- 92.18% 0.00% task_proc_all[unknown]- 0x8000- 64.01% __GI___libc_open- 50.71% entry_SYSCALL_64_fastpath-do_sys_open- 48.63% do_filp_open-path_openat- 19.60% link_path_walk- 14.23% walk_component- 13.87% lookup_fast- 7.55% pid_revalidate 4.13% get_pid_task + 1.58% security_task_to_inode 1.10% task_dump_owner 3.63% __d_lookup_rcu + 3.42% security_inode_permission + 14.76% proc_pident_lookup + 4.39% d_alloc_parallel + 2.93% get_empty_filp + 2.43% lookup_fast + 0.98% do_dentry_open 2.07% syscall_return_via_sysret 1.60% 0xfffffe000008a01b 0.97% kmem_cache_alloc 0.61% 0xfffffe000008a01e- 16.45% __getdents64- 15.11% entry_SYSCALL_64_fastpathsys_getdentsiterate_dir-proc_pid_readdir- 7.18% proc_fill_cache + 3.53% d_lookup 1.59% filldir + 6.82% next_tgid + 0.61% snprintf- 9.89% __close + 4.03% entry_SYSCALL_64_fastpath 0.98% syscall_return_via_sysret 0.85% 0xfffffe000008a01b 0.61% 0xfffffe000008a01e 1.10% syscall_return_via_sysret
The kernel spends almost 75% of the time just to create and delete the file descriptor, and about 16% to list the processes.
Although we know how long it takes to open () and close () for each process, we cannot yet assess how significant it is. We need to compare the obtained values with something. Let's try to do the same with the most famous files. Usually, when you need to display a list of processes, use the ps or top utility. They both read / proc /
<pid>/ stat and / proc /
<pid>/ status for each process in the system.
Let's start with / proc /
<pid>/ status - this is a massive file with a fixed number of fields:
$ time ./task_proc_all status tasks: 50283real0m0.455s user 0m0.033s sys 0m0.417s
- 93.84% 0.00% task_proc_all[unknown][k] 0x0000000000008000- 0x8000- 61.20% read- 53.06% entry_SYSCALL_64_fastpath-sys_read- 52.80% vfs_read- 52.22% __vfs_read-seq_read- 50.43% proc_single_show- 50.38% proc_pid_status- 11.34% task_mem + seq_printf + 6.99% seq_printf- 5.77% seq_put_decimal_ull 1.94% strlen + 1.42% num_to_str- 5.73% cpuset_task_status_allowed + seq_printf- 5.37% render_cap_t + 5.31% seq_printf- 5.25% render_sigset_t 0.84% seq_putc 0.73% __task_pid_nr_ns + 0.63% __lock_task_sighand 0.53% hugetlb_report_usage + 0.68% _copy_to_user 1.10% number 1.05% seq_put_decimal_ull 0.84% vsnprintf 0.79% format_decode 0.73% syscall_return_via_sysret 0.52% 0xfffffe000003201b + 20.95% __GI___libc_open + 6.44% __getdents64 + 4.10% __close
It can be seen that only about 60% of the time is spent inside the read () system call. If you look at the profile more closely, you find that 45% of the time is used inside the core functions seq_printf, seq_put_decimal_ull. So, converting from binary format to text is quite an expensive operation. What causes a well-founded question: do we really need a text interface to pull data from the kernel? How often do users want to work with raw data? And why do the top and ps utilities have to convert this text data back to a binary form?
It would probably be interesting to know how much faster the output would be if binary data were used directly, and if three system calls were not required.
Attempts to create such an interface have already been. In 2004, tried to use the netlink engine.
[0/2][ANNOUNCE] nproc: netlink access to /proc information (https://lwn.net/Articles/99600/) nproc is an attempt to address the current problems with /proc. In short, it exposes the same information via netlink (implemented for a small subset).
Unfortunately, the community has not shown much interest in this work. One of the last attempts to rectify the situation occurred two years ago.
[add a new interface to get information about processes (https://lwn.net/Articles/683371/)] task_diag:
The task-diag interface is based on the following principles:
- Transactional nature: sent a request, received a response;
- The format of messages is in the form of netlink (the same as in sock_diag interface: binary and extensible);
- Ability to request information about multiple processes in one call;
- Optimized attribute grouping (any attribute in a group should not increase response time).
This interface has been presented at several conferences. It was integrated into the utilities of pstools, CRIU, and also David Ahern integrated the task_diag into perf, as an experiment.
The kernel developer community has become interested in the task_diag interface. The main subject of discussion was the choice of transport between the core and user space. The initial idea of using netlink sockets was rejected. Partly because of unresolved problems in the code of the netlink engine itself, and partly because many people think that the netlink interface was designed exclusively for the network subsystem. Then it was proposed to use transactional files inside procfs, that is, the user opens the file, writes the request to it, and then simply reads the answer. As usual, there were also opponents of this approach. The solution, which everyone would like, has not yet been found.
Let's compare the performance of task_diag with procfs.
The task_diag engine has a test utility that is well suited for our experiments. Suppose we want to request process IDs and their rights. Below is the output for one process:
$ ./task_diag_all one -c -p $$ pid 2305 tgid 2305 ppid 2299 sid 2305 pgid 2305 comm bash uid:1000100010001000gid:1000100010001000CapInh:0000000000000000CapPrm:0000000000000000CapEff:0000000000000000CapBnd:0000003fffffffff
And now for all processes in the system, that is, the same thing that we did for the experiment with procfs, when we read the / proc / pid / status file:
$ time ./task_diag_all all -c real0m0.048s user0m0.001s sys 0m0.046s
It took only 0.05 seconds to get the data to build the process tree. And with procfs it took 0.177 seconds only to open one file for each process, and without reading the data.
The perf output for the task_diag interface:
- 82.24% 0.00% task_diag_all[kernel.vmlinux][k]entry_SYSCALL_64_fastpath-entry_SYSCALL_64_fastpath- 81.84% sys_readvfs_read __vfs_readproc_reg_readtask_diag_read-taskdiag_dumpit + 33.84% next_tgid 13.06% __task_pid_nr_ns + 6.63% ptrace_may_access + 5.68% from_kuid_munged- 4.19% __get_task_comm 2.90% strncpy 1.29% _raw_spin_lock 3.03% __nla_reserve 1.73% nla_reserve + 1.30% skb_copy_datagram_iter + 1.21% from_kgid_munged 1.12% strncpy
There is nothing interesting in the listing itself, except for the fact that there are no obvious functions suitable for optimization.
Let's look at the perf output when reading information about all processes in the system:
$ perf trace -s ./task_diag_all all -c -q Summary of events: task_diag_all (54326), 185events, 95.4% syscall calls total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- --------- --------- --------- --------- ------ read 4940.2090.0020.8214.1269.50% mmap 110.0510.0030.0050.0079.94% mprotect 80.0470.0030.0060.00910.42% openat 50.0420.0050.0080.02034.86% munmap 10.0140.0140.0140.0140.00% fstat 40.0060.0010.0020.00210.47% access 10.0060.0060.0060.0060.00% close 40.0040.0010.0010.0012.11% write 10.0030.0030.0030.0030.00% rt_sigaction 20.0030.0010.0010.00215.43% brk 10.0020.0020.0020.0020.00% prlimit64 10.0010.0010.0010.0010.00% arch_prctl 10.0010.0010.0010.0010.00% rt_sigprocmask 10.0010.0010.0010.0010.00% set_robust_list 10.0010.0010.0010.0010.00% set_tid_address 10.0010.0010.0010.0010.00%
For procfs, we need to make more than 150000 system calls to pull out information about all processes, and for task_diag - a little more than 50.
Let's look at real situations from life. For example, we want to display a process tree along with command line arguments for each. To do this, we need to pull out the pid of the process, the pid of its parent, and the command line arguments themselves.
For the task_diag interface, the program sends one request to get all the parameters at once:
$ time ./task_diag_all all--cmdline -qreal0m0.096s user0m0.006s sys 0m0.090s
For the original procfs, we need to read / proc // status and / proc // cmdline for each process:
$ time ./task_proc_all status tasks: 50278real0m0.463s user 0m0.030s sys 0m0.427s
$ time ./task_proc_all cmdline tasks: 50281real0m0.270s user 0m0.028s sys 0m0.237s
It is easy to see that task_diag is 7 times faster than procfs (0.096 versus 0.27 + 0.46). Usually, performance improvement by a few percent is already a good result, and here the speed has increased by almost an order of magnitude.
It is also worth mentioning that the creation of internal kernel objects also greatly affects performance. Especially in the case when the memory subsystem is under heavy load. Compare the number of objects created for procfs and task_diag:
$ perf trace --event'kmem:*alloc*' ./task_proc_all status 2>&1 | grep kmem | wc -l 58184 $ perf trace --event'kmem:*alloc*' ./task_diag_all all -q 2>&1 | grep kmem | wc -l 188
You also need to find out how many objects are created when you start a simple process, for example, the utility true:
$ perf trace --event'kmem:*alloc*'true2>&1 | wc -l 94
Procfs creates 600 times more objects than task_diag. This is one of the reasons why procfs works so badly when memory is heavy. At least, therefore, it is worth optimizing it.
We hope that the article will attract more developers to optimize the state of the procfs kernel subsystem.
Many thanks to David Ahern, Andy Lutomirski, Stephen Hemming, Oleg Nesterov, W. Trevor King, Arnd Bergmann, Eric W. Biederman, and many others who helped develop and improve the task_diag interface.
Thanks to cromer , k001 and Stanislav Kinsbursky for helping to write this article.
Only registered users can participate in the survey. Sign in , please.