Learning the Inside Kitchen of the Linux Kernel with / proc for Quick Diagnostics and Problem Solving

Original author: Tanel Poder
  • Transfer
This article is about modern Linux. For example, RHEL6 with 2.6.3x kernels will do, but RHEL5 with 2.6.18 kernels (by the way, the most popular in production) - alas, no. And yet - there will be no description of nuclear debuggers or SytemTap scripts; only good old simple commands like “cat / proc / PID / xyz” with respect to some useful nodes of the / proc file system.

Diagnostics of the "inhibitory" process


Here is a good example of a frequently occurring problem that I reproduced on my laptop: the user complains that the team findworks "much slower", while not returning any results. Knowing what was the matter, we solved the problem. However, I was asked to set out a systematic approach to solving such problems.

Fortunately, the system is running OEL6, i.e. on a fairly fresh kernel (namely - 2.6.39 UEK2)

So, let's proceed to the diagnosis.

First, check if the find process is still alive:

[root @ oel6 ~] # ps -ef | grep find
root     27288 27245 4 11:57 pts / 0 00:00:01 find. -type f
root 27334 27315 0 11:57 pts / 1 00:00:00 grep find

Yes, it is in place - PID 27288 (I will use this pid in the diagnostic example below)

Let's start with the basics and see what is the bottleneck for this process. If it is not blocked by anything (for example, it reads everything it needs from the cache), it should use 100% CPU. If it is blocked due to IO or network problems, then the load on the processor should be lower or absent altogether.

[root @ oel6 ~] # top -cbp 27288
top - 11:58:15 up 7 days, 3:38, 2 users, load average: 1.21, 0.65, 0.47
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped , 0 zombie
Cpu (s): 0.1% us, 0.1% sy, 0.0% ni, 99.8% id, 0.0% wa, 0.0% hi, 0.0% si, 0.0% st
Mem: 2026460k total, 1935780k used, 90680k free, 64416k buffers
Swap: 4128764k total, 251004k used, 3877760k free, 662280k cached

PID USER PR NI VIRT RES SHR S% CPU% MEM TIME + COMMAND
27288 root 20 0 109m 1160 844 D 0.0 0.1 0: 01.11 find. -type f

The output from the 'top' command states that this process does not load the CPU at all (or the load is so small that it can be considered zero). However, there is a very important difference between a process that freezes completely, having no chance at all to get a processor quantum, and a process that constantly wakes up from the standby state and then immediately falls asleep again (for example, a poll operation that constantly ends in timeout, and then the process calls it again and goes to sleep). Unfortunately, the 'top' commands are not enough to recognize these two states. But at least we already found out that this process does not devour processor time.

Let's try other tools. Usually, if a process seems to hang like this (0% CPU usually means that the process is in some kind of blocking system call - which causes the kernel to lull the process), I run straceon this process to track which system call it is on currently stuck. If the process actually does not completely hang, but periodically returns from the system call and wakes up, then this will also be visible in the output of strace (the system call will periodically end, and then be called again):

[root @ oel6 ~] # strace -cp 27288
Process 27288 attached - interrupt to quit

^ C
^ Z

[1] + Stopped strace -cp 27288

[root @ oel6 ~] # kill -9 %%
[1] + Stopped strace - cp 27288
[root @ oel6 ~] #
[1] + Killed strace -cp 27288

Oops ... Looks like the strace command itself also hung! She did not display anything for a long time, and did not even respond to pressing CTRL + C, so I had to first send her to the background by pressing CTRL + Z, and then completely kill. Not a very easy diagnosis!

Let's try pstack (on Linux, pstack is just a wrapper script for the GDB debugger). Pstack will not tell us anything about the kernel’s inner kitchen, but at least it will indicate what kind of system call it is (usually it looks like a libc library function call on top of the user stack):

[root @ oel6 ~] # pstack 27288

^ C
^ Z
[1] + Stopped pstack 27288

[root @ oel6 ~] # kill %%
[1] + Stopped pstack 27288
[root @ oel6 ~] #
[1] + Terminated pstack 27288

Pstack also hung without any explanation!

So, we still do not know whether our process is 100% free (alas), or only 99.99% (wakes up and immediately falls asleep) - and we do not know where exactly this happened.

Where else can I see? There is another often accessible place - the status and WCHAN fields, the contents of which can be examined using the good old ps command (hmm ... maybe you should run it right away to make sure that we are not dealing with zombies):

[root @ oel6 ~] # ps -flp 27288
FS UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 D root 27288 27245 0 80 0 - 28070 rpc_wa 11:57 pts / 0 00:00:01 find. -type f

To make sure that the process is still in the same state, ps need to run several times (you do not want to come to a false conclusion based on a single attempt, made in a bad moment?). However, here I am bravely showing only one launch.

The process is in state D ("sound sleep"), which is usually associated with disk I / O (as the ps man page also says). In addition, the WCHAN field (the name of the function that brought the process to a hibernation / wait state) is a bit cropped. I can add an option to the ps call to make the output of this field a little wider, however, since its contents in any case come from the / proc system, let's look directly at the source (again, it would be nice to do this several times to make sure it freezes whether the process is complete, or is it just very often sleeping a lot):

[root @ oel6 ~] # cat / proc / 27288 / wchan
rpc_wait_bit_killable


Hmm ... It looks like the process is expecting some kind of RPC call. This usually means that the process is communicating with other processes (either on the local machine, or even on the remote server). But we still don't know why.

Is there any movement, or is the process completely stuck?


Before moving on to the "meat" of the article, let's determine whether the process is completely stuck or not. On modern kernels, you can find out about this by studying / proc / PID / status. For clarity, I highlighted the values ​​that are of interest to us:

[root @ oel6 ~] # cat / proc / 27288 / status
Name: find
State: D (disk sleep)
Tgid: 27288
Pid: 27288
PPid: 27245
TracerPid: 0
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 256
Groups: 0 1 2 3 4 6 10
VmPeak: 112628 kB
VmSize: 112280 kB
VmLck: 0 kB
VmHWM: 1508 kB
VmRSS: 1160 kB
VmData: 260 kB
VmStk: 136 kB
VmExe: 224 kB
VmLib: 2468 kB
VmPTE
VmSwap: 0 kB
Threads: 1
SigQ: 4/15831
SigPnd: 0000000000040000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: 0000000180000000
CapInh: 0000000000000000
CapPrm: ffffffffffffffff CapEff: ffffffffffffffff
CapBnd: ffffffffffffffff
Cpus_allowed: ffffffffffffff
Cpus_allowed: fffffffffff, ffffffff
Cpus_allowed_list: 0-63
Mems_allowed00,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 ,
000,000,000,000,000,000.00000000.00000000.00000000.00000000.00000000.00000000.00000000.00000000.00000000.00000000.00000000.00000000.00000000,0000000000,00,000,000,000,000,000,000,000,000,000,000,000,000,00001 Mems_allowed_list: 0
voluntary_ctxt_switches: 9950
nonvoluntary_ctxt_switches: 17104


The process is in state D - disk sleep ("sound sleep"). Also pay attention to the values voluntary_ctxt_switchesand nonvoluntary_ctxt_switches- they will tell us how many times the processor received the CPU quanta (or gave back). Then, after a few seconds, run the command again and check if the values ​​have increased. In my case, the numbers did not increase, and therefore I would suggest that the process hangs tight (well, or at least never woke up during those few seconds between the teams). So, now I can be more confident that the process is completely frozen (and not just flying below the radar, constantly consuming 0.04% of the processor’s time).

By the way, there are two more places where you can see the number of context switches (and the second of them, in addition, is available on ancient kernels):

[root @ oel6 ~] # cat / proc / 27288 / sched
find (27288, #threads: 1)
- se.exec_start: 617547410.689282
se.vruntime: 2471987.542895
se.sum_exec_runtime: 1119.480311
se.statistics.wait_start: 0.000000
. sleep_start: 0.000000
se.statistics.block_start: 617547410.689282
se.statistics.sleep_max: 0.089192
se.statistics.block_max: 60082.951331
se.statistics.exec_max: 1.110465
se.statistics.slice_max: 0.334211
se.statistics.wait_max: 0.812834
se.statistics.wait_sum : 724.745506
se.statistics.wait_count: 27211
se.statistics.iowait_sum: 0.000000
se.statistics.iowait_count: 0
se.nr_migrations: 312
se.statistics.nr_migrations_cold: 0
se.statistics.nr_failed_migrations_affine: 0
se.statistics.nr_failed_migrations_running: 96
se.statistics.nr_failed_migrations_hot: 1794
se.statistics.nr_forced_migrations: 150
se.statistics.nr_wakeups: 18507
se.statistics.nr_wakeups_sync: 1
se .statistics.nr_wakeups_migrate: 155
se.statistics.nr_wakeups_local: 18504
se.statistics.nr_wakeups_remote: 3
se.statistics.nr_wakeups_affine: 155
se.statistics.nr_wakeups_affine_attempts: 158
se.statistics.nr_wakeups_passive: 0
se.statistics.nr_wakeups_idle: 0
avg_atom: 0.041379
avg_per_cpu: 3.588077
nr_switches: 27054
nr_voluntary_switches: 9950
nr_involuntary_switches: 17104
se.load.weight: 1024
policy: 0
prio: 120
clock-delta: 72

Here you need to look at the number nr_switches(which is nr_voluntary_switches+ nr_involuntary_switches).

The total number of nr_switches is 27054 in the above piece, and is also in the third field in the output / proc / PID / schedstat:

[root @ oel6 ~] # cat / proc / 27288 / schedstat
1119480311 724745506 27054


And it does not increase ...

Exploring the core kernel kitchen using the / proc file system


It seems that our process is very hard frozen :). Strace and pstack are useless. They use the ptrace () system call to connect to the process and inject it into its memory, but since the process hangs hopelessly, most likely in some kind of system call, I assume that the ptrace () call also hangs on its own (by the way, I I somehow tried to run strace on strace itself at the moment when it connects to the target process. And this led to the process crash. I warned! :)

How do we find out in which system call we hung up without strace or pstack? Fortunately, we work on a modern core! Welcome / proc / PID / syscall!

[root @ oel6 ~] # cat / proc / 27288 / syscall
262 0xffffffffffffff9c 0x20cf6c8 0x7fff97c52710 0x100 0x100 0x676e776f645f616d 0x7fff97c52658 0x390e2da8ea

Stumble! And what should I do with this ???
Well, usually these numbers are for some reason needed. If we see something like “0xAVeryBigNumber” then this is usually an address in memory (and they can be used with utilities like pmap), and if the number is small, it is most likely an index to some kind of array. For example, an array of open file descriptors (which you can see in / proc / PID / fd), or, in this case, since we are dealing with system calls, this is the number of the system call the process is in! So, we now know that the process is stuck in system call # 262!

Note that system call numbers may vary between different operating systems, OS versions, and platforms, so you will need the correct header file from your OS. It’s good to start by looking for “syscall *” in the / usr / include folder. On my version and Linux platform (64bit), system calls are defined in the file /usr/include/asm/unistd_64.h:

[root @ oel6 ~] # grep 262 /usr/include/asm/unistd_64.h
#define __NR_ newfstatat 262


We are almost there! System call 262 is something under the name newfstatat. It remains to run man and find out what it is. I’ll give a quick tip on the names of system calls: if the man page does not find the name you need, try searching without all kinds of suffixes and prefixes (for example, “man pread” instead of “man pread64 ″) - and in this case, run man without the“ new ”prefix - man fstatat. Well, or just google it.

Be that as it may, this system call - “new-fstat-at” - makes it possible to read file properties very similar to how a regular stat system call does it. And we just hung in the operation of reading file metadata. So, we have moved one step further. However, we still don’t know why the freeze happened!

Well then, it's time to greet my little friend/ proc / PID / stack , which allows you to view the chain of the nuclear stack of a process simply by printing the contents of the proc file !!!

[root@oel6 ~]# cat /proc/27288/stack
[] rpc_wait_bit_killable+0x24/0x40 [sunrpc]
[] __rpc_execute+0xf5/0x1d0 [sunrpc]
[] rpc_execute+0x43/0x50 [sunrpc]
[] rpc_run_task+0x75/0x90 [sunrpc]
[] rpc_call_sync+0x42/0x70 [sunrpc]
[] nfs3_rpc_wrapper.clone.0+0x35/0x80 [nfs]
[] nfs3_proc_getattr+0x47/0x90 [nfs]
[] __nfs_revalidate_inode+0xcc/0x1f0 [nfs]
[] nfs_revalidate_inode+0x36/0x60 [nfs]
[] nfs_getattr+0x5f/0x110 [nfs]

[] vfs_getattr+0x4e/0x80
[] vfs_fstatat+0x70/0x90
[] sys_newfstatat+0x24/0x50
[] system_call_fastpath+0x16/0x1b
[] 0xffffffffffffffff

The top function is where we are in the kernel code. This is exactly what the WCHAN field has already told us (however, note that in fact there are some more functions there, such as a kernel function schedule()that sleeps or wakes up processes as needed. However, these functions are not shown here, perhaps because they are already the result of the state of expectation, and not its cause).

Having on hand a complete kernel stack for this task, we can study it from the bottom up and try to figure out how the function call rpc_wait_bit_killableled to the scheduler and put us to sleep.

A call system_call_fastpathis a standard kernel system call handler that leads to a call to code that implements the newfstatat (sys_newfstatat) with which we are dealing. Moving further into the "child" functions, we see several things related to NFS. This is already 100% irrefutable proof that we are somewhere in the NFS processing code area! I don’t say “in NFS code”, as long as we see above, the last of these NFS functions, in turn, calls some RPC function (rpc_call_sync) to contact another process. In this case, it is likely to [kworker/N:N], [nfsiod], [lockd]or [rpciod]IO-flow core. And some of these threads for some reason does not respond at all (usually you should suspect a disconnected network connection, packet loss, or just some kind of network problems).

To figure out if any of these auxiliary threads are hanging in network-related code, you also need to consider their stacks, although, for example, kworkers perform much more functions than just connecting an RPC to NFS. During a separate experiment (just copying a large file via NFS), I caught one of the kworkers waiting in the code to communicate with the network:

[root@oel6 proc]# for i in `pgrep worker`; do ps -fp $i; cat /proc/$i/stack; done
UID PID PPID C STIME TTY TIME CMD
root 53 2 0 Feb14? 00:04:34 [kworker/1:1]

[] __cond_resched+0x2a/0x40
[] lock_sock_nested+0x35/0x70
[] tcp_sendmsg+0x29/0xbe0
[] inet_sendmsg+0x48/0xb0
[] sock_sendmsg+0xef/0x120
[] kernel_sendmsg+0x41/0x60
[] xs_send_kvec+0x8e/0xa0 [sunrpc]
[] xs_sendpages+0x173/0x220 [sunrpc]
[] xs_tcp_send_request+0x5d/0x160 [sunrpc]
[] xprt_transmit+0x83/0x2e0 [sunrpc]
[] call_transmit+0xa8/0x130 [sunrpc]
[] __rpc_execute+0x66/0x1d0 [sunrpc]
[] rpc_async_schedule+0x15/0x20 [sunrpc]
[] process_one_work + 0x13e / 0x460
[] worker_thread + 0x17c / 0x3b0
[] kthread + 0x96 / 0xa0
[] kernel_thread_helper + 0x4 / 0x10

It is probably not difficult to turn on kernel tracing and find out exactly which kernel threads communicate with each other, but in this article I do not want to do this. Let this be a practical (and simple) diagnostic exercise!

We understand and "repair"


In any case, due to the ability to very easily get a printout of the kernel stack in modern Linux (I can’t say exactly which version of the kernel it appeared in), we were able to consistently conclude exactly where our find command hung - namely, in the NFS code in the kernel Linux And when you are dealing with a hang-up related to NFS, you should probably suspect network problems. If you are curious how I reproduced the indicated problem, then everything is very simple: I mounted the NFS volume from the virtual machine, ran the find command, and then stopped the machine. This led to the same symptoms as if there was a network (configuration, firewall) problem, where the connection is silently disconnected without notifying the TCP end nodes, or where the packets simply do not pass for some reason.

Since at the top of the stack is one of the “killed” functions (which can be safely interrupted rpc_wait_bit_killable), we can kill it with the command kill -9:

[root @ oel6 ~] # ps -fp 27288
UID PID PPID C STIME TTY TIME CMD
root 27288 27245 0 11:57 pts / 0 00:00:01 find. -type f
[root @ oel6 ~] # kill -9 27288

[root @ oel6 ~] # ls -l / proc / 27288 / stack
ls: cannot access / proc / 27288 / stack: No such file or directory

[root @ oel6 ~] # ps -fp 27288
UID PID PPID C STIME TTY TIME CMD
[root @ oel6 ~] #

the process is complete.

Profiling kernel threads "on the knee"


Note that the file / proc / PID / stack looks like a regular text file. Therefore, you can easily get a kernel thread profile! Here’s how you can "knee" to find out the current system call, as well as the promotion of the kernel stack (if we are in a system call), and then combine all this into a semi-hierarchical profile:

[root @ oel6 ~] # export LC_ALL = C ; for i in {1..100}; do cat / proc / 29797 / syscall | awk '{print $ 1}'; cat / proc / 29797 / stack | / home / oracle / os_explain -k; usleep 100000; done | sort -r | uniq -c

     69 running
      1 ffffff81534c83
      2 ffffff81534820
      6,247
     25,180

    100 0xffffffffffffffff 
      1 thread_group_cputime 
     27 sysenter_dispatch 
      3 ia32_sysret 
      1 task_sched_runtime 
     27 sys32_pread 
      1 compat_sys_io_submit 
      2 compat_sys_io_getevents 
     27 sys_pread64 
      2 sys_io_getevents 
      1 do_io_submit 
     27 vfs_read 
      2 read_events 
      1 io_submit_one 
     27 do_sync_read 
      1 aio_run_iocb 
     27 generic_file_aio_read 
      1 aio_rw_vect_retry 
     27 generic_file_read_iter 
      1 generic_file_aio_read 
     27 mapping_direct_IO 
      1 generic_file_read_iter 
     27 blkdev_direct_IO 
     27 __blockdev_direct_IO 
     27 do_blockdev_direct_IO 
     27 dio_post_submission 
     27 dio_await_completion 
      6 blk_flush_plug_list


This will allow you to very roughly assess where in the core the process spends its time (if at all). System call numbers are highlighted separately at the top. “Running” means that during the diagnostic process ran in user space (and not in system calls). So, 69% of the time the process was in user code; 25% on system call # 180 (nfsservctl on my system), and 6% on system call # 247 (waitid).

Above, the author uses a call to a script / home / oracle / os_explain . You can get it by reference - approx. perev.

The output shows two more "functions", for some reason they did not appear by name. However, their addresses are known, so we can check them manually:

[root @ oel6 ~] # cat / proc / kallsyms | grep -i ffffff81534c83
ffffffff81534c83 t ia32_sysret

It seems that this is a return from the system call for the 32-bit subsystem, however, since this function is not a system call in itself (this is just an internal auxiliary function), apparently the / proc / stack handler did not display its name. It is also possible that these addresses are shown because the / proc state is not consistent with the reader: while one thread modifies these memory structures and elements reading the threads sometimes see stale data.

Check the second address at the same time:

[root @ oel6 ~] # cat / proc / kallsyms | grep -i ffffff81534820
[root @ oel6 ~] #

Didn’t find anything? Well, however, debugging is not over here! Let's look for something interesting around this address. For example, delete the last couple of digits at the end of the address:

[root @ oel6 ~] # cat / proc / kallsyms | grep -i ffffff815348
ffffffff8153480d t sysenter_do_call
ffffffff81534819 t sysenter_dispatch
ffffffff81534847 t sysexit_from_sys_call
ffffffff8153487a t sysenter_auditsys
ffffffffffffffffffffffffffffffffffffff

It looks like the sysenter_dispatch function starts just one byte earlier than the original address mapped to / proc / PID / stack. Therefore, we most likely almost completed one byte (perhaps the NOP command located there for dynamic debugging). However, it looks like these pieces of the stack are in a function sysenter_dispatchthat is not a system call in itself, but a helper function. ( If you dig a little bit - there still the difference is not 1, but 7 bytes. The numbers are hexadecimal! - approx. Transl. )

More on stack profiling


Note that the various stack promotion utilities, such as Perf on Linux, the Oprofile and profileDTrace utilities on Solaris, fix the instruction pointer registers (EIP on 32-bit Intel, or RIP on x64), as well as the stack pointer (ESP on 32bit and RSP on x64 ) for the current thread on the processor, and then go back along the stack pointers. Thus, these utilities are able to show only those threads that run on the CPU at the time of testing! This, of course, is wonderful when we are looking for problems with a high processor load, but it is completely useless for diagnosing dead-locked or long-waiting / hibernating processes.

Utilities like pstack on Linux, Solaris, HP-UX, procstack (on AIX), ORADEBUG SHORT_STACK and just reading the / proc / PID / stack pseudo-file are a good addition (but not a replacement) to the CPU profiling utilities - as long as they give access to the process memory independently from his state in the scheduler and read the stack right from there. If the process sleeps and does not touch the CPU, then the top of the process stack can be read from the saved context - which is saved in the kernel memory by the OS scheduler during context switching.

Of course, processor event profiling utilities can often do more than just pstack, OProfile, Perf, and even the CPC provider in DTrace (on Solaris11). For example, install and read internal processor counters to evaluate things like the number of CPU cycles missed while waiting for memory access; number of misses of L1 / L2 caches, etc. But better read what Kevin Closson writes about this: ( Perf , Oprofile )

Good luck :-)

Related topics



Also popular now: