Kdump - diagnostics and analysis of the causes of kernel failures

  • Tutorial
Kdump

Although the kernel has a fairly high level of stability in modern Linux systems, the likelihood of serious system errors is nonetheless always present. When an unrecoverable error occurs, a state called kernel panic occurs: the standard handler displays information that should help in troubleshooting, and enters an endless loop.

To diagnose and analyze the causes of kernel failures, the developers of RedHat developed a specialized tool - kdump. The principle of its operation can be briefly described as follows. Two kernels are created: the main and the crash (it is used to collect the memory dump). When the main kernel is loaded, a certain memory size is allocated for the emergency kernel. With kexec, during a panic of the main kernel, the emergency is loaded and collects a dump.

In this article, we will talk in detail about how to configure kdump and analyze system errors using it. We will consider the features of working with kdump in Ubuntu OC; in other distributions, the configuration and configuration procedures for kdump are significantly different.

Install and configure kdump


Install kdump using the command
$ sudo apt-get install linux-crashdump kdump-tools


Kdump settings are stored in the configuration file / etc / default / kdump-tools

# kdump-tools configuration
# ------------------------------------------------- --------------------------
# USE_KDUMP - controls kdump will be configured
# 0 - kdump kernel will not be loaded
# 1 - kdump kernel will be loaded and kdump is configured
# KDUMP_SYSCTL - controls when a panic occurs, using the sysctl
# interface. The contents of this variable should be the
# "variable = value ..." portion of the 'sysctl -w' command.
# If not set, the default value "kernel.panic_on_oops = 1" will
# be used. Disable this feature by setting KDUMP_SYSCTL = ""
# Example - also panic on oom:
# KDUMP_SYSCTL = "kernel.panic_on_oops = 1 vm.panic_on_oom = 1"
#
USE_KDUMP = 1
# KDUMP_SYSCTL = "kernel.panic_on_oops = 1"


To activate kdump, edit this file and set the value of the parameter USE_KDUMP = 1.
Also in the configuration file contains the following parameters:
  • KDUMP_KERNEL - full path to the crash kernel;
  • KDUMP_INITRD - full path to the initrd of the crash kernel;
  • KDUMP_CORE - Indicates where the core dump file will be saved. By default, the dump is saved in the / var / crash directory (KDUMP_CORE = / var / crash);
  • KDUMP_FAIL_CMD - indicates what action should be taken in case of an error while saving the dump (by default the reboot -f command will be executed);
  • DEBUG_KERNEL - debug version of a working kernel. By default, / usr / lib / debug / vmlinux- $ is used. If this parameter is not set, the makedumpfile utility will only dump all pages of memory;
  • MAKEDUMP_ARGS - Contains additional arguments passed to the makedumpfile utility. By default, this parameter has a value of '-c -d 31', indicating that compression should be used and only core pages used in the dump should be included.


Having set all the necessary parameters, run the update-grub command and select install the package maintainer's version.

Then reboot the system and make sure kdump is ready to go:
$ cat / proc / cmdline
BOOT_IMAGE = / boot / vmlinuz-3.8.0-35-generic root = UUID = bb2ba5e1-48e1-4829-b565-611542b96018 ro crashkernel = 384 -: 128M quiet splash vt.handoff = 7


Pay special attention to the parameter crashkernel = 384 -: 128M. It means that the crash kernel will use 128 MB of memory at boot time. You can specify the crashkernel = auto parameter in grub: in this case, memory for the emergency kernel will be allocated automatically.

In order for us to analyze the dump using the crash utility, we will also need a vmlinux file containing debugging information:

$ sudo tee /etc/apt/sources.list.d/ddebs.list << EOF
deb http://ddebs.ubuntu.com/ $ (lsb_release -cs) main restricted universe multiverse
deb http://ddebs.ubuntu.com/ $ (lsb_release -cs) -security main restricted universe multiverse
deb http://ddebs.ubuntu.com/ $ (lsb_release -cs) -updates main restricted universe multiverse
deb http://ddebs.ubuntu.com/ $ (lsb_release -cs) -proposed main restricted universe multiverse
Eof
$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys ECDCAD72428D7C01
$ sudo apt-get update
$ sudo apt-get install linux-image - $ (uname -r) -dbgsym


After the installation is complete, check the status of kdump again:

$ kdump-config status


If kdump is operational, the following message will be displayed on the console:

current state: ready to kdump


Testing kdump


We cause a kernel panic with the following commands:

echo c | sudo tee / proc / sysrq-trigger


As a result of their implementation, the system “freezes”.

After that, a dump will be created within a few minutes, which will be available in the / var / crash directory after a reboot.

Kernel crash information can be viewed using the crash utility:

$ sudo crash /usr/lib/debug/boot/vmlinux-3.13.0-24-generic /var/crash/201405051934/dump.201405051934
crash 7.0.3
Copyright (C) 2002-2013 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan KK
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and / or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3 +: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu" ...
      KERNEL: /usr/lib/debug/boot/vmlinux-3.13.0-24-generic
    DUMPFILE: /var/crash/201405051934/dump.201405051934 [PARTIAL DUMP]
        CPUS: 4
        DATE: Mon May 5 19:34:38 ​​2014
      UPTIME: 00:54:46
LOAD AVERAGE: 0.14, 0.07, 0.05
       TASKS: 495
    NODENAME: Ubuntu
     RELEASE: 3.13.0-24-generic
     VERSION: # 46-Ubuntu SMP Thu Apr 10 19:11:08 UTC 2014
     MACHINE: x86_64 (2675 Mhz)
      MEMORY: 8 GB
       PANIC: "Oops: 0002 [# 1] SMP" (check log for details)
         PID: 7826
     COMMAND: "tee"
        TASK: ffff8800a2ef8000 [THREAD_INFO: ffff8800a2e68000]
         CPU: 2
       STATE: TASK_RUNNING (PANIC)
crash>


Based on the above conclusion, we can conclude that the system failure was preceded by the event “Oops: 0002 [# 1] SMP” that occurred on CPU2 when the tee command was executed.
The crash utility also has a wide range of capabilities for diagnosing the causes of a kernel crash. Let's consider them in more detail.

Diagnosing the causes of a crash using the crash utility


Crash stores information about all system events that preceded a kernel crash. With its help, you can recreate the state of the system at the time of the failure: find out which processes were running at the time of the crash, which files were open, etc. This information helps to make an accurate diagnosis and prevent future core crashes.

The crash utility has its own set of commands:

$ crash> help
* files mach repeat timer          
alias foreach mod runq tree           
ascii fuser mount search union          
bt gdb net set vm             
btop help p sig vtop           
dev ipcs ps struct waitq          
dis irq pte swap whatis         
eval kmem ptob sym wr             
exit list ptov sys q              
extend log rd task           
crash version: 7.0.3 gdb version: 7.6
For help on any command above, enter "help "
For help on input options, enter "help input".
For help on output options, enter "help output".
crash>


For each command, you can call a brief manual, for example:

crash> help set


We will not describe all the teams (detailed information can be found in the official user guide from RedHat ), but we will only talk about the most important of them.

First of all, you should pay attention to the bt command (the abbreviation for backtrace is reverse tracing). With its help, you can see detailed information about the contents of the kernel memory (see details and usage examples here ). However, in many cases, the log command, which displays the contents of the kernel message buffer in chronological order, is sufficient to determine the cause of a system failure.

Here is a fragment of its output:
[3288.251955] CPU: 2 PID: 7826 Comm: tee Tainted: PF O 3.13.0-24-generic # 46-Ubuntu
[3288.251957] Hardware name: System manufacturer System Product Name / P7P55D LE, BIOS 2003 12/16/2010
[3288.251958] task: ffff8800a2ef8000 ti: ffff8800a2e68000 task.ti: ffff8800a2e68000
[3288.251960] RIP: 0010: [] [] sysrq_handle_crash + 0x16 / 0x20
[3288.251963] RSP: 0018: ffff8800a2e69e88 EFLAGS: 00010082
[3288.251964] RAX: 000000000000000f RBX: ffffffff81c9f6a0 RCX: 0000000000000000
[3288.251965] RDX: ffff88021fc4ffe0 RSI: ffff88021fc4e3c8 RDI: 0000000000000063
[3288.251966] RBP: ffff8800a2e69e88 R08: 0000000000000096 R09: 0000000000000387
[3288.251968] R10: 0000000000000386 R11: 0000000000000003 R12: 0000000000000063
[3288.251969] R13: 0000000000000246 R14: 0000000000000004 R15: 0000000000000000
[3288.251971] FS: 00007fb0f665b740 (0000) GS: ffff88021fc40000 (0000) knlGS: 0000000000000000
[3288.251972] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[3288.251973] CR2: 0000000000000000 CR3: 00000000368fd000 CR4: 00000000000007e0
[3288.251974] Stack:
[3288.251975] ffff8800a2e69ec0 ffffffff8144e5f2 0000000000000002 00007fff3cea3850
[3288.251978] ffff8800a2e69f50 0000000000000002 0000000000000008 ffff8800a2e69ed8
[3288.251980] ffffffff8144eaff ffff88021017a900 ffff8800a2e69ef8 ffffffff8121f52d
[3288.251983] Call Trace:
[3288.251986] [] __handle_sysrq + 0xa2 / 0x170
[3288.251988] [] write_sysrq_trigger + 0x2f / 0x40
[3288.251992] [] proc_reg_write + 0x3d / 0x80
[3288.251996] [] vfs_write + 0xb4 / 0x1f0
[3288.251998] [] SyS_write + 0x49 / 0xa0
[3288.252001] [] tracesys + 0xe1 / 0xe6
[3288.252002] Code: 65 34 75 e5 4c 89 ef e8 f9 f7 ff ff eb db 0f 1f 80 00 00 00 00 00 66 66 66 66 90 55 c7 05 94 68 a6 00 01 00 00 00 48 89 e5 0f ae f8  04 25 00 00 00 00 00 01 5d c3 66 66 66 66 90 55 31 c0 c7 05 be 
[3288.252025] RIP [] sysrq_handle_crash + 0x16 / 0x20
[3288.252028] RSP 
[3288.252029] CR2: 0000000000000000


One of the output lines will indicate the event that caused the system error:

[3288.251889] SysRq: Trigger a crash


Using the ps command, you can display a list of processes that were running at the time of the crash:

crash> ps
   PID PPID CPU TASK ST% MEM VSZ RSS COMM
      0 0 0 ffffffff81a8d020 RU 0.0 0 0 [swapper]
      1 0 0 ffff88013e7db500 IN 0.0 19356 1544 init
      2 0 0 ffff88013e7daaa0 IN 0.0 0 0 [kthreadd]
      3 2 0 ffff88013e7da040 IN 0.0 0 0 [migration / 0]
      4 2 0 ffff88013e7e9540 IN 0.0 0 0 [ksoftirqd / 0]
      7 2 0 ffff88013dc19500 IN 0.0 0 0 [events / 0]


To view information about using virtual memory, use the vm command:

crash> vm
PID: 5210 TASK: ffff8801396f6aa0 CPU: 0 COMMAND: "bash"
       MM PGD RSS TOTAL_VM
ffff88013975d880 ffff88013a0c5000 1808k 108340k
      VMA START END FLAGS FILE
ffff88013a0c4ed0 400000 4d4000 8001875 / bin / bash
ffff88013cd63210 3804800000 3804820000 8000875 /lib64/ld-2.12.so
ffff880138cf8ed0 3804c00000 3804c02000 8000075 /lib64/libdl-2.12.so


The swap command will print information about swap space usage to the console:

crash> swap
FILENAME TYPE SIZE USED PCT PRIORITY
/ dm-1 PARTITION 2064376k 0k 0% -1


CPU interrupt information can be viewed using the irq command:

crash> irq -s
           CPU0
  0: 149 IO-APIC-edge timer
  1: 453 IO-APIC-edge i8042
  7: 0 IO-APIC-edge parport0
  8: 0 IO-APIC-edge rtc0
  9: 0 IO-APIC-fasteoi acpi
 12: 111 IO-APIC-edge i8042
 14: 108 IO-APIC-edge ata_piix


You can list the files that were open at the time of the failure on the console using the files command:

crash> files
PID: 5210 TASK: ffff8801396f6aa0 CPU: 0 COMMAND: "bash"
ROOT: / CWD: / root
 FD FILE DENTRY INODE TYPE PATH
  0 ffff88013cf76d40 ffff88013a836480 ffff880139b70d48 CHR / tty1
  1 ffff88013c4a5d80 ffff88013c90a440 ffff880135992308 REG / proc / sysrq-trigger
255 ffff88013cf76d40 ffff88013a836480 ffff880139b70d48 CHR / tty1


Finally, you can get compressed information about the general state of the system using the sys command:

crash> sys
      KERNEL: /usr/lib/debug/lib/modules/2.6.32-431.5.1.el6.x86_64/vmlinux
    DUMPFILE: /var/crash/127.0.0.1-2014-03-26-12:24:39/vmcore [PARTIAL DUMP]
        CPUS: 1
        DATE: Wed Mar 26 12:24:36 2014
      UPTIME: 00:01:32
LOAD AVERAGE: 0.17, 0.09, 0.03
       TASKS: 159
    NODENAME: elserver1.abc.com
     RELEASE: 2.6.32-431.5.1.el6.x86_64
     VERSION: # 1 SMP Fri Jan 10 14:46:43 EST 2014
     MACHINE: x86_64 (2132 Mhz)
      MEMORY: 4 GB
       PANIC: "Oops: 0002 [# 1] SMP" (check log for details)


Conclusion


Analysis and diagnosis of the causes of a nuclear fall is a very specific and complex topic, which cannot be fit into the framework of one article. We will return to it in the following publications.

For those who want to know more - a few useful links:


Readers who cannot post comments here are welcome to join us on the blog .

Also popular now: