Installing and Configuring KDump on SLES 10 for Kernel Crash Analysis
Although Linux continues to prove itself as one of the most stable platforms ever, there are times when something goes awry and bad things happen. A poorly written application performs some function that the kernel doesn’t like and a crash or panic occurs. As with the Unix-related operating systems, a simple reboot isn’t an acceptable remedy or resolution.
This is where core dumps can help you out, especially with SLES and RHEL, for that matter. Core dump files can be sent to Novell or RedHat for further research and resolution.
In a production environment, time is critical to find the root cause, resolve it and get your systems back online and fully function. In this article, the intention is to help you understand, troubleshoot and and resolve issues surrounding kernel crashes.
There are a couple of utilities that can process core dump files. We will focus on KDump for this article.
lkcd is another utility included on the SLES 10 distro, but has more limitations than kdump. For one, the latest version was released in 2006, has very limited 64bit support and lacks support for a number of driver modules. KDump is more recent and is more scalable to fit a diverse environment.
KDump consists of two main components, Kexec and Kdump.
Kexec uses the UNIX exec system call image overlay philosophy to spawn a new Linux kernel image over a running Linux kernel image, without going through BIOS. Kexec has different uses including fast reboot, but here we’ll discuss it’s main use, kdump.
Capturing a dump after a kernel crash is inherently unreliable, since kernel code that access the dump device may be in an unstable state. Kdump gets around this problem by collecting the dump after booting into a healthy kernel via kexec.
You do not have to complete this section if you wish to use the kernel-kdump that is available through YaST, by searching on “kdump” in Software Management. This would not apply if you are using a VMI or more recent kernel.
Since kdump requires it’s own kernel to be built based off of your existing, we’ll need to run “make menuconfig” to edit some kernel parameters. Luckily, SLES 10 ships with a kernel that is 99% kdump ready. We just need to tweak a couple of things and build our new crash kdump kernel.
Change to the kernel sources directory (/usr/src/linux-2.6.xxx) depending on the version of your kernel.
Enter the command “make menuconfig” and is you aren’t missing any libraries, like ncurses-devel-5.5-18.11, you should see the following screen.
Scroll down to “Processor type and features” and hit Enter
At the bottom, highlight Kernel crash dumps (EXTERIMENTAL) and press the spacebar to mark. Another option will appear below. Leave that to default. Press ESC twice to go back.
Scroll to General Setup and press Enter.
Select the top item Local Version and press Enter.
Change the item listed from “-default” or “-bigsmp” or whatever, to “-kdump” and select OK.
Press ESC twice to go back to the main menu. And ESC to exit.
Select Yes to save the new configuration.
Compile your new kernel. Kernel compiling is beyond the scope of this article. Please refer to your documentation or the many resources on the web for more information.
Kdump doesn’t support compressed kernel images so we’ll use the image vmlinux-kdump instead of the compressed version vmlinuz-kdump
The packages we’ll need for SLES 10 are
yast2-kdump – Configure kdump for you X-server folks. (optional)
You can install them through YaST.
Now that we have our kernel “kdump enabled”, we need to configure for a both a local dump and a network dump. Whereas kdump can dump to an ftp server, nfs, cifs and ssh.
Configure Local Dump
Edit the file /etc/sysconfig/kdump and make the following changes. (Shaded)
If you are using SMP then modify the following line:
KDUMP_COMMANDLINE_APPEND="maxcpus=1 " KDUMP_OPTIONS="--args-linux "
It is recommended to have Kdump save crash dumps on a local device by setting the runlevel to 1, it will be different for network dumps and runlevel 5 is not recommended unless you allocated enough memory.
Once the dump is completed, we’ll want the server to reboot back to the normal kernel. It will not work if the very next parameter is not empty.
This specifies to which device the dump is written. It can cause issues with the underlying filesystem. We will use KDUMP_SAVEDIR for the location to put the dump file.
Change this setting to a location where you will have sufficient disk space to write the dump file. You can use nfs shares, as well as the other options.
You can reserve diskspace for kdump with this directive. Default is 64MB.
Altogether, the settings look like this for our setup:
KDUMP_COMMANDLINE_APPEND="maxcpus=1 " KDUMP_OPTIONS="--args-linux " KDUMP_RUNLEVEL="1" KDUMP_IMMEDIATE_REBOOT="yes" KDUMP_DUMPDEV="" KDUMP_SAVEDIR="file:///tmp/kerneldump" KDUMP_FREE_DISK_SIZE=128MB
Configure Network Dump
By default, init 3 is the lowest runlevel network devices are loaded.
You can use DNS or IP Address. I prefer IP, because DNS is just something else that might be crashing the box.
Save the file and we’ll configure GRUB.
For kdump to function properly, we need to add a parameter to kernel to reserve a certain amount of RAM for crash. The default is 64MB, and some recommend 128MB. The parameter looks like this:
The 16M is the amount that was allocated during the configuration and the 128M is the amount of memory we want to reserve for crash.
Modify /boot/grub/menu.lst and append this parameter to the kernel directive.
kernel /boot/vmlinuz-184.108.40.206-0.39.3-vmi root=/dev/sda2 vga=0x32b resume=/dev/sda1 splash=silent showopts clock=pit crashkernel=128M@16M
Save the file and exit.
Now we need to “turn it on” so it will start on boot.
chkconfig kdump on
and then we’ll start it.
If you get an error, check the entry in /boot/grub/menu.lst and make sure your syntax is correct.
At this point we want to reboot and load the kdump-enabled kernel with the crash parameters.
Now let’s simulate a crash. We can do this by turning on System Request (sysrq) functionality and then triggering a panic.
echo 1 > /proc/sys/kernel/sysrq
Now, send it into panic.
echo c > /proc/sysrq-trigger
The console should start to boot the crash kernel. After a few moments you will see the dump file being created. Should look something like this.
Setting flag_elf64 to true [ 67% ]
You can check the file progress by remembering the KDUMP_SAVEDIR directive and listing the files in that path.
# ll /tmp/kerneldump/ -rw------- 1 root root 392366080 May 16 11:05 vmcore
When it’s done, it will reboot, (because we told it too) and return to the normal kernel.
If this had been an actual emergency, you would have been directed, by Novell Support, where to send the file for further analysis. Just Kidding.
There are analyzers like GDB that can provide a more human view of the core dump file, and perhaps you can find your problem without further support.
This process is a bit complicated and it does require some kernel knowledge to configure. As stated before, SLES 10.x provides a kdump-enabled kernel that you can just install. If you looked closely at our config presented here, we’re using a VMI kernel, so we had to make the changes manually. There is much more functionality to kdump than was presented. If you wish to use this utility for your crash dump tool, get it working the way you like and then port it to production.