Generating a Kernel Core Dump running cluster HAE (High Availability Extension)

This document (7015029) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise High Availability Extension 11 Service Pack 3

Situation

The server is not able to create a Kernel core dump when HAE is running. A stonith action will kill the node before the crashdump is complete.

Resolution

The only way to collect a kernel core without fencing the node, is increasing the watchdog parameter. Therefore, you have to be careful as iTCO_wdt watchdog does not accept a value bigger than 63 seconds - Please read the TID: 7011426

Increasing the watchdog parameter will permit sbd to wait longer to detect a failure and then this time will be used to capture the kernel core file.

1 - Edit the sbd resource and add the crasdump parameter. The sbd resource should be similar to the following:

primitive stonith-sbd stonith:external/sbd \

params sbd_device="/dev/mapper/SAN_Disk_001_part1" \

crashdump="1" \

meta target-role="Started"

2 - Stop openais in all nodes.

3 - Recreate the sbd devices ( you can perform it only on one node):

# sbd -d /dev/<sbd-path-device> -4 600 -1 300 create

4 - You can check the timeouts using:

# sbd -d /dev/<sbd-path-device> dump

You should see:

Timeout (watchdog) : 300

Timeout (msgwait) : 600

5 - Add -C 300 in /etc/sysconfig/sbd file and copy to all nodes:

SBD_DEVICE="/dev/mapper/<sbd-device-path>

SBD_OPTS="-W -P -C 300"

5 - Now you can start openais again.

6 - Then after you finish the troubleshoot process you can repeat the process removing the parameters and put back the original timeouts.

In this case watchdog was configured for 300 seconds (5 minutes). If you think you need more time to complete your kernel core, you can increase it based on your needs. Just remember that your message wait (msgwait) should be the double of your watchdog value.

If your watchdog does not allow you to go over 63 seconds, you can change it to use softdog and then execute the steps above.

Cause

Sometimes we have situations where the kernel is crashing or misbehaving and a kernel core dump needs to be captured for analysis. However, when you are using cluster software, HAE on this case, your kernel core dump will not complete because a stonith action will kill your server in order to keep your resources up and running on the node available in the cluster. Therefore, you can change some parameters for your stonith device, in this case SBD, to allow a core to be captured.

Please keep in mind that this configuration is ONLY required during the troubleshooting process and you have to revert back your changes when you are done with it.

Additional Information

Please always perform any steps in a test environment before moving to production servers.

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.