Generating a Kernel Core Dump running cluster HAE (High Availability Extension)

This document (7015029) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise High Availability Extension 11 Service Pack 3

Situation

The server is not able to create a Kernel core dump when HAE is running. A stonith action will kill the node before the crashdump is complete. 

Resolution

The only way to collect a kernel core without fencing the node, is increasing the watchdog parameter. Therefore, you have to be careful as iTCO_wdt watchdog does not accept a value bigger than 63 seconds - Please read the TID: 7011426

Increasing the watchdog parameter will permit sbd to wait longer to detect a failure and then this time will be used to capture the kernel core file. 

1 - Edit the sbd resource and add the crasdump parameter. The sbd resource should be similar to the following:

       primitive stonith-sbd stonith:external/sbd \
        params sbd_device="/dev/mapper/SAN_Disk_001_part1" \
                     crashdump="1" \
        meta target-role="Started"

  2 - Stop openais in all nodes. 

  3 - Recreate the sbd devices ( you can perform it only on one node):

    # sbd -d  /dev/<sbd-path-device>  -4 600 -1 300 create 
   
  4 -  You can check the timeouts using:
     
   # sbd -d  /dev/<sbd-path-device> dump


   You should see:
   Timeout (watchdog) : 300
   Timeout (msgwait)  :  600

   5 - Add -C 300 in /etc/sysconfig/sbd file and copy to all nodes:

   SBD_DEVICE="/dev/mapper/<sbd-device-path>
   SBD_OPTS="-W -P -C 300"

 5 -  Now you can start openais again.

 6 - Then after you finish the troubleshoot process you can repeat the process removing the parameters and put back the original timeouts. 

In this case watchdog was configured for 300 seconds (5 minutes). If you think you need more time to complete your kernel core, you can increase it based on your needs. Just remember that your message wait (msgwait) should be the double of your watchdog value. 

If your watchdog does not allow you to go over 63 seconds, you can change it to use softdog and then execute the steps above. 
 
         

Cause

Sometimes we  have situations where the kernel is crashing or misbehaving and a kernel core dump needs to be captured for analysis. However, when you are using cluster software, HAE on this case, your kernel core dump will not complete because a stonith action will kill your server in order to keep your resources up and running on the node available in the cluster. Therefore, you can change some parameters for your stonith device, in this case SBD,  to allow a core to be captured. 

Please keep in mind that this configuration is ONLY required during the troubleshooting process and you have to revert back your changes when you are done with it. 

Additional Information

Please always perform any steps in a test environment before moving to production servers. 

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:7015029
  • Creation Date: 07-May-2014
  • Modified Date:03-Mar-2020
    • SUSE Linux Enterprise High Availability Extension

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback@suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.


SUSE Customer Support Quick Reference Guide SUSE Technical Support Handbook Update Advisories
Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Go to Customer Center