How to load the correct watchdog kernel module

This document (7016880) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise Server for SAP Applications 11 SP3
SUSE Linux Enterprise Server for SAP Applications 11 SP4
SUSE Linux Enterprise Server 11 Service Pack 3 (SLES 11 SP3)
SUSE Linux Enterprise Server 11 Service Pack 4 (SLES 11 SP4)
SUSE Linux Enterprise High Availability Extension 11 Service Pack 3
SUSE Linux Enterprise High Availability Extension 11 Service Pack 4

Situation

A hardware watchdog should be used to reboot the system in case of unrecoverable software errors. For example, this is common in the context of an high availability cluster.
Unfortunately automatic probing of the right watchdog kernel module fails sometimes. Manual configuration of the correct module is necessary.

Resolution

Overview

Finding out the right kernel module for a given system is not exactly trivial. This causes automatic probing to fail very often. As result, lots of modules are already loaded before the right one gets a chance.
A proven solution is to load the proper watchdog driver very early during system boot, before the auto-probing takes place.
In order to enable loading of the right watchdog kernel module on boot, the following steps could be performed:
  1. The right watchdog module has to be determined.
  2. Any wrong watchdog module has to be unloaded.
  3. The right watchdog module has to be loaded.
  4. In order to automate loading of the right watchdog module,it has to be included into the boot process.
  5. The watchdog has to be tested.

Please note, that only one software must access the watchdog timer. Some hardware vendors ship systems management software that use the watchdog for system resets. Such software has to be disabled if the watchdog needs to be used by the SBD that comes with SLE-HA.

Implementation

Step 1. The right watchdog module has to be determined.

Currently there is no stable programmatic approach to determine the right watchdog kernel module in any case.

  • On HP hardware the "hpwdt" module should work.
  • For systems with an Intel TCO "iTCO_wdt" can be used.
    Dell, Fujitsu, and Lenovo usually fall into this category.
  • Inside a VM on z/VM on an IBM mainframe "vmwatchdog" might be used.
  • Inside a Xen VM (aka DomU) "xen_wdt" is a good choice.
  • "softdog" is the most generic driver, but it is recommended that you use one with actual hardware integration.

See /lib/modules/.../kernel/drivers/watchdog in the kernel package for a list of choices. Of course, the hardware vendor should be able to name the right watchdog. A test as described in step 5 below shows, if a loaded module works.

Step 2. Any wrong watchdog module has to be unloaded.

List loaded watchdog module:

       # lsmod | grep -e dog -e wdt

Note: If any module is used, that does not contain the shown strings "dog" or "wdt", it will not show up. See /lib/modules/.../kernel/drivers/watchdog for modules shipped with the SLES operating system.

Unload non-needed watchdog modules:

       # rmmod <wrong_module>

Repeat the above for all non-needed modules.

Step 3. The right watchdog module has to be loaded.


The right module is loaded:

       # modprobe softdog

Note: We use the "softdog" module here as an example. The right module has to be determined following step 1.

Step 4. Automate module loading

In order to automate loading of the right watchdog module, it has to be included into the boot process.

The right watchdog kernel module is added to the initrd. This could be achieved by appending the basename of the right module to the content of the INITRD_MODULES variable:

       # vi /etc/sysconfig/kernel
       INITRD_MODULES=" ... softdog"

       # mkinitrd

Note: We use the "softdog" module here as an example. The right module has to be determined following step 1.

Step 5. The watchdog has to be tested.

Check, if the module was added to initrd:

       # zcat /boot/initrd | cpio -it 2>/dev/null | grep -e wdt -e dog

Note: If any module is used, that does not contain the shown strings "dog" or "wdt", it will not show up.

Check, if the watchdog module was loaded:

    # lsmod | grep -e wdt -e dog

Note: If any module is used, that does not contain the shown strings "dog" or "wdt", it will not show up.

Check, if the watchdog module has created a device file for communication:

       # ls -l /dev/watchdog

Check, if the watchdog already is used by a process. Only one process should use the watchdog:

       # lsof /dev/watchdog

Check, if the freshly loaded watchdog module works. This test will force a hard reboot of the system. Usually it will take up to 60 seconds, until the system gets a hard reboot. File systems will get unclean. Do not do this on a productive system:

       # echo "do not do this on a productive system:"
       # echo "cat /dev/watchdog"


Online documentation

       https://www.suse.com/documentation/sle_ha/book_sleha/data/sec_ha_storage_protect_fencing.html
       http://linux-ha.org/wiki/SBD_Fencing
       http://www.clusterlabs.org/doc/crm_fencing.html
       http://code.metager.de/source/xref/linux/stable/drivers/s390/char/vmwatchdog.c

Kernel documentation

       /usr/src/linux/Documentation/watchdog/watchdog-api.txt


Technical Information Documents

       SBD Operation Guidelines for HAE Clusters
           https://www.suse.com/support/kb/doc.php?id=7011346
       SBD setup - debug and verify (OPENAIS)
           https://www.suse.com/support/kb/doc.php?id=7009485
       iTCO_wdt does not accept Watchdog Timeout bigger 63 seconds
           https://www.suse.com/support/kb/doc.php?id=7011426


Manual pages

       lsof(8)
       mkinitrd(8)
       modinfo(8)
       modprobe(8)
       SuSEconfig(8)
       sbd(8)

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:7016880
  • Creation Date: 05-Oct-2015
  • Modified Date:03-Mar-2020
    • SUSE Linux Enterprise High Availability Extension
    • SUSE Linux Enterprise Server

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.


SUSE Customer Support Quick Reference Guide SUSE Technical Support Handbook Update Advisories
Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Go to Customer Center