Settings for long timeout in SBD_DELAY_START

This document (7023572) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise High Availability Extension 12

Situation

With very high values in msgwait for SBD there is the potential of two issues if the setting SBD_DELAY_START is used.

Issue number one can be that the SBD service will timeout during start, as the SBD_DELAY_START might take longer than the default for system services in systemd.

Issue number two can be that the on return the returning node starts corosync and by this blocks the cluster. The symptom looks like everything from a cluster perspective worked, for example fencing. But then the "surviving node waited until the fenced node returned"

The logs show entries similar to

   Dec 03 15:29:25 [3533] animal    pengine:   notice: LogActions: Start   fs_mysap   (animal - blocked)

Resolution

To fix this one copies the sbd.service to /etc

   cp /usr/lib/systemd/system/sbd.service /etc/systemd/system/sbd.service

and edit

   /etc/systemd/system/sbd.service

and add in section

   [Unit]
   Before=corosync.service

and add in section

   [Service]
   TimeoutSec=600

so the files looks like

[Unit]
Description=Shared-storage based fencing daemon
Documentation=man:sbd(8)
Before=pacemaker.service
Before=dlm.service
Before=corosync.service
After=systemd-modules-load.service iscsi.service
PartOf=corosync.service
RefuseManualStop=true
RefuseManualStart=true

[Service]
Type=forking
PIDFile=/var/run/sbd.pid
EnvironmentFile=-/etc/sysconfig/sbd
ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid watch
ExecStop=/usr/bin/kill -TERM $MAINPID
TimeoutSec=600
# Could this benefit from exit codes for restart?
# Does this need to be set to msgwait * 1.2?
# TimeoutSec=
# If SBD crashes, it'll very likely suicide immediately due to the
# hardware watchdog. But one can always try.
Restart=on-abort

[Install]
RequiredBy=corosync.service
RequiredBy=pacemaker.service
RequiredBy=dlm.service

and then issue

   systemctl daemon-reload


Cause

Issue number one is caused by the SBD timeout exceeding the systemd service timeout

Issue number two is caused by starting the corosync service on the returning  node before waiting for the SBD timeout

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:7023572
  • Creation Date: 10-Dec-2018
  • Modified Date:03-Mar-2020
    • SUSE Linux Enterprise High Availability Extension

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.


SUSE Customer Support Quick Reference Guide SUSE Technical Support Handbook Update Advisories
Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Go to Customer Center