Settings for long timeout in SBD_DELAY_START

This document (7023572) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise High Availability Extension 12

Situation

With very high values in msgwait for SBD there is the potential of two issues if the setting SBD_DELAY_START is used.

Issue number one can be that the SBD service will timeout during start, as the SBD_DELAY_START might take longer than the default for system services in systemd.

Issue number two can be that the on return the returning node starts corosync and by this blocks the cluster. The symptom looks like everything from a cluster perspective worked, for example fencing. But then the "surviving node waited until the fenced node returned"

The logs show entries similar to

Dec 03 15:29:25 [3533] animal pengine: notice: LogActions: Start fs_mysap (animal - blocked)

Resolution

To fix this one copies the sbd.service to /etc

   cp /usr/lib/systemd/system/sbd.service /etc/systemd/system/sbd.service

and edit

   /etc/systemd/system/sbd.service

and add in section

   [Unit]
   Before=corosync.service

and add in section

   [Service]
   TimeoutSec=600

so the files looks like

[Unit]
Description=Shared-storage based fencing daemon
Documentation=man:sbd(8)
Before=pacemaker.service
Before=dlm.service
Before=corosync.service
After=systemd-modules-load.service iscsi.service
PartOf=corosync.service
RefuseManualStop=true
RefuseManualStart=true

[Service]
Type=forking
PIDFile=/var/run/sbd.pid
EnvironmentFile=-/etc/sysconfig/sbd
ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid watch
ExecStop=/usr/bin/kill -TERM $MAINPID
TimeoutSec=600
# Could this benefit from exit codes for restart?
# Does this need to be set to msgwait * 1.2?
# TimeoutSec=
# If SBD crashes, it'll very likely suicide immediately due to the
# hardware watchdog. But one can always try.
Restart=on-abort

[Install]
RequiredBy=corosync.service
RequiredBy=pacemaker.service
RequiredBy=dlm.service

and then issue

   systemctl daemon-reload

Cause

Issue number one is caused by the SBD timeout exceeding the systemd service timeout

Issue number two is caused by starting the corosync service on the returning node before waiting for the SBD timeout

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.