Pacemaker: the configured stop timeout is not respected when stopping an OCFS2 filesystem resource.

This document (000020860) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise High Availability Extension 12
SUSE Linux Enterprise Server for SAP Applications 12

Situation

The manually configured stop timeout is not respected when an OCFS2 filesystem fails to unmount during the resource's stop operation, ending in a failed stop and causing the fence of the node.

Here is an example of a Filesystem resource configuration:

primitive myfs Filesystem \
    params device="/dev/mapper/3600axxxxxxxxxxx" directory="/myfs" fstype=ocfs2 options=acl \
    op monitor interval=0 timeout=10 \
    op start interval=0 timeout=20 \
    op stop interval=0 timeout=240

And here is an example of the error while trying to stop the resource:

crmd[8888]: notice: Initiating stop operation myfs_stop_0 locally on fileserver2p
lrmd[8888]: notice: executing - rsc:myfs action:stop call_id:102
Filesystem(myfs)[99999]: INFO: Running stop for /dev/mapper/3600axxxxxxxxxxx on /myfs
Filesystem(myfs)[99999]: INFO: Trying to unmount /myfs
Filesystem(myfs)[99999]: ERROR: Couldn't unmount /myfs; trying cleanup with TERM
Filesystem(myfs)[99999]: INFO: No processes on /myfs were signalled. force_unmount is set to 'yes'
Filesystem(myfs)[99999]: ERROR: Couldn't unmount /myfs, giving up!
lrmd[8888]: notice: myfs_stop_0:99999:stderr [ umount: /myfs: target is busy. ]
lrmd[8888]: notice: myfs_stop_0:99999:stderr [ ocf-exit-reason:Couldn't unmount /myfs; trying cleanup with TERM ]
lrmd[8888]: notice: myfs_stop_0:99999:stderr [ umount: /myfs: target is busy. ]
lrmd[8888]: notice: myfs_stop_0:99999:stderr [ ocf-exit-reason:Couldn't unmount /myfs; trying cleanup with KILL ]
lrmd[8888]: notice: myfs_stop_0:99999:stderr [ ocf-exit-reason:Couldn't unmount /myfs, giving up! ]
lrmd[8888]: notice: finished - rsc:myfs action:stop call_id:102 pid:99999 exit-code:1 exec-time:7050ms queue-time:0ms
crmd[8888]: notice: Result of stop operation for myfs on node2: 1 (unknown error)

In the above example, the resource is configured with a stop time-out of 240 seconds, but the logs show the resource failing after 7050ms (7 seconds).

Resolution

Only on SLES 12 based systems, the "FAST_STOP" parameter must be set to "no" in the resource configuration. Here is the parameter explained from the resource agent's help:

# crm ra info ocf:Filesystem

fast_stop (boolean, [yes]): fast stop
    Normally, we expect no users of the filesystem and the stop
    operation to finish quickly. If you cannot control the filesystem
    users easily and want to prevent the stop action from failing,
    then set this parameter to "no" and add an appropriate timeout
    for the stop operation.

Based on the resource configuration example described in the above "Situation" section, the resource configuration should look similar to this:

primitive myfs Filesystem \
    params device="/dev/mapper/3600axxxxxxxxxxx" directory="/myfs" fstype=ocfs2 options=acl fast_stop=no \
    op monitor interval=0 timeout=10 \
    op start interval=0 timeout=20 \
    op stop interval=0 timeout=240

Cause

In the Filesystem resource agent's script, there is a "FAST_STOP" parameter which defaults to "true", and has a hard-coded time-out of 6 seconds, here is a snippet of the code:

# /usr/lib/ocf/resource.d/heartbeat/Filesystem

                # Umount all sub-filesystems mounted under $MOUNTPOINT/ too.
                local timeout
                for SUB in `list_submounts $MOUNTPOINT` $MOUNTPOINT; do
                        ocf_log info "Trying to unmount $SUB"
                        if ocf_is_true "$FAST_STOP"; then
                                timeout=6
                        else
                                timeout=${OCF_RESKEY_CRM_meta_timeout:="20000"}
                                timeout=$((timeout/1000))
                        fi
                        fs_stop $SUB $timeout
                        rc=$?
                        if [ $rc -ne $OCF_SUCCESS ]; then
                                ocf_exit_reason "Couldn't unmount $SUB, giving up!"
                        fi
                done

The "FAST_STOP" was changed to default to "no" in the resource agents package v4.7.x. In the following link, it is an explanation of the confusion that the old default value caused:

    https://github.com/ClusterLabs/resource-agents/commit/57b6019ffc141c803d879df2352e699fbb72f7dc

Set OCF_RESKEY_fast_stop_default="no" for RHEL and CentOS major releases
9 and above, and for all other distros.

In the past, this attribute has defaulted to "yes", which has caused a
lot of confusion for users. fast_stop preempts the resource's stop
timeout, causing the agent to give up on unmounting the filesystem after
six seconds and declare a stop failure. (The resource operation does not
time out.)

The existence of a stop operation timeout renders fast_stop unnecessary,
and users typically expect that the agent will keep trying to unmount
the filesystem until the full stop operation timeout expires.

The resource agents package v4.7.x is not available on SLES 12 based systems, so the "FAST_STOP" parameter must be set to "no" in the resource configuration, as described in the above "Resolution" section.

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

Document ID:000020860
Creation Date: 16-Nov-2022
Modified Date:03-Mar-2023
- SUSE Linux Enterprise High Availability Extension
- SUSE Linux Enterprise Server for SAP Applications

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.

Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Report a Software Vulnerability

Go to Customer Center

Hello

SUSE Support

Here When You Need Us