Possible timeout of NFSv4 Export during failover or restart

This document (7023732) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise High Availability Extension 12

Situation

An NFSv4 Server is setup as a pacemaker cluster, fencing is in place and all is most things seem to be working from a logical perspective.

An issue might be detected during a controlled failover. For example, when stopping the cluster on a node:

systemctl stop pacemaker

It is noticed that it takes 90 seconds for each exportfs to stop. This might be too long for a connected NFS client. A client process might report an error instead of experiencing a transparent failover.

This issue might be even more disruptive if there are multiple NFS4 exports grouped together, as the sequential nature of the group will act as a multiplicator to the 90 Second timeout. For example, 3 exports stopping in sequence might take a total of 3 x 90 = 270 seconds to stop.

This issue can also be encountered when a resource reports a monitor error, as this will trigger the cluster to stop and start the resource.

Resolution

It may be helpful to read the "Cause" section below, before reading this "Resolution" section.

There are several options to partially or fully address these delays. Some options are much better than others, and not all will result in a correct and fully supportable NFS Server Cluster implementation.

1. The BEST SOLUTION is to design the cluster implementation correctly, whereby the option "wait_for_leasetime_on_stop=true" is not needed on exportfs primitives. For a discussion of correct implementation, see:
https://www.suse.com/support/kb/doc/?id=000020396

2. If (because of other preferences) an improper implementation is desired, which cannot not fully support all the features of NFS v4, some alternatives exist to lessen the wait periods caused by "wait_for_leasetime_on_stop=true".

a. If multiple exports are being stopped sequentially, the total delay can be reduced by allowing those resources to stop in parallel. Therefore, only a delay of approximately 90 seconds (the lease time) would be present, instead of multiple delays, one after another. See the "Additional Information" section for an example of this.

b. The lease time can be lowered to shorten the delays. However, this can be dangerous. A lease time that is low can result in loss of leases and locks even under normal operation conditions, when no cluster failure or restart event is in progress. The lease time can be set within /etc/sysconfig/nfs

NFSV4LEASETIME="60"

Lowering the value from 90 (the default) to 60 is usually safe. Some people might chose to go lower, so it should be noted that SUSE Support strongly discourages going lower than 30, and cannot promise safety even at that value. If any problems are encountered while using low lease time values, SUSE Support may require the lease time be raised again while troubleshooting.

Cause

This behavior is caused by a complicated set of misperceptions about how NFS 4 exports should be implemented in a cluster. NFS 4 uses something called a "Lease Time" which governs how often clients renew their locks and leases. The lease time defaults to 90 seconds. In certain inappropriate cluster configurations, this lease time can cause delays while bringing NFS resources down, and can even result in fencing. To compensate for those concerns, a parameter was created for exportfs primatives, known as:

wait_for_leasetime_on_stop=true
# the default is "false"

With this set to "true", when an export is stopped, a delay equal to the lease time is enforced. However, this wait / delay would not be necessary in a properly implemented NFS4 Server cluster. The setting of "true" is just a band-aid for an incorrect implementation.

Additional Information

As mentioned in Resolution item #2a, a partial workaround to the delays may be accomplished by replacing a single group of exports with colocation and order parameters, to make the exportfs work in parallel instead of sequentially.

For example, with the following set of exportfs primitives and IP primitive:

   primitive NFS_HA0_ERS20 exportfs \
   primitive NFS_HA0_ERS30 exportfs \
   primitive NFS_HA0_ERS40 exportfs \
   primitive NFS_HA0_ERS50 exportfs \
   primitive NFS_HA0_ERS60 exportfs \
   primitive IP_work IPAddr2

Instead of using a grouping such as:

group my-group NFS_HA0_ERS20 NFS_HA0_ERS30 NFS_HA0_ERS40 NFS_HA0_ERS50 NFS_HA0_ERS60 IP_work

use instead an approach such as:

    colocation ERS20_with inf: NFS_HA0_ERS20 IP_work
    order ERS20_after inf: IP_work NFS_HA0_ERS20
    colocation ...
    order ....

for each and every NFS

            ->    NFS_HA0_ERS20
            ->    NFS_HA0_ERS30
    IP_work ->    NFS_HA0_ERS40
            ->    NFS_HA0_ERS50
            ->    NFS_HA0_ERS60

This setup would then provide a stop and start time that is essentially only the NFSV4LEASETIME plus a little wiggle space.

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.