Necessary NFS Server Cluster Design for NFS Client Lock and File State Preservation

This document (000020396) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise Server 15
SUSE Linux Enterprise Server 12

Situation

HA Clusters intended to be NFS Servers have been designed in a variety of ways, and various blogs and advice have been given on this subject, across the internet, including other sources at SUSE.  Depending on what is needed from the NFS Server Cluster, many of the common recommendations may not work as well as expected.  While they may work very well for many scenarios, they are often lacking when it comes to fully supporting NFS v4.  The most common problems not properly addressed are:

(1)  Loss of client locks (and all other information held by the NFS Server about open files, file states, client states, etc.) during a cluster failover, or even during normal operations without any cluster failovers or migrations taking place.

(2)  Delays during and after failovers, including (but not limited to) the risk of fencing a node while resources are stopped.

Due to delays or loss of file state information, NFS client applications may receive IO errors (EIO), sometimes logged as error 5 or -5.  After this, the application may or may not be skilled enough at error handling to gracefully recover and continue its work.

This document is primarily intended to discuss general design requirements (and reasons for them) to accomplish an NFS Server HA Cluster which can failover quickly and can also truly support preservation of file states on NFS v4.

This document is mostly conceptual and not intended to be a step-by-step guide for creating such a cluster.  Step-by-step guides which follow these principles have recently been published in the Online Documentation for a few  of the most-recent SLES distributions:

https://documentation.suse.com/sle-ha/15-SP4/single-html/SLE-HA-nfs-storage/index.html

https://documentation.suse.com/sle-ha/15-SP3/single-html/SLE-HA-nfs-storage/index.html

https://documentation.suse.com/sle-ha/12-SP5/html/SLE-HA-all/art-ha-quick-nfs.html

Resolution

Note that the various factors below are very interrelated.  As such, it is difficult to decide on the best order to present this discussion.  Fully meshing these concepts together may require more than one reading.  Leaving out one piece of this "puzzle" may break others.
 

1.  Preserving knowledge of file states at the NFS Server.

      AKA  Sharing the nfsdcltrack location.


Linux kernel-based NFS Servers track v4 client state and file state within the directory path /var/lib/nfs/nfsdcltrack/.  Therefore, that path should be a shared cluster resource which will move from one node to another along with the exported file systems.  Without this, file state (opens, locks, delegations, etc.) which were held before failover cannot be reclaimed after failover.  In such cases, applications will often receive IO errors (EIO) during or after failover.  The client application (through some intelligence and some luck) may be able to handle an IO error and then recover access to its files, but this can easily fail if another client accesses the file first, or if the client application was not written with good error-handling code.

Having /var/lib/nfs/nfsdcltrack/ as a shared resource is usually accomplished using DRBD.

FYI, the location for the nfsdcltrack is configurable within /etc/nfs.conf, in the [nfsdcltrack] section, with the value "storagedir".  If changed, this would need to be set on every node.  This allows some flexibility as to how to handle the sharing of this resource.  For example, it could be located on a file system resource that is already shared and movable, instead of creating an entirely separate resource for it.  However, if a cluster is handling multiple NFS file system resources, having a separate resource for nfsdcltrack may have some long term advantages.  For example, if the NFS file system resources which holds nfsdcltrack is decommissioned, all other NFS file system resources will be effected and a new location for nfsdcltrack would have to be decided and configured.  Having a separate resource for nfsdcltrack avoids such future concerns.  There may also be performance advantages or protection against high load problems, when giving nfsdcltrack its own separate resource.
 

2.  Allowing NFS clients to reclaim the known locks and other file states.

      AKA Grouping exports together and starting nfsserver anew during resource migration, to initiate the NFS4 Grace Period.


When NFS v4 Server resources restart (or fail over), the clients' abilities to reclaim existing states depend upon a "grace period" during which those reclaims can succeed.  During the grace period, clients cannot open or lock new files, they can only reclaim old states.  (For information on setting the length of this grace period, see item #5, below.)

The grace period is offered only when the nfsserver.service is started fresh.  Furthermore, because of the way the nfs server tracks client state (#1 above), successful client state preservation requires that all nfs exported file systems in the cluster must located on the same node, and must migrate together with the nfsserver resource.

Many NFS clustering guides incorrectly recommend that the nfsserver.service be a cloned resource that runs at all nodes simultaneously.  Under that recommendation, nfsserver is not stopped and started during failovers.  No fresh start of the nfsserver service means no grace period is given, and all previous file and client states are lost.

In summary, nfsserver cannot be a cloned resource that runs at every node.  It must be an individual primitive running at one node, and all the NFS related primitives for file systems, exportfs primitives, and the virtual IP address must be grouped with it.
 

3.  Ordering the starting and stopping of resources correctly.


The order of operations for starting and stopping all the NFS-related resources becomes extremely important for all these requirements to function together.  Typically, all the following resources must be grouped together and ordered thusly:

3.1  Start drbd for the various file system volumes (all those which will be NFS exported).
3.2  Mount those file systems.
3.3  Start drbd for the nfsdcltrack location.
3.4  Mount nfsdcltrack location.
3.5  Start nfsserver service.
3.6  Export the file systems, but do not set "wait_for_leasetime_on_stop=true".  This will be discussed in #4, below.
3.7  Add the virtual IP

The above order will help ensure that file state can be reclaimed after start.  Since this order is also used in reverse while stopping the services at a node, it will also ensure that shutdown of resources occurs in a safe and timely manner, with fewer potential delays and fewer problems than some other cluster implementations experience.

NOTE:  In theory, items 3.1-3.4 have some flexibility in their order.  For example, steps 3.3 & 3.4 could come first, or both drbd steps could come first, followed by both mounting steps.  Overall, it is important that 3.1-3.4 be accomplished first, after which 3.5, 3.6, and 3.7 must proceed in that specific order.  For more notes about this order, see the "Additional Information" section, further below.
 

4.  Unexporting file systems without delay.

      AKA Do not use "wait_for_leasetime_on_stop=true"


It has been commonly recommended (but misunderstood) that exportfs primitives for clustered, NFS-shared file systems should include the option:

wait_for_leasetime_on_stop=true

THE ABOVE IS NOT RECOMMENDED, and will lead to unnecessary delays and loss of locks and other file states.  FYI, Lease Time will be discussed in more detail in #5, below.

That option should be removed.  It will then default to false.  Alternatively, as a reminder of the correct configuration, this option can be kept in place but explicitly set to "false".

For those who are curious about why this option exists (since it can cause delays and failures), the following explanation may prove interesting:

In implementations where nfsserver.service is not stopped as part of failover, there is another potential problem (besides the loss of file state, already mentioned).  If nfsserver is not stopped, then the file systems which were exported may still have open files on them.  If open files exist on a file system, it is considered "busy" and cannot be dismounted.  If a file system cannot be dismounted, then normal failover of the file system to a new node is impossible, and therefore the old node may be "fenced" and forced down.

The option "wait_for_leasetime_on_stop=true" was created to avoid that fencing scenario.  The theory behind it is:  By first removing the virtual IP from the NFS cluster resource and then waiting out the entire NFS lease time, all the open files (and locks, delegations, etc.) will timeout and be cleared.  Since they are cleared, the file system will no longer be "busy" and can then umount properly.  Fencing the node is avoided by allowing this "waiting period".

HOWEVER, as already stated, this causes 2 obvious problems:  (1) You must wait out the lease time, which delays the failover, and (2) all file states (locks, opens, delegations, etc.) expire and are lost.

Instead, if nfsserver.service is stopped and started as part of a failover (as stated in #2 above) there will be no need for "wait_for_leasetime_on_stop=true".  If the recommended order (above in #3) is used, nfsserver.service will be stopped before the attempt to umount the file systems.  As such, all file states (from the file system's perspective) will get cleared.  Therefore, umount can proceed without delay and without problem.  Fencing is avoided.

With this recommended setup, even though the file states are cleared from the file system's perspective, they are still known to the nfsserver service, through it's records in /var/lib/nfs/nfsdcltrack/ (see #1 above).  This will allow NFS clients to reclaim the open states, locks, etc. after NFS Server is started fresh at the new node.
 

5.  Choosing a reasonable NFS v4 Lease Time and Grace Time

      AKA Low lease times and grace time are dangerous and can cause loss of file state.


During normal NFS v4 operations (even without a failover occurring), NFS v4 clients need to be able to renew their state with the NFS Server.  The frequency of this renewal is determined by the NFS4 "Lease Time" which is set by the NFS Server.  NFS Clients learn this value from the NFS Server.  On Linux NFS Servers, this value is set in /etc/sysconfig/nfs as:

NFSV4LEASETIME="90"

90 seconds is the Linux default, even if this value is left blank.  Additionally, the default behavior on Linux is to set grace time equal to lease time, which protects client recovery after failover.

To explain the danger of low lease times:  Linux NFS Clients will attempt to renew their state after 2/3 of the least time expires.  Then, various events (to confirm that the renewal has been successful) must complete within the remaining 1/3 of the lease time.  For example, at a low lease time value of 15 seconds, Linux clients will attempt to renew state after 10 seconds.  Then that renewal must successfully complete within the remaining 5 seconds.  Even though the majority of renewals can complete in time, it is still common for minor OS or network delays to cause this timer to expire before the renewal is completed.  Loss of file state will occur, and IO errors (EIO) and subsequent application failures are the common outcome.

Moving on to the need for sufficient grace time:  Clients need reasonable time to reclaim old file states after a failover or restart of nfsserver.  This occurs during the grace time.  However, clients don't necessarily attempt recovery the very instant the NFS Server comes back up.  A client may be waiting on it's renew timer, or waiting inside a timeout/retry loop, before it makes another attempt.  A bit of extra time is usually needed to ensure success.  No new resources can be claimed during this grace period because one process's desire for a new resource could conflict with another process's need to recover its old resource.

Low lease times are mistakenly recommended by those who do not understand their implementation, or are not following the earlier recommendations in this document.  For example, if someone is unwisely using "wait_for_lease_time_on_stop=true" then they will notice delays during failover, specifically when stopping services on the old node.  Subsequently, they often discover that lowering the lease time will reduce those delays.  But as stated in #4 above, it is wrong to use "wait_for_lease_time_on_stop=true".  Therefore, this reason for lowering lease time is not valid.

Another reason often given to justify lowering lease time is to obtain a lower grace time.  Grace time is usually set based on lease time, and long grace times can make clients wait extra time before obtaining new resources after a failover.  This argument has some validity.  However, there are other ways to mitigate this without lowering lease time.

On Linux NFS Servers, NFS 4.1 and 4.2 mitigate the concern about longer grace times by automatically ending the grace period early if no further states are waiting to be reclaimed.  SUSE recommends using NFS 4.1 or 4.2 rather than 4.0. They newer versions were designed and implemented with better concepts which make them easier to implement, more intelligent, and achieve better recovery than 4.0.

If the correct NFS cluster method (as defined by this document) is used, and especially if NFS 4.1 or 4.2 are used, it is not usually necessary to reduce the lease time or grace time.  However, if legitimate reasons to lower the lease and grace time are found, it can be done, but must be done carefully to protect basic functionality of NFS v4.  Lowering these from 90 to 60 is often safe.  Even lower values could be tested, but the lower the value, the more danger is created.  Therefore, SUSE Support strongly discourages any value lower than 30.  Low values (even 30) may not be safe or supportable.  SUSE Customer Service will often require higher values be put in place before spending time troubleshooting problems.
 

6.   Keeping the NFS Server scope consistent if NFS 4.1 or 4.2 are used.


This concern does not exist in NFS 4.0; it is new to NFS 4.1 and higher.

For lock and state preservation and recovery on NFS 4.1 and 4.2, the NFS Server needs to identify itself with the same scope name every time it starts, regardless of which node it is on.  The Linux NFS Server normally uses system's hostname for the scope name, but this is not adequate in a HA cluster, because HA requires a different hostname for each node.  Therefore, if NFS clients will be mounting with NFS version 4.1 or 4.2, the HA NFS Server's scope must be configured manually.

In contrast, when using older versions of NFS (4.0 and lower), it is enough to have the same IP address present on each node, which is already accomplished by the Virtual IP address assigned to the clustered resources.

It may be useful to note that the scope name can be any arbitrary string.  The scope name does not need to match any other name in use (such as a hostname or domain name), but is not prohibited from coincidentally matching other names.  For purposes of the examples below, the name "SUSE" has been arbitrarily chosen.

Below are the simplest ways to set this on specific SLES releases.  Each is different because these abilities are new and depend not only on the kernel version, but also on nfs-utils version (nfs-client and nfs-kernel-server packages), or in some cases on resource-agents package version.

SLES 15 SP4 with maintenance updates current as of the end of March 2023:

Option 1:  Within the HA Cluster primitive for the nfsserver service, within the "params" section, include "nfs_server_scope=SUSE".  Since this is within the primitive, it only has to be set once.  For example, it might look like this:

primitive p_nfsserver nfsserver params nfs_server_scope=SUSE nfs_shared_infodir="/var/lib/nfs"

Option 2:  In /etc/nfs.conf.local, configure this section and value:
[nfsd]
scope=SUSE

Notes: 
/etc/nfs.conf.local is used to customize the configuration of /etc/nfs.conf, which should not be modified directly.
If using this option, every node of the cluster would need this file created, and the scope must be set identically on each one.

SLES 15 SP3 with maintenance updates current as of the end of December 2022

On each node of the cluster, create /etc/systemd/system/nfs-server.service.d/scope.conf with the following contents.  This will set up a unique, isolated "hostname" which only rpc.nfsd (the Linux Kernel-based NFS Server daemon) knows about, and that isolated hostname will get used as the scope name.

[Service]
ExecStart=
ExecStart=/usr/bin/unshare -u /bin/sh -ec "hostname SUSE; /usr/sbin/rpc.nfsd"

To make this change known to currently-running systemd, execute:

systemctl daemon-reload

SLES 12 SP5 with maintenance updates current as of the end of December 2022

On each node of the cluster, create /etc/systemd/system/nfs-server.service.d/scope.conf with the following contents.  This will override the ways rpc.nfsd is launched, to include a setting or the scope.

[Service]
ExecStart=
ExecStart=/usr/sbin/rpc.nfsd --scope SUSE $RPCNFSDARGS

To make this change known to currently-running systemd, execute:

systemctl daemon-reload

Regardless of any of the above SLES versions:

Even after making the configuration changes discussed above, this new scope name will not take effect until the next start of the nfsserver service.  On a standalone system, this might be done with "systemctl restart nfsserver", but as part of a cluster it will involve resource migration or bringing resources down and back up.  Note, however, that setting this value for the first time represents a change to the scope, so NFS 4.1 and 4.2 clients will not be able to reclaim their resources on the NFS Server after that service is restarted.  Therefore, it would be best to stop the NFS client machine applications which rely on the NFS mounts before putting this setting into effect for the first time.

Older distributions (SLES 15 SP2 and older, SLES 12 SP4 and older)

These older distributions do not support manually setting the scope.  For such systems within HA clusters, one option is to have clients use NFS version 4.0 or 3, which do not rely on the scope concept.  If clients use NFS 4.1 or 4.2 without the ability for the HA NFS Server to override the scope, file states may be lost during failover.  This may be an acceptable risk.  Some applications won't care.  The real impact of losing file states will depend on the applications running at the time of failover, and how well they react to such an event.
 

In Conclusion:


When put all together, the 6 recommendations above should allow very rapid failover of the NFS Server resources, true file state preservation and reclamation after failover, and timely recovery by NFS Client processes.  Many of the 6 points are interdependent, so ignoring one requirement may cause others to become useless or even harmful.

Status

Top Issue

Additional Information

Questions have arisen about the above recommendation of binding the virtual IP last while starting a group of resources (which also equates to unbinding it first, while stopping.)  Even in cases where the fully strict approach above is not being used, binding IP last (and unbinding it first) is generally best.

In a variety of NFS Server cluster implementations, starting IP last (and stopping it first) has been seen to help avoid loss of information about file states, locks, and client sessions; node fencing; NFS stale file handles; and failover delays.

Without going into lengthy explanations for each of those variations, it is probably sufficient to say:  The general philosophy for bringing up NFS (and many other clustered services) is that all services should be in place and ready to be used before adding the IP address through which other systems will reach those services.  When shutting down, the reverse is also true.  It is best to prevent further client-to-server communication before trying to stop the various services.  This way, if a network client or peer notices a problem, it will be a low level IP failure, which can be retried and recovered smoothly when the services come up on a new node.  Without stopping IP first (and restarting it last), client systems using these resources are more likely to experience higher-level failures (such as at the NFS level, or in the application code).  In some cases, such higher-level failures could be fatal and nonrecoverable.

The original case which led to the investigation of these needs and the writing of this document involved software from TIBCO.  That software was very sensitive to any loss of locks, and required all the design choices discussed in this document.  Since then, various other programs have also been found to be sensitive to these factors, and need this configuration as well.

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:000020396
  • Creation Date: 28-Sep-2023
  • Modified Date:28-Sep-2023
    • SUSE Linux Enterprise High Availability Extension
    • SUSE Linux Enterprise Server

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.


SUSE Customer Support Quick Reference Guide SUSE Technical Support Handbook Update Advisories
Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Go to Customer Center