Necessary NFS Server Cluster Design for NFS Client Lock Preservation
This document (000020396) is provided subject to the disclaimer at the end of this document.
SUSE Linux Enterprise Server 12
(1) Delays during and after failovers, including (but not limited to) the risk of fencing a node while resources are stopped.
(2) Loss of client locks during a cluster failover, or even during normal operations.
Due to delays or loss of locks, NFS client applications may receive IO errors (EIO), after which the application may or may not be skilled enough at error handling to gracefully recover, obtain a new lock, and continue its work.
This document is intended to give design recommendations for a NFS Server HA Cluster which can failover quickly and can also truly support preservation of lock / leasing functions of NFS v4. The requirements for full NFSv4 functionality are quite strict, but may not be necessary in cases where applications are not highly sensitive to loss of locks.
PLEASE NOTE: At this time, this document is only a high level discussion and is not attempting to be a step-by-step guide.
1. Preserving knowledge of locks at the NFS Server.
Linux kernel-based NFS Servers track v4 client state and locks within the directory path /var/lib/nfs/nfsdcltrack/. Therefore, that path should be a shared cluster resource which will move from one node to another along with the exported file systems. Without this, client locks which were held before failover cannot be reclaimed after failover. In such a case, an IO error (EIO) will likely occur. The client application (through some intelligence and some luck) may be able to handle an IO error and then obtain a NEW lock after the reclaim attempt for the old lock fails. But this can easily be thwarted if another client obtains a lock first or if the client application was not written with good error-handling code.
Having /var/lib/nfs/nfsdcltrack/ as a shared resource is usually accomplished using DRDB.
2. Allowing NFS clients to reclaim the known locks.
When an NFS v4 Server restarts (or fails over), the client's ability to reclaim existing locks depends upon a "grace period" during which those reclaims can succeed. During the grace period, clients cannot obtain new locks, they can only reclaim old locks. This grace period is only offered when the NFS Server service is freshly started. Therefore, to insure client lock recovery after failover of an NFS Server cluster, it is mandatory that the NFS Server service (nfsserver.service) be started fresh. Conversely, the NFS Server service should also have already been taken down at the old node.
This requirement might immediately cause concern for some NFS cluster administrators. Some NFS Server clusters are designed to keep the NFS Server service running on many nodes somewhat permanently and simultaneously, while individual file systems (to be exported via NFS) move around from node to node as needed. That approach allows different nodes to handle different file system exports. This allows a certain flexibility and allows a kind of "load balancing" as different nodes can simultaneously handle different NFS exports. While those goals are understandable, that approach does not support lock preservation during failovers.
This need for a fresh start of nfsserver on the new node also dictates that all nfs exported file systems in the cluster must be co-located on one node, and must migrate together. This is because fully stopping and starting the nfsserver service is disruptive to all nfs exports that reside on a node.
3. Ordering the starting and stopping of resources correctly.
The order of operations for starting and stopping all the NFS-related resources becomes extremely important for all these requirements to function together. The necessary starting order is:
3.1 Start drdb for the various file system volumes (all those which will be NFS exported).
3.2 Mount those file systems.
3.3 Start drdb for the nfsdcltrack location.
3.4 Mount nfsdcltrack location.
3.5 Start nfsserver service
3.6 Export the file systems (but do not set "wait_for_leasetime_on_stop=true". This will be discussed in #4, below)
3.7 Add the virtual IP
The above order will help insure that locks can be reclaimed after start. Since this order is also used in reverse while stopping the services at a node, it will also insure that shutdown of resources occurs in a safe and timely manner, with fewer potential delays and fewer problems than some other cluster implementations experience.
NOTE: In theory, items 3.1-3.4 have some flexibility in their order. Steps 3.3 & 3.4 could come first, or an order such as 3.1, 3.3, 3.2, 3.4 or 3.3, 3.1, 3.4, 3.2 could be used. It is important that (overall) 3.1-3.4 be accomplished first, and then 3.5, 3.6, and 3.7 must proceed in that specific order. For some more notes about this order, see the "Additional Information" section, further below.
4. Unexporting file systems without delay.
It has been commonly recommended (but misunderstood) that exportfs primitives for clustered, NFS-shared file systems should include the option:
THE ABOVE IS NOT RECOMMENDED, AND WILL LEAD TO UNECESSARY DELAYS AND LOSS OF LOCKS! (Lease time will be discussed in more detail in #5, below).
This option should be removed. It will then default to false. Alternatively, as a reminder of the correct configuration, this option can be kept in place but explicitly set to "false".
For those who are curious about why this option exists (since it can cause delays and failures), the following explanation may prove interesting:
In implementations where nfsserver.service is not stopped as part of failover, there is another potential problem (besides the loss of locks, already mentioned). If nfsserver is not stopped, then the file systems which were exported may still have active locks on them. If locks are active on a file system, that file system is considered "busy" and cannot be dismounted. If a file system cannot be dismounted, then normal failover of the file system is impossible, and the node will be "fenced" and forced down.
The option "wait_for_leasetime_on_stop=true" was created to avoid this fencing. The theory behind it is: By first removing the virtual IP from the NFS cluster resource and then waiting out the entire NFS lease time, all the locks will timeout and be cleared. Since they are cleared, the file system will no longer be "busy" and can then umount properly. Fencing the node is avoided.
HOWEVER, as already stated, this causes 2 obvious problems: (1) You must wait out the lease time, which delays the failover, and (2) all locks are lost.
Instead, if nfsserver.service is stopped and started as part of a failover, there should be no need for "wait_for_leasetime_on_stop=true". If the recommended order (above in #3) is used, nfsserver.service will be stopped before the attempt to umount the file systems. As such, all locks (from the file system's perspective) will already be cleared and umount can proceed without delay and without problem. Fencing is avoided.
With this recommended setup, even though the locks are cleared from the file system perspective, they are still known to the nfsserver service, through it's records in /var/lib/nfs/nfsdcltrack/ (see #1 above). This will allow NFS clients to reclaim the locks after NFS Server is started fresh at the new node.
5. Choosing a reasonable NFS v4 Lease Time.
During normal NFS v4 operations (even without a failover occurring), NFS v4 clients need to be able to renew their state with the NFS Server. The frequency of this renewal is determined by the NFS4 "Lease Time" which is set by the NFS Server. NFS Clients learn this value from the NFS Server. On Linux NFS Servers, this value is set in /etc/sysconfig/nfs as:
90 seconds is the default, even if this value is left blank.
This is also the same NFS lease time referenced above in #4 by "wait_for_leasetime_on_stop". It is also used as the NFS v4 "Grace Time" (discussed above in #2).
Because this lease time setting can influence various timings and because of misunderstandings already discussed above, it has often believed that the lease time on NFS v4 clusters should be reduced dramatically, often down to the absolute minimum allowed on Linux, which is 10 seconds. However, this is not healthy. Lease time serves an additional important function. Extremely low lease times can lead to loss of client state and locks.
To explain the danger in more detail: NFS Clients will attempt to renew their state after 66.7% of the least time expires. Then, various events to confirm that the renewal has been successful must complete within the remaining 33.3% of the lease time. At a minimum value of 10 seconds, this means all clients will attempt to renew state every 6.67 seconds (which is somewhat chatty and uses extra bandwidth). Then that renewal must successfully complete within the remaining 3.33 seconds. Even though the majority of renewals can complete in time, it is still common for minor OS or network delays to cause this timer to expire before the renewal is completed. IO errors (EIO) and loss of locks are the common results.
As such, balance must be achieved between the need for rapid failover and the need to protect basic functionality of NFS v4. If lowering the lease time from its default of 90 is desired, SUSE Support recommends trying 60. Even lower values could be tested, but the lower the value, the more danger is created. Therefore, SUSE Support strongly discourages any value lower than 30. Low values may not be supportable, and SUSE may require higher values while troubleshooting problems.
When put all together, the above recommendations should allow very rapid failover of the NFS Server resources, true lock preservation and reclamation after failover, and timely recovery by NFS Client processes.
In a variety of NFS Server cluster implementations, starting IP last (and stopping it first) has been seen to help avoid lock loss, node fencing, NFS stale file handles, and failover delays.
Without going into lengthy explanations for all of those variations, it is probably sufficient to say: The general philosophy for bringing up NFS (and many other clustered services) is that all services should be in place and ready to be used before adding the IP address through which other systems will reach those services. When shutting down, the reverse is also true. It is best to prevent further communication before trying to stop the various services. This way, if a network client or peer notices a problem, it will be a low level IP failure, which can be retried and recovered smoothly when the services come up on a new node. Without stopping IP first (and restarting it last), other systems using these resources are more likely to experience higher-level failures (such as at the NFS level, or in the application code) which, in some cases, could be fatal and nonrecoverable.
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:000020396
- Creation Date: 28-Sep-2021
- Modified Date:28-Sep-2021
- SUSE Linux Enterprise High Availability Extension
- SUSE Linux Enterprise Server
For questions or concerns with the SUSE Knowledgebase please contact: firstname.lastname@example.org