NFS file system is hung. New mount attempts hang also.
This document (000019722) is provided subject to the disclaimer at the end of this document.
Network packet captures show that the NFS client is sending out packets destined to the NFS Server's port 2049, but no responses are seen. Often (not always) this will take the form of a TCP SYN packet being sent but getting no reply. In those (SYN) cases, evidence of the connection attempt may be seen on the NFS client machine with the command:
netstat -nt | grep :2049
If one or more connections show the status "SYN_SENT" for an extended period of time, then something is blocking SYN attempts.
There may also be occurrences of the follow message in the NFS client machine's /var/log/messages:
nfs: server <hostname or address> not responding
However, packets on OTHER connections (both new and old) between the same client machine and server machine may be having no problems. This is what sets this scenario apart from most other cases of "nfs: server not responding": There is no global problem effecting ALL communication between the two devices.
NOTE: There can be many different causes of a hanging NFS mount, or of the "not responding" logs. This document discusses only one type of scenario.
For more explanation and other options, see the discussion below:
Something is blocking the individual connection which NFS is attempting to use. It is not necessarily blocking other connections between the two devices. This blockage often comes from a smart router, frontend device, or some other kind of network security device or policy.
The connection being blocked is usually a repeat of a previous connection which the NFS mount was using before it ran into some temporary trouble. As such, this attempt is commonly called "connection reuse". NFS needs to do "connection reuse" fairly often. An NFS mount will initially get established successfully on a somewhat randomly chosen connection definition. For any number of reasons that connection may eventually be interrupted. When the NFS file system is used again, the connection will need to be re-established.
As background information: A unique TCP connection is defined by 4 factors: Client IP address, client port, server IP address, server port. Some protocols use a differently-defined connection each time they make a connection (usually by varying the client's source port number), but NFS sometimes needs to use the same connection it was using before a problem came up, because of the way NFS recovery works.
Many modern smart routers or other security conscious devices will have a "smart connection reuse" feature of some kind. Many such devices treat connection reuse (if it happens within a short time frame) with suspicion and may dynamically start manipulating or blocking connections which are being reused. While it is true that some instances of connection reuse may indicate that malicious activity is going on, connection reuse is not illegal and in fact is very necessary for some use cases, such as NFS. Thus, blocking connection reuse can lead to problems for NFS file systems.
The options available to get past a connection-reuse blockage are:
1. Reset the smart router (or other device) which is blocking the connection between the nfs client and nfs server. Better yet, completely turn off the "smart connection reuse" feature of that device, so this won't happen again in the future.
2. Reboot the NFS client. To explain this need in more detail:
The underlying need is to get the NFS/RPC layers at the NFS client machine to forget the connection definition it was using previously. To do that, it must "forget" the details of this NFS mount. This is extremely difficult, because:
a. Any existing mount that is already suffering from this problem will typically not umount, because the file system is considered "busy".
b. Attempts to use umount -l (lazy umount) or -f (forced umount) will not likely help. Even if the umount -l -f appears to succeed, it may only superficially remove the mounts from a mount list. Data about the mounts will still be held in other locations.
c. Often, multiple different NFS mounts pointing to the same NFS Server will be sharing a connection. There may more than one mount to clear away before the connection definition will be forgotten.
So in the majority of cases, if option #1 above is not possible, reboot of the client machine will be needed.
To check for current values of certain TCP tuning, give the command:
sysctl -a | less
and then search for the following settings. The values shown below are default. If they have been altered, this could be contributing the problem. These settings should be reset to these defaults. Permanent configuration for these is usually controlled in /etc/sysctl.conf, though other utilities (sapconf, saptune, tuned, and others) may be altering them.
net.ipv4.tcp_tw_recycle = 0
# On newer kernels, this parameter no longer exists, which yields equivalent behavior to "0".
net.ipv4.tcp_tw_reuse = 0
net.ipv4.tcp_retries1 = 3
net.ipv4.tcp_retries2 = 15
# NOTE: In other problem scenarios, for some reason it may have been decided to use a value lower than the default of 15. Lowering this setting can be risky. Changes that help with one problem could cause other problems. If you feel you must set this value lower than the default of 15, it is still recommended not to decrease this value below 8. RFC 1122 recommends the length of certain timeouts, and those recommendations cannot be met without a tcp_retries2 value of at least 8.
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:000019722
- Creation Date: 21-Oct-2021
- Modified Date:21-Oct-2021
- SUSE Linux Enterprise Server
For questions or concerns with the SUSE Knowledgebase please contact: firstname.lastname@example.org