SLES 11 SP2, SP3, SP4 -- NFS client mounts hang
This document (7014392) is provided subject to the disclaimer at the end of this document.
SUSE Linux Enterprise Server 11 Service Pack 3 (SLES 11 SP4)
SUSE Linux Enterprise Server 11 Service Pack 3 (SLES 11 SP3)
SUSE Linux Enterprise Server 11 Service Pack 2 (SLES 11 SP2)
A SLES 11 SP2, SP3, or SP4 system (typically using kernel 3.0.80-x or newer) is acting as an NFS client. In other words, it has mounted one or more file systems from remote NFS server(s). (Mount type nfs).
The nfs-client-mounted file system works for a while, but after some time, any process on the client machine which is trying to access the nfs mount or get data or statistics from it might stall. The process is waiting for a response from the nfs client layer which is not coming. The NFS Server is still functioning fully, as other nfs clients are not necessarily effected.
Typically, when this occurs, the nfs client will log messages to /var/log/messages which state:
kernel: nfs: server <server-name> not responding, still trying ...
This message would normally imply that the nfs client is sending request but the nfs server is not answering. Historically, when this error occurs, the first thing to do would be to examine the TCP communication between this NFS client and the NFS server, and see whether that is breaking down during periods when this error is occuring. However, due to a recent bug in some 3.0.x kernels, this error can instead be given when the problem is actually internal to the NFS/RPC code on the NFS client system; not due to TCP or network communication problems. With this bug, the nfs client layer *believes* it is sending requests to the nfs server, but those requests are not really making it to the TCP layer or out onto the network.
There have been more than one issue identified and corrected for this symptom. To resolve the known cases, the recommendations are:
1. For a host runing SLES 11 SP4, and acting as an NFS client: For the most part, the SP4 kernel already contained most of the fixes for this symptom when SP4 was first released. However, the most recent bug that leads to this symptom did effect SP4 kernels as well. Update the kernel to at least 3.0.101-68.1.
2. For a host running SLES 11 SP3, and acting as an NFS client, update the kernel to at least 3.0.101-0.47.71. (Most of the common causes for this problem where fixed by kernel 3.0.101-0.21, but later an additional case was found where sunrpc was not recovering properly from a network error, requiring the additional fix.)
3. For a host running SLES 11 SP2 (which is now out of maintenance), and acting as an NFS client:
A. If a Long Term SupportPack Service (LTSS) contract is present, update the kernel to at least 3.0.101-0.7.19.1 from the SLES 11 SP2 LTSS maintenance channel.
B. If a LTSS contract is not present, update to kernel 3.0.101-0.7.17 from the regular maintenance channel. This contains all but two of the potential fixes for this symptom. If this kernel does not correct the problem being seen, the options are to upgrade to SP3 or SP4, or obtain an LTSS contract for SP2.
If the kernel is already as new (or newer) than the fixed kernels listed in this TID, then do not assume that the issue being encountered is the one described in this document. Rather, investigate whether TCP communication is failing between the NFS client and NFS server. Communication failures can happen temporarily, and even on just one TCP connection at a time. So tests of "ping" or of various applications which use TCP connections may not give conclusive comparisons. Failure of all communication would explain NFS failure as well, but success of other communication will not prove that NFS's TCP communication is successful. Often, investigation of the specific NFS connection activity is required, via tcpdump.
On SLES 11 SP2, it might also be possible to avoid this symptom by back-reving the kernel to 3.0.74. This might resolve some cases of this symptom, but not others.
To the author's knowledge, this issue has only been reported by users of NFS v3. However, this may be a misleading coincidence, as the percentage of NFS v4 users is small compared to NFS v3. The code fix was made in sunrpc code, which is used by both NFS v3 and v4.
This Support Knowledgebase provides a valuable tool for NetIQ/Novell/SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.