SUSE Linux Enterprise Server 12
SUSE Linux Enterprise Server 12 Service Pack 1 (SLES 12 SP1)
SUSE Linux Enterprise Server 12 Service Pack 2 (SLES 12 SP2)
SUSE Linux Enterprise Server 12 Service Pack 3 (SLES 12 SP3)
SUSE Linux Enterprise Server 15
The situation happens during a high throughput network operation such as a database copy or network backup. It is also seen on networks where high rate of packet loss is being experienced.
On the SLES 12 system the settings for SACK and/or TCP_TIMESTAMPS have been disabled, or alternatively, on the other end of the TCP conversation those settings are disabled.
During high loads the network is encountering packet loss or a packet out-of-order situation, which is causing the TCP congestion control algorithms to kick in. Since SACK and/or TCP_TIMESTAMPS are disabled, the SLES_12 server reverts back to older congestion control algorithms. In certain situations there is a problem where the RTT "round trip time" is not able to be calculated properly and this causes the sending machine to delay retransmissions. In some cases the application will timeout and fail. In others it will recover and began to work again but only after a long delay.
Enable TCP SACK and Timestamp settings on the server and client. Both the client and the server must be configured to have SACK and TCP Timestamps enabled.
Check this on the SLES side by;
Both should be set to 1.
The setting of "1" is actually the system default, but can be explicitly set in the /etc/sysctl.conf file:
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_sack = 1
After setting to 1, save the file and execute:
This will put the change into effect for new TCP connections which are formed after the change. But existing TCP connections will continue to use the previous settings. For full implementation of the change, it may be necessary to tear down connections (for example, umount an NFS mount and remount it). In some cases, it may be faster and easier to reboot than to track down and clear all affected connections.
One more subtle note: These settings default to 1, so some may question why the above recommendation is to set them to 1 rather than just remove the previous settings of 0. This is because "sysctl -p" will only set things which are explicitly present in the sysctl.conf file. It will not return something to it's default merely because the setting is no longer in the conf file. Certainly, if a full reboot it going to take place, mere removal of all mention of these settings will return the system to normal (i.e. return these setting to he default of "1") upon that reboot.
Changes are being checked into kernel code to avoid this problem, even when SACK and TCP Timestamps are disabled. However, Linux development still strongly recommends that SACK and TCP Timestamps always be enabled.
SLE15 SP1 should already have the fix in its first shipping kernel.
SLE15 SP0 contains the fix in kernel 4.12.14-25.16 and higher.
SLE12 SP4 already contains the fix in its first shipping kernel.
SLE12 SP3 has the fix in kernel 4.4.143-94.47 and higher.
SLE12 SP2 LTSS has the fix in kernel 4.4.121-92.101 and higher. (Without LTSS entitlement, this fix is not available on SP2.)
SLE12 SP1 LTSS has the fix in kernel 3.12.74-60.64.110 and higher. (Without LTSS entitlement, this fix is not available on SP1.)
SLE12 SP0 LTSS has the fix in kernel 3.12.61-52.149 and higher. (Without LTSS entitlement, this fix is not available on SP0.)
The phrase that should be present in an official maintenance kernel package changelog, if the fix is present, is:
prevent bogus FRTO undos with non-SACK flows
So, for example, the currently installed kernel can be checked with:
rpm -q --changelog kernel-default | grep "non-SACK"
(where "kernel-default" might have to be altered to reflect the actual kernel package in use.)
or a kernel package file which is in the local directory, but which is not installed, could be checked with:
rpm -qp --changelog kernel-file.rpm | grep "non-SACK"
(where "kernel-file.rpm" must be replaced by the actual name of the kernel rpm file.)
This issue does not effect SLES 11, as the code in use was introduced in general Linux kernel 3.10.
This document it primarily meant to deal with one known bug, but there can be other reasons for communication delays. Even with the fixed kernel in place, communication problems may exist and it may improve recovery speed to have SACK and timestamps enabled. Alternatively, even with SACK and timestamps enabled, long delays may continue to exist. This document may not fit or resolve every similar condition. For example, SUSE has seen 3rd party servers or NIC driver flaws cause similar delays. SUSE does not plan to expand this TID for all possible causes of similar symptoms. If these type of symptoms continue even after taking the steps in this TID, then other TIDs may help, or opening a Service Request with SUSE Technical Support may be warranted.
2018-May-7 - Sascha Wehnert - Fixed formatting
2018-Nov-30 - Darcy Partridge - Expanded some topics, reorganized slightly. Added info about kernel fixed available or upcoming.
2018-Dec-4 - Darcy Partridge - Added more info about kernel versions with fix, including SLES 15.
2019-Apr-18 - Darcy Partridge - Added "Additional Information" about other causes.
This Support Knowledgebase provides a valuable tool for NetIQ/Novell/SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.