Long Client hang to Cluster after failover of ERS Instance

This document (7023324) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise High Availability Extension 15
SUSE Linux Enterprise High Availability Extension 12
SUSE Linux Enterprise High Availability Extension 11

Situation

An ASCS/ERS cluster is set up, fencing is in place and all other testing done.  Everything works according to plan from an HA cluster perspective.

After a shutdown or destroy on one of the two nodes (in most cases the node holding the ERS instance), the failover happens according to the configuration. From a cluster perspective everything seems to be 100% correct and working. But the client connections to the SAP application that have a lock hang for about 15 minutes.

This issue is not caused by the HA Setup but by the SAP monitoring, however, it's most likely visible in an HA setup.

The attempts of the new ERS instance to connect to the ASCS can be seen with:
   tail -f /usr/sap/<SID>/ASCS<Instancenumber>/work/dev_enqrepl
on the server carrying the ASCS instance.

To our knowledge this issue only occurs on ENSA1 (ENSA = Standalone Enqueue Server) setups.

Resolution

The core issue is in SAP's handling of the situation, but as a workaround, the TCP level connection timeout can be decreased by lowering the value of:

    /proc/sys/net/ipv4/tcp_retries2

which can also be persistently stored in /etc/sysctl.conf as:

    net.ipv4.tcp_retries2=n

Where "n" should be replaced with a value lower than the default of 15.   It is suspected that a value of 8 or 9 would be sufficient to work around this issue.  It is not recommended to make this value any lower than is absolutely necessary to avoid the problem.

After altering that file, it can be activated with:

    sysctl -p

Possible side effects:

1.  This is a global value, so it can affect timeout of all TCP connections.  Modifications do carry some risks.  RFC 1122 recommends at least 100 seconds for certain timeouts, which corresponds to a tcp_retries2 value of at least 8.  A lower value might be tried but would require careful testing and monitoring for unintended consequences which may not be noticed until much later.

2.  On certain public cloud environments, infrastructure maintenance procedures rely on VM instance keeping TCP connections for 30 seconds. In those environments tcp_retries2 should not be set lower than 8.

3.  Lowering this value may cause NFS connections to timeout earlier.  This can cause NFS clients to try to reestablish the connection with the same source and destination ports, a practice often referred to as "connection reuse".  Many security-conscious devices (such as smart routers, firewalls, frontends, etc) may treat connection reuse with suspicion and may block such activity, leading to NFS client failures.  See
https://www.suse.com/support/kb/doc/?id=000019722 for more details.

Cause

This seems to be hardware and setup related as it only happens in some environments. The issue is that the "en" replication instance on the ASCS node can only have one active replication partner but does not do a keepalive. So the Enqueue process does not notice that the ERS instance is started again. This can be checked on the ASCS node with searching for the "en" process:

brora:~ # ss -pt | grep HA1
ESTAB      0      0          10.162.192.139:50016       10.162.192.213:20500    users:(("en.sapHA1_ASCS0",4526,41))


and then checking for a keepalive

brora:~ # ss -o | grep keep
tcp    ESTAB      0      0       10.162.192.216:ssh      10.162.192.213:36686    timer:(keepalive,113min,0)
tcp    ESTAB      0      0       10.162.192.216:rpasswd  10.162.192.191:nfs      timer:(keepalive,9.776ms,0)


which is not to be found.

To ensure that the Enqueue replication process will time-out faster and accept the new replication for the new ERS instance the above parameter, lowering tcp_retries2 is the workaround

Additional Information

sysctl.conf(5)
SAPStartSrv_basic_cluster(7)
https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:7023324
  • Creation Date: 03-Sep-2018
  • Modified Date:08-Mar-2021
    • SUSE Linux Enterprise High Availability Extension

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback@suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.


SUSE Customer Support Quick Reference Guide SUSE Technical Support Handbook Update Advisories
Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Go to Customer Center