Unexpected ERS Cluster Resource Failure post ASCS Failover in SAP NetWeaver ENSA1 Architecture

This document (000021333) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise Server for SAP Applications 15 SP5
SUSE Linux Enterprise Server for SAP Applications 15 SP4
SUSE Linux Enterprise Server for SAP Applications 15 SP3
SUSE Linux Enterprise Server for SAP Applications 15 SP2
SUSE Linux Enterprise Server for SAP Applications 12 SP5

Situation

In an ENSA1 SAP NetWeaver cluster environment, it was observed that after a successful post fail-over of the ASCS instance (Enqueue Server), a monitor operation failure was reported from the ERS cluster resource (Enqueue Replication Server). Typically, a failover of the ASCS resource might be initiated automatically by the cluster in response to a resource failure. However, for the purpose of illustrating this specific case, a manual failover was executed, moving the ASCS resource group to the node where the ERS was operational, using the command:

# sapcontrol -nr 10 -function HAFailoverToNode "node_name"

Following the manual relocation of the ASCS instance and the successful loading of the ENQ Lock Table from the ERS instance, the ERS cluster resource encountered a cluster monitor operation failure. This event is evidenced both in the system logs and by the output from the "crm_mon -1r" command:

# /usr/sbin/crm_mon -r -1

Failed Resource Actions:
  * rsc_sap_SY1_ERS11_monitor_11000 on cl1-sap-02-vm 'not running' (7): call=128, status='complete', last-rc-change='Wed Nov 15 08:41:59 2023', queued=0ms, exec=0ms

Resolution

This should not be a concern as the observed behavior aligns with the correct behavior of the Enqueue Replication Server version 1 (ENSA1). In ENSA1, the ERS is designed to shut down (stop) once it has successfully completed its role in mapping the replicated ENQ Lock Table to the Enqueue Server (using local memory mapping mechanisms). This shutdown of the ERS service triggers the cluster monitor operation failure of the ERS resource, when it finds the ERS instance in a stopped status. The cluster will automatically recover the ERS resource group, moving it to the other node. After such an event, it is recommended to reset the ERS resource failcount by executing the following command:

# crm resource refresh rsc_sap_SY1_ERS11

Cause

In the ENSA1 approach, following an ASCS failover to the node where the ERS is running, the Enqueue Server (ASCS) attaches locally a copy of the ENQ Lock Table from the ERS instance using shared memory mechanisms. Subsequently, as design, the ERS instance shuts down. This sequence of actions is internally built-in into Enqueue Replication Server version 1. Consequently, the SAPInstance resource agent which manages ASCS and ERS instances, cannot recognize this as an intended shutdown. This is because sap_suse_cluster_connector tracks only actions executed through the sapcontrol interface, resulting in a failed cluster monitor operation for the ERS. The following log trace patterns will be evident in the respective work directories of ASCS and ERS post-failover:

ASCS, after failover to the node where ERS is running: dev_enqsrv

[Thr 140394463781248] Mon Nov  7 12:16:56 2023
[Thr 140394463781248] LOG GEZ=> Server start [encllog.cpp  550]
[Thr 140394463781248] Enqueue server start with instance number 00

ERS, after ASCS migration to its node: dev_enrepsrv.old

[Thr 139839211409792] Mon Nov  7 12:17:00 2023
[Thr 139839211409792] EnEnqToRep::checkEnserverAttached: the enqueue server has attached to the replicationtable
[Thr 139839211409792] checkActive: Enqueue server has attached my repl. table
[Thr 139839211409792] checkActive: shutdown ...

From the perspective of an ENSA1 NW cluster, this behavior might be viewed as a cosmetic issue. Once the ASCS failover occurs and the lock table attachment is complete, the ERS instance will stop and be restarted on the other node by the cluster (previously hosting the ASCS resource group). This is a standard operational sequence within the ENSA1 high-availability cluster approach.

Additional Information

Contrary to the behavior observed in the ENSA1 approach, the newer ENSA2 architecture, employing both Enqueue Server version 2 and Enqueue Replicator Server version 2, introduces significant resilience and flexibility. This is achieved by replicating the ENQ Lock Table via network instead of local shared memory. In the ENSA2 approach, ERS instance does not stop after successful replication/export of the Lock Table. Moreover, the ASCS instance doesn't have to relocate to the ERS node for attaching the Lock Table; it can recover on its original node and receive the ENQ Lock Table through the network.

The ENSA2 cluster approach is applicable starting from SAP NW 7.52 and later versions, which include support for Standalone Enqueue Server 2 with integrated replication support through Enqueue Replicator 2. Similarly, from S/4HANA ABAP Platform 1809 onwards, Standalone Enqueue Server 2 and Enqueue Replicator 2 are installed by default for high-availability scenarios.

For guidance on implementing ENSA2 in high-availability setups, using SAP NW 7.52 and later, the following Best Practice Guide is recommended. Although primarily focused on S/4HANA ABAP Platform 1809, this guide is universally applicable to any SAP product or stack that supports both Enqueue Server 2 and Enqueue Replication Server 2. This is due to the SUSE cluster stack's capability to independently manage ASCS and ERS components, regardless of the specific stack they are utilized in. This guide is relevant for SAP NW 7.52 and newer versions:
SAP S/4 HANA - Enqueue Replication 2 High Availability Cluster

For SAP NW 7.51 and earlier versions, it's important to note that, according to "SAP Note 2630416 - Support for Standalone Enqueue Server 2", these versions do not support the new Standalone Enqueue Server 2 (ENSA2) in high-availability (HA) setups. This is because SAP NW 7.51 and its predecessors continue to operate with the older Enqueue Replication Server. Therefore, for SAP NW 7.51 and earlier versions, the ENSA1 cluster architecture should be used:
SAP NetWeaver Enqueue Replication 1 High Availability Cluster

An alternative new approach for managing a high-availability stack for SAP NW 7.52 (ENSA2) can be implemented using the following guide:
SAP S/4 HANA - Enqueue Replication 2 High Availability Cluster With Simple Mount

Moreover, for scenarios involving a Multi-SID setups in one cluster, the following guide offers examples for both ENSA1 and ENSA2 configurations:
SAP S/4HANA and SAP NetWeaver Multi-SID Cluster Guide

Related SAP Acronyms:

ABAP     Application server
ASCS     ABAP Central Services (Message server & Enqueue Server)
ENSA1    Standalone Enqueue Server 1
ENSA2    Standalone Enqueue Server 2
ERS      Enqueue replication Server
ERS2     Enqueue replication Server 2
NW       NetWeaver

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.