SAP NW cluster failover due to sapstartsrv frequent restart

This document (000020517) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise Server for SAP Applications 15
SUSE Linux Enterprise Server for SAP Applications 12
 

Situation

ASCS cluster instance/node self-fences and fails over frequently to the other node, due to a monitor operation failure:
node01 pacemaker-execd[5806]:  warning:  rsc_sapinst_HA1_SCS01_monitor_11000 process (PID 23112) timed out

In the logs a lot of sapstartsrv restart triggered by the cluster are reported:
node01 SAPInstance(rsc_sapinst_HA1_SCS01)[1767]: WARNING: sapstartsrv is running for instance ASCS01, that service will be killed

node01 SAPHA1_01[1990]: SAP Service SAPHA1_01 successfully started.

It seems sapstartsrv was restarted during an ongoing monitor operation or the restart took more time than usual which could have lead to the resource monitor timeout:
node01 SAPInstance(rsc_sapinst_HA1_SCS01)[26602]: WARNING: sapstartsrv is running for instance ASCS01, that service will be killed

node01 pacemaker-execd[5806]:  warning:  rsc_sapinst_HA1_SCS01_monitor_11000 process (PID 23112) timed out

While startsapsrv was not fully running, the cluster is trying to recover the ASCS resource, basically a stop and start (using sapcontrol) which is not responding:
node01 SAPInstance(rsc_sapinst_HA1_SCS01)[29105]: WARNING: sapstartsrv is not running for instance HA1-SCS01 (no UDS), it will be started now

node01 pacemaker-execd[5806]:  warning: rsc_sapinst_HA1_SCS01_stop_0     process (PID 28238) timed out
node01 pacemaker-execd[5806]:  warning:rsc_sapinst_HA1_SCS01_stop_0[29105] timed out after 600000ms

A stop operation failure is critical for the cluster, as it can not guarantee anymore the data integrity, and it issues a fence operation:
node01 pacemaker-controld[5809]: notice: Requesting fencing (reboot) of node node01
node01 pacemaker-fenced[5805]: notice: Requesting that node02 perform 'reboot' action targeting node01

Resolution

Please make sure that "params InstanceName=..." on the SAPInstance cluster resource matches the instance name returned by the following sapcontrol command:
sapcontrol -nr 01 -function ParameterValue INSTANCE_NAME -format script | grep '^0 : ' | cut -d' ' -f3

An example of a correct ASCS resource configuration would be:
primitive rsc_sap_HA1_ASCS00 SAPInstance \
    operations $id=rsc_sap_HA1_ASCS00-operations \
    op monitor interval=11 timeout=60 on-fail=restart \
    params InstanceName=HA1_ASCS00_sapha1as \
     START_PROFILE="/sapmnt/HA1/profile/HA1_ASCS00_sapha1as" \
     AUTOMATIC_RECOVER=false \
    meta resource-stickiness=5000 failure-timeout=60 migration-threshold=1 \
     priority=10 

For more details please refer to SUSE best practice guides:
https://documentation.suse.com/sbp/all/html/SAP_NW740_SLE15_SetupGuide/index.html
https://documentation.suse.com/sbp/all/html/SAP_S4HA10_SetupGuide-SLE15/index.html

Cause

The warning:
WARNING: sapstartsrv is running for instance ASCS01, that service will be killed

and the restart is actually coming from the following code of SAPInstance resource agent:
#/usr/lib/ocf/resource.d/heartbeat/SAPInstance
---------------------------------------------------------------
408 check_sapstartsrv() {
409   local restart=0
...
414   if [ ! -S /tmp/.sapstream5${InstanceNr}13 ]; then
415     ocf_log warn "sapstartsrv is not running for instance $SID-$InstanceName (no UDS), it will be started now"
416     restart=1
417   else
418     output=`$SAPCONTROL -nr $InstanceNr -function ParameterValue INSTANCE_NAME -format script`
419     if [ $? -eq 0 ]
420     then
421       runninginst=`echo "$output" | grep '^0 : ' | cut -d' ' -f3`
422       if [ "$runninginst" != "$InstanceName" ]
423       then
424         ocf_log warn "sapstartsrv is running for instance $runninginst, that service will be killed"
425         restart=1 

More specifically, the following command:
 sapcontrol -nr 01 -function ParameterValue INSTANCE_NAME -format script | grep '^0 : ' | cut -d' ' -f3

returns  "runninginst = ASCS01", while "$InstanceName" returns SCS01 (assigned from echo "$OCF_RESKEY_InstanceName" | cut -d_ -f2 == echo HA1_SCS01_host01 | cut -d_ -f2 = SCS01)

So we have ASCS01 != SCS01, which lead to the referenced warning logs and always triggering a sapstartsrv restart during a monitor operation.

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:000020517
  • Creation Date: 25-Nov-2021
  • Modified Date:03-Dec-2021
    • SUSE Linux Enterprise High Availability Extension
    • SUSE Linux Enterprise Server for SAP Applications

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback@suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.


SUSE Customer Support Quick Reference Guide SUSE Technical Support Handbook Update Advisories
Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Go to Customer Center