SLES for SAP HANA Maintenance Procedures – Part -2 (Manual Administrative Tasks, OS reboots and Updation of OS and HANA)

Share
Share

This is the second part of the blog on maintenance procedure for SLE for SAP running SAP HANA workload. The first part of the blog is about the pre-maintenance checks. In this blog I am going to discuss about the actual maintenance procedures, the steps involved as per the best practises, cleanup procedures etc.

We’ll cover the following maintenance procedures.

  1. Manual take-over
  2. Manual start of primary when only one node is available
  3. OS Reboots
  4. SAP HANA update
  5. Patching of Cluster Software Stack
  6. Cleanup after manual administrative activities

Manual take-over

This section details the manual take-over of the SAP HANA database. The status of SAP HANA databases, system replication and Linux cluster has to be checked. The SAP HANA resources are set into maintenance, an sr_takeover is performed, the old primary is registered as new secondary. Therefore the correct secondary site name has to be used. Finally the SAP HANA resources are given back to the Linux cluster.

1. First perform the checks as mention in the first part of the blog. If everything looks fine, proceed to put the msl resource into maintenance mode.


llhana1:~ # crm resource maintenance msl_SAPHana_TST_HDB00
llhana1:~ # crm_mon -1r
Cluster Summary:
* Stack: corosync
* Current DC: llhana2 (version 2.0.5+20201202.ba59be712-150300.4.16.1-2.0.5+20201202.ba59be712) - partition with quorum
* Last updated: Tue Apr 19 23:21:54 2022
* Last change:  Tue Apr 19 23:21:39 2022 by root via cibadmin on llhana1
* 2 nodes configured
* 6 resource instances configured

Node List:
* Online: [ llhana1 llhana2 ]

Full List of Resources:
* stonith-sbd (stonith:external/sbd):  Started llhana1
* Clone Set: cln_SAPHanaTopology_TST_HDB00 [rsc_SAPHanaTopology_TST_HDB00]:
* Started: [ llhana1 llhana2 ]
* Clone Set: msl_SAPHana_TST_HDB00 [rsc_SAPHana_TST_HDB00] (promotable) (unmanaged):
* rsc_SAPHana_TST_HDB00     (ocf::suse:SAPHana):     Master llhana1 (unmanaged)
* rsc_SAPHana_TST_HDB00     (ocf::suse:SAPHana):     Slave llhana2 (unmanaged)
* rsc_ip_TST_HDB00    (ocf::heartbeat:IPaddr2):        Started llhana1
llhana1:~ #

2. Stop the SAP HANA primary site


llhana1:~ # su - tstadm
tstadm@llhana1:/usr/sap/TST/HDB00> sapcontrol -nr 00 -function StopSystem HDB

19.04.2022 23:23:33
StopSystem
OK
tstadm@llhana1:/usr/sap/TST/HDB00> sapcontrol -nr 00 -function GetSystemInstanceList

19.04.2022 23:23:56
GetSystemInstanceList
OK
hostname, instanceNr, httpPort, httpsPort, startPriority, features, dispstatus
llhana1, 0, 50013, 50014, 0.3, HDB|HDB_WORKER, GREEN
tstadm@llhana1:/usr/sap/TST/HDB00> sapcontrol -nr 00 -function GetSystemInstanceList

19.04.2022 23:24:23
GetSystemInstanceList
OK
hostname, instanceNr, httpPort, httpsPort, startPriority, features, dispstatus
llhana1, 0, 50013, 50014, 0.3, HDB|HDB_WORKER, GRAY
tstadm@llhana1:/usr/sap/TST/HDB00>

We should now only proceed after we have made sure the SAP HANA primary is down. This can be ensured by checking the value of “dispstatus” to be “GRAY” and not “GREEN”.

3. Initiate the takeover on the SAP HANA secondary site


llhana2:~ # su - tstadm
tstadm@llhana2:/usr/sap/TST/HDB00> hdbnsutil -sr_takeover
done.
tstadm@llhana2:/usr/sap/TST/HDB00> HDBSettings.sh systemReplicationStatus.py; echo RC:$?
there are no secondary sites attached

Local System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

mode: PRIMARY
site id: 2
site name: TWO
RC:10
tstadm@llhana2:/usr/sap/TST/HDB00> HDBSettings.sh landscapeHostConfiguration.py; echo RC:$?
| Host    | Host   | Host   | Failover | Remove | Storage   | Storage   | Failover | Failover | NameServer | NameServer | IndexServer | IndexServer | Host   | Host   | Work
er  | Worker  |
|         | Active | Status | Status   | Status | Config    | Actual    | Config   | Actual   | Config     | Actual     | Config      | Actual      | Config | Actual | Conf
ig  | Actual  |
|         |        |        |          |        | Partition | Partition | Group    | Group    | Role       | Role       | Role        | Role        | Roles  | Roles  | Grou
ps  | Groups  |
| ------- | ------ | ------ | -------- | ------ | --------- | --------- | -------- | -------- | ---------- | ---------- | ----------- | ----------- | ------ | ------ | ----
--- | ------- |
| llhana2 | yes    | ok     |          |        |         1 |         1 | default  | default  | master 1   | master     | worker      | master      | worker | worker | defa
ult | default |

overall host status: ok
RC:4
tstadm@llhana2:/usr/sap/TST/HDB00>

If everything looks fine, then proceed to the next step.

4. Register the former HANA primary site, now future secondary site to the new primary site


tstadm@llhana1:/usr/sap/TST/HDB00> hdbnsutil -sr_register --remoteHost=llhana2 --remoteInstance=00 --replicationMode=sync --name=ONE --operationMode=logreplay
adding site ...
nameserver llhana1:30001 not responding.
collecting information ...
updating local ini files ...
done.
tstadm@llhana1:/usr/sap/TST/HDB00>sapcontrol -nr 00 -function StartSystem HDB

19.04.2022 23:38:29
StartSystem
OK
tstadm@llhana1:/usr/sap/TST/HDB00>exit
logout
llhana1:~ #

5. Check the system replication status on the new HANA primary site



tstadm@llhana2:/usr/sap/TST/HDB00> HDBSettings.sh systemReplicationStatus.py; echo RC:$?
| Database | Host    | Port  | Service Name | Volume ID | Site ID | Site Name | Secondary | Secondary | Secondary | Secondary | Secondary     | Replication | Replication |
Replication    |
|          |         |       |              |           |         |           | Host      | Port      | Site ID   | Site Name | Active Status | Mode        | Status      |
Status Details |
| -------- | ------- | ----- | ------------ | --------- | ------- | --------- | --------- | --------- | --------- | --------- | ------------- | ----------- | ----------- |
-------------- |
| SYSTEMDB | llhana2 | 30001 | nameserver   |         1 |       2 | TWO       | llhana1   |     30001 |         1 | ONE       | YES           | SYNC        | ACTIVE      |
|
| TST      | llhana2 | 30007 | xsengine     |         2 |       2 | TWO       | llhana1   |     30007 |         1 | ONE       | YES           | SYNC        | ACTIVE      |
|
| TST      | llhana2 | 30003 | indexserver  |         3 |       2 | TWO       | llhana1   |     30003 |         1 | ONE       | YES           | SYNC        | ACTIVE      |
|

status system replication site "1": ACTIVE
overall system replication status: ACTIVE

Local System Replication State
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

mode: PRIMARY
site id: 2
site name: TWO
RC:15
tstadm@llhana2:/usr/sap/TST/HDB00> HDBSettings.sh landscapeHostConfiguration.py; echo RC:$?
| Host    | Host   | Host   | Failover | Remove | Storage   | Storage   | Failover | Failover | NameServer | NameServer | IndexServer | IndexServer | Host   | Host   | Work
er  | Worker  |
|         | Active | Status | Status   | Status | Config    | Actual    | Config   | Actual   | Config     | Actual     | Config      | Actual      | Config | Actual | Conf
ig  | Actual  |
|         |        |        |          |        | Partition | Partition | Group    | Group    | Role       | Role       | Role        | Role        | Roles  | Roles  | Grou
ps  | Groups  |
| ------- | ------ | ------ | -------- | ------ | --------- | --------- | -------- | -------- | ---------- | ---------- | ----------- | ----------- | ------ | ------ | ----
--- | ------- |
| llhana2 | yes    | ok     |          |        |         1 |         1 | default  | default  | master 1   | master     | worker      | master      | worker | worker | defa
ult | default |

overall host status: ok
RC:4
tstadm@llhana2:/usr/sap/TST/HDB00> exit
logout
llhana2:~ #

If everything looks fine, then perform the next set of steps.

6. On either of the nodes, check the cluster state, refresh the msl resource, set the maintenance attribute to “off” on the msl resource and check the system replication attributes.


llhana2:~ # cs_clusterstate -i
### llhana2.lab.sk - 2022-04-19 23:42:04 ###
Cluster state: S_IDLE
llhana2:~ # crm resource refresh msl_SAPHana_TST_HDB00
Cleaned up rsc_SAPHana_TST_HDB00:0 on llhana1
Cleaned up rsc_SAPHana_TST_HDB00:1 on llhana2
Waiting for 2 replies from the controller.. OK
llhana2:~ # crm resource maintenance msl_SAPHana_TST_HDB00 off
llhana2:~ # SAPHanaSR-showAttr
Global cib-time
--------------------------------
global Tue Apr 19 23:42:50 2022

Resource              maintenance
----------------------------------
msl_SAPHana_TST_HDB00 false

Sites srHook
-------------
ONE   SOK
TWO   PRIM

Hosts   clone_state lpa_tst_lpt maintenance node_state op_mode   remoteHost roles                            score site srmode standby sync_state version                vho
st
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----
llhana1 DEMOTED     30          off         online     logreplay llhana2    4:S:master1:master:worker:master 100   ONE  sync   off     SOK        2.00.052.00.1599235305 llh
ana1
llhana2 PROMOTED    1650404570              online     logreplay llhana1    4:P:master1:master:worker:master 150   TWO  sync   off     PRIM       2.00.052.00.1599235305 llh
ana2

llhana2:~ #

llhana2:~ # crm_mon -1r
Cluster Summary:
* Stack: corosync
* Current DC: llhana2 (version 2.0.5+20201202.ba59be712-150300.4.16.1-2.0.5+20201202.ba59be712) - partition with quorum
* Last updated: Tue Apr 19 23:43:50 2022
* Last change:  Tue Apr 19 23:42:50 2022 by root via crm_attribute on llhana2
* 2 nodes configured
* 6 resource instances configured

Node List:
* Online: [ llhana1 llhana2 ]

Full List of Resources:
* stonith-sbd (stonith:external/sbd):  Started llhana1
* Clone Set: cln_SAPHanaTopology_TST_HDB00 [rsc_SAPHanaTopology_TST_HDB00]:
* Started: [ llhana1 llhana2 ]
* Clone Set: msl_SAPHana_TST_HDB00 [rsc_SAPHana_TST_HDB00] (promotable):
* Masters: [ llhana2 ]
* Slaves: [ llhana1 ]
* rsc_ip_TST_HDB00    (ocf::heartbeat:IPaddr2):        Started llhana2
llhana2:~ # cs_clusterstate -i
### llhana2.lab.sk - 2022-04-19 23:43:54 ###
Cluster state: S_IDLE
llhana2:~ #

Manual start of primary when only one node is available

This might be necessary in case the cluster can not detect the status of both sites. This is an advanced task. For this, I am not sharing the command outputs however, I am indicating which commands needs to be run for the specific step. Assuming, hostnames and cluster node names to be “llhana1” and “llhana2”. SID to be TST and instance number to be 00.

Before doing this, make sure SAP HANA is not primary on the other site!

1. Start the cluster on remaining nodes.


systemctl start pacemaker

2. Wait and check for cluster is running, and in status idle.


watch cs_clusterstate -i

3. Become sidadm, and start HANA manually.


# su - tstadm
~>HDB start

4. Wait and check for HANA is running. If the cluster does not start the SAP HANA then refresh the msl resource.


#crm resource refresh msl_SAPHana_TST_HDB00 llhana1

5. In case the cluster does not promote the SAP HANA to primary, instruct the cluster to migrate the IP address to that node.


#crm resource move rsc_ip_TST_HDB00 llhana1

6. Wait and check for HANA has been promoted to primary by the cluster.
7. Remove the migration rule from the IP address.


#crm resource clear rsc_ip_TST_HDB00

8. Check if cluster is in status idle.


watch cs_clusterstate -i

9. You are done, for now.
10. Please bring back the other node and register that SAP HANA as soon as possible. If the SAP HANA primary stays alone for too long, the log area will fill up.

OS Reboots

Cluster pre-checks should be done as explained in first part of the blog before performing below steps and cluster idle state must be ensured in the intermediate steps:

1. Disabling pacemaker on SAP HANA primary


llhana1:~ # systemctl disable pacemaker
Removed /etc/systemd/system/multi-user.target.wants/pacemaker.service.
llhana1:~ #

2. Disabling and stopping pacemaker on SAP HANA secondary


llhana2:~ # systemctl disable pacemaker
Removed /etc/systemd/system/multi-user.target.wants/pacemaker.service.
llhana2:~ #

3. Stopping cluster on SAP HANA secondary


llhana2:~ # crm cluster stop
INFO: Cluster services stopped
llhana2:~ #

– SAP HANA secondary will be stopped, secondary shows OFFLINE in crm_mon
– system replication goes SFAIL


llhana1:~ # SAPHanaSR-showAttr
Global cib-time
--------------------------------
global Wed Apr 20 10:31:18 2022

Resource              maintenance
----------------------------------
msl_SAPHana_TST_HDB00 false

Sites srHook
-------------
ONE   PRIM
TWO   SFAIL

Hosts   clone_state lpa_tst_lpt maintenance node_state op_mode   remoteHost roles                            score site srmode standby sync_state version                vho
st
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----
llhana1 PROMOTED    1650443478  off         online     logreplay llhana2    4:P:master1:master:worker:master 150   ONE  sync   off     PRIM       2.00.052.00.1599235305 llh
ana1
llhana2             10                      offline    logreplay llhana1                                           TWO  sync   off                                       llh
ana2

llhana1:~ #

4. stopping cluster on SAP HANA primary


llhana1:~ # crm cluster stop
INFO: Cluster services stopped
llhana1:~ #

– SAP HANA primary will be stopped
5. Doing something with OS or hardware
6. Enabling pacemaker on SAP HANA primary


llhana1:~ # systemctl enable pacemaker
Created symlink /etc/systemd/system/multi-user.target.wants/pacemaker.service → /usr/lib/systemd/system/pacemaker.service.
llhana1:~ #

7. Enabling pacemaker on SAP HANA secondary


llhana2:~ # systemctl enable pacemaker
Created symlink /etc/systemd/system/multi-user.target.wants/pacemaker.service → /usr/lib/systemd/system/pacemaker.service.
llhana2:~ #

8. Starting cluster on SAP HANA primary


llhana1:~ # crm cluster start
INFO: Cluster services started
llhana1:~ #

Since, for a two node cluster the default corosync configuration is wait_for_all therefore, when we start the pacemaker on the primary HANA node while the pacemaker at the other node is stopped, we observe that pacemaker does not start any resource and waits for the other node to be available.

9. Starting cluster on SAP HANA secondary


llhana1:~ # crm cluster start
INFO: Cluster services started
llhana1:~ #

As soon as the pacemaker at secondary SAP HANA is started, the existing primary node in the cluster sees the secondary node online. In a 2 node cluster it is impossible to ascertain that while the existing node was down/offline the other node was not running any resource, therefore to ensure data integrity and to be on the safer side cluster fences the secondary node. Once the secondary node reboots and comes online, it synchronizes with the primary SAP HANA and the system replication status changes to SOK.
It is also important to note that during the maintenance in step 5 if the secondary node was recently rebooted (less than 5 minutes ago) when the fencing was triggered then in that case although we see a message that secondary node has rebooted due to fencing but there is no actual reboot.
In case it is desired that there should be no fencing in this case then, either:
1. You can temporarily set the cluster property “startup-fencing” to “false”, although, it is important to note that it is not a recommended setting and should only be performed by an advanced users only.
Or,
2. Or, you can set the SBD configuration parameter SBD_DELAY_START to “no”.

SAP HANA update

When we need to perform the update of SAP HANA software, we have to ask the cluster to disable the management of multi-state (msl) resource which in turn will disable the management of SAP HANA resource agents. Cluster will now no longer start, stop or monitor the SAP HANA database. Admins will be able to manually start, stop, the SAP HANA database and perform a system replication takeover. Since the virtual IP resource will still be running and managed by the cluster, so, in case of any takeover, the IP will automatically move to the new primary node.

1. Pre Update Task
For the multi-state-resource set the maintenance mode:


llhana1:~ # crm resource maintenance msl_SAPHana_TST_HDB00
llhana1:~ #

2. Update
The update procedures for SAP HANA needs to be followed from SAP documentations.

3. Post Update Task
In case if the roles of the node were changed during the maintenance activity then a resource refresh will help the cluster to know the status of the current roles.


llhana1:~ # crm resource refresh msl_SAPHana_TST_HDB00
Cleaned up rsc_SAPHana_TST_HDB00:0 on llhana1
Cleaned up rsc_SAPHana_TST_HDB00:1 on llhana2
Waiting for 2 replies from the controller.. OK
llhana1:~ #

4. At the end of the maintenance, enable the cluster control on the msl resource again.


llhana1:~ # crm resource maintenance msl_SAPHana_TST_HDB00 off
llhana1:~ #

Patching of Cluster Software Stack

Regular patching of cluster nodes are important to improve the security, removal of bugs and feature enhancements. Here are some recommendations on how to plan and execute the patching maintenance activity.

      1. In case the filesystem of the root (/) is “btrfs” and the snapshots for the filesystem is enabled then it is recommended to take a pre-snapshot and a post-snapshot during the patching activity.
      2. If the nodes of the cluster are virtual machines then it is recommended to take a snapshot of the VM before the start of the patching. If it requires that the VM should be stopped before the snapshot then follow the steps in section “OS reboot” to shutdown the VM to take the snapshot.
      3. In case the nodes of the virtual machines are physical machines or a filesystem or VM snapshot is not possible then if available/possible the backup of the OS partition should be taken using some backup tool.
    1. The patching procedures should be first tested on a test machine before attempting it on production machine. It is highly recommended that the test environment is as similar to the production environment as possible. It has been observed a number of times that when the test environment is not similar to the production environment, the patching of the production shows a very different behaviour than the test.
    2. Finally to patch the cluster, you need to follow the exact same steps for the “OS reboots” and perform the patching in step 5.
      In case you need to update the cluster stack from a lower version/service-pack to higher version/service-pack, please follow the SLE – HAE documentation section titled “Upgrading your cluster to the latest product version” at https://documentation.suse.com/sle-ha/15-SP3/html/SLE-HA-all/cha-ha-migration.html

Cleanup after manual administrative activities

Once the maintenance activity is complete, it is recommended to again run the pre-maintenance checks procedure to ensure that the status of the cluster is as expected. In particular, I want to emphasize on below checks after the maintenance:
1. Checking status of SUSE HA cluster and SAP HANA system replication


llhana1:~ # cs_clusterstate -i
### llhana1.lab.sk - 2022-04-12 18:44:12 ###
Cluster state: S_IDLE
llhana1:~ #

2. Check for any migration constraint


llhana1:~ # crm configure show | grep cli-
llhana1:~ #

It is important to note that the location constraints starting with “cli-prefer” or “cli-ban” are created when the resources are moved or migrated. When moving the resources manually, an expiry can be assigned to the migration constraint. The syntax in CLI is “crm resource move <resource name> <node name> <expiry-for-constraint>. Here expiry can be defined in ISO syntax for example “PT5M” would mean expire this constraint after 5 minutes of its creation.

In case you find a migration constraint then you can remove it using below command (replace <resource-name> with the name of the resource whose migration constraint needs to be cleared).


llhana1:~ # crm resource clear <resource-name>
INFO: Removed migration constraints for <resource-name>
llhana1:~ #

3. Check for any “maintenance” meta attribute on resource.


llhana1:~ # crm configure show | grep -B2 maintenance
params pcmk_delay_max=15s
ms msl_SAPHana_TST_HDB00 rsc_SAPHana_TST_HDB00 \
meta clone-max=2 clone-node-max=1 interleave=true maintenance=false
--
stonith-timeout=150s \
last-lrm-refresh=1651438829 \
maintenance-mode=false
llhana1:~ #

in above example there are 2 results of “grep” command, however, we are focusing on the first one which is for the resource and ignoring the second one which is for the cluster, for now. If you find one like in above example, then follow below command to remove this attribute from the CIB. It helps to tidy up the cluster configuration and a good practice to be used after the maintenance.


llhana1:~ # crm resource meta msl_SAPHana_TST_HDB00 delete maintenance
Deleted 'msl_SAPHana_TST_HDB00' option: id=msl_SAPHana_TST_HDB00-meta_attributes-maintenance name=maintenance
llhana1:~ #

 

Please also read our other blogs about #TowardsZeroDowntime.

 

Where can I find further information?

Share
(Visited 32 times, 1 visits today)

Comments

  • Avatar photo Ajay Sandal says:

    Thanks for the detailed information.
    I have a question for OS activities, do we need to stop pacemaker, services or simply putting the cluster in maintenance mode is sufficient?

    crm configure property maintenance-mode=”true”

  • Avatar photo Sanjeet Kumar Jha says:

    For OS activities like patching, updating etc; I have written a dedicated blog which can be accessed using this link: https://www.suse.com/c/sles-for-sap-os-patching-procedure-for-scale-up-perf-opt-hana-cluster/

    To briefly answer your question:
    1) Stopping pacemaker separately is not required as when one wishes to stop the cluster one should always use “crm cluster stop” command instead, which stops all required services including pacemaker.
    2) For OS maintenance we highly recommend to put the multi-state resource into maintenance. This is further explained in Fabian’s blog: https://www.suse.com/c/sap-hana-maintenance-suse-clusters/

  • Leave a Reply

    Your email address will not be published. Required fields are marked *

    Avatar photo
    10,749 views
    Sanjeet Kumar Jha I am a SAP Solution Architect for High Availability at SUSE. I have over a decade years of experience with SUSE high availability technologies for SAP applications.