Discover your cluster logfiles – A journey into the past.

September 8, 2021 | By: Thomas Schlosser

Log files are very useful when it comes to situations where the root cause of an event has to be investigated. But analyzing logs does not only mean looking for errors in the system. There are also a lot of other informations in most of the log files. The pacemaker log file is a perfect example. Beside warnings and errors it includes also all cluster changes. The trick is to know which keywords you have to search for.

Most Wanted – The main keyword

The cluster is defined by the Cluster Information Base (CIB), which uses XML notation. And every change of the cluster has to be updated to the CIB file. As mentioned above the good thing is that all information about changes are also written to the pacemaker logs.

To get a list of all changes we can use the following keyword:

cib_perform_op

Choose your way – The path into more details

If we look for cib_perform_op in the pacemaker log we also see a strange path. But this is not a file system path:

Aug 26 09:47:40 hana01 pacemaker-based [2510] (cib_perform_op) info: + /cib/configuration/crm_config/cluster_property_set/ …..

The secret behind that strange path reveals itself, after we took a closer look into the CIB XML file. It can be found in form of XML elements:

<cib ....>
  <configuration>
    <crm_config>
      <cluster_property_set ...
[...]

(the current CIB XML can be checked with the command: cibadmin -Q)

Taking a deeper look into the XML example we can also see the different properties within the cluster_property_set element:

        <nvpair name="have-watchdog" value="true" ...
        <nvpair name="cluster-name" value="hana_cluster" ...
        <nvpair name="stonith-enabled" value="true" ...
        <nvpair name="maintenance-mode" value="true" ...
        [...]

Now that we know where this path comes from, it is easy to read all these informations from the logs. As an example the line below shows us that the cluster_property of the maintenance mode has been changed on Aug 26th:

Aug 26 09:35:25 hana01 pacemaker-based [2510] (cib_perform_op) info: ++ /cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']: <nvpair name="maintenance-mode" value="true" id="cib-bootstrap-options-maintenance-mode"/>

Guideline into the past – A list of important sections

If we take a look into a valid XML file, we can easily create an overview about the different paths.

/cib/configuration/resources

All resources and it groups, clones and master/slave sections.

primitive resources
group
clone
master /slave

/cib/configuration/constraints

The constrains section is all about locations, colocations and orders.

location
colocation
order

/cib/configuration/nodes/node

Information about a single node.

maintenance true/false
on SAP HANA cluster some important HANA status values on a single node

/cib/configuration/crm_config/cluster_property_set

Information about the cluster settings.

have-watchdog
cluster-name
stonith-enabled
maintenance-mode
[…]

A first look – The simple example

Let us try a first example. Maybe we are only interested in resources. To find only changes of “resources” we can grep for the following:

# grep cib_perform_op /var/log/pacemaker/pacemaker.log | grep "/cib/configuration/resources"
Aug 26 09:47:40 hana01 pacemaker-based [2510] (cib_perform_op) info: + /cib/configuration/resources/primitive[@id='rsc_ip_HA0_HDB00']/operations/op[@id='rsc_ip_HA0_HDB00-stop-0']: @timeout=10
Aug 26 09:53:06 hana01 pacemaker-based [2510] (cib_perform_op) info: + /cib/configuration/resources/primitive[@id='rsc_exporter_HA0_HDB00']/meta_attributes[@id='rsc_exporter_HA0_HDB00-meta_attributes-0']/nvpair[@id='rsc_exporter_HA0_HDB00-meta_attributes-0-target-role']: @value=Stopped
Aug 26 09:53:56 hana01 pacemaker-based [2510] (cib_perform_op) info: + /cib/configuration/resources/primitive[@id='rsc_exporter_HA0_HDB00']/meta_attributes[@id='rsc_exporter_HA0_HDB00-meta_attributes-0']/nvpair[@id='rsc_exporter_HA0_HDB00-meta_attributes-0-target-role']: @value=Started

The first line in this example shows a change of the stop timeout of the resource rsc_ip_HA0_HDB00 to 10.

The second and third line show that the resource rsc_exporter_HA0_HDB00 has been stopped and started after 50 seconds.

There is a story behind – A more fine tuned example

Let’s assume we have a two node SAP HANA cluster and the behavior of the failover was not working as expected. After some investigations we found out that the cluster property “stonith_enable” is set to “false”. This would explain the weird behavior of the cluster. Because without a stonith enabled the cluster is not able to work correctly at all.

Let’s take a look into the log file. Maybe we can find out when these parameter has been changed.

What do we need:

For changing the CIB: -> cib_perform_op
The section cluster property -> /cib/configuration/crm_config/cluster_property_set
We are searching for: -> stonith

The easiest way is to pipe every keyword after another with a new grep command.
(Of course you can do this smarter and shorter. But we want to have this transparent.)

# grep cib_perform_op /var/log/pacemaker/pacemaker.log | grep /cib/configuration/crm_config/cluster_property_set| grep stonith
Sep 02 09:26:32 hana01 pacemaker-based [2510] (cib_perform_op) info: + /cib/configuration/crm_config/cluster_property_set [@id='cib-bootstrap-options']/nvpair[@id='cib-bootstrap-options-stonith-enabled']: @value=false

The grep command shows us that on Sep 02th at 09:26 the property of stonith-enabled was set to false. Assuming the cluster was created before September we now know that the changes has been done most likely manually afterwards.

With some luck we can now search in the system logs for a user login at that time.

In our example we can see short before:

An ssh login of the user penguin:

2021-09-02T09:25:04.762826+02:00 hana01 sshd[31933]: Accepted keyboard-interactive/pam for penguin from 192.168.4.23 .....

and a sudo command to become root:

2021-09-06T09:25:42.958937+02:00 hana01 sudo: penguin : TTY=pts/1 ; PWD=/home/penguin ; USER=root ; COMMAND=/usr/bin/su

It is of course not a sufficient evidence that the user has really done the changes but it is very likely.

Like our last example shows, there is really a lot of potential in the logs if you know the right keyword. The journey into the past is not over here. Watch out for the next article in this series.

(Visited 20 times, 1 visits today)

Oct 24th, 2022

What Trento 2.0 Means for a Secure SAP Platform? Changes in the SUSE’s tool to reduce risk and improve the reliability of the SAP environments

Sebastian Martinez

8,588 views

Thomas Schlosser