Discover your cluster logfiles – A journey into the past.
Log files are very useful when it comes to situations where the root cause of an event has to be investigated. But analyzing logs does not only mean looking for errors in the system. There are also a lot of other informations in most of the log files. The pacemaker log file is a perfect example. Beside warnings and errors it includes also all cluster changes. The trick is to know which keywords you have to search for.
Most Wanted – The main keyword
The cluster is defined by the Cluster Information Base (CIB), which uses XML notation. And every change of the cluster has to be updated to the CIB file. As mentioned above the good thing is that all information about changes are also written to the pacemaker logs.
To get a list of all changes we can use the following keyword:
cib_perform_op
Choose your way – The path into more details
If we look for cib_perform_op in the pacemaker log we also see a strange path. But this is not a file system path:
Aug 26 09:47:40 hana01 pacemaker-based [2510] (cib_perform_op) info: + /cib/configuration/crm_config/cluster_property_set/ …..
The secret behind that strange path reveals itself, after we took a closer look into the CIB XML file. It can be found in form of XML elements:
<cib ....> <configuration> <crm_config> <cluster_property_set ... [...]
(the current CIB XML can be checked with the command: cibadmin -Q)
Taking a deeper look into the XML example we can also see the different properties within the cluster_property_set element:
<nvpair name="have-watchdog" value="true" ... <nvpair name="cluster-name" value="hana_cluster" ... <nvpair name="stonith-enabled" value="true" ... <nvpair name="maintenance-mode" value="true" ... [...]
Now that we know where this path comes from, it is easy to read all these informations from the logs. As an example the line below shows us that the cluster_property of the maintenance mode has been changed on Aug 26th:
Aug 26 09:35:25 hana01 pacemaker-based [2510] (cib_perform_op) info: ++ /cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']: <nvpair name="maintenance-mode" value="true" id="cib-bootstrap-options-maintenance-mode"/>
Guideline into the past – A list of important sections
If we take a look into a valid XML file, we can easily create an overview about the different paths.
/cib/configuration/resources
All resources and it groups, clones and master/slave sections.
- primitive resources
- group
- clone
- master /slave
/cib/configuration/constraints
The constrains section is all about locations, colocations and orders.
- location
- colocation
- order
/cib/configuration/nodes/node
Information about a single node.
- maintenance true/false
- on SAP HANA cluster some important HANA status values on a single node
/cib/configuration/crm_config/cluster_property_set
Information about the cluster settings.
- have-watchdog
- cluster-name
- stonith-enabled
- maintenance-mode
- […]
A first look – The simple example
Let us try a first example. Maybe we are only interested in resources. To find only changes of “resources” we can grep for the following:
# grep cib_perform_op /var/log/pacemaker/pacemaker.log | grep "/cib/configuration/resources" Aug 26 09:47:40 hana01 pacemaker-based [2510] (cib_perform_op) info: + /cib/configuration/resources/primitive[@id='rsc_ip_HA0_HDB00']/operations/op[@id='rsc_ip_HA0_HDB00-stop-0']: @timeout=10 Aug 26 09:53:06 hana01 pacemaker-based [2510] (cib_perform_op) info: + /cib/configuration/resources/primitive[@id='rsc_exporter_HA0_HDB00']/meta_attributes[@id='rsc_exporter_HA0_HDB00-meta_attributes-0']/nvpair[@id='rsc_exporter_HA0_HDB00-meta_attributes-0-target-role']: @value=Stopped Aug 26 09:53:56 hana01 pacemaker-based [2510] (cib_perform_op) info: + /cib/configuration/resources/primitive[@id='rsc_exporter_HA0_HDB00']/meta_attributes[@id='rsc_exporter_HA0_HDB00-meta_attributes-0']/nvpair[@id='rsc_exporter_HA0_HDB00-meta_attributes-0-target-role']: @value=Started
The first line in this example shows a change of the stop timeout of the resource rsc_ip_HA0_HDB00 to 10.
The second and third line show that the resource rsc_exporter_HA0_HDB00 has been stopped and started after 50 seconds.
There is a story behind – A more fine tuned example
Let’s assume we have a two node SAP HANA cluster and the behavior of the failover was not working as expected. After some investigations we found out that the cluster property “stonith_enable” is set to “false”. This would explain the weird behavior of the cluster. Because without a stonith enabled the cluster is not able to work correctly at all.
Let’s take a look into the log file. Maybe we can find out when these parameter has been changed.
What do we need:
- For changing the CIB: -> cib_perform_op
- The section cluster property -> /cib/configuration/crm_config/cluster_property_set
- We are searching for: -> stonith
The easiest way is to pipe every keyword after another with a new grep command.
(Of course you can do this smarter and shorter. But we want to have this transparent.)
# grep cib_perform_op /var/log/pacemaker/pacemaker.log | grep /cib/configuration/crm_config/cluster_property_set| grep stonith Sep 02 09:26:32 hana01 pacemaker-based [2510] (cib_perform_op) info: + /cib/configuration/crm_config/cluster_property_set [@id='cib-bootstrap-options']/nvpair[@id='cib-bootstrap-options-stonith-enabled']: @value=false
The grep command shows us that on Sep 02th at 09:26 the property of stonith-enabled was set to false. Assuming the cluster was created before September we now know that the changes has been done most likely manually afterwards.
With some luck we can now search in the system logs for a user login at that time.
In our example we can see short before:
An ssh login of the user penguin:
2021-09-02T09:25:04.762826+02:00 hana01 sshd[31933]: Accepted keyboard-interactive/pam for penguin from 192.168.4.23 .....
and a sudo command to become root:
2021-09-06T09:25:42.958937+02:00 hana01 sudo: penguin : TTY=pts/1 ; PWD=/home/penguin ; USER=root ; COMMAND=/usr/bin/su
It is of course not a sufficient evidence that the user has really done the changes but it is very likely.
Like our last example shows, there is really a lot of potential in the logs if you know the right keyword. The journey into the past is not over here. Watch out for the next article in this series.
Related Articles
Aug 29th, 2024
Introducing SUSE Typeface: SUSE’s new open sourced font
May 28th, 2024
Join Portworx at SUSECON 2024: Driving Innovation Together
Feb 06th, 2023
SUSE’s Rachel Cassidy Honored as a 2023 CRN Channel Chief
Oct 15th, 2024