Cluster node fence "Corosync main process was not scheduled for 11445.4121ms"

This document (000019795) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise Server 12 SP4
SUSE Linux Enterprise Server 12 SP3
SUSE Linux Enterprise Server 11 SP4
 

Situation

A node fence operation was triggered as the corosync process was not scheduled in time, consequently the cluster communication token was not replied:
2020-11-23T12:01:45.993848+08:00 HA-NODE1 corosync[5134]:   [MAIN  ] Corosync main process was not scheduled for 11445.4121ms (threshold is 4000.0000 ms). Consider token timeout increase.

The above error message indicates that the corosync process was not scheduled for around 11 seconds, while the max scheduling timeout is 4 sec, which is 80% of the token timeout time (in case of token: 5000 on corosync.conf). Consequently corosync has not replayed to the tokens sent by the other node. In this situation if corosync was not scheduled for more than token timeout then the other node (HA-NODE2) will declare it as "lost" and will remove it from the cluster and a new membership is formed.  As reported on the following logs from the second node:
2020-08-23T12:01:45.754148+08:00 HA-NODE2 corosync[5129]:   [TOTEM ] A new membership (10.41.95.15:26816) was formed. Members left: 170483470
2020-08-23T12:01:45.754695+08:00 HA-NODE2 attrd[5286]:   notice: Node HA-NODE1 state is now lost.

Resolution

The initial workaround which can be applied immediately is to increase the corosync token and consensus timeout on both nodes:
#/etc/corosync/corosync.conf
----------------------------
 totem {
    ...
   token:              14000
   consensus:          16000

After performing the changes on both nodes, corosync should be reload in order to use the new settings, this can be achieved by running the following command on both nodes, no downtime is need:
# corosync-cfgtool -R

 

Cause

The root cause of such behavior is system CPU resource starvation at the referenced time causing the corosync process to wait longer to be scheduled. In case of a virtual machine, the underlying host ( Hypervisor) should be checked for any resource overcommitment.  

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:000019795
  • Creation Date: 25-Nov-2020
  • Modified Date:25-Nov-2020
    • SUSE Linux Enterprise High Availability Extension
    • SUSE Linux Enterprise Server
    • SUSE Linux Enterprise Server for SAP Applications

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback@suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.


SUSE Customer Support Quick Reference Guide SUSE Technical Support Handbook Update Advisories
Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Go to Customer Center