Corosync Communication Failure

This document (000020407) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise Server for SAP Applications
SUSE Linux Enterprise High Availability Extension
 

Situation

In an event of a corosync message exchange failure, the receiving node will complain that the expected message was not received, while the sending node would retry to send the message multiple times.

In the corosync logs, the following messages are indicative of a retransmission of messages:
2021-07-29 05:08:03 [node name] corosync: message repeated 156 times: [   [TOTEM ] Retransmit List: 1be7]
2021-07-29 05:09:07  [node name] corosync: [TOTEM ] Retransmit List: 1bef
2021-07-29 05:12:20  [node name] corosync: [TOTEM ] Retransmit List: 1c01
2021-07-29 05:12:20  [node name] corosync: [TOTEM ] Retransmit List: 1c01
If the number of rotations of the token without receiving the messages reaches fail_recv_const (default 2500), a FAILED TO RECEIVE error is logged and new membership is formed indicating the other member left.  There is no wait time between the token rotations with message retransmission; 2500 retransmissions can be reached in 1-2 seconds.  This condition will be shown in the corosync logs:
2021-07-29 06:31:44 [node name] corosync: [TOTEM ] FAILED TO RECEIVE
2021-07-29 06:31:44 [node name] corosync: [TOTEM ] A new membership (10.52.34.138:1416) was formed. Members left: 1
2021-07-29 06:31:44 [node name] corosync: [TOTEM ] Failed to receive the leave message. failed: 1

 

Resolution

A new corosync parameter cancel_token_hold_on_retransmit (default value 0) is introduced to enable the default behavior that the token is held when the number of retransmissions is too high.  This introduces wait time for message retransmission and significantly reduces the situation from occurring.

A maintenance update has been released with a fix.  Update corosync to the latest version or ensure the corosync version is at the following version or higher:
 

SUSE version

Corosync version

SLE 12 all SPs

corosync-2.3.6-9.19.1

SLE 15

corosync-2.4.5-5.13.1

SLE 15 SP1

corosync-2.4.5-9.16.1

SLE 15 SP2

corosync-2.4.5-10.14.6.1

SLE 15 SP3

corosync-2.4.5-12.3.1

Cause

The situation may occur during a short network glitch or AntiVirus scan which delays the corosync message packet transmission.

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:000020407
  • Creation Date: 06-Oct-2021
  • Modified Date:06-Oct-2021
    • SUSE Linux Enterprise High Availability Extension
    • SUSE Linux Enterprise Server for SAP Applications

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback@suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.


SUSE Customer Support Quick Reference Guide SUSE Technical Support Handbook Update Advisories
Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Go to Customer Center