corosync[35436]: [TOTEM ] sendmsg(mcast) failed (non-critical): Invalid argument (22).

This document (7022316) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise High Availability Extension 11 Service Pack 4
SUSE Linux Enterprise High Availability Extension 12

Situation

In /var/log/messages on a Pacemaker cluster node is message:

Nov 2 16:05:41 sapnode1 corosync[35436]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.

After enabling debug for corosync the following will show :

Nov 2 16:17:16 sapnode1 corosync[35436]: [TOTEM ] sendmsg(mcast) failed (non-critical): Invalid argument (22)
Nov 2 16:17:16 sapnode1 corosync[35436]: [TOTEM ] sendmsg(mcast) failed (non-critical): Invalid argument (22)
Nov 2 16:17:16 sapnode1 corosync[35436]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly.
Nov 2 16:17:16 sapnode1 corosync[35436]: [TOTEM ] sendmsg(mcast) failed (non-critical): Invalid argument (22)
Nov 2 16:17:16 sapnode1 corosync[35436]: [TOTEM ] sendmsg(mcast) failed (non-critical): Invalid argument (22)
Nov 2 16:17:16 sapnode1 corosync[35436]: [TOTEM ] sendmsg(mcast) failed (non-critical): Invalid argument (22)

and cluster will refuse to start.

Resolution

When sure there's no firewall on the network blocking the traffic, go ahead and check

   /etc/corosync/corosync.conf

for bindnetaddr variable to see if the address matches netmask of the interface corosync should listen on.

An example from a invalid configuration:

corosync.conf :

    interface {
        bindnetaddr:    6.101.0.0

while the system network is configured with netmask 255.0.0.0(/8):

/bin/ip a

[...]

14: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP

    link/ether 38:63:bb:2b:7f:94 brd ff:ff:ff:ff:ff:ff

    inet 6.101.35.3/8 brd 6.255.255.255 scope global bond0

Therefore the valid bindnetaddr in this case should be 6.0.0.0
After making the change, corosync needs to be restarted :

   on SLES11: rcopenais restart
   on SLES12: systemctl restart pacemaker

Cause

The bindnetaddr is configured incorrectly (meaning it doesn't match system network settings).

The problem usually only occurs after migration from SLES 11 SP3 to SLES 11 SP4, where new logic for picking up the right interface was introduced.

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.