Preventing a Fence Race in Split Brain (COROSYNC,PACEMAKER)

This document (7022467) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise High Availability Extension 12

Situation

During a Split Brain in a two node cluster the situation might occur where both nodes fence each other. This is referred to as "Fence Race". An example in the logs of two nodes might look like

Node1 sees Node2 gone and fences

Nov 20 15:17:40 [117052] node1        cib:   notice: crm_update_peer_state_iter:    Node node2 state is now lost | nodeid=168364360 previous=member source=crm_update_peer_proc
Nov 20 15:17:41 [117056] node1    pengine:  warning: pe_fence_node:    Node node2 will be fenced because the node is no longer part of the cluster


Node2 sees Node1 gone and fences at the same time

Nov 20 15:17:40 [16727] node2        cib:   notice: crm_update_peer_state_iter:    Node node1 state is now lost | nodeid=168364359 previous=member source=crm_update_peer_proc
Nov 20 15:17:41 [16731] node2    pengine:  warning: stage6:    Scheduling Node node1 for STONITH


the resulting effect is, that both nodes fence each other. While Data Integrity is maintained this results in a complete loss of all services.

Resolution

The solution for this is to add to one fencing device in the cluster configuration the parameter

   pcmk_delay_max=<Seconds>

which, in case of an IPMI Device could look like

primitive brie_stonith_ducal stonith:external/ipmi \
        params pcmk_delay_max=20 hostname=ducal ipaddr=10.162.192.209 userid=admin passwd=xxxx interface=lanplus \
        op monitor interval=1800 timeout=20


this will make it more likely, that one fencing device will have a delay. It is at that moment irrelevant which node fences which node, as there is no way for a Cluster without Quorum to determine the right node to be fenced.
 
This Parameter can be applied to any Fencing Device. IPMI above is only an example.
 
This still does not ensure to 100% that no fence race will take place. There can still be the situation that one node has an inherent time advantage but has a bigger random delay, so both nodes meet in the middle and kill each other.
 
With SLES 12 SP2 there is another parameter introduced
  
   pcmk_delay_base=<Seconds>
 
description below, that makes it possible in some scenarios to ensure that one node survives.

Additional Information

With SLES 12 SP2 there is the introduction of
 
   pcmk_delay_base
 
as a parameter. In a setup with 2 different fence devices this can be used to ensure that in case of a split brain one specific node survives.
 
one primitive would be not delayed
 
   primitive fencing-1 stonith:fence_ipmilan \
      params pcmk_delay_base=0 ...
 
while the other one gets the wait base as the value of consensus from the file
 
   /etc/corosync/corosync.conf
 
if it would be for example
 
   consensus:      36000
 
then the other fence would read
 
   primitive fencing-2 stonith:fence_ipmilan \
      params pcmk_delay_base=36 ...
 
without compromising the general functionality of fencing.

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:7022467
  • Creation Date: 18-Dec-2017
  • Modified Date:23-Feb-2021
    • SUSE Linux Enterprise High Availability Extension

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback@suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.


SUSE Customer Support Quick Reference Guide SUSE Technical Support Handbook Update Advisories
Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Go to Customer Center