Setting Primary for Stacked DRBD Backup Node Causes Server Crash

This document (7022556) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise High Availability Extension 12 SP2
SUSE Linux Enterprise High Availability Extension 12 SP3

Situation

In a stacked DRBD configuration, when trying to enable replication between a stacked node and the backup node, the backup node suffers a crash.

This has been seen when setting the stacked node to 'primary', or when bringing up the 'stacked-on-top-of' resource on the backup node when it is also active on the stacked node and the stacked node is already set to primary. In the later scenario, it is possible to get in to a crash/reboot loop on the 'backup' node.

Resolution

For a permanent resolution, apply the appropriate update:-

SUSE Linux Enterprise High Availability Extension 12 SP2 - Fix released 2018-03-22

     drbd and drbd-utils 530: drbd-9.0.8+git.c8bc3670-10.8.1
        drbd-kmp-default-9.0.8+git.c8bc3670_k4.4.114_92.67-10.8.1
                                              drbd-utils-9.0.0-8.15.1

SUSE Linux Enterprise High Availability Extension 12 SP3 - Fix released 2018-03-26

     drbd and drbd-utils 551: drbd-9.0.8+git.c8bc3670-3.3.2
                                              drbd-kmp-default-9.0.8+git.c8bc3670_k4.4.114_94.14-3.3.2
                                              drbd-utils-9.0.0-2.8.1

Cause

Suspected bug in configuration file handling.

Additional Information

If updating is not an option, the following work-around is available.

This configuration syntax is known to trigger the issue (specifically in the r0-U resource section):-

resource r0 {
   disk {
      on-io-error    pass_on;
   }
   net {
      protocol    C;
   }
   connection-mesh {
      hosts    sles12sp3s1 sles12sp3s2;
   }
   on sles12sp3s1 {
      address    192.200.2.34:7788; # address of sles12sp3s1
      device    /dev/drbd0 ;
      disk    /dev/sdb;
      meta-disk    internal;
      node-id    0;
   }
   on sles12sp3s2 {
      address    192.200.2.35:7788; # address of sles12sp3s2
      device    /dev/drbd0 ;
      disk    /dev/sdb;
      meta-disk    internal;
      node-id    1;
   }
}

resource r0-U {
   disk {
      on-io-error       pass_on;
   }
   net {
      protocol A;
   }

   stacked-on-top-of r0 {
      address 192.200.3.34:7790; # address of drbd0 clustered resource
      device /dev/drbd10;
   }

   on sles12sp3s3 {
   device /dev/drbd10;
   disk /dev/sdc;
   address 192.200.3.36:7790; # address of sles12sp3s3
   meta-disk internal;
   }
}

By reorganizing the syntax in the 'upper' resource section, it is possible to avoid the issue:-

<snip>
resource r0-U {
   disk drbd0;
   protocol A;
   device /dev/drbd10;
   meta-disk internal;
   stacked-on-top-of r0 {
      address 192.200.3.34:7790;    # address of drbd0 clustered resource
   }
   on sles12sp3s3 {
      disk /dev/sdc;
      address 192.200.3.36:7790; # address of sles12sp3s3
   }
}
</snip>

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.