SUSE Linux Enterprise High Availability Extension 12 SP1, SP2

Data Replication across Geo Clusters via DRBD

Included with SUSE Linux Enterprise High Availability Extension

This technical guide describes the setup of a geo cluster using DRBD as delivered with the SUSE Linux Enterprise High Availability Extension.

Author: Matt Kereczman, Cluster Engineer, LINBIT
Author: Philipp Marek, Senior Software Developer, LINBIT
Author: Kristoffer Grönlund, Architect High Availability, SUSE
Publication Date: November 7, 2016

1 Background

The following sections provide you with important background information about DRBD and the SUSE Linux Enterprise High Availability Extension.

1.1 About DRBD

DRBD is a Linux-kernel block-level replication facility that is widely used as a shared-nothing cluster building block. It is included in vanilla kernels since 2.6.33, and most distributions ship the necessary userspace utilities. Furthermore, many distributions have newer DRBD versions than the one included in the kernel in extra packages.

DRBD can replicate across multiple network protocols, and in (currently) three modes, from synchronous for local HA clusters, to asynchronous for pushing data to a disaster recovery site.

DRBD is developed and supported worldwide by LINBIT http://www.linbit.com. That includes most distributions and architectures, with Service Level Agreement (SLA) levels up to 24/7 e-mail and phone availability.

1.2 About SUSE Linux Enterprise High Availability Extension

SUSE Linux Enterprise High Availability Extension is an integrated suite of open source clustering technologies that enables you to implement highly available physical and virtual Linux clusters, and to eliminate single points of failure. It ensures the availability and manageability of critical networked resources including data, applications, and services. Thus, it helps you maintain business continuity, protect data integrity, and reduce unplanned downtime for your mission-critical Linux workloads.

SUSE Linux Enterprise High Availability Extension ships with essential monitoring, messaging, and cluster resource management functionality (supporting failover, failback, and migration (load balancing) of individually managed cluster resources).

SUSE Linux Enterprise High Availability Extension includes DRBD.

2 Introduction

This technical guide describes a setup consisting of two highly available Pacemaker clusters in two sites, with a routed IPv4 or IPv6 connection in between. The connection can range from a few Mbit/sec up to 10 GBit/sec, depending on the IO load imposed on the cluster nodes.

Various services can be distributed across the clusters. Because of latency between the data centers you cannot drive a cluster file system across them. But simply replicating the data to have a quick failover in case one site goes down is possible.

Failover between the two sites is managed using the Booth Cluster Ticket Manager, which is included in the Geo Clustering for SUSE Linux Enterprise High Availability Extension and the LINBIT HA Cluster Stack. In addition to the two cluster sites, a third site is needed to run a booth Arbitrator. Arbitrators are single machines that run a booth instance in a special mode. As all booth instances communicate with each other, arbitrators help to make more reliable decisions about granting or revoking tickets.

Lots of details will be skimmed. For example, for this technical guide it does not matter whether the application driving IO is SAP, an NFS server, a PostgreSQL instance, VMs via KVM, Apache, MySQL, or whatever else you may have in use.

3 Installation Requirements

You need to have the Geo Clustering for SUSE Linux Enterprise High Availability Extension installed on both pairs of nodes, and on the Arbitrator node. For a detailed description on how to configure the geo cluster nodes, see the Geo Clustering Quick Start document included in the Geo Clustering for SUSE Linux Enterprise High Availability Extension documentation at https://www.suse.com/documentation/sle-ha-geo-12/art_ha_geo_quick/data/art_ha_geo_quick.html.

It is good practice and recommended to use LVM as storage layer below DRBD. This allows for quick and easy creation and online enlarging of DRBD devices.

You also need to install all the software including dependencies for the services you want to run on all four nodes. In case you replicate Virtual Machines (VMs), these are typically self-contained, so you will just need the KVM environment libvirt with virsh on the cluster nodes.

4 Cluster Architecture Overview

The following image depicts a two-site, four-node cluster with arrows showing the direction of replication, and blocks in orange showing that a DRBD device is Primary. There are two nodes in the local site, alice and bob, and two nodes in the remote site, charlie and daisy.

Each site will have its own DRBD resource stack and service IP to allow independent migration across the cluster nodes.

Stacked Four-Node Cluster
Figure 1: Stacked Four-Node Cluster

The most important details of the cluster architecture are the following:

  • The local high availability setup: DRBD running in Protocol C, using IP addresses in a LAN respective a cross-over connection.

  • Just after activating the lower DRBD devices on one node, the dedicated service IP address is started.

    This is not only used for the service as such, but also as a fixed point that can be accessed by the upper DRBD device (in Secondary state) for replication.

  • The upper layer DRBD runs on one node per site, and is responsible for replicating the data to the other replication site. This connection runs in protocol A, and might have a DRBD Proxy setup in between.

    Note
    Note: Data Compression

    The DRBD Proxy buffers and optionally compresses data, from one or multiple DRBD resources. Compression helps if the site interconnect is slower than the summed average I/O rate of the resources. The ratio is about 1:4 for zlib, and up to 1:50 for lzma compression. See the DRBD Proxy 3 Compression Considerations whitepaper on https://links.linbit.com/tech_guides for more details.

  • On the site that should actually run the service, the upper layer DRBD gets set as Primary, so that the file system therein can be mounted and used by the application.

5 Example Configurations for a Single Service

The example configurations below are using the following premises:

  • Two sites, called RZ1 and RZ2, with two private networks 192.168.201.x and 192.168.202.x, routed to the other site

  • Four nodes, called geo_rzN-M in four combinations: geo-rz1-a to geo-rz2-b

  • NFS is to be served; but there’s not much difference for other services

  • Nodes are using LVM, VG name is volgroup

  • The lower DRBD layer (for the HA-clusters) uses minor 0; minor 10 is used for DR replication

5.1 DRBD Configuration

The following snippets show a basic DRBD configuration. These are bare-bones; performance-tuning options are not included here.

All three snippets can be contained in a single resource file, for example in /etc/drbd.d/nfs.res. This is the recommended configuration, because keeping a single file synchronized across the four cluster nodes is easier. You can also consult csync2 at http://oss.linbit.com/csync2/.

If you have used the Geo Clustering Quick Start Guide at https://www.suse.com/documentation/sle-ha-geo-12/art_ha_geo_quick/data/art_ha_geo_quick.htmlto perform the basic configuration of the cluster nodes, the DRBD configuration files are already included in the list of files to be synchronized.

To synchronize any changes to the configuration files across both cluster nodes, use the following command:

root # csync2 -xv /etc/drbd.d/

If you do not have csync2, or if you do not want to use it, you will need to copy the DRBD configuration files manually to all the other nodes.

5.1.1 DRBD on Site 1

To configure DRBD on Site 1, you should be aware of the following details:

  • The resource-name has the site in it, so that the complete configuration can be kept in syncronization across both clusters without naming conflicts

  • The nodes' local per-node IP addresses are used

  • A shared-secret is used to avoid inadvertent wrong connections

    Note
    Note: uuid Program

    The uuid program is an easy way to get unique values.

resource nfs-lower-rz1 {
        disk            /dev/volgroup/lv-nfs;
        meta-disk       internal;
        device          /dev/drbd0;
        protocol        C;

        net {
                shared-secret   "2a9702a6-8747-11e3-9ebb-782bcbd0c11c";
        }
        
        on geo-rz1-a {
                address          192.168.201.111:7900;
        }
        on geo-rz1-b {
                address          192.168.201.112:7900;
        }
}

5.1.2 DRBD on Site 2

Even if Site 2 is nearly identical to Site 1, you should notice the following differences:

  • The resource name has changed

  • The node names and IP addresses are different

  • Another shared-secret has been generated

  • The volume group and LV name can be kept in the resource section if identical on both nodes

resource nfs-lower-rz2 {
        disk            /dev/volgroup/lv-nfs;
        meta-disk       internal;
        device          /dev/drbd0;
        protocol        C;
        
        net {
                shared-secret   "cd9d857d-72ef-4d10-a1de-6450d1797a2c";
        }
                    
        on geo-rz2-a {
                address          192.168.202.111:7900;
        } 
        on geo-rz2-b {
                address          192.168.202.112:7900;
        } 
}

5.1.3 DRBD Connection across Sites

To configure a DRBD connection across sites, you should be aware of the following:

  • The storage disk is the HA-cluster DRBD device /dev/drbd0

    • You could also use /dev/drbd/by- res/nfs-lower-rzN/0, but that would be site-specific, and so would need to be moved into the per-site configuration (stacked on top of nfs-lower-rzN)

  • The DRBD device drbd10 says to use minor number 10

  • Protocol A, and a higher ping-timeout are needed because of the higher latency

  • A different shared-secret is used

  • Do not pass any host names, but tell DRBD to stack upon its lower device; that implies that this must be Primary

  • To allow TCP/IP connections to the other site without knowing which cluster node has the lower DRBD device Primary, we are using a (the) service IP address

resource nfs-upper {
        disk             /dev/drbd0;
        meta-disk        internal;
        device           /dev/drbd10;
        protocol         A;
        
        net {
                shared-secret    "e0fbd1fe-6b0b-47db-829a-2c4ba638bf1e";
                ping-timeout     20;
        }

        stacked-on-top-of nfs-lower-rz1 {
                address          192.168.201.151:7910;
        }
        stacked-on-top-of nfs-lower-rz2 {
                address          192.168.202.151:7910;
        }
}

Using a DRBD Proxy would involve inserting proxy on ... sections into stacked-on-top-of above, and a proxy { ... } section inside of resource. See LINBIT's DRBD Proxy guide at http://links.linbit.com/DRBD-Proxy-Conf for more details regarding configuring DRBD Proxy.

5.2 Pacemaker Resources (in crm-shell syntax)

For a more in-depth look at how to configure the NFS server, see the Highly Available NFS Storage with DRBD and Pacemaker document included in the SUSE Linux Enterprise High Availability Extension documentation. To configure the necessary resources, use the crmcrm shell commands as outlined in the following chapters.

5.2.1 Basic Primitives

Setting up the basic primitives is fairly straightforward. You need a service IP, the file system, and the NFS server.

Note
Note: exportfs

It is also possible to use the exportfs resource agents instead, and keep the NFS server running all the time. This is necessary if there are multiple NFS exports that must migrate independently.

crm configure
primitive p-ip-nfs IPaddr2 ip=192.168.202.151 iflabel=nfs nic=eth1 cidr_netmask=24
primitive p-nfs-fs Filesystem device=/dev/drbd/by-res/nfs/0 directory=/mnt/nfs fstype=ext4
primitive p-nfs-service systemd:nfs-server

5.2.2 DRBD Pacemaker Resources, Multi-State Resources

To configure the cluster resources for DRBD, it is possible to use the drbd cluster script. This script will create a base resource for DRBD, as well as a multi-state resource that ensures that DRBD only runs in Primary mode on a single node. Multi-state resources, previously called Master-Slave resources, allow the instances to be in one of two operating modes (called roles). The roles are called master and slave.

crm script run drbd id=drbd-nfs drbd_resource=nfs-upper
crm script run drbd id=drbd-nfs-lower drbd_resource=nfs-lower-rz2

5.2.3 Group and Basic Primitives

This is mostly what you would expect from the earlier picture: the multi-state equivalent of having a group consisting of ms-drbd-nfs-lower:Master, p-ip-nfs, and ms-drbd-nfs:Master.

crm configure
    
group g-nfs p-nfs-fs p-nfs-service
    
colocation co-nfs-ip-with-lower inf: p-ip-nfs:Started ms-drbd-nfs-lower:Master
colocation co-nfs-g-with-upper inf: g-nfs:Started ms-drbd-nfs:Master
colocation co-nfs-upper-with-ip inf: ms-drbd-nfs:Master p-ip-nfs:Started
    
order o-lower-drbd-before-ip-nfs inf: ms-drbd-nfs-lower:promote p-ip-nfs:start
order o-ip-nfs-before-drbd inf: p-ip-nfs:start ms-drbd-nfs:promote
order o-drbd-nfs-before-svc inf: ms-drbd-nfs:promote g-nfs:start

6 Interoperability with Booth

In the following sections, you can find the example configurations for the booth cluster ticket manager. Use csync2 to synchronize the configurations for booth across all cluster nodes.

6.1 Booth Configuration

In the booth configuration, you need to specify the following components:

  • A UDP port to use

  • Three IP addresses

    Note
    Note

    You need one distinct service IP for each site, and a third one for the arbitrator. Separate addresses are preferred, so that booth can be managed independently.

  • The Pacemaker ticket name (here ticket-nfs)

The configuration file might be stored as /etc/booth/nfs.conf.

transport = udp   
port      = "9929"
    
site       = "192.168.201.100"
site       = "192.168.202.100"
arbitrator = "192.168.203.100"
    
ticket = "ticket-nfs"
        expire  = 600
        timeout =   5
        acquire-after = 60

6.2 Pacemaker Integration

Follow the recommendations below to integrate booth into Pacemaker:

  • This example is again valid for site 2 (service IP address subnet).

  • booth automatically uses the default directory and suffix if the configuration name does not specify any.

  • As the DRBD replication should be running even if a site does not have the ticket, the correct loss-policy is demote. This will put DRBD into Secondary mode.

  • As described in the SUSE Multi-Site Cluster Documentation at https://www.suse.com/documentation/sle_ha/book_sleha/data/sec_ha_geo_setup.html, there should be an order constraint that makes sure that Booth can fetch the ticket before trying to start the service.

crm configure
    
primitive p-booth ocf:pacemaker:booth-site config=nfs
primitive p-ip-booth IPaddr2 ip=192.168.202.100 iflabel=ha nic=eth1 cidr-netmask=24
    
group g-booth p-ip-booth p-booth
    
rsc-ticket nfs-req-ticket ticket-nfs: ms-drbd-nfs:Master loss-policy=demote
    
order o-booth-before-nfs inf: g-booth ms-drbd-nfs:promote

7 Testing Failovers and Recovery

After you have configured everything as described, you should check that your setup is working as expected.

The following command from any node running the booth daemon will tell you which tickets it knows about, which site currently holds the ticket, and when that ticket will expire if it is not renewed:

# booth list
ticket: ticket-nfs, leader: 192.168.201.100, expires: 2016-04-14 07:50:48

7.1 Failover

There are many ways to test that failovers will work as expected. An easy way to test is to sever network communications between sites using IPtables. You only need to cut communication on the Booth ports to see a ticket expire and services failover.

On the node currently running the booth daemon, issue the following IPtables commands to sever Booth’s communications with its peers:

/usr/sbin/iptables -I INPUT -p udp --dport 9929 -j DROP
/usr/sbin/iptables -I OUTPUT -p udp --dport 9929 -j DROP

After the ticket's expire time has lapsed, you should see services begin to demote at the current Primary site. After the expire plus acquire-after time lapses, you should see services begin to start at the Secondary site.

Remove the IPtables rules you inserted before continuing:

# /usr/sbin/iptables -D INPUT -p udp --dport 9929 -j DROP
# /usr/sbin/iptables -D OUTPUT -p udp --dport 9929 -j DROP

7.2 Failback

To failback to the original Primary site you need to manually revoke the ticket from the site currently running services, and grant the ticket to the original Primary site.

You can issue the following commands from any of the nodes currently running the Booth daemon:

# booth revoke -s 192.168.202.100 ticket-nfs
  booth[30268]: 2016/04/14_08:21:05 info: revoke request sent, waiting for the result ...
  booth[30268]: 2016/04/14_08:21:09 info: revoke succeeded!
# booth grant -s 192.168.201.100 ticket-nfs
  booth[30269]: 2016/04/14_08:21:17 info: grant request sent, waiting for the result ...
  booth[30269]: 2016/04/14_08:21:28 info: grant succeeded

8 Further Documentation

SUSE Linux Enterprise High Availability Extension Guide: A comprehensive documentation about nearly every part of the Linux Cluster stack. See https://www.suse.com/documenta- tion/sle_ha/singlehtml/book_sleha/book_sleha.html.

DRBD Project Page: Located at https://www.drbd.org, it provides lots of information, including a detailed Users' Guide, one of the most extensive project documentations in the Open Source world.

LINBIT Home Page: At https://www.linbit.com , you can find answers to all questions about paid support from the developers. An overview about supported platforms, SLAs, and price quotes is available at https://www.linbit.com/en/p/services/support.

SUSE Geo Clustering Documentation: This describes the general challenges for Geo-clustering, and typical solutions. The HTML version is hosted at https://www.suse.com/documenta- tion/sle-ha-geo-12/singlehtml/art_ha_geo_quick/art_ha_geo_quick.html#sec.ha.geo.challenges.

9 Legal Notice

Copyright ©2006– 2016 SUSE LLC and contributors. All rights reserved.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or (at your option) version 1.3; with the Invariant Section being this copyright notice and license. A copy of the license version 1.2 is included in the section entitled GNU Free Documentation License.

SUSE, the SUSE logo and YaST are registered trademarks of SUSE LLC in the United States and other countries. For SUSE trademarks, see http://www.suse.com/company/legal/.

DRBD and LINBIT are trademarks or registered trademarks of LINBIT in Austria, the United States, and other countries.

Linux is a registered trademark of Linus Torvalds. Other names or trademarks mentioned in this document may be trademarks or registered trademarks of their respective owners.

All information found in this book has been compiled with utmost attention to detail. However, this does not guarantee complete accuracy. Neither SUSE LLC, its affiliates, the authors, nor the translators shall be held liable for possible errors or the consequences thereof.

Print this page