Building HA cluster with Pacemaker, Corosync and DRBD

If you want to setup a Highly Available Linux cluster, but for some reason do not want to use an "enterprise" solution like Red Hat Cluster, you might consider using Pacemaker, Corosync and DRBD [1], [2], [3].

Pacemaker is a cluster resource manager. It achieves maximum availability for your cluster services by detecting and recovering from node and resource-level failures by making use of the messaging and membership capabilities provided by your preferred cluster infrastructure - either Corosync or Heartbeat.

For the purpose of this blog, we'll use Corosync and setup a two node highly available Apache web server with an Active/Passive cluster using DRBD and Ext4 to store data.

To install the software we'll be using a Fedora repository:

[root@node1 ~]# sed -i.bak "s/enabled=0/enabled=1/g" /etc/yum.repos.d/fedora.repo
[root@
node1 ~]# sed -i.bak "s/enabled=0/enabled=1/g" /etc/yum.repos.d/fedora-updates.repo
[root@
node1 ~]# yum install -y pacemaker corosync

To configure Corosync, we need to choose unused multicast address and a port:

[root@node1 ~]# export ais_port=4000
[root@
node1 ~]# export ais_mcast=226.94.1.1
[root@
node1 ~]# export ais_addr=`ip addr | grep "inet " | tail -n 1 | awk '{print $4}' | sed s/255/0/`
[root@
node1 ~]# cp /etc/corosync/corosync.conf.example /etc/corosync/corosync.conf
[root@
node1 ~]# sed -i.bak "s/.*mcastaddr:.*/mcastaddr:\ $ais_mcast/g" /etc/corosync/ corosync.conf
[root@
node1 ~]# sed -i.bak "s/.*mcastport:.*/mcastport:\ $ais_port/g" /etc/corosync/corosync.conf
[root@
node1 ~]# sed -i.bak "s/.*bindnetaddr:.*/bindnetaddr:\ $ais_addr/g" /etc/corosync/corosync.conf

We also need to tell Corosync to load the Pacemaker plugin:

[root@node1 ~]# cat <<-END >>/etc/corosync/service.d/pcmk
service {
# Load the Pacemaker Cluster Resource Manager
name: pacemaker
ver: 1
}
END

At this point we need to propagate the configuration changes we made to the second node:

[root@node1 ~]# for f in /etc/corosync/corosync.conf /etc/corosync/service.d/pcmk /etc/hosts; do scp $f node2:$f ; done

Now we can start Corosync on the first node and check /var/log/messages:

[root@node1 ~]# /etc/init.d/corosync start

If all looks good we can start Corosync on the second node as well, and check if the cluster was formed by tailing /var/log/messages.

The next step is to start Pacemaker on both nodes:

[root@node1 ~]# /etc/init.d/pacemaker start

To display the cluster status run:

[root@node1 ~]# crm_mon

Now that we have a working cluster make sure you get familiar with the main cluster administration tool:

[root@node1 ~]# crm --help

Let's examine the current cluster configuration:

[root@node1 ~]# crm configure show

One thing to note is that Pacemaker ships with STONITH enabled. STONITH is a common node fencing mechanism that is used to ensure data integrity by powering off (or Shooting The Other Node In The Head) a problematic node.

For the purpose of this example let's simplify things and disable STONITH, at least for now:

[root@node1 ~]# crm configure property stonith-enabled=false
[root@
node1 ~]# crm_verify -L

We should also disable Quorum, since this is a two node cluster:

[root@node1 ~]# crm configure property no-quorum-policy=ignore

Now it's time to add the first shared resource - an IP address, because regardless of where the cluster service(s) are running, we need a consistent address to contact them on:

[root@node1 ~]# crm configure primitive ClusterIP ocf:heartbeat:IPaddr2 params ip=192.168.122.101 cidr_netmask=32 op monitor interval=30s

The other important piece of information here is ocf:heartbeat:IPaddr2. This tells Pacemaker three things about the resource you want to add. The first field, ocf, is the standard to which the resource script conforms to and where to find it. The second field is specific to OCF resources and tells the cluster which namespace to find the resource script in, in this case heartbeat. The last field indicates the name of the resource script.

To obtain a list of the available resource classes, run

[root@node1 ~]# crm ra classes
[root@
node1 ~]# crm ra list ocf pacemaker
[root@
node1 ~]# crm ra list ocf heartbeat

Let's test this by performing a fail-over. The IP should move from the first node it's currently being hosted on to the second - passive - node.
First let's check on what node the IP recourse is currently running:

[root@node1 ~]# crm resource status ClusterIP

On that node stop Pacemaker and Corosync, in that order:

[root@node1 ~]# /etc/init.d/pacemaker stop
[root@
node1 ~]# /etc/init.d/corosync stop

Or put the node on stand-by:

[root@node1 ~]# crm node standby

Check the status of the cluster and observer where the IP resourse has moved:

[root@node1 ~]# crm_mon

You can also check with:

[root@node1 ~]# ip addr show

Now let's simulate node recovery by starting the services back in the following order:

[root@node1 ~]# /etc/init.d/corosync start
[root@
node1 ~]# /etc/init.d/pacemaker start

Or put the node back online:

[root@node1 ~]# crm node online

It's time to add more services to the cluster. Let's install Apache:

[root@node1 ~]# yum install -y httpd

Create an index page on both nodes, displaying the name of the node:

[root@node1 ~]# cat <<-END >/var/www/html/index.html
<html>

<body>
HA Apache - node1
</body>

</html>
END

In order to monitor the health of your Apache instance, and recover it if it fails, the resource agent used by Pacemaker assumes the server-status URL is available.
Look for the following in /etc/httpd/conf/httpd.conf and make sure it is not disabled or commented out:

<location server-status="">
SetHandler server-status
Order deny,allow
Deny from all
Allow from 127.0.0.1
</location>

At this point, Apache is ready to go, all that needs to be done is to add it to the cluster. Lets call the resource WebSite. We need to use an OCF script called apache in the heartbeat namespace, the only required parameter is the path to the main Apache configuration file and we’ll tell the cluster to check once a minute that apache is still running:

[root@node1 ~]# crm configure primitive WebSite ocf:heartbeat:apache params configfile=/etc/httpd/conf/httpd.conf op monitor interval=1min
[root@node1 ~]# crm configure show

Pacemaker will generally try to spread the configured resources across the cluster nodes. In the case with Apache we need to tell the cluster that two resources are related and need to run on the same host (or not at all). Here we instruct the cluster that WebSite can only run on the host that ClusterIP is active on:

[root@node1 ~]# crm configure colocation website-with-ip INFINITY: WebSite ClusterIP
[root@node1 ~]# crm configure show

When Apache starts, it binds to the available IP addresses. It doesn’t know about any addresses we add afterwards, so not only do they need to run on the same node, but we need to make sure ClusterIP is already active before we start WebSite. We do this by adding an ordering constraint. We need to give it a name (choose something descriptive like apache-after-ip), indicate that its mandatory (so that any recovery for ClusterIP will also trigger recovery of WebSite) and list the two resources in the order we need them to start:

[root@node1 ~]# crm configure order apache-after-ip mandatory: ClusterIP WebSite
[root@
node1 ~]# crm configure show

We can also specify a preferred location - node - on which the Apache server should run (if it's a better hardware for example):

[root@node1 ~]# crm configure location prefer-node1 WebSite 50: node1
[root@
node1 ~]# crm configure show

To manually move a resource from one node to the other we need to run:

[root@node1 ~]# crm resource move WebSite node1
[root@
node1 ~]# crm_mon

And to move it back:

[root@node1 ~]# crm resource unmove WebSite

Configuring DRBD as a cluster resource.

Think of DRBD as network based RAID-1. Instead of manually syncing data between nodes, we can use a block level replication to do it for us.
For more information on how to setup DRBD refer to [3] and [4].

Run the cluster configuration utility:

[root@node1 ~]# crm

Next we must create a working copy or the current configuration. This is where all our changes will go. The cluster will not see any of them until we say its ok.

cib crm(live)# cib new drbd

Now let's create the DRBD clone, display the revised configuration and commit the changes:

crm(drbd)# configure primitive WebData ocf:linbit:drbd params drbd_resource=wwwdata op monitor interval=60s
crm(drbd)# configure ms WebDataClone WebData meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true
crm(drbd)# configure show
crm(drbd)# cib commit drbd

Now that DRBD is functioning we can configure a Filesystem resource to use it. In addition to the filesystem’s definition, we also need to tell the cluster where it can be located (only on the DRBD Primary) and when it is allowed to start (after the Primary was promoted):

[root@node1 ~]# crm
crm(live)# cib new fs
crm(fs)# configure primitive WebFS ocf:heartbeat:Filesystem params device="/dev/drbd/by-res/wwwdata" directory="/var/www/html" fstype="ext4"
crm(fs)# configure colocation fs_on_drbd inf: WebFS WebDataClone:Master
crm(fs)# configure order WebFS-after-WebData inf: WebDataClone:promote WebFS:start

We also need to tell the cluster that Apache needs to run on the same machine as the filesystem and that it must be active before Apache can start and commit the changes:

crm(fs)# configure colocation WebSite-with-WebFS inf: WebSite WebFS
crm(fs)# configure order WebSite-after-WebFS inf: WebFS WebSite
crm(fs)# cib commit fs

Now we have a fully functional two node HA solution for Apache!
You can easily setup HA Mysql or NFS using the same method.

For more detailed information please read the main tutorial at [5].

Resources:

[1] http://www.clusterlabs.org/
[2] http://www.corosync.org/
[3] http://www.drbd.org/
[4] http://kaivanov.blogspot.com/2012/01/deploying-drbd-on-linux.html
[5] http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/index.html

Deploying DRBD on Linux

DRBD stands for Distributed Replicated Block Device and refers to block devices designed as a building block to form high availability clusters. This is done by mirroring a whole block device via an assigned network. DRBD can be understood as network based raid-1.

DRBD works on top of block devices, i.e., hard disk partitions or LVM's logical volumes. It mirrors each data block that it is written to disk to the peer node.

What follows next is a detailed explanation on installing and configuring DRBD on two server nodes. For more information you can refer to [1].

1. Install the kernel headers, gcc and flex if not already present on the system, then download and extract the source code from linbit.com:

[root@drbd1 ~] cd /usr/src/
[root@drbd1 ~] yum install kernel-devel kernel-headers gcc flex
[root@drbd1 ~] wget http://oss.linbit.com/drbd/8.4/drbd-8.4.1.tar.gz
[root@drbd1 ~] tar zxfv drbd-8.4.1.tar.gz
[root@drbd1 ~] cd drbd-8.4.1

2. Configure and compile:

[root@drbd1 ~] ./configure --prefix=/usr --sysconfdir=/etc --localstatedir=/var --with-km --with-udev
[root@drbd1 ~] make
[root@drbd1 ~] make install

The --with-km option will compile the kernel module as well.

3. Load the kernel module and make drbd execute after system reboot:

[root@drbd1 ~] modprobe drbd
[root@drbd1 ~] echo "/etc/init.d/drbd start" >> /etc/rc.local

4. You can use almost any block device with DRBD. For this demo we'll use an LVM logical volumes for the data and metadata.

[root@drbd1 ~] vgcreate -n lv_drbd -L 10G vg_drbd
[root@drbd1 ~] vgcreate -n lv_drbdmeta -L 10G vg_drbd

5. Create the global configuration file drbd.conf with the following configuration in /etc:

[root@drbd1 ~] cat /etc/drbd.conf

include "/etc/drbd.d/global_common.conf";
include "/etc/drbd.d/*.res";

6. Create /etc/drbd.d/ that will contain the global file that all resources will share - global_common.conf

[root@drbd1 ~] mkdir /etc/drbd.d/
[root@drbd1 ~] cat /etc/drbd.d/global_common.conf

global {
usage-count no;
}

common {
net {
protocol C;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
}

7. Create a resource named r0, that will make a DRBD device on top of our logical volume we've created earlier:

[root@drbd1 ~] cat /etc/drbd.d/r0.res

resource r0 {
on drbd1 {
device /dev/drbd1;
disk /dev/vg_drbd/lv_drbd;
address 192.168.29.135:7789;
meta-disk /dev/vg_drbd/lv_drbdmeta;
}

on drbd2 {
device /dev/drbd1;
disk /dev/vg_drbd/lv_drbd;
address 192.168.29.136:7789;
meta-disk /dev/vg_drbd/lv_drbdmeta;
}
}

For more information on the options in this file refer to [1]

8. Repeat the above steps on the second node, ensuring the configuration files are all identical

9. Create device metadata - this step must be completed only on initial device creation. It initializes DRBD’s metadata. Needs to be run on both nodes:

[root@drbd1 ~] drbdadm create-md r0

10. Enable the resource - this step associates the resource with its backing device, sets replication parameters, and connects the resource to its peer. Run this on both nodes:

[root@drbd1 ~] drbdadm up r0
[root@drbd1 ~] cat /proc/drbd

11. Initial device synchronization - select an initial sync source node, in our case drbd1, and run:

[root@drbd1 ~] drbdadm primary --force r0

This will sync all data from node drbd1 /dev/drbd1 on to node drbd2 /dev/drbd1, erasing all data if any on drbd2.
At that point the DRBD device is fully operational, even before the initial synchronization has completed (albeit with slightly reduced performance). You may now create a filesystem on the device, use it as a raw block device, mount it, and perform any other operation you would with an accessible block device, but keep in mind that you can only do this on the Primary node.

12. Checking the status of DRBD:

[root@drbd1 ~] cat /proc/drbd

1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
ns:8 nr:8 dw:16 dr:2001 al:1 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

[root@drbd1 ~] drbd-overview

1:r0/0 Connected Primary/Secondary UpToDate/UpToDate C r-----

13. Switch Primary and Secondary nodes - you can make the Primary node Secondary and vice-versa with the following:

On node drbd1

[root@drbd1 ~] drbdadm secondary r0

On node drbd2

[root@drbd2 ~] drbdadm primary r0

14. Dealing with node failure - if a node that currently has a resource in the secondary role fails temporarily, no further intervention is necessary - the two nodes will simply re-establish connectivity upon system start-up.
After this, DRBD replicates all modifications made on the primary node in the meantime, to the secondary node. When the failed node is repaired and returns to the cluster, it does so in the secondary role. If the failed node was the primary node, DRBD does not promote the surviving node to the primary role, it is the cluster management application’s responsibility to do so, or it can be done manually as described in step 13.

15. Split brain recovery - if your nodes fail to reconnect to each other after a crash recovery, check the logs for the following signs of split-brain condition:

[root@drbd2 ~] tail -100 /var/log/messages | grep drbd
...
Split-Brain detected, dropping connection!
...

After split brain has been detected, one node will always have the resource in a StandAlone connection state. The other might either also be in the StandAlone state (if both nodes detected the split brain simultaneously), or in WFConnection (if the peer tore down the connection before the other node had a chance to detect split brain).
At this point, unless you configured DRBD to automatically recover from split brain, you must manually intervene by selecting one node whose modifications will be discarded (this node is referred to as the split brain victim).

On the victim node run:

[root@drbd2 ~] drbdadm secondary r0
[root@drbd2 ~] drbdadm connect --discard-my-data r0

On the other node (the split brain survivor) run:

[root@drbd1 ~] drbdadm connect r0

This should reconnect your nodes and resolve the split-brain condition.

Now that DRBD is setup you can integrate it with HA solution like Red Hat Cluster or LVM. For more information refer to [2].

Resources:
[1] http://www.drbd.org/docs/about/
[2] http://www.drbd.org/docs/applications/