SUSE Conversations


SAN Storage Design for Xen Virtualization Based on Block Devices

variia

By: variia

September 28, 2007 11:25 am

Reads:264

Comments:0

Rating:0

This document solely focuses on a design where virtual machines are manually managed/migrated by the administrator (xm interface) and running on block devices. (logically managed hard drive partitions)

Scenario: HP EVA6000 FC SAN and 2 HP DL360G5 servers and the requirement is to create an environment where the administrator can manually (xm) migrate virtual machines between the 2 servers. We didn’t really need very high availability but flexibility. For instance we wanted to take 1 server down for maintenance, we could do it without causing too much outage to services running on virtual machines.

After browsing the Internet for a week I had to face the truth that there’s no such information, howto out there which would suit my needs. I was able find some VMware ESX related papers and Novell’s ultimate HA storage foundation. The problem with both is that they run virtual machines on file images what we cannot do due to the nature of our virtual machines (high I/O).

Question I didn’t have answer for:

Q: What is really needed for being able to migrate a Xen virtual machine (running off a block device) from one host to another?
A: Nothing really just the block device to be available on both servers.

Even with “–live” option the migration can be done without having any kind of cluster aware file system on the 2 servers or on the (shared) block devices, I did test it throughly. However we want to use it in production environment therefore for extra safety I do not recommend doing migrations with “–live” option without cluster aware file system even though Xen is very sophisticated software. There might be situations when the buffer is not cleared and the VM is already migrated over causing race conditions where 2 dom0s might try to write to the same (shared) block device corrupting your filesystem. According to my tests a virtual machine with 512M of RAM and normal I/O load migrates over ~6-8 seconds and only 8 ping packets get lost which is fairly affordable.

I have ended up with 2 possible solutions:

  1. Create a separate LUN for each virtual machine
    • The problem of this is that due to multipathing every LUN creates 2 devices. After a while
      it becomes unmanageable, the multipath configuration can become fairly big and complex along the SAN configuration as each LUN needs to be presented to all dom0 hosts not to mention that every time you need a LUN you have to nag the SAN administrator.

     

    • The advantage is that it doesn’t need any further software (except multipathd) which makes this solution very feasible for small systems. (5-10 VMs)

 

  1. Use some sort of cluster volume management
    • The issue with this is that it involves some level of complexity therefore requires knowledge of several software products. It can also be overwhelming and unnecessary for small systems.

     

    • The advantage is that you need just one big LUN which is managed by the Xen administrators providing ultimate flexibility over your storage.

    We need the second option so this document explains how to achieve that. I assume you already have the LUN created and presented to both (or more) dom0 hosts.

 

  • NTP Setup

 

 

The time on the two physical machines needs to be synchronized. Several components in the HASF stack require this. I have configured both nodes to use our internal ntp servers (3 of them) in addition to the other node which would give us fairly decent redundancy.

host1:~ # vi /etc/sysconfig/ntp
NTPD_INITIAL_NTPDATE="ntp2.domain.co.nz ntp3.domain.co.nz ntp1.domain.co.nz"
NTPD_ADJUST_CMOS_CLOCK="no"
NTPD_OPTIONS="-u ntp"
NTPD_RUN_CHROOTED="yes"
NTPD_CHROOT_FILES=""
NTP_PARSE_LINK=""
NTP_PARSE_DEVICE=""

Remember that making changes to the /etc/sysconfig directory you need to run SuSEconfig:

host1:~ # SuSEconfig

The server setup:

host1:~ # vi /etc/ntp.conf
server 127.127.1.0 
fudge 127.127.1.0  flag1 0 flag2 0 flag3 0 flag4 0 stratum 5
driftfile /var/lib/ntp/drift/ntp.drift     
logfile /var/log/ntp                
server ntp2.domain.co.nz 
server ntp3.domain.co.nz 
server ntp1.domain.co.nz 
server host2.domain.co.nz

It was set by the YaST GUI module, includes mainly the defaults. I added the servers and changed the local NTP server to be stratum 5.

Ensure that both nodes could reach each other without DNS:

host1:~ # vi /etc/hosts
10.0.0.1  host1.domain.co.nz host1
10.0.0.2  host2.domain.co.nz host2

Certainly these are needed to be done on the other node as well the same way.

 

  • Multipathing

 

 

It has to be done for proper redundancy by the way it could confuse EVMS. There’s a nice guide from HP but it requires the HP drivers to be installed. I prefer using the SuSE stock kernel drivers because it is maintained. Using the HP one will require you to re-install or update the HP drivers too every time you receive a kernel update. The HP HBA drivers take more options, here I present the setup which includes the modified HP setup which suits the stock kernel drivers.

Tools we need:

host1:~ # rpm -qa | grep -E 'mapper|multi'
device-mapper-1.02.13-6.9
multipath-tools-0.4.7-34.18

Find out what the stock kernel driver supports.

host1:~ # modinfo qla2xxx

It shows that only 1 option is supported (from the HP guide) by the stock driver. I bet it’s not a crucial but I thought if my driver supports and HP recommends it then might as well do it:

host1:~ # echo "options qla2xxx qlport_down_retry=1" >> /etc/modprobe.conf.local

Update the ramdisk image then reboot the server:

host1:~ # mkinitrd && reboot

After reboot ensure that modules for multipathing are loaded:

host1:~ # lsmod | grep 'dm'
dm_multipath           24456  0 
dm_mod                 66384  7 dm_multipath

Your SAN devices should be visible by now, in my case /dev/sda and /dev/sdc. Note: this may change when you add additional LUNs to the machine!

Find out your WWID number, it’s needed for multipath configuration:

host1:~ # scsi_id -g -s /block/sda
3600508b4001046490000700000360000

Configure multipathd according to your WWID:

host1:~ # vi /etc/multipath.conf
defaults {
        multipath_tool          "/sbin/multipath -v0"
        udev_dir                /dev
        polling_interval        5
        default_selector        "round-robin 0"
        default_path_grouping_policy    multibus
        default_getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
        default_prio_callout    /bin/true
        default_features        "0"
        rr_min_io               100
        failback                immediate
}

multipaths {
        multipath {
                wwid                    3600508b4001046490000700000360000
                alias                   mpath2
                path_grouping_policy    multibus
                path_checker            readsector0
                path_selector           "round-robin 0"
        }

devices {
        device {
                vendor                  "HP"
                product                 "HSV200"
                path_grouping_policy    group_by_prio
                getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
                path_checker            tur
                path_selector           "round-robin 0"
                prior_callout           "/sbin/mpath_prio_alua %n"
                failback                immediate
                rr_weight               uniform
                rr_min_io               100
                no_path_retry           60
        }
}

Enable services upon reboot:

host1:~ # insserv boot.device-mapper boot.multipath multipathd
host1:~ # reboot

After reboot, everything should be back. You can check your multipaths:

host1:~ # multipath -l
mpath2 (3600508b4001046490000700000360000) dm-0 HP,HSV200
[size=100G][features=0][hwhandler=0]
\_ round-robin 0 [prio=0][active]
 \_ 0:0:0:1 sda 8:0   [active][undef]
 \_ 0:0:1:1 sdc 8:32  [active][undef]

For further information please refer the original HP guide:

http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00814876/c00814876.pdf ?HPBCMETA::doctype=file

Do exactly the same on the other node as well. The only difference you may have is the /dev/sd* devices but they don’t count. I copied the multipath.conf over to the other host followed by setting up the services.

 

  • Heartbeat

 

 

EVMS what we will configure in a minute doesn’t maintain cluster memberships. We need heartbeat to actually maintain memberships and activate EVMS volumes upon startup on every member nodes.

Install heartbeat package first:

host1:~ # yast2 sw_single &

Select the filter “pattern”:

Configuration:

host1:~ # vi /etc/ha.d/ha.cf
autojoin any
crm true
auto_failback off
ucast eth0 10.0.0.2
node host1
node host2
respawn root /sbin/evmsd
apiauth evms uid=hacluster,root

I configured UNICAST simply because I prefer it over broadcast and multicast as well as we get heartbeat to start evmsd on the nodes.

Configure authentication:

host1:~ # sha1sum
yoursecretpassword
7769bf61f294d7bb91dd3583198d2e16acd8cd76  -
host1:~ # vi /etc/ha.d/authkeys
auth 1
1 sha1 7769bf61f294d7bb91dd3583198d2e16acd8cd76

Set logging:

host1:~ # vi /etc/ha.d/ha_logd.cf
logfacility     daemon

host1:~ # ln -s /etc/ha.d/ha_logd.cf /etc/logd.cf

Start it up:

host1:~ # rcheartbeat start

The other node (everything is the same except the IP of the other node):

host2:~ # vi /etc/ha.d/ha.cf
autojoin any
crm true
auto_failback off
ucast eth0 10.0.0.1
node host1
node host2
respawn root /sbin/evmsd
apiauth evms uid=hacluster,root

Configure logging, authentication as above then:

host2:~ # rcheartbeat start

Ensure they see each other:

host1:~ # crmadmin -N
normal node: host1 (50dfbd69-7a40-484f-b548-4270b6e34251)
normal node: host2 (8602848c-c8ff-4ee5-b66e-844e998dca48)

host2:~ # crmadmin -N
normal node: host1 (50dfbd69-7a40-484f-b548-4270b6e34251)
normal node: host2 (8602848c-c8ff-4ee5-b66e-844e998dca48)
Guide used:

http://wiki.xensource.com/xenwiki/EVMS-HAwSAN-SLES10

 

  • Runlevels

 

 

We need to change the startup order for some services. In short: we don’t want anything starting up automatically (by xendomains), we need full control over the domain starting and stopping processes to ensure that only one VM is running at a time on a certain shared block device.

This is the primary reason for not setting up STONITH device for Xen domUs, I can’t trust HA simply because it would never know surely whether the VM is really shut down on the other node cleanly or not. We need xendomains for migrating domUs over to the other node at shutdown though. We achieve this by the following:

Startup (runlevel 3-5):

    • xend starts before heartbeat (xen changes the networking, must be finished before heartbeat starts)
    • heartbeat starts next (ensures EVMS volume discovery)
    • xendomains starts last (does nothing at startup)

Shutdown (runlevel 6):

    • xendomains shuts down first due to it was started last (we configure it to migrate running domains over to the other node, obviously xend must be running at this stage)
    • heartbeat stops cleanly before we nuke the networking (remember that eth0 what you use for keepalive messages is virtual interface!)
    • xend stops next which shuts down the xen networking and so forth…

Service dependencies are set by comments in the init script headers. Remove xendomains from the “Should-start line” of heartbeat:

host1:~ # vi /etc/init.d/heartbeat

-snip-
### BEGIN INIT INFO
# Provides: heartbeat
# Required-Start: $network $syslog $named
# Should-Start: drbd sshd o2cb evms ocfs2 xend
# Required-Stop:
# Default-Start:  3 5
# Default-Stop:   0 1 2 6
# Description:    Start heartbeat HA services
### END INIT INFO
-snip-

Insert heartbeat to the “Required-Start” section in xendomains:

host1:~ # vi /etc/init.d/xendomains

-snip-
### BEGIN INIT INFO
# Provides:          xendomains
# Required-Start:    $syslog $remote_fs xend heartbeat
# Should-Start:
# Required-Stop:     $syslog $remote_fs xend
# Should-Stop:
# Default-Start:     3 5
# Default-Stop:      0 1 2 4 6
# Short-Description: Starts and stops Xen VMs
# Description:       Starts and stops Xen VMs automatically when the
#                    host starts and stops.
### END INIT INFO
-snip-

Remove these services from all runlevels in this order then re-activate them:

host1:~ # insserv -r heartbeat
host1:~ # insserv -r xendomains
host1:~ # insserv -r xend

host1:~ # insserv -d xend
host1:~ # insserv -d heartbeat
host1:~ # insserv -d xendomains

Ensure the right order:

host1:~ # ls -l /etc/init.d/rc3.d | grep -E 'xend|heartbeat|xendomains'
lrwxrwxrwx 1 root root 13 Aug 31 17:42 K09xendomains -> ../xendomains
lrwxrwxrwx 1 root root 12 Aug 31 17:42 K10heartbeat -> ../heartbeat
lrwxrwxrwx 1 root root  7 Aug 31 17:41 K12xend -> ../xend
lrwxrwxrwx 1 root root  7 Aug 31 17:41 S10xend -> ../xend
lrwxrwxrwx 1 root root 12 Aug 31 17:42 S12heartbeat -> ../heartbeat
lrwxrwxrwx 1 root root 13 Aug 31 17:42 S13xendomains -> ../xendomains

Do the same on the other node as well. Before we proceed with HA we need EVMS to be ready.

 

  • EVMS

 

 

EVMS is a great enterprise class volume manager software, it has a feature called CSM (Cluster Segment Manager). We will use this feature to distribute the block devices between the dom0 nodes. On top of CSM we use LVM2 volume management which gives us the ultimate flexibility to create, resize, extend logical volumes.

I include only the device I want to manage by EVMS at this stage. I don’t want EVMS to discover other disks I am not planning to use in this setup. The “multipath -l” command above tells you the device-mapper created device you need:

host1:~ # grep . /etc/evms.conf | grep -v \#
-snip-
sysfs_devices {
        include = [ dm-0 ]
        exclude = [ iseries!vcd* ]
}
-snip-

I also disable LVM2 on the host machine to avoid interfering with EVMS. I will use other LUNs later on what I plan to manage by LVM2 but running on a certain VM:

host1:~ # grep . /etc/lvm/lvm.conf | grep -v \#
-snip-
devices {
    dir = "/dev"
    scan = [ "/dev" ]
    filter = [ "r|.*|" ]
    cache = "/etc/lvm/.cache"
    write_cache_state = 1
    sysfs_scan = 1
    md_component_detection = 1
}
-snip-

Now we can create the volumes. Note: I am going to present my configuration here just for reference. If you need step by step guide how to do this please read this document:

http://wiki.novell.com/images/0/01/CHASF_preview_Nov172006.pdf

I strongly recommend to visit this project’s home: http://evms.sourceforge.net

Disks:

Segments:

CSM container and LVM2 on top:

Regions:

Volumes:

After you have created your EVMS volumes, save it. To activate changes (create the devices) on all nodes immediately we need to run evms_activate on every other node simply because the default behavior of EVMS is to apply changes upon the local node only.

I have 2 nodes at this stage and I want to activate only the other node:

host1:~ # evms_activate -n host2

What if I had 20 nodes? It would be a bit overwhelming so I present here a solution to do this on all nodes: (there are many other ways of doing this)

host1:~ # for node in `grep node /etc/ha.d/ha.cf | cut -d ' ' -f2`; do evms_activate -n $node; done

 

  • Heartbeat and EVMS

 

 

Now we will configure heartbeat to ensure that in case of node failure (meaning reboot, network issue or any occasion when heartbeat stops receiving signals from the other node) EVMS volume discovery does happen on node when it re-joins the cluster (HA starts receiving heartbeat signals again). Our heartbeat is already prepared for the new crm type configuration so we need to create and load the following xml file:

host1:~ # vi evmscloneset.xml 
<clone id="evmscloneset" notify="true" globally_unique="false">
<instance_attributes id="evmscloneset">
 <attributes>
  <nvpair id="evmscloneset-01" name="clone_node_max" value="1"/>
 </attributes>
</instance_attributes>
<primitive id="evmsclone" class="ocf" type="EvmsSCC" provider="heartbeat">
</primitive>
</clone>

Load it in:

host1:~ # cibadmin -C -o resources -x evmscloneset.xml

Give it a bit of a time then check the proper operation:

host1:~ # crm_mon
============
Last updated: Fri Sep 14 09:20:15 2007
Current DC: host2 (8602848c-c8ff-4ee5-b66e-844e998dca48)
2 Nodes configured.
1 Resources configured.
============

Node: host1 (50dfbd69-7a40-484f-b548-4270b6e34251): online
Node: host2 (8602848c-c8ff-4ee5-b66e-844e998dca48): online

Clone Set: evmscloneset
    evmsclone:0 (heartbeat::ocf:EvmsSCC):       Started host1
    evmsclone:1 (heartbeat::ocf:EvmsSCC):       Started host2

It has to be done on only one node and once. I usually ensure that HA configuration changes are done on the “DC” node but it’s not really essential.

The configuration is easier by xml files I reckon, in fact complex scenarios can only be done this way.
I use the HA GUI just to take an overview at the services. You either have to reset the the password for “hacluster” user or add yourself into the “haclient” group for being able to authenticate:

host1:~ # hb_gui &

Reboot the machines several times. Ensure that 1 node is always up and the DC (designated controller) role does change over in the HA cluster, volumes get discovered and activated well on all nodes.

Novell’s ultimate HA solution based on VMs running on file images and OCFS2:
http://wiki.novell.com/images/3/37/Exploring_HASF.pdf

Examples from Brainshare 2007 related to the topic above:
http://wiki.novell.com/images/c/c8/Tut323_bs2007.pdf

 

  • Xen configuration and xendomains

 

 

I am going to present here what is need for being able to migrate running domains over the other node at shutdown but the entire configuration of Xen is beyond the scope of this document. It’s quite straight forward:

host1:~ # grep . /etc/xen/xend-config.sxp | grep -v \#
(xen-api-server ((unix none)))
(xend-http-server yes)
(xend-unix-server yes)
(xend-relocation-server yes)
(xend-relocation-hosts-allow '^localhost$ ^localhost\\.localdomain$ ^host2$ ^host2\\.domain\\.co\\.nz$')
(network-script 'network-bridge netdev=eth0')
(vif-script vif-bridge)
(dom0-min-mem 196)
(dom0-cpus 0)
(vncpasswd '')

The other node is the same except the relocation-hosts configuration bit, do that as well.

Xendomains is configured at different location:

host1:~ # grep . /etc/sysconfig/xendomains | grep -v \#
-snip-
XENDOMAINS_MIGRATE="10.0.0.2"
XENDOMAINS_SAVE=""
XENDOMAINS_AUTO_ONLY="false"
-snip-

The important parts: the IP of the other node and we need to force it upon any domain not just the ones that are specified at /etc/xen/auto directory (must not have anything there). I removed the setting of XENDOMAINS_SAVE as well.

Restart xend to apply changes. To test the auto migration (assume you have a VM running) just stop xendomains even if it’s not running:

host1:~ # rcxend restart
Restarting xend (old pid 25531 25532 25560)                           done

host1:~ # rcxendomains stop

Now you are ready to rock. Create some volumes, VMs. Each of these block devices will have their own partitioning, filesystem on them respect to the VM running on it.

 

  • Proof of concept

 

 

Here I explain how I tested this setup. I created a VM with 512MB of memory which runs NFS service. You may have wondered why the domU pool is called mpath2 and what is mpath1? It’s another LUN but I am not using it with EVMS. Our VM is going to host user home directories exported by NFS service to store user data. The twist is that it’s not part of the EVMS configuration nor part of the VM, I am just exporting this device cleanly into the VM as it comes off multipathd. Here it is my VM configuration:

host1:~ # cat /etc/xen/vm/nfs.xm 
ostype="sles10"
name="nfs1"
memory=512
vcpus=1
uuid="86636fde-1613-2e12-8f94-093d1e3f962e"
on_crash="destroy"
on_poweroff="destroy"
on_reboot="restart"
localtime=0
builder="linux"
bootloader="/usr/lib/xen/boot/domUloader.py"
bootargs="--entry=xvda1:/boot/vmlinuz-xenpae,/boot/initrd-xenpae"
extra="TERM=xterm "
disk=[ 'phy:/dev/evms/san2/vm2,xvda,w', 'phy:/dev/mapper/mpath1,xvdc,w' ]
vif=[ 'mac=00:16:3e:1e:11:87', ]
vfb=["type=vnc,vncunused=1"]

The VM is running on host1, everything is as presented earlier in this document.

Test1: writing 208MB file to the NFS export from my box:

geeko@workstation:~> ls -lh /private/ISO/i386cd-3.1.iso 
-rw-r--r-- 1 geeko geeko 208M Nov  3  2006 /private/ISO/i386cd-3.1.iso 

geeko@workstation:~> md5sum /private/ISO/i386cd-3.1.iso 
b4d4bb353693e6008f2fc48cd25958ed  /private/ISO/i386cd-3.1.iso 

geeko@workstation:~> mount -t nfs -o rsize=8196,wsize=8196 nfs1:/home/geeko /mnt 

geeko@workstation:~> time cp /private/ISO/i386cd-3.1.iso /mnt

real    0m20.918s 
user    0m0.015s 
sys     0m0.737s

It wasn’t very fast due to my uplink was limited to 100Mbit/s but it’s not what we are concerned about right now.

Now do the same thing but migrate the domain while writing to the NFS export:

geeko@workstation:~> time cp /private/ISO/i386cd-3.1.iso /mnt

Meanwhile on host1:

host1:~ # xm migrate nfs1 host2
host1:~ # xentop
xentop - 12:02:23   Xen 3.0.4_13138-0.47 
2 domains: 1 running, 0 blocked, 0 paused, 0 crashed, 0 dying, 1 shutdown 
Mem: 14677976k total, 1167488k used, 13510488k free    CPUs: 4 @ 3000MHz 
      NAME  STATE   CPU(sec) CPU(%)     MEM(k) MEM(%)  MAXMEM(k) MAXMEM(%) VCPUS NETS NETTX(k) NETRX(k) VBDS   VBD_OO   VBD_RD   VBD_WR SSID 
  Domain-0 -----r       2754   47.6     524288    3.6   no limit n/a     4    4  1282795  4132024    0        0        0        0    0 
migrating-nfs1 -s----          8    0.0     524288    3.6     532480      3.6     1    1    17813   433132    3        0       99    10714 

real    0m41.221s 
user    0m0.020s 
sys     0m0.772s

As you can see it was twice as long but:

nfs1:~ # md5sum /home/geeko/i386cd-3.1.iso 
b4d4bb353693e6008f2fc48cd25958ed  /home/geeko/i386cd-3.1.iso 

The md5sum matches up and that is what I wanted to see from the NFS VM. Check out the file system just in case. 
(on the NFS VM is used LVM2 on top of xvdc (mpath1) with XFS) 

nfs1:~ # umount /home
nfs1:~ # xfs_check /dev/mapper/san1-nfshome 
nfs1:~ #

No corruption found.

Ingredients used:

  • OS: SLES10SP1
  • HW: HP DL360G5
  • SAN: HP EVA6000
  • HBA: QLA2432
  • Heartbeat 2.0.8
  • EVMS: 2.5.5
VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)

Tags: , , ,
Categories: SUSE Linux Enterprise Server, Technical Solutions, Virtualization

Disclaimer: As with everything else at SUSE Conversations, this content is definitely not supported by SUSE (so don't even think of calling Support if you try something and it blows up).  It was contributed by a community member and is published "as is." It seems to have worked for at least one person, and might work for you. But please be sure to test, test, test before you do anything drastic with it.

Comment

RSS