SUSE Conversations


Implementing MPIO over iSCSI: Considerations, Common Issues and Clustering Concerns



By: utlemming

November 7, 2008 12:12 pm

Reads:1830

Comments:3

Rating:5.0

Scope

This paper seeks to demystify multipath over software iSCSI initator and common concerns. Most notably, it addresses some concerns about fail-over, OCFS2, Linux Heartbeat and OCFS2 questions.

Why?

While it is true that the TCP/IP stack has built in redundancy, the idea of highly available systems is to reduce single points of failure. There are scenerios when multipath on software iSCSI devices is desirable. However, when introducing redundancy there are several issues that need to be addressed in order to make it work properly.

These recommendations come from countless people asking about how to test it and set it up.

General Recommendation

If you require even greater reliability, it is advisable to consider an iSCSI HBA card. QLogic and other vendors have iSCSI NICS that present the iSCSI devices as if they were native hardware devices. These cards a little on the spendy side, however, they work very well. If you use an iSCSI HBA, then the multipath scenario and setup will be according to the vendor recommendations.

Operating Axiom

This paper assumes that you have setup the open-iscsi software initiator and that the targets are connected; there are countless articles about getting iSCSI working including several good Cool Solution articles.

Supportability

While I cannot speak officially for Novell, this is believed to be supported. It is, however rarely done using the iSCSI software initiator. As such, should you contact Novell about this setup, realize that this work is my own and Novell may not support it.

With that said, Novell supports the individual pieces covered in this article fully. To date, I have not seen anything indicating that Novell has tested this configuration.

NIC-based configuration

  1. Dedicated network hardware should be used for iSCSI environments. The entire network stack should be dedicated to the use of iSCSI — from the cables, switches and NICS. Many people try to get by using VLANs (for some reason performance issues can be introduced by high switch utilization). I strongly discourage the use of VLANs simply because infrastructure can be shared. Also, in some VLAN environments the iSCSI traffic is switched down the trunk, resulting in lower bandwidth. Depending on the application, like GroupWise, Clustering and high data-transfer applications, you will need full 1GB bandwidth to and from the target. Using VLAN trunks, you may not be able to get the fully bandwidth.

 

  1. At least two NICS. In almost every case of iSCSI, I recommend at least two NICs or more, dedicated to iSCSI.

 

  1. Bonding of NICS: In the event that you bond the NICS, I suggest the use of “dynamic link aggregation” or bonding mode 4. The main advantage to mode 4 is that it is switch managed and will allow more connections at higher speeds. Other modes either use one active at a time or play an arp switching game (like mode 0). These other modes can provide higher reliability, but you can saturate one link and under utilize the other. Dynamic link aggregation allows the switch to manage the connection and allows for more efficient management of the data. Further, you can hypothetically have up to more connections at 1GB speeds — so two NICs in Dynamic Link Aggregation will allow you to have 2 1Gb connections, and three will allow 3 1Gb connections. I would also recommend that you have multiple bonded pairs on separate networks.

 

  1. Unbonded NICS: If you choose to have unbonded NICS, these NICS need to be on separate physical hardware. See criteria 1.

 

  1. Use higher quality NICS: Don’t buy consumer level Ethernet cards. To some this may sound like a real no-brainer, but some people don’t see the value in purchasing a high quality enterprise class NIC. If the NIC is low quality, you will have problems.

 

  1. Do not route traffic: Again, this may seem obvious. However, do not plug your iSCSI SAN into the regular network. If there is a failure of a LAN segment or NIC, you do not want to have the traffic route over your regular LAN.

 

iSCSI initiator settings

The hardest part of the multipath configuration is setting up the iSCSI settings. The following settings need to be tweeked in your /etc/ietd.conf

Specify the length of time to wait for a session to re-establish before failing the SCSI command. Default is 120 seconds, which is too long when you want to continue writing data. In cases of clustering this is way too long.

node.session.timeo.replacement_timeout

 

How often to send an iSCSI Nop-out ping to the target. The default is every 10 seconds. Again for clustering this is way too long

	node.conn[0].timeo.noop_out_interval

 

How long to wait for a reply from the Nop-out ping. The default is 15 seconds. Again, bad for clustering.

	node.conn[0].timeo.noop_out_timeout

 

As an example:

	node.session.timeo.replacement_timeout=5
	node.conn[0].timeo.noop_out_interval=2
	node.conn[0].timeo.noop_out_timeout=2

 

(I recommend that you play with these settings. Blindly implement this could be extraordinarily bad. I would recommend adjusting these values as needed and seek to find a good balance. The idea that you need make sure is that the timeout happens in a timely fashion)

Multipath Configuration

The default multipath configuration should work just fine with your iSCSI connection. Work with your iSCSI vendor if you would like to do some tweaking. The only suggestion that you may need to make is to change the polling interval (how often does multipath check to make sure that the device is still there).

If you don’t have an /etc/multipath.conf, populate it with:

	defaults {
		polling_interval "5"
	}

 

If you want to make life a little easier you may want to put in an alias for the WWID. Basically, this takes the WWID and puts into a much more manageable name. After all your options, you define the paths by putting something similar in your /etc/multipath.conf

	multipaths {
		multipath {
			wwid 	1494554000000000031000000000000000000000000000000
			alias	mpio-disk1
		}
	}

 

Testing the fail over

The great thing about testing the fail over is that it is incredibly easy to test the most likely failures. With iSCSI there are a couple of useful scenarios such as NIC failure, cabling transport failures. Fortunately there are two commands that can enable this testing.

The Linux TCP/IP stack supports a routing feature called “blackholes.” Essentially any traffic destined for a blackholed IP address or subnet is sent to a bit-bucket.
To add a blackhole:

	ip route add blackhole 192.168.100.10

 

To remove a blackhole:

	ip route del blackhole 192.168.100.10

 

When you test the fail-over you’ll see something like:

	Oct 14 14:24:39 bhoward3 multipathd: sdb: readsector0 checker reports path is down
	Oct 14 14:24:39 bhoward3 multipathd: checker failed path 8:16 in map mpio-disk1
	Oct 14 14:24:39 bhoward3 kernel: device-mapper: dm-multipath: Failing path 8:16.
	Oct 14 14:24:40 bhoward3 multipathd: mpio-disk1: remaining active paths: 3
	Oct 14 14:24:40 bhoward3 multipathd: sdd: readsector0 checker reports path is down
	Oct 14 14:24:40 bhoward3 multipathd: checker failed path 8:48 in map mpio-disk1
	Oct 14 14:24:40 bhoward3 multipathd: mpio-disk1: remaining active paths: 2
	Oct 14 14:24:40 bhoward3 kernel: device-mapper: dm-multipath: Failing path 8:48.

 

When you remove the blackhole, you’ll see something like the following:

	Oct 14 14:25:21 bhoward3 iscsid: cannot make a connection to 151.155.241.72:3260 (101)
	Oct 14 14:25:22 bhoward3 iscsid: connection3:0 is operational after recovery (19 attempts)
	Oct 14 14:25:23 bhoward3 multipathd: sdb: readsector0 checker reports path is up
	Oct 14 14:25:23 bhoward3 multipathd: 8:16: reinstated
	Oct 14 14:25:23 bhoward3 multipathd: mpio-disk1: remaining active paths: 3
	Oct 14 14:25:23 bhoward3 multipathd: sdd: readsector0 checker reports path is up
	Oct 14 14:25:23 bhoward3 multipathd: 8:48: reinstated
	Oct 14 14:25:23 bhoward3 multipathd: mpio-disk1: remaining active paths: 4
	Oct 14 14:25:23 bhoward3 iscsid: connection1:0 is operational after recovery (20 attempts)

 

Obviously this procedure only tests the transport capabilities and does not address NIC failures in a bonding situation. In those cases, you can test by simple unplugging NIC cards.

Software iSCSI-Target and Multipath

The software target iSCSI device will not show when you type “multipath -ll.” However, the targets will be multipathed. During the testing done while writing this article, the paths were successfully detected and failed over, they just were not shown. You can make sure that the target is being seen by going to /dev/disk/by-name and finding the multipathed LUN.

NSS and other Heartbeating Clusters

You WILL need to adjust any disk based heartbeating technologies appropriately based on the underlying layers. In cluster scenerios the heartbeat must be higher than the polling interval and the polling interval must be higher than the path check. If the iSCSI device is will fail after 5 seconds, then the polling interval must be at least 6, but more likely 12 seconds, and then the heartbeat threshold should be set for at least 24 seconds but for more reliability, 36 seconds.

The reason why you need to play with the timing of the heartbeat threshold and polling interval is due to the time that it takes to detect a failure. If a failure happens, then it can take up to the “noop_out_interval” (NOI) and “noop_out_timeout” (NOT) to propagate the error up to the multipath layer. Then, depending what the “polling_interval” (PI) for MPIO is set to, it could take up to NOI+NOT+PI in order for a heartbeat packet to be written. If the heartbeat threshold is set too low, i.e. less than the NOI+NOT+PI, then you could end up a split brain condition for a two node cluster, fencing or poison pill operations. In short, this could result in the cluster being unstable in a fail-over situation. And in a cluster you must plan for worst-case scenerios, and in this case, you need to plan for a device failure to happen immediately after a polling of MPIO has happened.
<br/

OCFS2 Concerns

Loading OCFS2 on an iSCSI device can be problematic. Very, problematic. OCFS2 can either use disk-based heartbeat or user-space heartbeat. If you are using disk heartbeating, you may find that the server randomly reboots, freezes or crashes for no apparent reasons — in fact most of the time, this is caused by an OCFS2 operation called “fencing.”

Fencing happens when the ability to read or write the heartbeat to disk is unstable or delayed for too long. The fencing operation can be issued by another node in the cluster or by the node itself. A fence is manifested as a “kernel panic.” The logic of OCFS2 is the data is more important than the server; in order to minimize the risk of data corruption the node will be fenced or fence itself.

The problem is that under SLES 10 SP1 the default heartbeat threshold is set to 7 seconds and under SLES 10 SP2 it is set to 31 seconds. Under high I/O scenerios on iSCSI this number can be painfully too low. In order to increase the likelyhood of success, this number should be increased incrementally. It is not uncommon to find OCFS2 heartbeats between 60 and 90 and even as high as 120. You will want to do some performance testing to make sure that the number that you settle on is correct.

NSS and traditional filesystems

Single system file-systems should work just fine on top of an multipath iSCSI configuration.

Revisiting the NIC Question

This may seem rather redundant, but it cannot be overstated. In every case where I have worked on a OCFS2 or Linux Heartbeat cluster that was having stability problems, putting the OCFS2 and Linux Heartbeat on a separate network was the difference between stability and an angry customer. If OCFS2, Linux Heartbeat, and any other system service share the same physical connection, then the FIRST TROUBLESHOOTING step should be to locate the clustering network and OCFS2 network on a separate, private network.

Of course, when I mention this, the first thing that people say is that they want nodes to fail if clients are unable to get to the service in question (such as a data volume that is served out via Samba, or a print server, etc.). In Linux Heartbeat, the “pingd” resource agent will allow you to define where a resource agent resides. Under NCS clustering in OES 2, there is some monitoring (for an IP resource that will cause a fail-over if the IP resource disappears). Regardless of this concern, the most important concern is that the cluster and OCFS2 stay up. If the clustering technology requires a stable link, then the primary concern should be a stable link followed by whether or not the service is accessible. As counter-intuitive as this may seem, failure to make sure that the cluster and the data source is stable can and will result in data corruption. And in computers, the data is more important than the services themselves.

VN:F [1.9.22_1171]
Rating: 5.0/5 (1 vote cast)
Implementing MPIO over iSCSI: Considerations, Common Issues and Clustering Concerns, 5.0 out of 5 based on 1 rating

Tags: , ,
Categories: SUSE Linux Enterprise Server, Technical Solutions

Disclaimer: As with everything else at SUSE Conversations, this content is definitely not supported by SUSE (so don't even think of calling Support if you try something and it blows up).  It was contributed by a community member and is published "as is." It seems to have worked for at least one person, and might work for you. But please be sure to test, test, test before you do anything drastic with it.

3 Comments

  1. By:Anonymous

    It’ s especially important to make sure you show iscsi the proper respect. Many networks treat iscsi with a very casual attitude, “it’s just ethernet”. That’s a gross miscalculation and a recipe for disaster. Looks like you figured that out. Bravo.

  2. By:ngc-dna

    I was wondering what version of SUSE you were using to do this. I have heard certain versions might not support doing MPIO with software initiators. I was wondering if you had any knowledge of this.

  3. By:DaveHowe

    Yup, this is true.
    The SLES 9 build comes with an initiator that can do multipath.
    The SLES 10 build comes with one that doesn’t – if you want it, you would have to add it yourself.

Comment

RSS