LJDT: iSCSI for Shared Storage
So first let’s cover the basics. Shared storage, as its names implies, is storage shared among multiple machines. You can probably lump Network-Attached Storage (NAS) into this but the typical high-performance acronym is SAN which stands for Storage Area Network. A SAN typically works at the block level (lower) while a NAS environment works at the file level (higher) which also contributes to the performance benefits for a SAN. Tied to a SAN are mystical terms like Fibre Channel which is essentially one way to attach machines to high-speed storage.
A SAN is usually a hardware-based solution where machineA and machineB have a dedicated network to attach to sanC and share the storage on there in the form of LUNs (Logical Unit Numbers, and a fairly common SCSI term). SANs, because of their performance and ability to do things quickly and well, are typically expensive. Surprisingly enough I don’t have one next to my desk here at work, or even one at home (for some reason). I have decent-sized non-SCSI non-RAID hard drives in my machines but that doesn’t exactly help me with shared storage, or at least that was my mistaken impression until several days ago. iSCSI is actually one implementation of SAN technology with Fibre Channel being the other popular implementation I know about though one that requires proprietary hardware including cabling.
iSCSI stands for Internet SCSI (Small Computer Systems Interface) and basically means a way of doing shared storage without proprietary hardware. Why would you want to do this? Well, cost is the big reason. Many will likely tell you that iSCSI is nothing more than fecal remains from cheap non-SCSI implementers, and while it is inexpensive I do not believe that means it does not have its place. The performance of Fibre Channel is made to be top of the line and so is the price. iSCSI may not be able to perform with the best of them but so far my experience with it has been just fine for my purposes and many companies and organizations implement in their largest production environments. I’ll leave the performance debate for those who have more time and expertise in the area and for now let’s just leave it as, it works well enough for almost anything and the rest is outside my budget to even consider anyway.
So what is Shared Storage used for? High Availability (HA) systems need this in many cases so that when one server explodes another server can pick up with the exact same dataset and continue processing. If you need five-nines (99.999%) uptime the leeway for failure is too small to wait for somebody to actually go and turn the computer back on, much less replace it if hardware is bad or identify the problem if it is not known. Shared storage means you can have all of your user data in one location (RAID array on a dedicated system, for example) and then have access to it from anywhere else simultaneously. Reading the shared data simultaneously is usually not a big problem though shared storage systems must address the potential for multiple writers. I’ll probably save most of that for another day but suffice it to say that iSCSI allows multiple writers of data simultaneously and leaves it up to the filesystem and other components to implement locking properly.
One example of shared storage that is in my area product-wise at Novell is for eDirectory and Identity Manager. Historically (for about seventeen years now) eDirectory has natively supported fail-over and other aspects of high availability through replication of data to multiple locations. “Partitioning” the tree and placing replicas at various locations is trivial and even done automatically by eDirectory so if one server explodes it is still possible for clients to hit another server and just keep on running. eDirectory is, in LDAP terms, a multi-master setup which means each replica can be written to simultaneously, but this is not done via shared storage. Adding Novell Identity Manager (IDM) to the mix means quite a bit more complexity.
IDM is typically replica-specific in its operations so it doesn’t care about what other replicas hold for its short-term use (though that’s simplifying things for this article; see the documentation for details) and a given driver config will typically run on one system so when you have a failure of the system running IDM the various applications provisioning identities to and from the Identity Vault suddenly stop provisioning. This can be painful for environments where users are actively making changes; for example if I change my password in one environment and IDM is supposed to change my password in various other systems automatically (saving me changing passwords in a dozen systems all on my own) then I cannot login to those other systems with my new password until IDM does its job. If it is down then I call the helpdesk which is not what the helpdesk wants. Enter shared storage with eDirectory.
With shared storage and a properly-configured eDirectory/IDM environment it is possible to have two physical machines (node0 and node1) both running the same eDirectory Data InfoBase (DIB) in an active/passive configuration. node0 runs and runs until it suddenly gets kicked by the janitor and then node1, detecting its partner’s demise, fires up the same eDirectory “instance” in its own circuitry within a negligible timeframe. Both servers are essentially acting as the same eDirectory server and to do that shared storage is utilized. In this example a Fibre Channel or iSCSI SAN environment could be used to keep the data stored in a safe environment (clear of cleaning crews’ cleats) so when node0 dies and node1 picks up it basically looks like a quick eDirectory restart instead of a complete rebuild of a server that was broken. This also means that the data are stored on, presumably, better storage devices that are faster and dedicated to the task of data storage and delivery. Since IDM is particular about its eDirectory replicas and is only working within the one eDirectory server it can keep on functioning after one of the nodes running it has been drowned in the ocean, as long as another node can replace it.
So with all of this prelude how does it actually work? iSCSI (or any other shared storage really) needs to make the Operating System (OS) think it has a local disk when, in fact, it does not. This lets all of the other layers above the hardware just work as if nothing was different; there is no need for funky network protocols, strange file transfer rules, nighttime synchronization schedules, etc. In the example above eDirectory literally believes that /dev/sdb1 is local and an independent disk dedicated to holding eDirectory information.
The “client” side of the iSCSI interaction is referred to as the iSCSI initiator. It is the one that initiates the connection to the server side of the connection referred to as the iSCSI target. The target is the one targeted by the client and which actually has the data accessible locally somehow. A target can be accessed by multiple initiators at the same time for the storage to be shared.
The actual client/server (hereafter initiator/target) connection is done over a typical TCP/IP network. For best performance the network used should be one dedicated to the task using the best hardware possible (gigabit, or ten-gigabit ethernet). On the security side of it adding encryption to a SAN usually means a lot of overhead that you do not want, and so iSCSI does not typically implement it. This means on the network side it is important to have the data secured between the initiator and the target as it is possible to see the traffic on the wire with a network capturing utility, or even possible to inject new data into the transfer corrupting data on the target without the initiator knowing immediately. Authentication is possible between the initiator and target to keep somebody from connecting to a target that is not their own.
So we have the basics… a client, a server, optional authentication, and data moving between the two over the (preferably) dedicated network. The case that led me to wanting to do this in the first place revolved around setting up Oracle Real Application Clusters (RAC) which requires shared storage. In this case Oracle has their own filesystem (Oracle Cluster File System 2 (OCFS2)) which controls the multiple-writers issue along with the Oracle Database (DB) software so I’ll show how I managed to set that up. To get the Oracle side going I used the following link which is a bit dated but gets the point across, though with a distinct lack of brevity:
To accomplish this on SLES 10 (SP2) I had to add a few packages that let OCFS2 and iSCSI work nicely, but they all comes with SLES so that was fairly trivial. The resulting packages follow:
ab@mybox0:~> rpm -qa | grep -i ocfs ocfs2console-1.4.0-0.3 ocfs2-tools-1.4.0-0.3 ocfs2-tools-devel-1.4.0-0.3 ab@mybox0:~> rpm -qa | grep -i iscsi yast2-iscsi-server-2.13.26-0.3 iscsitarget-0.4.15-9.4 yast2-iscsi-client-2.14.42-0.3 open-iscsi-2.0.707-0.44
As you can see I have packages for OCFS2 to work properly as well as some iSCSI packages which give me the tools I needed there. As a note the iscsitarget package is specific to the target machine. In my setup I wanted to limit machines as much as possible while still having a true “cluster” so I am using one physical machine to host two Virtual Machines (VM). mybox0 is the iSCSI target and also runs an iSCSI initiator on the same box so it can be a client to itself. mybox1 is simply an iSCSI initiator. Both machines in my setup are running Oracle using the OCFS2 filesystem on the iSCSI-accessible partitions though that will not matter much to most. The virtual machines, in my case, are Xen VMs on a SLES 11 x86_64 host though using another virtualization technology like kVM/Qemu or VMware would also be acceptable and does not really affect the outcome of this exercise except perhaps in terms of performance.
The next question that came up was how in the world to get a disk that I wanted to share. Unfortunately I built these VMs for generic use and did not have any extra space on the disk for things like shared storage. I could have added a second “hard drive” to the VM without too much effort but an easier solution came and can be implemented instantly. Using the ‘dd’ command I created a sparse file of a size I wanted, placed it in /var/iscsidata for access later (with a size less-than the free space on the VM’s hard drive, of course) and then pointed to that with my iSCSI target configuration. For those not aware ‘dd’ is a great tool, and one you should learn about immediately. For brevity I will just tell you that the command below creates a forty-gigabyte sparse file meaning it is a file that natively takes up very little space but can grow to forty gigabytes:
ab@mybox:~> sudo dd if=/dev/zero of=/var/iscsidata/data0 bs=1 count=1 seek=40G
The command takes almost no time at all to run (less than one second) and you end up with one file located where you specified in the ‘of’ (outfile) parameter. To anything that wants to use this we have just created forty gigabytes of nothing. The filesystem where the file is located on the host (/var/iscsidata) will handle giving it more blocks as it needs to grow and in the meantime the internals do not matter to the host, which is to our benefit. Now we will create the rest of the iSCSI target setup.
ab@mybox:~> sudo /sbin/yast iscsi-server
Going into YaST directly into the iSCSI “server” (target) configuration we have options to control the server, other global settings, and finally the various targets.
Under the Service section it makes sense to have this service start on system startup so it is always available. Also it makes sense to have the system open the firewall port for us for iSCSI traffic so we do not need to do that later after hours of hair-tugging troubleshooting.
Under the Global section we can setup authentication if desired, though for my test I abstained from this (it is an isolated VM network so I’m not concerned about intruders at this point).
Finally under Targets is where the real meat resides. The default target itself is named with a big long value of ‘iqn.2009-06.novell.lab:d7998b3b-6622-4c90-9128-0f8d65d1’ by default in my system though within that target I can define the LUNs which will be mounted on the initiator when a connection is made to the target. Selecting the ‘Edit’ option I specify a new LUN (number 0) and point it to the file I created with the ‘dd’ command above by either entering, or browsing to, the path. It is also possible to create additional targets with their own LUNs for a lot of configuration options out of the box.
When the LUNs are configured I ‘Finish’ to exit out of YaST and should have a service running and listening as shown below:
ab@mybox0:~> ps aux | grep ietd root 9184 0.0 0.1 6200 800 ? Ss 23:44 0:00 /usr/sbin/ietd ab@mybox0:~> netstat -anp | grep ietd tcp 0 0 0.0.0.0:3260 0.0.0.0:* LISTEN 9184/ietd
Because I asked YaST to also open a port in my firewall I also have port 3260 open in the firewall as well as listening and bound to the ‘ietd’ process which, not coincidentally, is the one for my iscsitarget package:
ab@mybox0:~> sudo /usr/sbin/iptables-save | grep 3260 -A input_ext -p tcp -m limit --limit 3/min -m tcp --dport 3260 --tcp-flags FIN,SYN,RST,ACK SYN -j LOG --log-prefix "SFW2-INext-ACC-TCP " --log-tcp-options --log-ip-options -A input_ext -p tcp -m tcp --dport 3260 -j ACCEPT ab@mybox0:~> rpm -qf /usr/sbin/ietd iscsitarget-0.4.15-9.4
Going to my client machine I want to ensure everything is working correctly and I can reach my server. As I mentioned earlier there are two machines that will be initiators or clients and one of them is also the target machine. This is a bit non-standard but it works well for my get-it-going-quickly test. The same test is applicable for both machines though testing locally doesn’t get affected by the default firewall if it did happen to still be in the way:
ab@mybox0:~> netcat -zv mybox0 3260 mybox0 [188.8.131.52] 3260 (iscsi-target) open
So as we can see the TCP port 3260 is open and reachable from the local system. Running the same command from all other initiators also works so I know the firewall is not in the way of traffic on this port currently. Configuring the initiator is as easy as configuring the target. Let’s start by going into YaST:
ab@mybox0:~> sudo /sbin/yast iscsi-client
Once inside YaST I am again given options for Service and this time I also have options for Connected Targets as well as Discovered Targets.
Under Service I go ahead and set it to start ‘When Booting’ as that makes sense to me.
Under Connected Targets I currently have nothing though that will change soon as I proceed to Discovered Targets. Selecting the option for ‘Discovery’ in this section I enter the IP address of my target system along with the port which is already defaulted to 3260. Setting up authentication as configured on the Target would be trivial here as well but for this example I did not do it so I select ‘Next’ which takes me to the list of what it discovered. I can now click ‘Log In’ and login to the target to get access to its LUNs. At this point if I go to Connected Targets I see that I am logged into my target properly. The only thing left is to set this to be ‘Automatic’ instead of ‘Manual’ so on a reboot the service not only starts but this connection is re-established. Once this is done and I use ‘Finish’ to exit YaST I should have everything setup and running on the iSCSI side.
There are a few issues left to resolve, like to ensure it’s working. Looking for established TCP connections to the system is fairly easy to start with:
ab@mybox0:~> netstat -anp | grep 'ESTAB' | grep 3260 tcp 0 0 184.108.40.206:1740 220.127.116.11:3260 ESTABLISHED 2142/iscsid tcp 0 0 18.104.22.168:3260 22.214.171.124:1740 ESTABLISHED 9184/ietd
Also I can use the ‘fdisk’ command to list all the drives and partitions available to my system:
ab@mybox0:~> sudo /sbin/fdisk -l Disk /dev/xvda: 85.8 GB, 85899345920 bytes 255 heads, 63 sectors/track, 5221 cylinders Units = cylinders of 16065 * 512 = 16450560 bytes Device Boot Start End Blocks Id System /dev/xvda1 1 131 1052226 82 Linux swap / Solaris /dev/xvda2 132 5221 81770850 83 Linux Disk /dev/sda: 42.8 GB, 42949672960 bytes 64 heads, 32 sectors/track, 40960 cylinders Units = cylinders of 4096 * 512 = 2097152 bytes Disk /dev/sda doesn't contain a valid partition table
From the text above we can see that we have two disks essentially. /dev/xvda is the device for my virtual disk for my Xen VM’s storage (it, the VM, is paravirtualized, in case that matters). The other disk now shows up as /dev/sda or as the first SCSI disk. fdisk properly reports that it is forty gigabytes and does not have a valid partition table. That’s easy to take care of, though:
ab@mybox0:~> sudo /sbin/fdisk /dev/sda root's password: Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel Building a new DOS disklabel. Changes will remain in memory only, until you decide to write them. After that, of course, the previous content won't be recoverable. The number of cylinders for this disk is set to 40960. There is nothing wrong with that, but this is larger than 1024, and could in certain setups cause problems with: 1) software that runs at boot time (e.g., old versions of LILO) 2) booting and partitioning software from other OSs (e.g., DOS FDISK, OS/2 FDISK) Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite) Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4): 1 First cylinder (1-40960, default 1): Using default value 1 Last cylinder or +size or +sizeM or +sizeK (1-40960, default 40960): Using default value 40960 Command (m for help): w The partition table has been altered! Calling ioctl() to re-read partition table. Syncing disks.
So after loading fdisk I selected ‘n’ (new), specified a primary partition with ‘p’, made it number ‘1’, and then took all of the space (the defaults). Once completed I used ‘w’ to write the partition table and update the system. Now I should have a bit more useful data from fdisk:
ab@mybox0:~> sudo /sbin/fdisk /dev/sda Disk /dev/sda: 42.8 GB, 42949672960 bytes 64 heads, 32 sectors/track, 20480 cylinders Units = cylinders of 4096 * 512 = 2097152 bytes Device Boot Start End Blocks Id System /dev/sda1 1 40960 41943008 83 Linux
The text above is truncated for brevity but it gives you the idea. Now I have /dev/sda1 which is ready for partitioning. In this case I am going to setup an OCFS2 partition in that space for use by Oracle. Doing this, along with the partitioning, only needs to be done by one of the two nodes so I’d might as well do it here now before configuring the other node’s iSCSI initiator:
ab@mybox0:~> sudo /sbin/mkfs.ocfs2 /dev/sda1 mkfs.ocfs2 1.4.0 Filesystem label= Block size=4096 (bits=12) Cluster size=4096 (bits=12) Volume size=42949672960 (1052226 clusters) (1052226 blocks) 163 cluster groups (tail covers 17404 clusters, rest cover 32256 clusters) Journal size=134213632 Initial number of node slots: 4 Creating bitmaps: done Initializing superblock: done Writing system files: done Writing superblock: done Writing backup superblock: 3 block(s) Formatting Journals: done Writing lost+found: done mkfs.ocfs2 successful
After about five seconds I have my volume setup and all that is left to do is to create a mount point and then mount it:
ab@mybox0:~> sudo mkdir /mnt/oraracdata ab@mybox0:~> sudo mount -t ocfs2 /dev/sda1 /mnt/oraracdata
As this completes let’s check to see if everything shows up as it should:
ab@mybox0:~> ls -a /mnt/oraracdata/ total 8 drwxr-xr-x 3 root root 4096 2009-06-10 00:26 . drwxr-xr-x 3 root root 23 2009-06-05 00:49 .. drwxr-xr-x 2 root root 4096 2009-06-10 00:26 lost+found ab@mybox0:~> df -h Filesystem Size Used Avail Use% Mounted on /dev/xvda2 39G 1.9G 38G 5% / udev 257M 112K 256M 1% /dev /dev/sda1 40G 519M 40G 3% /mnt/oraracdata
So at this point we have everything mounted on the box running both the initiator and the target. With the target completely setup, partitioned, and with the filesystem of my choice on there I am going to do the same initiator setup ONLY on the second client machine (no fdisk, no mkfs, etc.) followed by the same mkdir and mount commands to see if I get the same output. Once I do I can create a file on one side and make sure it shows up on the other as follows (notice the different hosts in the two tests):
ab@mybox0:~> sudo touch /mnt/oraracdata/test0 ab@mybox1:~> ls -l /mnt/oraracdata total 4 drwxr-xr-x 2 root root 4096 2009-06-10 00:26 lost+found -rw-r--r-- 1 root root 0 2009-06-10 00:37 test0
In these tests each host is mounting the shared storage to /mnt/oraracdata in its respective filesystem so the exact same ‘ls -l’ command shows the same output on each system. Creating files on each node works and we are up and running. Restarting the system brings the iSCSI initiators up again but one additional step is required to make the mountpoint automatic which is trivial. Adding the following to the /etc/fstab on each system takes care of it since each is mounting the iSCSI location as /dev/sda1:
/dev/sda1 /mnt/oraracdata ocfs2 _netdev,datavolume,nointr 0 0
The mount options specified are unique to OCFS2 so with other filesystems they would not necessarily be the same but in this case they are there. Finally we have shared storage working. The joy of this system is that it is all within one physical machine (a SLES 11 x86_64 Xen host), two VMs (SLES 10 SP2 x86_64 Xen Paravirtualized VMs) and without any shared disks due to the “disk” created in the filesystem with the ‘dd’ command. For a test system this is a quick and dirty way to get things going. For production adding some real hardware into the mix and a dedicated network can give you great performance overall with minimal work and cost. The demonstration above was done with SUSE-based distributions but iSCSI is one of those things that every Linux distribution can, and probably does, do out of the box. In SLES’s case the setup is really easy (taking out all the fluff, verification, etc. we’re talking about a dozen steps for all of it including both machines) and easy to manage via YaST. In my case the next steps are to setup Oracle and have RAC work which, unfortunately, is much more involved than just getting the disks working via iSCSI.