I have often been asked if you can rollback a SUSE Linux Enterprise Server (SLES) after applying an online system update. The answer effectively is no. SLES uses installed RPM packages to build its distribution. When you update the server, you simply install new RPM packages to replace the outdated ones. Many of the RPM packages have pre-install and post-install scripts that make additional modifications to the system. So to rollback, you would need to forcefully reinstall the older RPM packages. It’s difficult to revert a system update, and the solution usually involves only rolling back key RPM packages, like the kernel.
Nothing beats trying out a system update on a test system first, before applying it to a production system. Some higher end solutions provide snapshot capabilities too. However, one possible way to rollback an unwanted update is to use a redundant array of independent disks (RAID). You should be able to use hardware or software RAID to accomplish the rollback, but the technique is dependent on how the RAID meta data is affected when the mirror is broken.
The basic principle is break the mirror prior to the change. If you like the change, add the disk back into the RAID1 mirror array. If you don’t like the update, use the disk taken out of the mirror as your primary disk. Finally, sync the primary disk to the disk with unwanted changes, forcing it to rollback it’s changes.
A word of caution. The one thing this procedure does not rollback are some kinds of application meta data changes. If an application updates it’s meta data format as a result of the update process and you decide to rollback the updates, the older application will most likely not do well with a new meta data format. That said, I have never run into an issue like this yet because most meta data is on the disk that is being rolled back. The only application I think comes close to this scenario is Novell Storage Services (NSS) on Open Enterprise Server (OES) for Linux. I have not seen this happen with NSS, but it has the potential.
First off, I will assume the server has been installed based on Installing SLES on Software RAID1. If you have already installed your server and need to migrate to a software RAID1 configuration, use Migrating SLES to Software RAID1. Hardware RAID1 should work and is a preferable solution, but I will use software RAID for this article. I have only tested i386 and x86_64 architectures using SLES10 SP1.
The kernel and disk configuration looks like this.
Figure A – Kernel and disk configuration
The RAID configuration looks like this.
Figure B – RAID configuration
- Prepare the server before breaking the mirror.
- Backup the server. Any time you are going to perform major changes on the server, you must back it up along with all your data volumes.
- For online updates, register the server with Novell Customer Center, if you haven’t already.
- Shut down application services on the server, like apache.
- Determine which disk is the secondary disk. Servers will usually only boot from the primary disk, as determined by the system BIOS. The /boot/grub/device.map shows which disk(s) the BIOS presented to GRUB. In this case the secondary disk is (hd1) or /dev/sdb.
raid1:~ # cat /boot/grub/device.map (fd0) /dev/fd0 (hd0) /dev/sda (hd1) /dev/sdb
- Determine which device partition /dev/md1 uses on the secondary (hd1) disk. The secondary disk device for the root partition is /dev/sdb2 or RaidDevice 1.
Figure 1 – Determine secondary disk device
- Remove the secondary disk from the RAID1 array. Remember, you must first mark the device faulty, and then remove it. This can be done in one step, just make sure you do it the right order, using mdadm /dev/md1 –fail /dev/sdb2 –remove /dev/sdb2.
Figure 2 – Break the mirror
- Confirm the disk has been removed from the mirror. You should now have a degraded array. Run mdadm –detail /dev/md1.
Figure 3 – Confirm broken mirror
This section represents any major change to the server. For this article, we’ll use a system update as the major change taking place.
- Update the server
- Run yast online_update to update the server with the latest patches.
- Reboot the server.
- Test the updated system.
- Determine if you will Accept the Update or Reject the Update
Figure 4 – Server state after SLES system update
This is the most common outcome. We would expect the update or major changes to go smoothly and therefore would wish to move
forward. We simply need to add the secondary disk back into the mirror.
- Confirm the current state of the mirrored array. See Figure 4.
- Add the secondary disk back into the mirrored array.
Figure 5 – Adding secondary disk back into the mirror
- Wait for the array to synchronize.
Figure 6 – Array currently syncing
Figure 7 – Array finished syncing
- Reinstall GRUB on both disks. The secondary disk already has GRUB installed on it. This step is really only necessary if the GRUB RPM package was updated. However, I think it is a good idea.
Figure 8 – Reinstalling GRUB on both disks
- Reboot the server to make sure everything works as expected. You should boot to the updated kernel, and the mirrored array should be intact. See Figure 8.
After updating the server, I have simulated a corrupted RAM disk file so that the server gets a kernel panic when attempting to reboot. A kernel panic after a system update would be a good reason to roll the system back.
Figure 9 – Kernel panic
- Boot into rescue mode.
- Stop the /dev/md1 mirror.
Rescue:~ # mdadm --stop /dev/md1 mdadm: stopped /dev/md1
- Make sure you are deleting the RAID superblock from the correct device. Observe the meta data in Figure 10 for the secondary disk we want to keep.
Figure 10 – Secondary disk RAID meta data
- Delete the RAID superblock on /dev/sda2. Observe the meta data in Figure 11 for the disk we are going to zero out. This is the updated primary disk that we are rejecting.
Figure 11 – Deleting software RAID superblock
- Halt the server.
- Swap disk locations. Make sure the disk that was not updated, is now in the primary position so the BIOS will boot from it. This means that /dev/sdb2 will become /dev/sda2 after the server boots up.
- Boot into the normal degraded array, with the system rolled back. (See Why this Works)
Figure 12 – Degraded mirror after rollback
- Add the failed secondary disk partition back into the RAID1 array. This secondary disk used to be the primary, but we swapped it in a previous step.
Figure 13 – Adding failed secondary back into the array
- Wait for the array to synchronize.
Figure 14 – Array currently syncing
Figure 15 – Array finished syncing
- Reinstall GRUB on both disks (See Figure 8.)
- Reboot the server to make sure everything is back prior to the update.
The reason the rollback works is the way the software RAID meta data is written when the mirror is broken. Compare the meta data changes to the primary disk before and after the secondary disk as been removed from the array.
Figure 16 – Primary disk RAID meta data comparison
Now compare the meta data changes to the secondary disk before and after it was removed from the RAID array. Other than recognizing that something has changed, the active disks in the mirror are the same.
Figure 17 – Secondary disk RAID meta data comparison
When you boot off of the secondary disk after zeroing out the primary disk’s RAID meta data, the secondary disk thinks all disks will be added to the mirror set. However, when the server boots, the primary disk will not be found, since it’s RAID meta data was erased. So the server will boot with a degraded array. You can then add the primary disk back into the mirror set as a secondary disk, which forces the data to be synced and the update to be rolled back on both disks.
WARNING: Software RAID devices on NSS volumes do NOT work this way. In fact, if you attempt to break an NSS mirror and try to use the mirror object that you deleted to replace the primary mirror disk, you will most likely loose your data and volume. So don’t use this update procedure on NSS volumes, it won’t work.
- I tried the rollback technique with a VMware virtual machine, and Dell Precision 340 (x86) with success on both.
- I tested the rollback procedure on an HP Workstation xw6200 (x86_64) with integrated IDE and SATA controllers. The BIOS has an option to pick which controller to boot from. This is nice because you don’t have to physically change the cable on the disks. However, when I switched controllers after rejecting the update, I got a GRUB error file not found. The BIOS was still reporting the (hd0) and (hd1) order as before. I just had to change root from (hd1,1) to (hd0,1). When you change the physical cable you don’t have to worry about the GRUB change, because the BIOS reports the change as well, so (hd0,1) is correct before and after a physical cable change.
- The split kernel scenario. If you choose to reject an update, but you do not zero out the updated disk’s RAID meta data, you will create a split kernel scenario when you boot off of the secondary disk. GRUB will see the old kernel, but the system will only allow the updated primary disk to be in the RAID mirror. You will see this when uname -r reports the old kernel, but the new kernel is the only one you see in the RPM database and on the mounted file system. Since we broke the mirror, the primary disk still knows that the secondary disk was marked as failed and removed. You must make sure you zero out the primary disk’s RAID meta data if you in fact are going to boot from the secondary disk. To resolve the problem, just repeat the steps to reject the update.
- I have only tested with the default RAID meta data version 00.90.03.
I have not been a fan of software RAID, and still think hardware RAID is a better solution. I did not really see what benefits software RAID offered that outweighed the kernel overhead and added layer of complexity to the system disk. However, the ability to rollback a failed system update is a compelling reason to consider a RAID mirror. If you can accomplish the same thing with hardware RAID, I still think it’s better, but if you cannot or if you are on a tight budget, software RAID is worth considering.