Problems with /boot mounting or a split kernel after patching
Beginning with SLES10 SP3 many of us started to experience problems getting to /boot on a running system. This can be a major problem if you apply a kernel patch while /boot is not mounted as the patch engine will update libraries and kernel files in / so that when you reboot the kernel is /boot is read but then mismatched to libraries in / causing what is often referred to as a split kernel.
If you experiencing:
The symptoms of a split kernel can be:
- Slow boot
- System freezing after boot completion
- Logon not completing logon processes
- Modules failing to load during boot – ACPI is the most common
If you don’t have a split kernel go ahead and skip forward to the section headed “Mounting /boot when /boot won’t mount”
Repairing a split kernel (we will check if you have a split kernel first)
Boot to a rescue DVD/CD.
At the login prompt login as root.
First – mount your /boot to /media/floppy with the command:
mount /dev/CCISS/c0d0p1 /media/floppy
Now that we have the real /boot mounted we want to peek into your grub menu to find the path to your / mount point:
cat /media/floppy/grub/menu.lst
Look for the section “root=” I put the rest of my filesystem in LVM so mine is “root=/dev/system/slash” where system is the VG name and slash is the LVM name for /. We take the line we find after root= and mount it to /media/cdrom like:
mount /dev/system/slash /media/cdrom
With both mounted we need to make comparisons:
ls /media/floppy <this is mounted directly to /boot.
ls /media/cdrom/boot <this is mounted to / so boot is a sub-directory intended to be a mount point.
If the second is empty or matches the files in /media/floppy then you do not have a split kernel. If they are mismatched we’ll recover from that problem with these steps.
Make a backup directory of the new files in case an accident happens”
mkdir /media/cdrom/boot.new
cp /media/cdrom/boot/* /media/cdrom/boot.new
Make a backup of the current files in the real /boot in case something goes wrong:
mkdir /media/floppy/boot.old
mv /media/floppy/* /media/floppy/boot.old <We’re moving these because the new files will be different versions and we don’t normally want different versions of the kernel sitting directly in /boot.
cp /media/floppy/grub/menu.lst /media/floppy/boot.old <cp, do not mv this file.
put the correct (new) files in place
mv /media/cdrom/boot/* /media/floppy
Make sure you have softlinks for your initrd and vmlinuz files:
ls /media/floppy Look for files with the short name initrd and vmlinuz
If the short name versions (softlinks to the long name versions) do not exist use these link commands:
link /media/floppy/vmlinuz-{your version of kernel here} /media/floppy/vmlinuz -s
link /media/floppy/initrd-{your version of kernel here} /media/floppy/initrd -s
Update the GRUB menu – make a note of the kernel version of the files you just moved such as “vmlinux-2.6.16.60-0.54.5-bigsmp.gz” You’ll need the numbers inside of the GRUB menu:
vi /media/floppy/grub/menu.lst – mine looks like this:
# Modified by YaST2. Last modification on Wed Jun 23 08:08:38 UTC 2010 default 0 timeout 8 ##YaST - generic_mbr gfxmenu (hd0,0)/message ##YaST - activate ###Don't change this comment - YaST2 identifier: Original name: linux### title SUSE Linux Enterprise Server 10 SP3 root (hd0,0) kernel /vmlinuz-2.6.16.60-0.54.5-bigsmp root=/dev/system/slash vga=0x317 res ume=/dev/cciss/c0d0p2 splash=silent showopts initrd /initrd-2.6.16.60-0.54.5-bigsmp ###Don't change this comment - YaST2 identifier: Original name: failsafe### title Failsafe -- SUSE Linux Enterprise Server 10 SP3 root (hd0,0) kernel /vmlinuz-2.6.16.60-0.54.5-bigsmp root=/dev/system/slash vga=normal sh owopts ide=nodma apm=off acpi=off noresume nosmp noapic maxcpus=0 edd=off 3 initrd /initrd-2.6.16.60-0.54.5-bigsmp
Edit the version numbers on every line to match the kernel files you just put in place. Use caution here as a mistyped version will cause a grub failure and you’ll be back here in rescue mode to fix it.
Once complete exit and save forcefully (:wq!) as menu.lst is usually read only.
Reboot and you should be back to normal operating status (mostly) 🙂
Mounting /boot when /boot won’t mount
If your system is running normally but you cannot see any files in /boot or you just repaired a split kernel we need to fix the mount problem. This is most often caused by multipath trying to control the device where /boot resides.
Lets check
ls /boot
Is the directory empty? If so try to mount /boot
mount /boot
Do you get an error like “already mounted” or “/boot busy”? if so lets check if it is already mounted:
umount /boot <If you do not get an error that boot is not mounted then this is not the solution you are looking for and you should search for other solutions.
Next steps – first, check where /boot is supposed to be by looking in fstab:
/dev/system/slash / ext3 defaults 1 1
/dev/cciss/c0d0p1 /boot ext3 acl,user_xattr 1 2
/dev/system/tmp /tmp ext3 defaults 1 2
/dev/system/usr-novell /usr/novell ext3 defaults 1 2
/dev/system/var /var ext3 defaults 1 2
/dev/cciss/c0d0p2 swap swap defaults 0 0
proc /proc proc defaults 0 0
sysfs /sys sysfs noauto 0 0
debugfs /sys/kernel/debug debugfs noauto 0 0
usbfs /proc/bus/usb usbfs noauto 0 0
devpts /dev/pts devpts mode=0620,gid=5 0 0
DIST /media/nss/DIST nssvol noauto,rw,name=DIST,norename 0 0
VOL1 /media/nss/VOL1 nssvol noauto,rw,name=VOL1,norename 0 0
VOL2 /media/nss/VOL2 nssvol noauto,rw,name=VOL2,norename 0 0
This shows that /boot is on my first disks first partition – remember the disk and partition number for the next steps.
Let’s mount /boot by device ID:
mount /dev/disk/by-id/ then press tab 3 times. This will show all possible completions such as:
server1:~ # mount /dev/disk/by-id/cciss-3600508b1001030393146353436306 cciss-3600508b1001030393146353436306400 cciss-3600508b1001030393146353436306400-part1 cciss-3600508b1001030393146353436306400-part2 cciss-3600508b1001030393146353436306400-part3 cciss-3600508b10010303931463534363064001 cciss-3600508b10010303931463534363064002 cciss-3600508b10010303931463534363064003 cciss-3600508b1001030393146353436306500 cciss-3600508b1001030393146353436306500-part1 cciss-3600508b10010303931463534363065001 cciss-3600508b1001030393146353436306600 cciss-3600508b1001030393146353436306600-part1 cciss-3600508b10010303931463534363066001 server1:~ # mount /dev/disk/by-id/cciss-3600508b1001030393146353436306
For my system (and I am guessing most systems) we are looking for the first device with a -part1 and should complete the line entered to this point with the correct device ID – in this case it is the one ending with 400-part1 then add the target mount point of /boot for a complete command looking like:
mount /dev/disk/by-id/cciss-3600508b1001030393146353436306400-part1 /boot
After the mount command is executed verify /boot mounted by checking for contents again.
ls /boot
Once boot is confirmed mounted we need to make sure the device is blacklisted from multipath.
Create (or edit) /etc/mulitpath.conf but first lets check for SAN devices that we do not want to blacklist. (if the server is not SAN connected this is not needed)
multipath -ll Ensure any SAN devices listed do not match any of the patterns in your multipath blacklist. Then create your multipath.conf:
vi /etc/multipath.conf
It needs to include a section for blacklisting such as:
blacklist { devnode ^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]* devnode ^hd[a-z] devnode ^cciss!c[0-9]d[0-9]* }
This should cover most normal non-SAN devices.
Exit and save
:wq
and now rebuild initrd
mkinitrd – this will generate a new /boot/initrd-{kernel version and platform} including the multipath blacklists to prevent multipath from trying to manage the local CCISS devices.
Once these steps are completed you should be able to reboot and see the boot files in /boot without taking extra steps. This is pretty important when patching your kernel or you will end up with the split kernel problem outlined above (but you now know how to fix that too!)
Comments
In TID 7005808, the multipath.conf is discussed but it is not always read until after the /boot device is grabbed by multipath – a little too late!
TID 7005808 also does not help in recovering from a split kernel leaving you (possibly) with the cause of the broken server after a kernel patch but not knowing how to repair the problem.
This problem is not unique to SuSE so other distributions can use the same or similar steps to what is outlined here.