HPC Update for Azure…Finally
A long time ago …. in Upgrading your running on demand instances in the Public Cloud I promised an upgrade process for SLES 12 HPC instances. This, to a certain extent has been obsoleted by changes within Azure and by the fact that any instances that are running and have not been changed, i.e. a newer kernel driver has not been installed, no longer work. However, just updating the kernel driver does not a distribution upgrade make, and thus there is still quite a lot to talk about.
First, let’s take a brief look why it took us sooo long to get new images out the door and then explore these new SLES 12 SP1 HPC images a bit. Finally we’ll look at getting from point A, a running SLES 12 HPC instance, to point B, an upgraded instance that is running SLES 12 SP1.
How did we get here?
Well, at some point after the initial SLES 12 HPC images were released we all had a big “but, what if moment”. This “but, what if moment” should have been obvious from the get go, but hindsight is 20/20 and we cruised through all the gates to release the SLES 12 HPC images without considering what would happen if the underlying hardware and firmware for RDMA would change. The answer, again rather obvious in hindsight, is of course that there would be multiple drivers floating around and we’d somehow have to match up the version of the driver with information we get from the Azure framework. However, in the initial implementation neither the kernel, nor the repositories that provide the RDMA driver, nor the initialization code (provided by the WALinuxAgent package at the time) were prepared for this multi version approach and thus there was work to be done. Then when we thought we were ready, the kernel had the necessary patches, the repository stream was set up, and the WALinuxAgent had code to handle multiple versions of the driver it turned out that things didn’t work as they were supposed to. The problem was due to an incompatibility, like a tunnel started from both sides of a mountain that does not meet in the middle and both boring teams breaking through on the other side. Not really all that funny, but as they say …. happens. At about the same time the unification of the initialization code between Azure and Azure Stack was happening and thus it was decided to address the incompatibility in the integrated agent code base. All the work is done, with one caveat, and thus we released HPC images based on SLES 12 SP1 on August 24th, 2016. HURRAY!
What’s new in the SLES 12 SP1 HPC image?
As compared to the original SUSE Linux Enterprise Server 12 HPC images you are no longer on the hook to download the Intel MPI binaries yourself. Rather, the packages are now provided as part of the image and are located in /opt/intelMPI/intel_mpi_packages. The packages represent the Intel MPI runtime environment that provides everything necessary to use MPI with your application and that we can distribute. These are not yet pre-installed as we still have some things to work out for pre-installation. However, you can install the runtime environment packages as the necessary lower level complications have been resolved within the image. A simple
rpm -v -i –nodeps /opt/intelMPI/intel_mpi_packages/*.rpm
will do the trick. This of course raises the question why in the world are we not using zypper, as it is after all the better tool for package management? The answer is quite simple. The Intel MPI packages have a dependency on lsb >= 3.0. However, LSB prior to 5.0 was a monolithic standard and it contains a lot of GUI packages you certainly do not need/want on your HPC system(s). If we were to use zypper to install the MPI packages
zypper –no-remote –non-interactive in –force –no-recommends /opt/intelMPI/intel_mpi_packages/*.rpm
you’d have to tell zypper to “break” the install of every package in the directory. While in reality nothing actually gets broken as all the underlying dependencies for Intel MPI are fulfilled. Using the rpm command makes skipping the dependency check for the packages easier.
A quick word about the LSB requirement of the MPI packages. This is actually a false dependency as the MPI packages require interfaces that are not part of the LSB standard and thus this requirement provides the misleading impression that the MPI code (packages) are LSB compliant, when they are not.
With the Intel MPI packages installed you are ready to go. The SLES 12 SP1 HPC images, as well as the latest SLES 12 SP1 images, contain the 2.1.5 version of the initialization code and thus the appropriate kernel driver has been installed at boot time and has been initialized. You will have an IP address on the second network interface which is the one to use for MPI. Configuring MPI and setting up your cluster is described in Set up a Linux RDMA cluster to run MPI applications. As we do not know the user or group as which your application runs this cannot be pre-configured in the image. We will look at the driver a bit more later during the instance upgrade discussion. Rinse and repeat for every instance in your cluster.
And now back to the caveat I mentioned earlier. The 2.1.5 version of the initialization code contains a bug, #994592 (login with SUSE account required) “IP address not assigned on A8 instances” that shows up in about 10% of launches on the A8 instance type. This issue has not been observed on A9, which based on data available to us is the preferred instance type for HPC work loads. Thus, we have decided to release the images with the 2.1.5 driver rather than wait for the fixes which will be available in the 2.1.6 version. The updated version will show up in the update repositories within the next 6 weeks or so.
Upgrading running SLES 12 HPC instances
The more interesting part is the migration of running instances from SLES 12 to SLES 12 SP1, especially since we are dealing with a driver that is not delivered as part of the SUSE kernel. The pre-condition for this is that the instance must have access to the Internet, meaning at least the SUSE Update infrastructure must be reachable, you can use “pint” to obtain the IP addresses of the SUSE update infrastructure servers in Azure, and drivers.suse.com. Before we get started with the upgrade lets get rid of the package msft-lis-rdma-kmp-default-20150707_k3.12.28_4-3.1.x86_64, that currently provides the RMDA driver and more. Keeping the package would create file conflicts during the SLES 12 to SLES 12 SP1 migration process as we had to include the vmbus kernel module in the package to get RDMA working in SLES 12. At the time, in July 2015, the necessary changes were not upstream, by the way all commands need to be executed as root.
zypper rm msft-lis-rdma-kmp-default
This will take a short while as the initrd will be rebuilt.
So, RDMA is now temporarily broken, sorry. While we are in the process of getting rid of things lets also get rid of the repository that would have provided updates for this driver if there would have been any.
The “*” cleans up a backup file that was in the original image. Since we have a driver module that expects the SLES 12 SP1 kernel and the SLES 12 GA kernel was still in active maintenance until very recently we are in a situation that the SLES 12 GA kernel may have a higher version number than the SLES 12 SP1 kernel. This is an oddity introduced by the release and development model. Ordinarily during an upgrade where no external kernel modules are involved this is immaterial. However, when driver modules are involved as in this case we have to consider this condition. Thus rather the the usual update, then fiddle with the repos, we will use a different process. Therefore the steps outlined here are different than outlined in Upgrading your running on demand instances in the Public Cloud. In addition since the upgrading blog post our update infrastructure has been enhanced such that the new for SLES 12 simple migration process now works. Unfortunately the SLES 12 HPC images contain a bug in the registration client such that some an extra step is necessary.
1.) zypper in zypper-migration-plugin
2.) for prod in sle-module-toolchain/12/x86_64 sle-module-containers/12/x86_64 sle-sdk/12.1/x86_64 sle-module-public-cloud/12/x86_64 sle-module-legacy/12/x86_64 sle-module-web-scripting/12/x86_64 sle-module-adv-systems-management/12/x86_64; do SUSEConnect –url https://smt-azure.susecloud.net -p $prod; done
This will take a little while, but for every repository that gets added you will get a message, thus you can see the progress.
3.) zypper migration
Select 1 to migrate to SLES 12 SP1
Then resolve the upgrade changes,
a.) Allow the vendor change for the following packages:
by choosing solution 1
b.) Allow the deinstall of dapl-debug-2.0.42 by choosing solution 2
Then let zypper proceed with the distribution upgrade.
This will take a while as you should expect somewhere between 500 and 600 packages to be installed during the migration process. Once zypper is done with the installation you have all the packages from SLES 12 SP1, but we are not quite yet ready for a reboot, and a few more steps are required. First lets make sure we have the kernel from SLES 12 SP1:
zypper info kernel-default | grep Repository
Before we go on, a quick explanation why step 2 in the above procedure was necessary. The original SLES 12 HPC images were afflicted with a bug during the registration to the SUSE update infrastructure that stopped the registration short due to the built in repository for the RDMA driver. This repository only exists on the suse.com domain and thus when attempting to register with the update infrastructure this mismatch caused the registration procedure to exit without setting up the repositories for the modules. This bug has since been fixed in a new version of the registration client.
Back to our migration. During the upgrade process the initialization code configuration file for the new package, python-azure-agent, was saved to /etc/waagent.conf.rpmnew. If you have not made any modifications to the original configuration you can simply clobber the old file:
cp /etc/waagent.conf.rpmnew /etc/waagent.conf
if you did make modifications you will want to merge the files. After the new or merged config is in place edit the file and make 3 changes:
1.) Uncomment the “OS.EnableRDMA=y” entry by removing “#” from the beginning of the line
+ This enables the new initialization code to pull in the appropriate driver
2.) Set “AutoUpdate.Enabled” to “n”
+ You may leave the auto update feature enabled, however if there are problems it will be difficult to help debug. We generally provide updated packages as needed and thus it is better to leave version management to the package manager rather than let individual apps manage their own updates.
3.) Set the filesystem to ext4 by replacing “ext3” with “ext4”
Last but not least the initialization code service needs to be enabled:
systemctl enable waagent.service
This completes the migration to SLES 12 SP1 which included the switch to a new initialization code base.
Since you already had the HPC instance running, and presumably in use, the Intel MPI libraries should already be installed. To double check that everything is in order run through the following set of commands.
zypper in libopenssl0_9_8
zypper in libstdc++6-32bit
ln -s libssl.so.0.9.8 libssl.so.6
ln -s libcrypto.so.0.9.8 libcrypto.so.6
ln -s libreadline.so.6 libreadline.so.5
This leaves us with the RDMA driver setup. First lets add the new repositories as follows:
zypper ar -c -n ‘Microsoft-RDMA-Driver-Pack-Pool’ https://drivers.suse.com/microsoft/Microsoft-RDMA/sle-12-sp1/install msft-rdma-pack-poll
zypper ar -c -n ‘Microsoft-RDMA-Driver-Pack-Updates’ https://drivers.suse.com/microsoft/Microsoft-RDMA/sle-12-sp1/updates/ msft-rdma-pack-updates
The repositories provide access to the drivers that only provide the RDMA functionality and work with the SLES 12 SP1 kernel. With the update to the initialization agent the installation of the driver that matches the hardware underlying your instance is now automatic. HURRAY!
As a final step before reboot you may choose to switch the debug dapl libraries to the non debug version.
zypper in dapl-devel dapl
This will trigger zypper to ask you if you wish to uninstall the -debug packages. Answer yes to both questions to proceed.
What is left to do is to reboot the VM and rinse and repeat for all the instances in your HPC cluster.
After reboot, if everything worked as planned you should have a package named “msft-rdma-kmp-default” on your system
rpm -qa msft-rdma-kmp-default
and the proper network driver will be loaded (as root)
lsmod | grep hv_network_direct
should produce and output and you have have an IP address on eth1
From a development process it was not easy to get here and we are glad to have finally managed to sort out everything and provide new images as well as an upgrade process. Enjoy.