Or under the motto no good deed shall go unpunished. Well lets get he problem definition out of the way, then talk about the fix for running instances, followed by how we got there and what we’ll do to prevent such snafus in the future.

First, if you started a SUSE Linux Enterprise Server 12 SP3 image from the quick launcher since November 7th, 2017 you may be afflicted by this condition. To check run the following commands, unfortunately there is no direct way…

-> ec2metadata --api latest --ami-id
-> pint amazon images --filter 'id=ID_FROM_PREVIOUS_COMMAND' | grep 20171107

where ID_FROM_PREVIOUS_COMMAND is the ami-id returned by the ec2metadatd command. If the pint command with the grep has any output then you may have the issue and definitely want to continue reading, if not, you can skip the rest of this unless of course you are interested in the details.

Here you are, you have an instance launched from an image with the potential condition, or you are just interested in what’s going on. The condition is that if you have multiple network interfaces attached, or your instance runs in a VPC with no public or EIP attached then you will loose network connectivity to your instance upon reboot. That, of course is ckind of a bummer in the Cloud.

With the embarrasement out of the way lets first fix the running instance such that you will not get effected by the issue. You’ll have to do the following operations as root (sudo -i).

 

First we want to disable network configuration handled by cloud-init.

-> mkdir -p /etc/cloud/cloud.cfg.d
-> echo "network: {config: disabled}" >> /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg

Second we want to fix up ifcfg-eth0, and the content should be as follows:

BOOTPROTO='dhcp'
MTU=''
REMOTE_IPADDR=''
STARTMODE='onboot'
CLOUD_NETCONFIG_MANAGE='yes'

Edit /etc/sysconfig/network/ifcfg-eth0 with your favorite editor, erase what’s there and put the above information in place. And that’s it, instance fixed.

How did we get here? Well the 20171107 images (on-demand and BYOS for SLES and BYOS for SLES For SAP) were released to address a boot issue with C5. In addition this was the first release with a version of cloud-init (17.1) where the core of the SLES and openSUSE implementation came from the upstream code rather than from patches in the package. Unfortunately I had forgotten that between version 0.7.8 and version 0.7.9 some major changes to the network handling had taken place in cloud-init which is why there was never a cloud-init-0.7.9 package. Anyway, the excitement of finally having the core SLES and openSUSE implementation merged upstream in 17.1 overshadowed the memory of the potential network pitfalls and thus cloud-init 17.1 ended up in the released images, only to, of course, come back and bite me in the rear end.

The root cause of the network trouble on reboot is that on first boot under the conditions outlined above the interface is not marked as “auto”, meaning always start on boot. Rather the interface is set to “manual” which is of course completely useless in a cloud environment. How can you start the network interface manually when you cannot login to begin with?

Yes, of course this should have been caught during testing. The problem with testing of course is that when one is unaware of a specific failure mode then no test exists for this failure mode. Anyway, our IPA test suite will be enhanced to cover the conditions that triggered this issue.

Oh, and you might have one lingering question about this, why disable cloud-init network configuration? Well, the answer is simple. When we build our images we include a default configuration file for the “eth0” network interface. That’s the content you restored earlier. We do not want cloud-init to overwrite this file as it also enables multi-nic support on SLES and SLES For SAP. A long standing issue we also intended to address with the 20171107 release. A separate blog about how to retrofit running instances for multi-nic support is forthcoming.

Last but not least, the v20171107 images have been deprecated and have been replaced with images with a timestamp of v20171121. Yes, it took about 2 weeks for us to find out about the ugly network snafu. The Quick Launcher has been updated to point to the new images and pint data has been updated as well.

Sorry for the hiccup, it’s all better now.

(Visited 1 times, 1 visits today)
Tags:
Category: Cloud Computing, Enterprise Linux, SUSE in the Cloud, SUSE Linux Enterprise Server, SUSE Linux Enterprise Server for SAP Applications
This entry was posted Wednesday, 22 November, 2017 at 9:46 am
You can follow any responses to this entry via RSS.

Leave a Reply

Your email address will not be published. Required fields are marked *

No comments yet