A New Update Infrastructure For The Public Cloud
Our update infrastructure that provides updates for on-demand instances has been running with more or less no changes for more than 5 years and has shown great reliability over this period of time. Over time new requirements have arisen and some bugs have been worked around due to some fundamental limitations of the implementation. The time has come for a change and a wholesale upgrade to something new and shiny.
What’s happening with our update infrastructure?
Let me start with the things that stay the same. As with the previous incarnation there is an HA setup, this has been improved and more on this later. Updates will remain region local for maximum performance.
The major improvements being implemented with the new update infrastructure are:
- Consolidation of update servers to serve all products from the same set of servers
- Improved HA setup
- Support of traffic routing through the DC
We worked really hard to avoid any hiccups, but didn’t quite manage, more on this later. For running instances there is no service disruption during the transition, “zypper up” works.
What are the practical implications for you?
1. Consolidation of update servers to serve all products from the same set of servers
In the previous implementation we had to run update servers on a per product basis. Which resulted in interesting setup challenges in network constructs that by default denied egress to the Internet at large. As a user one had to either have a subnet that had a routing table that allowed access to all update servers for all products or one had to segregate subnets and run instances with SLES For SAP in one subnet and instances with SLES in another subnet to get to a minimum set of IP addresses to allow egress to. As we begin to roll out SLES For HPC as a product one would have had to have yet another subnet or open egress to the outside world to more IP addresses. With the new update server technology this problem is resolved! Meaning all update servers are capable of serving updates for all products and do proper authentication such that registration does not cross the channels between the various products. Practically this means there is/will be 1 less IP address that has to be white-listed for egress. The new update infrastructure consists of at least 3 update servers per region, which leads me to the next topic.
2. Improved HA setup
While we have not experienced a total outage of the update infrastructure as a whole or in any given region, there was always the nagging monkey on our back that we only had 2 systems in an HA configuration, i.e. minimal redundancy. With the new update servers, we have 3 servers in the HA configuration providing greater resiliency. The servers are spread across availability zones wherever possible to isolate us from zone based issues. With the consolidation of all products onto one set of servers we reduce the total number of systems we operate – yay for us; and at the same time improve our resilience. Less is more.
3. Support of traffic routing through the DC
With the new update infrastructure, we are in a position to eventually allow traffic that flows from your Public Cloud Network construct through your data center and then back to our update infrastructure. This type of traffic flow is not supportable with the previous technology and we know this is a concern for many of our users. Supporting this data flow comes with a data verification caveat that has a side effect in AWS and I’ll get to this in a minute. Due to this side effect we will not support the Cloud -> DC -> Cloud data flow immediately.
The cut-over date and transition period has been set. Details can be found in Step 2 Toward Enhanced Update Infrastructure Access
End Update 2019-12-16
The “Grace Period”/Delay Of Cloud -> DC -> Cloud Data-flow
In order to support the Cloud -> DC -> Cloud traffic flow we needed a reliable way to look for the marker that makes SLES and SLES For SAP on-demand instances exactly that, on-demand instances. This implies that we need to look for this marker every time a system comes to visit the update servers. This checking process has two components, one server side and one client side (your instances). The client-side changes are in version 8.1.2 or later of the cloud-regionsrv-client package, and the server side changes are in the new update infrastructure implementation. I know what you are thinking, both are available lets go. Well that’s where the caveat comes into play.
In AWS EC2 a condition existed where it was possible to accidentally lose the marker that identifies an instance as a SUSE Linux Enterprise Server on-demand instance. If we would enable the Cloud -> DC -> Cloud traffic flow immediately all those instances that are in this condition would lose access to the update infrastructure immediately – Not Good. Therefore, there is a transition period, exact dates are to be determined, that will allow those that lost the marker to re-initiate their instances to get the marker back. Another blog will follow on this topic soon. Once the end of the transition period has been reached there will be an announcement specific to the Cloud -> DC -> Cloud traffic flow.
The cut-over date and transition period has been set. Details can be found in Step 2 Toward Enhanced Update Infrastructure Access. Also, while the code in version 8.1.2, as stated above, in cloud-regionsrv-client provides the necessary bits additional enhancements have been made to address package update concerns and repository duplication we saw after updates to the package along certain paths. Therefore it is recommended to pull the latest version of the package.
zypper up cloud-regionsrv-client
End Update 2019-12-16
A Hiccup and a Caveat related to the transition
As indicated earlier we didn’t quite manage to avoid all potential hiccups. There is a registration issue with SUSE Linux Enterprise Server For SAP instances created from the AWS Marketplace images with a date stamp prior to 20181212. Marketplace images with a date stamp prior to 20181212 have a bug. This bug is immaterial in the previous incarnation of the update infrastructure but rears it’s ugly head with the new update infrastructure. The bug has been fixed for a while, but fixed images never made their way into the Marketplace. The images that are currently on their way to the AWS Marketplace address this issue and also contain fixes for SACK and MDS. We are working very closely with AWS to get these out into the Marketplace as quickly as possible.
The good news is that despite the automatic registration failing there is a pretty easy fix.
First check whether or not your instance is affected
if this doesn’t produce any repositories and you launched a SLES For SAP instance from AWS Marketplace run the following commands as root
ln -s ln -s SLES_SAP.prod baseproduct
systemctl start guestregister.service
After this “zypper lr” is expected to list the repositories as you would expect it if all were as it is supposed to be.
Before moving on, a quick explanation of the bug and what just happened. With SUSE Linux Enterprise 15 inter module dependencies are supported. This means one module may depend on another. Naturally dependencies require ordering. Therefore, modules have to be registered in the expected order. The new update infrastructure enforces the proper module registration order while the old update infrastructure basically accepted registration in any order and thus other weird issues could arise. The bug in the images with date stamps prior to 20181212 is that the so called “baseproduct”, indicated by the “baseproduct” link is pointing to the incorrect product definition file and this breaks the registration order. The above set of commands fixes the problem and allows registration to take place as expected.
Once the new images are in the AWS Marketplace the issue will simply go away.
An expected but previously not communicated side effect of the update infrastructure update is that it is no longer possible to register newly created SLES 11 SPx instances. The new update infrastructure servers do not support the SLES 11 registration protocol. Following our life-cycle and the general life cycle of the product the SUSE Linux Enterprise Server 11 release series reached the end of general support on March 31st, 2019. This implies that on-demand images for SUSE Linux Enterprise Server 11 (any service pack) are no longer supported to be launched and over time all images will disappear. LTSS is available for BYOS instances or via SUSE Manager for on-demand instances.
The new update infrastructure servers do have the SLES 11 repositories and therefore no apparent service interruption to running instances is occurring. However, the SLES 11 repositories no longer receive any updates and therefore the connection to the update infrastructure as such does not deliver anything new anymore. It is really time to upgrade to a newer version of SLES. For the transition to SLES 12 we have a documented, unfortunately, tedious process. This is expected to work for the most part but you have to be careful and you have to know what you are doing. Major distribution upgrade gets easier from SLES 12 SP4 to SLES 15. The new process is fully supported, while the SLES 11 to SLES 12 is a more “do at your own risk” process. By the end of this year, 2019, the SLES 11 repositories will disappear from the update infrastructure.
The transition to the new update infrastructure is in full swing. Therefore the natural questions are, how does one know if a particular region has already been transitioned, and when is this going to be done. The answer to the first question can be obtained using “pint”
pint $PROVIDER servers
produces a list of all the update servers we run in AWS, Azure, and GCE. If a region has 3 entries then that region has been switched to the new update infrastructure. For example:
pint amazon servers
<server ip=”18.104.22.168″ name=”smt-ec2.susecloud.net” region=”ap-south-1″ type=”smt-sles”/>
<server ip=”22.214.171.124″ name=”smt-ec2.susecloud.net” region=”ap-south-1″ type=”smt-sles”/>
<server ip=”126.96.36.199″ name=”smt-ec2.susecloud.net” region=”ap-south-1″ type=”smt-sles”/>
There are 3 servers all designated as “smt-sles” and therefore the new update infrastructure is up and running in the “ap-south-1” region.
<server ip=”188.8.131.52″ name=”smt-ec2.susecloud.net” region=”cn-north-1″ type=”smt-sles”/>
<server ip=”184.108.40.206″ name=”smt-ec2.susecloud.net” region=”cn-north-1″ type=”smt-sles”/>
There are only 2 servers in the “cn-north-1” region designated as “smt-sles” and therefore this region has not yet transitioned. Similar for other frameworks
pint google servers
<server ip=”220.127.116.11″ name=”smt-gce.susecloud.net” region=”europe-west6″ type=”smt-sles”/>
<server ip=”18.104.22.168″ name=”smt-gce.susecloud.net” region=”europe-west6″ type=”smt-sles”/>
<server ip=”22.214.171.124″ name=”smt-gce.susecloud.net” region=”europe-west6″ type=”smt-sles”/>
pint microsoft servers
<server ip=”126.96.36.199″ name=”smt-azure.susecloud.net” region=”southafricanorth” type=”smt-sles”/>
<server ip=”188.8.131.52″ name=”smt-azure.susecloud.net” region=”southafricanorth” type=”smt-sles”/>
<server ip=”184.108.40.206″ name=”smt-azure.susecloud.net” region=”southafricanorth” type=”smt-sles”/>
You will eventually also see a change in the “type” designation in the data. With the split infrastructure we had to introduce the “smt-sles” and “smt-sap” designation. These will go away and type will simply be “smt“.
We almost pulled it off. We are going through the switch with no downtime in update server availability, but we stumbled over an issue that is not completely in our control. The SLES 11 caveat should have been pre-announced. We apologize for the inconvenience caused.