A day long to be remembered 2018-01-03
When I wrote this it was Janurary 9, 2018 which was supposed to be the day of a major earthquake in the computing world. However, due to various circumstances this all happened on January 3rd, 2018 at 2 PM PST and the aftershocks as well as the queasy feelings that come along with such an event will be with us for a long time to come. It will probably be a few years before we can evaluate how the effects of this event change the direction of computing as a whole. Additionally the opinions will vary widely and no doubt there will be much speculation and finger-pointing. Anyway, I do not want to reminisce about Meltdown and Spectre neither do I want to speculate. What I do want to do is talk specifically about the effects of Meltdown and Specter on our Public Cloud efforts. Everything in the announcement blog applies.
Let me start with talking about the update infrastructure that supports all on-demand images. The reported issues effected us on two levels, one of which we can control, just like everyone else running in the Public Cloud, the other is under our control. With respect to what we can control, i.e. our Virtual Machines, all of those have been updated with the kernel that supports PTI and have been restarted across all frameworks where our update infrastructure runs. On the second level we are of course dependent on the framework and depend on the speed at which the framework changes were rolled out. We also had a few cases where our systems force restarted at an in-opportune time for us, thus this was a bit of extra fun. Anyway everything got done and the updated systems were up and running in a reasonable amount of time. Unfortunately over the weekend (January 6 -7) an unrelated issue developed that triggered registration failures for on-demand instances. The issue with the registration was unrelated to Meltdown & Spectre. The registration issue was fixed by 5:06 P.M. EST on Monday Janurary 8, 2018 in all frameworks and all regions. As of the time of this writing the ultimate root cause for the failure is still under investigation. The first level problem was that product availability data disappeared from the DB on the update servers. This data is populated by a cron job and we are investigating the data provider. We are in the process to design and implement new monitoring controls to ensure we get alerted immediately if such an event should ever occur again. Once the ultimate root cause is understood measures will be put in place to prevent the issue.
As far as update availability is concerned we actively pulled the updated kernel packages into the update infrastructure once they were available, this happened on January 4th in the afternoon EST. The way updates usually land, and become available, is that they get pulled by a cron job and thus a released update may take up to 24 hours to show up in the update infrastructure. In this case that was a delay that we did not want to incur and thus we actively pulled the updated packages.
Finally I’d like to talk about our images. Given the severity of the issues we refreshed not only the on-demand images, as is customary for critical security issues but we also refreshed the BYOS images for everything that is considered maintained. All images that are under our direct release control have been released and when you start new instances you want to start with images that have a date stamp greater than 20180104, i.e. January 4th 2018. If you are looking for specific images please use the “pint” tool.
For images that are not directly under our release control we are working with the respective framework provider to make those images available as quickly as possible. Update 2018-01-13: All images have now been published.
Well, 2018 certainly started out with a bang, lets hope the rest of the year will be a little less exciting.
Last but not least keep your eye out for an update to Firefox and install it on your own systems as soon as possible. The next version of Firefox will have changes to the timer implementation to mitigate Spectre attacks. If you are using Chromium or Chrome make sure you enable “Strict Site Isolation“.