Vojtěch’s announcement about kGraft started a number of interesting discussions, e.g. at LWN, Reddit, and German Heise. All of those discussions have some common topics. One of these is the question whether a live kernel patching solution such as kGraft is necessary at all. Some typical comments sound like this:
- What’s so bad about rebooting?
- We already have high availability – Why this?
- Regular reboots are healthy!
So, why are we going this way?
Welcome to the world of Service Level Agreements. Those typically exist between companies or IT departments and lines of business within one company.
SLAs ensure that you can get money from an ATM at any point in time or that your salary is paid. Sounds promising, doesn’t it?
Well, things tend to become complex, when SLAs depend on each other, get mixed, or even contradict each other. Life of administrators becomes painful when “planned down time” is just too short to implement required functionality and remain stable and remain secure at the same time – while one of these (functionality or stability/security) would be possible.
The “magic nines” are used to measure the availability of a service, and “4 nines” means that your service shall not be down more than 52 minutes a year. Let’s keep that in mind.
Yes, those SLAs very often can be met using existing technology such as load balancers, high availability, or virtual machine failover. However, any mitigation in the required frequency of reboots saves time and money. For services to stay available during scheduled downtimes of components a minimum of triple redundancy is required, driving up costs.
Yet there is more. Let’s have a look at three typical use cases, where requirements go beyond the comfort zone of existing technology, and customers or partners frequently and repeatedly requested us to consider kernel patching without a reboot, as otherwise their needs would not be met:
Use Case 1: In Memory Databases
In memory databases are fast in calculating results from huge amounts of data.
Systems running in memory databases are pretty slow in booting up though, as the systems typically have high numbers of CPU cores (100+), big storage backends (you want to permanently store your data, won’t you?), and especially huge amounts of RAM. We are talking 4 TiB and beyond here. And all this hardware needs time to come up in a sane state, and you need to load the existing data into the memory, before you can proceed.
Bootup times of 15 minutes or more are not the exception. Remember those “4 nines” above? This means, with a single machine, you can boot your in memory database system 3 times a year and have another 7 minutes to do service on it, …
Use Case 2: Simulations
A lot of technology in our world needs simulation before a product can be built. Think of cars and ships, design of chips, pharmaceutics, to name a few.
Some of these simulations need weeks, need months to create a result, yet not all simulation frameworks or tasks actually allow you to checkpoint and restart a running simulation properly.
Use Case 3: Infrastructure Building Blocks
I don’t know about you, but I personally prefer phone calls to have a clean sound and happening without any interruption, thus you can concentrate on the topic and come to quick and positive results. It’s similar for other infrastructure devices, such as network appliances, storage engines, or even hosts for consolidating virtual machines on.
The common denominator among those systems is that they are responsible for a continuous service to a large number of consumers (either machines or indeed human consumers), and it’s hard to find any time when you can stop the service, because there is always somebody accessing them.
To all three use cases described, it is key, to keep a specific system running until a predefined maintenance window has been reached, while you are able to fix stability or security issues along that long way.
There certainly are more examples, and I am happy to learn about those.