Now that everything has been set to go as described in Part 1 (Key technical obstacles for Live Patching Meltdown), Part 2 (Virtual address mappings and the Meltdown vulnerability) and Part 3 (Changes needed for Translation Lookaside Buffer (TLB) flushing primitives), the last missing piece is to actually replace the entry code and make it switch back and forth between the two page table variants at entry to kernel and exit to user space respectively.
As initially described, the idea was to replace all of the CPU’s pointers to entry code with addresses into the live patch’s replacement. However, it should be possible to undo that operation and to eventually unload the live patch module. Before unmapping the module from memory, it must be made sure that no task is currently or will ever be executing somewhere within the entry code replacement anymore.
Consider for example a thread issuing some syscall and entering the kernel through the replaced entry code. Assume that while the syscall is executing, the entry code replacement gets unmapped. At the time the syscall completes, a return address to the now removed entry code will be found on the stack and the inevitable attempt to return to that location will cause bad things to happen. Actually, any task not currently running on some CPU is sleeping somewhere in kernel space and thus, has got a return address pointing into entry code stored on its stack.
In order to handle this, all tasks currently either executing in or having a return address to the entry code replacement on their stack are tracked by means of a reference count. It gets incremented at entry to kernel space and decremented again right before the exit to user space. A small but important detail is that the increment must have happened before reaching any point where a task might get scheduled out again. Otherwise some on-stack references to entry code might get missed. Similarly, the decrement must be made only after the very last possibility to get scheduled has been left behind.
With this reference counting in place, the live patch module’s cleanup handler would restore the CPUs’ pointers to the original, unpatched entry code and block until the reference count has drained to zero. A final
schedule_on_each_cpu() run afterwards would then ensure that no task is still executing within the small window between the counter decrement and actual return to user space.
Finally, note that task forks and exits need a special treatment: the former leave the kernels without ever having entered it while the reverse is true for the latter. In order to account for those reference counter imbalances, the kernel’s fork and exit code is kGraft-patched and thus made to compensate for them.
As demonstrated, even the complex Meltdown vulnerability can in principle still be live patched using kGraft. I hope you enjoyed reading as much as I did the research and would like to point you at the proof of concept sources for further reference: https://github.com/nicstange/meltdown-livepatch
Special thanks to my colleagues Miroslav Benes, Jiri Kosina and Libor Pechacek for their encouragement and support!
Here is the set of 4 blogs on this topic for your quick reference: