CPU Isolation – Nohz_full troubleshooting: broken TSC/clocksource – by SUSE Labs (part 6)

Last Updated On: December 14, 2022 | By: Frederic Weisbecker

This blog post is the fifth in a technical series by SUSE Labs team exploring Kernel CPU Isolation along with one of its core components: Full Dynticks (or Nohz Full). Here is the list of the articles for the series so far :

CPU Isolation – Introduction
CPU Isolation – Full dynticks internals
CPU Isolation – Nohz_full
CPU Isolation – Housekeeping and tradeoffs
CPU Isolation – A practical example
CPU Isolation – Nohz_full troubleshooting: broken TSC/clocksource

undisturbed

A practical example on how to isolate a CPU and launch a dedicated process on it got displayed in the previous blog post. Now we are going to explore CPU isolation troubleshooting, starting with a strong focus on a common issue met by nohz_full users on x86.

1) What is a clocksource? And a TSC?

A clocksource is a kernel object implementing a hardware or halfway-software resource that can provide a timestamp of the current time. This is a clock the kernel relies on to do its work. Note this is a passive clock, not to be confused with a clockevent object that can issue an interrupt after a given deadline elapsed. A clocksource is a simple clock that can be read from. A clockevent, on the other hand, is a programmable alarm.

Clocksources each have different properties:

Read latency: How fast is the timestamp to fetch? Since it can be read on fast-path, this has a significant impact on performance.
Granularity: How many nanoseconds elapse between two increments? This has an impact on precision.
Capacity: How much can it count before wrapping around? This has an impact on powersaving and even precision/stability if the clock eventually manages to wrap.
Per-CPU: Is the clocksource global or can it be found on each CPU?
Reliability/Stability: Is the clocksource trustworthy enough for serious use? This has an impact on performance and usability.
- Every clocksource has an uncertainty margin, but it is not considered stable as soon at it crosses that margin.
- Does the clocksource accelerate, slowdown, move backwards or stop sometimes? It must advance constantly.
- In case it is per-CPU, is it well synchronized between them?

The following is a typical example of a list of clocksource that can be found on an x86 machine:

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource tsc hpet acpi_pm

TSC: The timestamp counter is a clock implemented within the processor on x86 architectures. As such it is very fast and has a high granularity (Ghz). It also has a good capacity. Due to its nature, it is a per-CPU clock with a reliability/stability that may vary across various processor model.
HPET: The high precision event timer is a timer and clock traditionally available on x86 platforms. It has a good granularity (Mhz) but is not as fast as the TSC to read. It is therefore a fallback in case the TSC is too unstable to read. The HPET has a reasonable capacity (wraps after ~300 seconds).
ACPI_PM: The ACPI Power Management Timer is a simple timer and clock. It is usually as slow to read as the HPET, with a bit lower granularity (still Mhz). It also has a smaller capacity (wraps after ~4 seconds) making it not ideal for powersaving within long idle periods.

Some clocks are also hidden so as not to expose irrelevant sources in userspace that may otherwise jeopardize hrtimer/nohz if they were set as the current clocksource. The most notable one is:

JIFFIES: This clocksource doesn’t rely on a direct timestamp read from hardware but instead uses a clockevent as backend. A clockevent is programmed to tick at a constant rate (CONFIG_HZ ranges from 24 to 1200) and a variable named jiffies is incremented at every interrupt fired by the clockevent. If the previous clocksources can be compared to actual clocks that a user can visually check anytime, the jiffies clocksource rather look like a tally counter incrementing every time a bell rings. This clocksource has a very low resolution (Khz at best) and stands as the basis for traditional unprecise timers.

The kernel relies a lot on clocksources, and it can use different ones for different purposes, sometimes bypassing the actual clocksource framework and directly referring the source as is the case for the scheduler clock or jiffies. Now the high resolution timers and the dynticks subsystem (on which nohz_full depends on) both specifically need the most precise, the most reliable, the highest capacity and if possible also the fastest clocksource. This is what a stable TSC would look like and as we’ll explain in the next part, this is not a piece of cake.

2) TSC reliability

2.1) Common issues

2.1.1) Variable rate

On old intel processors, the TSC used to count at the actual CPU frequency. However if the CPU frequency were to accelerate or slowdown, the clock would then do the same, losing its synchronization with the course of time. This made the TSC unusable as a clocksource. Modern processors shouldn’t have that issue. Unaffected CPUs have the “constant_tsc” flag on:

$ grep constant_tsc /proc/cpuinfo

2.1.2) Stop

Some processors stop the TSC while entering into low power mode (deep idle state), producing clock gaps. Unaffected CPUs have the “nonstop_tsc“ flag on:

$ grep nonstop_tsc /proc/cpuinfo

2.1.3) Overwrite abuses and recovery

SMM code may interrupt the kernel via SMIs. Those are unmaskable interrupts that are mostly unnoticeable and some broken BIOS may try to overwrite the TSC in order to make that system interrupt cputime unaccounted. For example the TSC may be saved on the interrupt entry and restored on exit. This breaks the actual time accounting and the timekeeping synchronization between CPUs.

Modern processors that support the IA32_TSC_ADJUST MSR are able to keep track of any modification on the TSC after the boot and provide a way for cancelling them. This helps the kernel protect against abuses with checking modifications periodically and fixing them. Protected CPUs have the “tsc_adjust” flag:

$ grep tsc_adjust /proc/cpuinfo

2.1.4) Multi-socket systems

The TSC can be hard to synchronize between sockets (see this document and especially the “TSC and multi-socket” part) and in practice there is no architectural guarantee that the TSC is well synchronized beyond 4 sockets. For that reason and to take an even more defensive approach, a TSC can’t be considered reliable beyond 2 sockets.

2.2) Trustworthy architectures

Some processor models are known not to carry the above described issues and are therefore assumed to provide a reliable TSC, in this case the “tsc_reliable” flag will be set:

$ grep tsc_reliable /proc/cpuinfo

2.3) Expanded trust

The Linux Kernel is very careful in determining whether a TSC is reliable or not. Until recently, only a few trusted processor models were qualified: those that the kernel tagged with the above described “tsc_reliable” flag. But the situation has changed since an upstream commit that got released within the 5.16 linux kernel. The range of trusted models has then widen to those meeting all the following criterias:

The CPU has the constant_tsc flag (clock rate doesn’t vary with CPU frequency).
The CPU has the nonstop_tsc flag (clock doesn’t stop when the CPU goes idle).
The CPU has the tsc_adjust flag (clock abusive rewrites can be detected and reverted).
The hardware doesn’t have more than two sockets (beyond that, clocks aren’t guaranteed to be synchronized between sockets).

To summarize, if the CPU has the tsc_reliable flag, it is considered reliable. If it doesn’t have the tsc_reliable flag, then it has to meet the 4 above requirements as a last chance to be considered reliable, and that only applies to kernels >= 5.16. Suse Linux Enterprise Server 15 SP4 (including its Real Time counterpart) supports this pretty well.

2.4) Unreliable TSC implications

Now what does an unreliable TSC imply? Does it mean it’s unusable enough for the system which then needs to use another main clocksource? Not really, or rather not yet. An unreliable TSC means it has the potential for becoming unstable. And if it were to end that way, the kernel would then indeed need to switch to a trustworthier clocksource. Now before that to happen, an unreliable TSC has yet to be observed unstable. A watchdog implements that vigilance while comparing periodically a delta from two TSC snapshots against the delta from two equivalent snapshots of a clocksource considered as more reliable. If the difference between the delta of the TSC and the delta from the other clocksource goes further the minimal acceptancy, the TSC is then tagged as unstable and the kernel switches to another clocksource.

This watchdog runs in a timer that executes every 0.5 second, cycling through each CPUs in turn. For example it runs on CPU 0, waits 0.5 second then executes on CPU 1, waits 0.5 second then executes on CPU 2… So on any given CPU, the watchdog runs every num_online_cpus/2 seconds.

Here is a bash command providing an estimation of how often the timer runs on any given CPU:

$ echo $(($(getconf _NPROCESSORS_ONLN)/2))

On an 8 CPUs system, the watchdog runs every 4 seconds on all CPUs for example.

Systems carrying a reliable TSC don’t run this watchdog timer.

2.5) Unreliable TSC VS Nohz_full

On most workloads, the aforementioned watchdog is barely noticeable. But users of CPU isolation and nohz_full who rely on undisturbed CPUs to perform critical job may see this periodic timer callback as an unbearable nuisance even if it triggers only once every few seconds. Here is a trace excerpt of an empty loop running on an isolated CPU with an unreliable TSC:

user_loop-1412 [007] ..s. 241.700979: timer_expire_entry: timer=00000000621eeca1 function=clocksource_watchdog now=4294952320 baseclk=4294952320
user_loop-1412 [007] ..s. 245.700972: timer_expire_entry: timer=00000000621eeca1 function=clocksource_watchdog now=4294953320 baseclk=4294953320
user_loop-1412 [007] ..s. 249.700966: timer_expire_entry: timer=00000000621eeca1 function=clocksource_watchdog now=4294954320 baseclk=4294954320

This is a machine with 8 CPUs, so the timer runs every 4 seconds on the isolated CPU. The name of the clocksource watchdog’s timer callback is clocksource_watchdog and it is easily spotted when the timer_expire_entry trace event is enabled:

# echo 1 > /sys/kernel/debug/tracing/events/timer/timer_expire_entry/enable

2.6) Overcome an unreliable TSC

2.6.1) Do I have an unreliable TSC? Is it a problem?

If you observe the clocksource_watchdog timer callback after enabling the timer_expire_entry timer trace event, as explained in the previous chapter, it means your TSC is unreliable. Is it a problem? It depends on the requirements of your workload. If it really can’t suffer any disturbance at all, here are a few possible solutions/workarounds.

2.6.1) Upgrade kernel version

The first thing to do is to make sure you’re running a kernel >= 5.16, or if you run a SUSE product, make sure it is at least Suse Linux Enterprise Server 15 SP4. This is all especially helpful if your hardware meets the requirements for TSC reliability (See above “Expanded trust” chapter).

2.6.2) tsc=nowatchdog

If upgrading the kernel version isn’t enough and your TSC is still not considered reliable by the kernel, it means the hardware doesn’t meet the requirements. Unfortunately there is no sane solution from there. The only possible option is then to force the kernel to shutdown the clocksource watchdog, though users are advertised that they do it at their own risk. The TSC may skew or drift anytime and there will be no way for the kernel to fix it. This can result in any kind of kernel malfunction.

For those ready to take the risk, simply pass the tsc=nowatchdog kernel boot parameter (on kernel >= 5.1 or Suse Linux Enterprise Server 15 SP4).

Note though that another timer (tsc_sync_check_time) may still stay around to check if the TSC hasn’t been abusively rewritten by SMM code. This timer cycles through all CPUs in turn every 10 minutes. In order to know how often it runs on any given CPU, run the following command:

$ echo $(($(getconf _NPROCESSORS_ONLN)*10))

This will give you the result in minute units.

2.6.3) tsc=reliable

The previous solution doesn’t work on kernels < 5.1 and also it doesn’t deactivate the tsc_sync_check_time. For those who want to push further, it is still possible to pass the tsc=reliable kernel boot parameter. This is supported for over a decade (since 2.6.28) and it also deactivates the rare tsc_sync_check_time timer, removing the protection against abusive TSC rewrites from BIOS. As in the previous solution, use it at your own risk, this can result in any kind of kernel malfunction.

3) TSC stability

From the kernel point of view:

An unreliable TSC means there is a suspicion that something bad can happen with the clocksource.
An unstable TSC means that something bad has eventually happened with the clocksource.

When the clocksource watchdog detects too much drifts or skews from the unreliable TSC, it tags it as unstable. When that happens the generic clocksource layer arranges for switching the current clocksource from the unstable TSC to a more stable clocksource (hpet, acpi_pm, etc..).

Unfortunately even if the current clocksource has been switched over from the bad TSC to another source, the scheduler clock is an exception as it continues to rely on the TSC, even if unstable, because it still needs the fastest source available. It’s a requirement due to it being called very often and on fast paths. In order to cope with the unstability, sched_clock() then resorts to the tick to take a regular snapshot from a more stable clocksource and uses it as a reference to filter and stabilize the timestamps fetched from the TSC.

Symptoms of an unstable TSC can be grepped from dmesg:

# sudo dmesg | grep -i tsc
[ 0.243540] tsc: Marking TSC unstable due to check_tsc_sync_source failed

A trace issued with the tick_stop trace event enabled on a nohz_full CPU would also show that the tick is prevented from being stopped due to the unstable TSC:

# echo 1 > /sys/kernel/debug/tracing/events/timer/tick_stop/enable
# cat /sys/kernel/debug/tracing/per_cpu/cpu7/trace
user_loop-1305 [007] d.... 1200.855901: tick_stop: success=0 dependency=CLOCK_UNSTABLE
user_loop-1305 [007] d.... 1200.859901: tick_stop: success=0 dependency=CLOCK_UNSTABLE
user_loop-1305 [007] d.... 1200.863906: tick_stop: success=0 dependency=CLOCK_UNSTABLE
[...]

Indeed the tick can’t be stopped if the TSC is unstable because the scheduler clock relies on the timer interrupt to stabilize it. And unfortunately there is no way to overcome this problem. Nohz_full is simply unusable if the TSC is unstable.

4) Other architectures

It seems that only x86 has a clocksource officially considered as unreliable (so far). This doesn’t mean that other architectures have fully accurate clocksources but merely that the problems there are waiting to be exposed someday. As for possibly unstable clocksources, another occurence can be found with the MIPS R4K timers that seem to misbehave when the CPU frequency changes. Patience will certainly bring more candidates to grow the list in the future.

5) Sources, bibliography and further reading

(Visited 95 times, 1 visits today)

Jul 03rd, 2024

Frederic Weisbecker Linux Kernel Engineer at SUSE Labs.