CPU Isolation – Nohz_full – by SUSE Labs (part 3)

March 7, 2021 | By: Frederic Weisbecker

This blog post is the third in a technical series by SUSE Labs team exploring Kernel CPU Isolation along with one of its core components: Full Dynticks (or Nohz Full). Here is the list of the articles for the series so far :

CPU Isolation – Introduction
CPU Isolation – Full dynticks internals
CPU Isolation – Nohz_full
CPU Isolation – Housekeeping and tradeoffs
CPU Isolation – A practical example
CPU Isolation – Nohz_full troubleshooting: broken TSC/clocksource

Undisturbed

Now that we have drown ourselves within theory and full dynticks internals, it’s time to dive into the feature in practice.

NOHZ_FULL

The “nohz_full=” kernel boot parameter is the current main interface to configure full dynticks along with CPU Isolation.

A cpu-list argument is passed to define the set of CPUs to isolate. Assuming you have 8 CPUs for example and you want to isolate CPUs 4, 5, 6, 7:

nohz_full=4-7

Some more details on how to format a cpu-list can be found here.

What does nohz_full do exactly

When a CPU is included in the cpu-list from the nohz_full boot parameter, the kernel tries to move away from that CPU as much kernel noise as it can. We have explained what can and need to be done in theory in the previous article in order to shutdown the timer tick, here is what is eventually performed:

The timer tick

The timer tick is stopped whenever possible, assuming some conditions are met:

The task that runs on the CPU can’t be preempted by another. This means you can’t have more than one task with the following policies: SCHED_OTHER, SCHED_BATCH, SCHED_IDLE. The same applies to SCHED_RR if the highest prio is shared by two or more tasks. The less error prone setting is to run a single task on an isolated CPU.
The task doesn’t use posix-cpu-timers.
The task doesn’t use perf events.
If you run on x86, your machine must have a reliable timestamp counter (TSC). We’ll describe that later.

A residual 1 Hz tick (an interrupt every second) remains in order to maintain scheduler internal statistics. It used to execute on the isolated CPUs but nowadays this event is offloaded to the CPUs outside the nohz_full range using an unbound workqueue. This means that a clean setup can afford to run 100% tick-free on a CPU.

Timer callbacks

Unbound timer callbacks execution are moved to any CPU outside the nohz_full range, so they won’t trigger timer ticks on the wrong place to serve them. Meanwhile pinned timer ticks can’t be moved elsewhere. We’ll see later how to cope with them.

Workqueues and other kernel threads

In a similar fashion to the timer callbacks, unbound kernel workqueues and kthreads are moved to any CPU outside the nohz_full range. But pinned workqueues and kthreads can’t be moved elsewhere. Again we’ll see later how to cope with them.

RCU

Most of RCU processing is offloaded to the CPUs outside the isolated range. The CPUs set as nohz_full run in NOCB mode, which means the RCU callbacks queued on these CPUs are executed from unbound kthreads running on non-isolated CPUs. No need to pass the “rcu_nocbs=” kernel parameter as that is automatically taken care of while passing the “nohz_full=” parameter.

The CPU also doesn’t need to actively report quiescent states through the tick because it enters into RCU extended quiescent state upon return to userspace (see previous article at “3.2 RCU quiescent states reporting”)

Cputime accounting

The CPU switches to full dynticks cputime accounting (see previous article at 3.1 Cputime accounting) so that it doesn’t rely on a periodic event anymore.

Other isolation settings

Even though nohz_full is a significant part of the whole isolation setting, you’ll need to care about other details separately, among which two significant items:

User tasks affinity

If you wish to run a task undisturbed, you may not want other threads or processes to share the CPU with it. And full dynticks only works on single tasks in the end. It is therefore necessary to:

Affine each of your isolated tasks to one CPU within the range of nohz_full. There must be only one isolated task per CPU.
Affine all other tasks outside the nohz_full range.

There are several ways to affine your tasks to a set of CPUs, from the low level sched_setaffinity() API to tools like taskset. Powerful interfaces such as cpusets are also recommended.

IRQs affinity

Hardware IRQs (other than the timer and some other specific interrupts) may run on any CPU and disturb your isolated set. The resulting noise may not be just about interrupts stealing CPU time and trashing the CPU cache, IRQs may launch further asynchronous work on the CPU: softirq, timer, workqueue, etc… So it is usually a good idea to affine the IRQs to the CPUs outside the range of nohz_full. This affinity can be overriden through the file:

/proc/irq/$IRQ/smp_affinity

with $IRQ being the vector number. More details can found on the kernel documentation.

All these CPU isolation settings though involve tradeoffs and pitfalls that we’ll explore in the next article.

(Visited 26 times, 1 visits today)

Jan 31st, 2023

Running SLE Micro on Apple Silicon has never been easier!

Joachim Werner

Apr 29th, 2024

SUSE Linux Enterprise Server 15 SP6 Public Release Candidate ready to rock!

Juan Herrera Utande

Jul 23rd, 2024

SUSE Receives 37 Badges in the Summer G2 Report

Dr. Thomas Di Giacomo

Apr 03rd, 2023

SUSE Linux Enterprise and SBOM support

Marcus Meissner

Comments

skyryu says:

May 3, 2022 at 4:03 am

Hi Frederic,
I still get one tick per second after setting isolcpus=1,2 nohz_full=1,2 in GRUB_CMDLINE_LINUX and mapping irq to cpu 0 and 3 in smp_affinity_list, I expect it should be offloaded to the CPUs outside the nohz_full range as you mentioned ? Does it only works on suse enterprise server version ?

my version is openSUSE Leap 15.3, linux is 5.3.18-59.27-default.
could you pls help check what might be wrong ?
Thanks so much and your artile is really inspiring !

Thanks,
skyryu