CPU Isolation – A practical example – by SUSE Labs (part 5)

January 19, 2022 | By: Frederic Weisbecker

This blog post is the fifth in a technical series by SUSE Labs team exploring Kernel CPU Isolation along with one of its core components: Full Dynticks (or Nohz Full). Here is the list of the articles for the series so far :

CPU Isolation – Introduction
CPU Isolation – Full dynticks internals
CPU Isolation – Nohz_full
CPU Isolation – Housekeeping and tradeoffs
CPU Isolation – A practical example
CPU Isolation – Nohz_full troubleshooting: broken TSC/clocksource

Undisturbed

Four blog posts are enough for a shallow exploration of the theoretical concepts. Now it’s time for practice. In this example, our setup is made of 8 CPUs. We are going to run a dummy userspace loop on the 8th CPU in a fully isolated fashion, ie: without any disturbance.

1) Kernel config requirements

If you’re running SUSE Linux Enterprise Server 15 SP3 or higher, you don’t need to worry about this part. Otherwise, make sure you have:

CONFIG_NO_HZ_FULL=y
CONFIG_CPUSETS=y
CONFIG_TRACING=y

The first entry provides support to stop the tick while running a task. The second makes the task affinity setting easier. Finally the third option enables tracing to debug the CPU isolation.

2) Boot requirements

Using the “nohz_full=” boot parameter, the timer tick can be shutdown while running a single task and most of the kernel unbound load is also migrated to the CPUs outside the range passed. As we plan to isolate the 8th CPU, we need to boot the kernel with passing the following:

nohz_full=7

Since the CPU numbers start from 0, the 8th CPU has the number 7. Also no need to set the “rcu_nocbs=” boot parameter as is commonly displayed in examples, nohz_full accomodates that automatically.

3) Tasks affinity

There are several ways to partition the CPUs between your isolated task and the rest of the system. The preferred way here is using cpuset. However other solutions are exposed for those who may have special needs.

3.1) Cpuset

Once the kernel has booted and in order to make sure that undesired tasks won’t disturb the CPU 7, we create two cpusets partitions. A directory called “isolation” contains our isolated CPU that will run our isolated task in the future. Another directory called “housekeeping” takes the regular load. We force disable the load balancing to the “isolation” partition in order to make sure that no task can migrate from/to CPU 7 unless it is manually moved.

In this example we use the default cpuset mount point on SUSE Linux Enterprise Server 15 SP3.

cd /sys/fs/cgroup/cpuset
mkdir housekeeping
mkdir isolated
echo 0-6 > housekeeping/cpuset.cpus
echo 0 > housekeeping/cpuset.mems
echo 7 > isolated/cpuset.cpus
echo 0 > isolated/cpuset.mems
echo 0 > cpuset.sched_load_balance
echo 0 > isolated/cpuset.sched_load_balance
while read P
do
echo $P > housekeeping/cgroup.procs
done < cgroup.procs

Some of the writes to housekeeping/cgroup.procs may fail because kernel threads pids can’t be moved out of the root cpuset partition. However unbound kernel threads have their affinity automatically forced to the CPUs outside the nohz_full range so these failures can safely be ignored.

3.2) Isolcpus

You can also achieve the same as the above cpuset setting using the “isolcpus=” kernel boot parameter. However this solution is not advised because the isolation configuration can’t be later changed on runtime. This is why “isolcpus” tends to be considered as “deprecated” despite it being still in use. It may remain useful with specialized or embedded kernels that haven’t been built with cpusets/cgroups support.

3.3) Taskset, sched_setaffinity(), …

At a lower level, it is also possible to affine each individual task to the desired set of CPUs using tools like taskset or relying on APIs like sched_setaffinity(). On a setup without cpusets support, it has the advantage to allow for affinity change on runtime, unlike what “isolcpus” does. The drawback is that it requires more finegrained work.

4) IRQs affinity

We have dealt with tasks affinity but hardware interrupts can still fire on the isolated CPU and disturb its exclusive load. Fortunately we can arrange for firing these on the housekeeping set through the procfs interface:

# Migrate irqs to CPU 0-6 (exclude CPU 7)
for I in $(ls /proc/irq)
do
if [[ -d "/proc/irq/$I" ]]
then
echo "Affining vector $I to CPUs 0-6"
echo 0-6 > /proc/irq/$I/smp_affinity_list
fi
done

You’ll likely meet an I/O error on one of these interrupt vectors, the number 0 on x86-64 machines for example, because this is the per-CPU timer vector and it can not be moved away due to its local nature. However this issue can safely be ignored because “nohz_full” is purposely designed to address this.

5) Prevention from other disturbances

In this example we are dealing with straightforward scheduler and interrupt based disturbances. More advanced topics such as preventing from exceptions like page faults will be covered in subsequent articles.

6) The actual testing

Now most of the housekeeping load should be running on CPUs 0 to 6. The CPU 7 is expected to be ready for running userspace code without interrupting it. So let’s cook a dummy loop with a launcher.

6.1) The dummy userspace loop

The following code binds the current task to the isolated cpuset (ie: CPU 7) and executes a dummy loop forever. It is intended to be started and then eventually killed by the separate launcher after running 10 seconds.

#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
int main(void)
{
// Move the current task to the isolated cgroup (bind to CPU 7)
int fd = open("/sys/fs/cgroup/cpuset/isolated/cgroup.procs", O_WRONLY);
if (fd < 0) {
perror("Can't open cpuset file...\n");
return 0;
}

write(fd, "0\n", 2);
close(fd);

// Run an endless dummy loop until the launcher kills us while (1) ; return 0; }

Include this code in a file named “user_loop.c” and build:

$ gcc user_loop.c -o user_loop

6.2) The launcher

The purpose of the launcher here, beyond running the dummy user loop on the isolated CPU 7 for 10 seconds, is to trace the events that may show potential disturbances on our sensitive workload. In this example we use the default tracing debugfs mount point on SUSE Linux Enterprise Server 15 SP3:

TRACING=/sys/kernel/debug/tracing/
# Make sure tracing is off for now
echo 0 > $TRACING/tracing_on
# Flush previous traces
echo > $TRACING/trace
# Record disturbance from other tasks
echo 1 > $TRACING/events/sched/sched_switch/enable
# Record disturbance from interrupts
echo 1 > $TRACING/events/irq_vectors/enable
# Now we can start tracing
echo 1 > $TRACING/tracing_on
# Run the dummy user_loop for 10 seconds on CPU 7
./user_loop &
USER_LOOP_PID=$!
sleep 10
kill $USER_LOOP_PID
# Disable tracing and save traces from CPU 7 in a file
echo 0 > $TRACING/tracing_on
cat $TRACING/per_cpu/cpu7/trace > trace.7

Two interesting low level events are traced here:

Scheduler context switches: report any task preempting our “user_loop”. This includes workqueues and kernel threads.
IRQ vectors: report any (most) IRQ interrupting our “user_loop”, this includes timer interrupts.

The above code can be written into a file named “launch” in the same directory than the “user_loop”.

7) The result in a perfect world

After running the above “launch” as root, if everything happened well, a content similar to the following can be found in the “trace.7” file:

<idle>-0 [007] d..2. 1980.976624: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=user_loop next_pid=1553 next_prio=120
user_loop-1553 [007] d.h.. 1990.946593: reschedule_entry: vector=253
user_loop-1553 [007] d.h.. 1990.946593: reschedule_exit: vector=253

Here the user_loop task starts in the first trace at timestamp 1980 seconds, scheduling from swapper (the idle task). Then for 10 seconds (1990 – 1980) nothing happens until the task is finally interrupted in order to handle the kill signal sent by the launcher. This shows that no task or interrupt have disturbed our user_loop during that time.

Note this desired result only happens with ideal settings on a perfect machine in a perfect world. Random noise is likely to be found on the trace and therefore the next article will be about disturbance chasing and troubleshooting.

(Visited 24 times, 1 visits today)

Sep 05th, 2023

Frederic Weisbecker Linux Kernel Engineer at SUSE Labs.