SUSE Linux Enterprise Real Time 15

Virtualization Guide

SUSE Linux Enterprise Real Time 15

SUSE Linux Enterprise Real Time 15 supports virtualization and Docker usage. The following text describes how to do so.

Publication Date: 03/04/2019

1 Running RT Applications with non-RT KVM Guests

It is possible to achieve isolation of real-time workloads running alongside KVM by using standard methods. For example, cpusets and routing IRQs to dedicated CPUs, all of which can be achieved by using the cset utility. Both libvirtd and KVM work fine in such configurations. System configurations that share CPUs between both RT and KVM workloads are not supported; proper isolation of workloads is imperative for achieving RT deadline constraints. None of the below observations and recommendations are specific to virtualization. Nevertheless, they can be considered best-effort for isolating RT and KVM workloads. The basic steps are:

1.1 Setup

All examples were carried out on a 48-core Xeon machine with 2 NUMA nodes and 64GB of RAM running SLE12 RT and the 3.12.49-rt kernel. The virtual machine was installed with vm-install, running SLE12 SP3 on 4 CPUs and 2GB of memory. The disk used was physical disk /dev/sdb as recommended by the SUSE virtualization documentation.

The cpuset utility was used to shield the RT workload from KVM as described in the SLE RT Shielding Guide (see Book “Shielding Linux Resources”):

cset shield --kthread=on -c 8-47

Affinity for the KVM vCPU tasks was modified via the virsh vcpupin command, with a 1-1 mapping. For example, vCPU 0 pinned to CPU 0, etc.

The CPUs were split into two groups. CPU 0-7 were allocated to the system cpuset and CPU 8-47 were allocated to the user group. Having CPUs on the same socket in two groups was done intentionally to monitor the effects on shared CPU resources, such as LLC.

The RT workload used throughout is cyclictest, executed like so:

cset shield --exec cyclictest -- -a 8-47 -t 40 -n -m -p99 -d 0 -D 120 --quiet

1.2 Observations

The following observations were made:

  1. VM Heavy I/O

    The test for this was to do the following in a VM:

    dd if=/dev/zero of=empty bs=4096 count=$(((80*1024*1024)/4096))

    Doing large amounts of disk I/O in the VM guests has a noticeable impact on the latency of RT tasks. This is because of the constant eviction of LLC data, resulting in more cache misses.

    The maximum latencies in for the real-time workload are seen on those CPUs on the same socket as the CPUs available to the KVM workload. For example, where the LLC is a shared resource between the system and user cpuset.

  2. cpufreq drivers incur timer latency

    Drivers like intel_pstate will set up a timer on each CPU to periodically sample and adjust the CPU's current P-state. If this fires at an inopportune time it can add delays to the scheduling of RT tasks, particularly because lots of the IRQ/timer code paths run with interrupts disabled.

  3. Interrupt handling introduces delays

    The handling of interrupts can result in latencies that affect RT workloads. Interrupts should be routed to housekeeping CPUs that are not running RT applications.

  4. Some kernel threads cannot be controlled with cpuset

    Performing heavy I/O in the VM may cause kthreads to be scheduled on the CPUs dedicated for RT. This can occur, for example, when a kthread is flushing dirty pages to disk.

    While it is impossible to move some kworker threads into the system cpuset, the above issue can be mitigated by setting the CPU affinity for those threads via:

    /sys/devices/virtual/workqueue/writeback/cpumask

1.3 Recommendations

Suggestions for tuning machines running both RT and KVM workloads are as follows:

  1. Affinitize RT tasks to their own CPUs, and if possible, to CPUs on their own dedicated socket. Using a dedicated socket avoids the issue from Section 1.2, “Observations” above where the LLC occupancy is churned by VMs doing lots of I/O operations. If that is not an option some customers should look at Intel's Cache Allocation Technology to further enforce cache allocation policies.

  2. Disable drivers that arm per-CPU timers such as cpufreq drivers, for example, intel_pstate=disable.

  3. Set IRQ affinity to CPUs that are not running RT workloads and disable irqbalance.

  4. Set IRQ affinity to CPUs that are not running RT workloads. This can be achieved by setting the IRQBALANCE_BANNED_CPUS environment variable used by irqbalance(1) with a bitmask of banned CPUs. For the examples used throughout this document the following setting was used:

    IRQBALANCE_BANNED_CPUS="ffff,ffffff00"
  5. Search for cpumask control files in /sys and set them appropriately for those cases that cannot be controlled via cpuset. The following command will list those files:

    find /sys -name cpumask

2 RT Applications within Docker Feasibility

It is important to note that real-time processes will be affected by container activity as there is insufficient isolation to guarantee zero cross-talk. There are no special settings, nor container-specific interactions to consider as from a RT prespective, nothing changes due to containers. Whether a noise source in a container is irrelevant. Interference may be considerably higher if multiple RT applications are executed in separate containers. Also bear in mind that while worst-case latency may be better than SLE, it will not necessarily perform better than NOPREEMPT due to the overhead required for RT.

Some shielding is possible but there is no tool-based support for it. There is a generic shield script attached that can move Docker contents onto shielded cores once running. Launching of either KVM/Docker directly into a shielded home did not appear to be possible but the Docker or virtualisation team may be able to do better. The basic steps are

2.1 Running Real-Time Applications in a Virtualized Environment

Standard real-time dangers apply in that if the intention is to run a compute intensive application with realtime priority, then the user must ensure that kernel threads cannnot starve. A simple precaution is to use rtkthread=prio and rtworkqueues=prio kernel boot parameters, with priority set higher than anything that may dominate a CPU. This is not strictly real-time capable, but it is safer overall.

Docker Prerequisites
  • kernel must be booted with nortsched commandline parameter

    This is to hide cgroup scheduling from Docker. If cgroup scheduling is required then isolating docker is very problematic.

  • Docker run must be passed --privileged=true

    This is required for using the RT classes.

  • your container is equipped with the chrt system tool.

If no isolation is required for your use case then it's ready. Run docker run your container, using chrt to set RT class/priority of that which you execute upon startup of the container. Example:

docker run --privileged=true ... /usr/bin/chrt -f 1 /usr/sbin/sshd -D

The above (with additional arguments of course) will start sshd within the container as a SCHED_FIFO task of priority 1. ssh into it, and whatever you run therein will inherit scheduler RT class/priority.

2.2 Docker Shielding

There is currently no facility withing Docker to launch a container directly into an isolated cpuset, this must be done manually.

Example 1: Pseudo Script
# note cpuset mount point
cpuset_mnt=$(mount|grep cpuset|cut -d' ' -f3)

# create an isolated cpuset for your container
cset shield --userset=rtcpus --cpu=4-7 --kthread=on

# note path and id of your container
docker_path=$(docker run...)
docker_id=$(docker ps -q)

# move container content into the isolated cpuset
for i in $(cat ${cpuset_mnt}/system/docker/${docker_path}/tasks);
do
  echo $i > ${cpuset_mnt}/rtcpus/tasks;
done

# stop/destroy the container
docker stop ${docker_id}
docker rm ${docker_id}

# remove dir docker abandons in the shield system directory
rmdir ${cpuset_mnt}/system/docker

# tear down the shield, and you're done
cset shield --userset=rtcpus --cpu=4-7 --reset

2.3 Scripts

Example 2: Sample Shield Script
#!/bin/sh

let START_CPU=4
let END_CPU=63
let ONLINE=1
let SHIELD_UP=0
GOVERNOR="performance"

DEFAULT_MASK=ffffffff,ffffffff
SHIELD_MASK=00000000,0000000f

if [ -f /proc/sys/kernel/sched_rt_runtime_us ]; then
  RT_RUNTIME=$(cat /proc/sys/kernel/sched_rt_runtime_us)
fi
if [ -f /proc/sys/kernel/nmi_watchdog ]; then
  NMI_WATCHDOG=$(cat /proc/sys/kernel/nmi_watchdog)
fi

CPUSET_ROOT=$(grep cpuset /proc/mounts|cut -d ' ' -f2)
if [ ! -z $CPUSET_ROOT ]; then
  if [ -d ${CPUSET_ROOT}/rtcpus ]; then
    let SHIELD_UP=1
  fi
  if [ -f ${CPUSET_ROOT}/cpuset.cpus ]; then
    CPUSET_PREFIX=cpuset.
  fi
fi

if [ $SHIELD_UP -eq 1 ]; then
  # take it down
  echo 1 > ${CPUSET_ROOT}/${CPUSET_PREFIX}sched_load_balance
  cset shield --userset=rtcpus --reset

  # restore default irq affinity
  echo ${DEFAULT_MASK} > /proc/irq/default_smp_affinity
  for irqlist in $(ls /proc/irq/*/smp_affinity); do
    echo ${DEFAULT_MASK} > $irqlist 2>/dev/null
  done

  if [ -f /proc/sys/kernel/timer_migration ]; then
    echo 1 > /proc/sys/kernel/timer_migration
  fi
  if [ -f /proc/sys/kernel/sched_rt_runtime_us ]; then
    echo ${RT_RUNTIME} > /proc/sys/kernel/sched_rt_runtime_us
  fi
  if [ -f /sys/kernel/debug/tracing/tracing_on ]; then
    echo 1 > /sys/kernel/debug/tracing/tracing_on
  fi
  if [ -f /sys/kernel/mm/transparent_hugepage/enabled ]; then
    echo always > /sys/kernel/mm/transparent_hugepage/enabled
  fi
  if [ -f /proc/sys/kernel/nmi_watchdog ]; then
   echo ${NMI_WATCHDOG} > /proc/sys/kernel/nmi_watchdog
  fi
  if [ -f /sys/devices/system/machinecheck/machinecheck0/check_interval ]; then
   echo 300 > /sys/devices/system/machinecheck/machinecheck0/check_interval
  fi
  if [ -f /sys/devices/virtual/workqueue/writeback/cpumask ]; then
   echo ${DEFAULT_MASK} > /sys/devices/virtual/workqueue/writeback/cpumask
  fi
  if [ -f /sys/devices/virtual/workqueue/cpumask ]; then
    echo ${DEFAULT_MASK} > /sys/devices/virtual/workqueue/cpumask
  fi
  if [ -f /proc/sys/vm/stat_interval ]; then
    echo 1 > /proc/sys/vm/stat_interval
  fi
  if [ -f /sys/module/processor/parameters/latency_factor ]; then
   echo 2 > /sys/module/processor/parameters/latency_factor
  fi
  if [ -f /sys/module/processor/parameters/ignore_ppc ]; then
   echo 0 > /sys/module/processor/parameters/ignore_ppc
  fi
  if [ -f /sys/module/processor/parameters/ignore_tpc ]; then
   echo 0 > /sys/module/processor/parameters/ignore_tpc
  fi
  if [ -f /etc/init.d/sgi_irqbalance ]; then
   /etc/init.d/sgi_irqbalance start
  fi
else
  # route irqs away from shielded cpus
  if [ -f /etc/init.d/sgi_irqbalance ]; then
    /etc/init.d/sgi_irqbalance stop
  fi
  echo $SHIELD_MASK > /proc/irq/default_smp_affinity
  for irqlist in $(ls /proc/irq/*/smp_affinity); do
    echo $SHIELD_MASK > $irqlist 2>/dev/null
  done

  # poke some buttons..
  if [ -f /proc/sys/kernel/sched_rt_runtime_us ]; then
    echo -1 > /proc/sys/kernel/sched_rt_runtime_us
  fi
  if [ -f /sys/kernel/debug/tracing/tracing_on ]; then
    echo 0 > /sys/kernel/debug/tracing/tracing_on
  fi
  if [ -f /sys/kernel/mm/transparent_hugepage/enabled ]; then
    echo never > /sys/kernel/mm/transparent_hugepage/enabled
  fi
  if [ -f /proc/sys/kernel/nmi_watchdog ]; then
    echo 0 > /proc/sys/kernel/nmi_watchdog
  fi
  if [ -f /sys/devices/system/machinecheck/machinecheck0/check_interval ]; then
    echo 0 > /sys/devices/system/machinecheck/machinecheck0/check_interval
  fi
  if [ -f /sys/devices/virtual/workqueue/writeback/cpumask ]; then
    echo ${SHIELD_MASK} > /sys/devices/virtual/workqueue/writeback/cpumask
  fi
  if [ -f /sys/devices/virtual/workqueue/cpumask ]; then
    echo ${SHIELD_MASK} > /sys/devices/virtual/workqueue/cpumask
  fi
  if [ -f /proc/sys/vm/stat_interval ]; then
    echo 999999 > /proc/sys/vm/stat_interval
  fi
  if [ -f /sys/module/processor/parameters/latency_factor ]; then
    echo 1 > /sys/module/processor/parameters/latency_factor
  fi
  if [ -f /sys/module/processor/parameters/ignore_ppc ]; then
    echo 1 > /sys/module/processor/parameters/ignore_ppc
  fi
  if [ -f /sys/module/processor/parameters/ignore_tpc ]; then
    echo 1 > /sys/module/processor/parameters/ignore_tpc
  fi

  # ...and fire up the shield
  cset shield --userset=rtcpus --cpu=${START_CPU}-${END_CPU} --kthread=on

  # If cpuset wasn't previously mounted (systemd will, like it or not),
  # we just mounted it.  Find the mount point.
  if [ -z $CPUSET_ROOT ]; then
   CPUSET_ROOT=$(grep cpuset /proc/mounts|cut -d ' ' -f2)
   if [ -z $CPUSET_ROOT ]; then
     # If it's not mounted now, bail.
     echo EEK, cupset is not mounted!
     exit
   else
     # ok, check for cgroup mount
     if [ -f ${CPUSET_ROOT}/cpuset.cpus ]; then
      CPUSET_PREFIX=cpuset.
     fi
   fi
  fi

  echo 0 > ${CPUSET_ROOT}/${CPUSET_PREFIX}sched_load_balance
  echo 1 > ${CPUSET_ROOT}/system/${CPUSET_PREFIX}sched_load_balance
  echo 0 > ${CPUSET_ROOT}/rtcpus/${CPUSET_PREFIX}sched_relax_domain_level
  # this ain't gonna happen in -rt kernels, but...
  if [ -f ${CPUSET_ROOT}/rtcpus/cpu.rt_runtime_us ]; then
    echo 300000 > ${CPUSET_ROOT}/system/cpu.rt_runtime_us
    echo 300000 > ${CPUSET_ROOT}/rtcpus/cpu.rt_runtime_us
  fi
  echo 0 > ${CPUSET_ROOT}/rtcpus/${CPUSET_PREFIX}sched_load_balance

  # wait a bit for sched_domain rebuild
  sleep 1

  # now we can go to hpc
  if [ -f ${CPUSET_ROOT}/rtcpus/${CPUSET_PREFIX}sched_hpc_rt ]; then
    echo 1 > ${CPUSET_ROOT}/rtcpus/${CPUSET_PREFIX}sched_hpc_rt
  fi

  # offline/online to migrate timers and whatnot
  if [ $ONLINE -eq 1 ]; then
    for i in `seq ${START_CPU} ${END_CPU}`; do
      echo 0 > /sys/devices/system/cpu/cpu$i/online
    done
    for i in `seq ${START_CPU} ${END_CPU}`; do
      echo 1 > /sys/devices/system/cpu/cpu$i/online
    done

    # re-add CPUs the kernel removed on offline
    echo ${START_CPU}-${END_CPU} > ${CPUSET_ROOT}/rtcpus/${CPUSET_PREFIX}cpus

    # and prioritize re-initialized kthreads
    systenctl restart set_kthread_prio
  fi
  if [ -f /proc/sys/kernel/timer_migration ]; then
    echo 0 > /proc/sys/kernel/timer_migration
  fi
  GOVERNOR="performance"
fi

if [ -f /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor ]; then
  CURRENT_GOVERNOR=$(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor)
  if ! [ $GOVERNOR = $CURRENT_GOVERNOR ]; then
    for i in $(ls /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor); do
     echo $GOVERNOR > $i;
    done
  fi
fi
Example 3: Patch to sysjitter to Use the User Affinity Instead of Whole Box
sysjitter.c |   10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

--- a/sysjitter.c
+++ b/sysjitter.c
@@ -412,7 +412,7 @@ static void write_raw(struct thread *thr
 	FILE *f;
 	int i;
 	for (i = 0; i < g.n_threads; ++i) {
-		sprintf(fname, "%s.%d", outf, i);
+		sprintf(fname, "%s.%d", outf, threads[i].core_i);
 		if ((f = fopen(fname, "w")) == NULL) {
 			fprintf(stderr, "ERROR: Could not open '%s' for writing\n", fname);
 			fprintf(stderr, "ERROR: %s\n", strerror(errno));
@@ -578,6 +578,7 @@ int main(int argc, char *argv[])
 	const char *outf = NULL;
 	char dummy;
 	int i, n_cores, runtime = 70;
+	cpu_set_t cpus;
 
 	g.max_interruptions = 1000000;
 
@@ -609,10 +610,13 @@ int main(int argc, char *argv[])
 	    sscanf(argv[0], "%u%c", &g.threshold_nsec, &dummy) != 1)
 		usage(app);
 
+	CPU_ZERO(&cpus);
+	sched_getaffinity(0, sizeof(cpus), &cpus);
+
 	n_cores = sysconf(_SC_NPROCESSORS_ONLN);
-	TEST(threads = malloc(n_cores * sizeof(threads[0])));
+	TEST(threads = malloc(CPU_COUNT(&cpus) * sizeof(threads[0])));
 	for (i = 0; i < n_cores; ++i)
-		if (move_to_core(i) == 0)
+		if (CPU_ISSET(i, &cpus) && move_to_core(i) == 0)
 			threads[g.n_threads++].core_i = i;
 
 	signal(SIGALRM, handle_alarm);

3 Running RT Applications with RT KVM Guests

In Section 1, “Running RT Applications with non-RT KVM Guests”, we see that it is possible to isolate real-time workloads running alongside KVM by using standard methods. In SLE12 RT SP3 this can be done in user space using libvirt/qemu.

Applications and guest operating systems run inside KVM guests similarly to how they run on bare metal. The guest interfaces with emulated hardware presented by QEMU, which submits I/O requests to the host on behalf of the guest. Then the host kernel treats the guest I/Os like any user-space application.

In SLE12 SP3, both QEMU and libvirt support isolating the CPUs, partitioning the memory for guests, and setting the vCPU/iothread scheduler policy and priority for running both non-RT KVM and RT KVM.

3.1 Support of QEMU/libvirt

  1. QEMU includes the -realtime mlock=on|off option. Mlocking QEMU and guest memory is enabled with mlock=on (which is enabled by default) .

  2. libvirt supports CPU Allocation, CPU Tuning, and Memory Backing, which allows you to control RT parameters, see Section 3.2, “Sample of libvirt.xml.

    CPU Allocation

    We can define the maximum number of virtual CPUs allocated for the guest OS.

    CPU Tuning
    • Pinning is a tuning option for the virtual CPUs in KVM guests. With pinning we can control where the guest runs in order to reduce the overhead of scheduler switches, pin vCPUs to physical CPUs that have low utilization, and improve the data cache performance. Overall performance is improved when the memory that an application uses is local to the physical CPU, and the guest vCPU is pinned to this physical CPU.

    • We can specify the vCPU scheduler type (values batch, idle, fifo, rr), and priority for particular vCPU threads. Priority 99 is too high, and it will massively interfere with the host's ability to function properly. There are host-side per-CPU threads that must be always be able to preempt, for example, timer sirq threads.

    Memory Backing

    Use memory backing to allocate enough memory in the guest to avoid memory overcommit, and to lock the guest page memory in host memory to prevent it from being swapped out. This will show a performance improvement in some workloads.

3.2 Sample of libvirt.xml

<domain>
   …
   <vcpu placement='static' cpuset="1-4,^3,6" current="1">4</vcpu>
   …
   <cputune>
       <vcpupin vcpu="0" cpuset="1-5,^2"/>
       <vcpupin vcpu="1" cpuset="0,1"/>
       <vcpupin vcpu="2" cpuset="2,3"/>
       <vcpupin vcpu="3" cpuset="0,4"/>
       <vcpusched '0-4,^3' scheduler='fifo' priority='1'/>
   </cputune>
   …
   <memoryBacking>
       <locked/>
   <memoryBacking/>
   …
</domain>

3.3 Other Host Settings

  1. Power Management. Intel processors have a power management feature that puts the system into power-saving mode when the system is under-utilized. The system should be configured for maximum performance, rather than allowing power-saving mode.

  2. Turboboost and Speedstep. Turboboost overclocks a core when CPU demand is high, whereas Speedstep dynamically adjusts the frequency of processor to meet processing needs. Turboboost requires Speedstep to be enabled, as it is an extension of Speedstep. For maximum performance, enable both Turboboost and Speedstep in BIOS. The host OS may also need configuration to support running at higher clock speeds. For example:

    cpupower -c all frequency-set -g performance
  3. Disable Interrupt Balancing (irqbalance). The irqbalance daemon is enabled by default. It distributes hardware interrupts across CPUs in a multi-core system to increase performance. When irqbalance is disabled, all interrupts will be handled by cpu0, and therefore the guest should NOT run on cpu0.

  4. RT Throttling. The default values for the realtime throttling mechanism allocate 95% of the CPU time to realtime tasks, and the remaining 5% to non-realtime tasks. If RT throttling is disabled, realtime tasks may use up to 100% of CPU time. Hence, programming failures in real-time applications can cause the entire system to hang because no other task can preempt the realtime tasks.

The above settings are just part of the configurations for the RT KVM to run at the best-effort performance. Other factors must be considered, such as storage and network. The overall KVM performance is dependent on the host hardware, firmware, BIOS settings, and the guest OS and application charactistics.

Print this page