What are all these "Bug: soft lockup" messages about?

This document (7017652) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise Server 15
SUSE Linux Enterprise Server 12
SUSE Linux Enterprise Server 11

Situation

In the system log (/var/log/messages or journalctl) a lot of the following messages are printed:

May 25 07:23:59 XXXXXXX kernel: [13445315.881356] BUG: soft lockup - CPU#16 stuck for 23s! [yyyyyyy:81602]

These are followed by various stack traces. This document tries to explain what the soft lockup messages mean.

Resolution

Under normal circumstances, these messages may go away if the load decreases.

A 'soft lockup' watchdog timeout can happen if the kernel is busy, working on a huge amount of objects which need to be scanned, freed, or allocated, respectively.

The stack traces of those tasks can give a first idea about what the tasks were doing. However, to be able to examine the cause behind the messages, a kernel dump would be needed.

While these messages cannot be disabled entirely, in some situations, increasing the time before these soft lockups are fired can relax the situation.

To do so, increase the following sysctl parameter: kernel.watchdog_thresh

The default value for this parameter is 10 and to double the value might be a good start.

e.g.

server1:~ # echo 20 > /proc/sys/kernel/watchdog_thresh

or

server1:~ # echo "kernel.watchdog_thresh=20" > /etc/sysctl.d/99-watchdog_thresh.conf
server1:~ # sysctl -p  /etc/sysctl.d/99-watchdog_thresh.conf

For more information on how to configure and capture kernel dump please check: Configure crashkernel memory for kernel core dump analysis

Cause

A 'soft lockup' is defined as the kernel running in kernel mode for more than 20 seconds without giving other tasks a chance to run. Unless this situation is caused by external factors (see Additional Information section below), this is usually indicating a bug.

The watchdog daemon will send an non-maskable interrupt (NMI) to all CPUs in the system who, in turn, print the stack traces of their currently running tasks.

Additional Information

A soft lockup (or rarely even a hard lockup) timeout can also occur if the Linux system is running in a virtual machine and the hypervisor does not schedule the guest for a prolonged time. As a guest OS has no way to know if this has happened, the host's logs should be investigated whether the hypervisor suspends the guest, for example due to slow memory paging to or from disk, or 'stealing' pages from guests. In that case, the available amount of host memory has been overcommitted to the virtual machines.

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.