System hangs while [sisips] kernel module functions are spinning on a write_lock

This document (000019787) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise Server 11 SP4
 

Situation

System experiences soft lockups and hangs/freeze with high load. The following soft lockups may appear on kernel logs:
[508821.753092] BUG: soft lockup - CPU#0 stuck for 23s! [oraagent.bin:17436]
[508821.753220] Pid: 17436, comm: oraagent.bin Tainted: PF          ENX 3.0.101-108.111-default #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
[508821.753227] RIP: 0010:[<ffffffff81272a09>]  [<ffffffff81272a09>] __write_lock_failed+0x9/0x20
[508821.753304] Call Trace:
[508821.753321]  [<ffffffff8147d42e>] _raw_write_lock+0xe/0x10
[508821.753343]  [<ffffffffa030342c>] ExAcquireResourceExclusiveLite+0xc/0x40 [sisips]
[508821.753380]  [<ffffffffa030be08>] hook_open+0x198/0x290 [sisips]
[508821.753402]  [<ffffffff81485abe>] system_call_fastpath+0x22/0x27
[508821.753409]  [<00007f315905709d>] 0x7f315905709c

Analyzing the crash dump, almost all (15 from 16 in total) of the active running tasks, which are running on [sisips] module context, are spinning on the same rwlock_t:
crash> rwlock_t 0xffff8809fed74460
struct rwlock_t {
  raw_lock = {
    lock = 0x0
  }
}

With a stack trace like the following:
PID: 43447  TASK: ffff8802fd8f0480  CPU: 1   COMMAND: "sqlplus"
 #0 [ffff880a3ee2be40] crash_nmi_callback at ffffffff81027895
 #1 [ffff880a3ee2be50] notifier_call_chain at ffffffff81481882
 #2 [ffff880a3ee2be80] __atomic_notifier_call_chain at ffffffff814818cd
 #3 [ffff880a3ee2be90] notify_die at ffffffff8148191d
 #4 [ffff880a3ee2bec0] default_do_nmi at ffffffff8147ecd3
 #5 [ffff880a3ee2bee0] do_nmi at ffffffff8147edf8
 #6 [ffff880a3ee2bef0] restart_nmi at ffffffff8147e166
    [exception RIP: __write_lock_failed+9]
    RIP: ffffffff81272a09  RSP: ffff8803c70d9e48  RFLAGS: 00000287
    RAX: 000000001011d178  RBX: ffff8809fed74460  RCX: 000000001011d177
    RDX: ffffffffa0341ce0  RSI: 0000000000000001  RDI: ffff8809fed74460
    RBP: ffff8809fbd86800   R8: 0000000000004000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000000  R14: ffff8803c70d9ee8  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #7 [ffff8803c70d9e48] __write_lock_failed at ffffffff81272a09
 #8 [ffff8803c70d9e48] _raw_write_lock at ffffffff8147d42e
 #9 [ffff8803c70d9e50] ExAcquireResourceExclusiveLite at ffffffffa030342c [sisips]
#10 [ffff8803c70d9e60] hook_stat at ffffffffa03114c7 [sisips]
#11 [ffff8803c70d9f80] system_call_fastpath at ffffffff81485abe

While the one remaining active task is running on [sisips] module context:
PID: 43611  TASK: ffff8804c249c340  CPU: 9   COMMAND: "perl"
 #0 [ffff880a3ef2be40] crash_nmi_callback at ffffffff81027895
 #1 [ffff880a3ef2be50] notifier_call_chain at ffffffff81481882
 #2 [ffff880a3ef2be80] __atomic_notifier_call_chain at ffffffff814818cd
 #3 [ffff880a3ef2be90] notify_die at ffffffff8148191d
 #4 [ffff880a3ef2bec0] default_do_nmi at ffffffff8147ecd3
 #5 [ffff880a3ef2bee0] do_nmi at ffffffff8147edf8
 #6 [ffff880a3ef2bef0] restart_nmi at ffffffff8147e166
    [exception RIP: _ZN7Process12LockHashLineEPv+0x24]
    RIP: ffffffffa032a2b4  RSP: ffff880407fffe28  RFLAGS: 00000246
    RAX: 000000000000002a  RBX: ffff8809fbc909e0  RCX: 0000000000000040
    RDX: 0000000000000358  RSI: ffff880407fffe3c  RDI: ffff8809fade4380
    RBP: ffff880407fffe3c   R8: f600000000000000   R9: 000000000000aa5b
    R10: ffffffff81a27ea0  R11: ffffffffa0328580  R12: ffff8804c249c340
    R13: ffff880369a144c0  R14: ffff8809fb817a00  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #7 [ffff880407fffe28] _ZN7Process12LockHashLineEPv at ffffffffa032a2b4 [sisips]
 #8 [ffff880407fffe30] _ZN7Process10findLockedEi at ffffffffa032a347 [sisips]
 #9 [ffff880407fffe50] _ZN13ProcessCommon21CreateMissingChildrenEP7ProcessP15LIST_ENTRY_LINK at ffffffffa032d6c6 [sisips]
#10 [ffff880407fffeb0] _ZN13ProcessCommon14CreateChildrenEP7Process at ffffffffa032d82e [sisips]
#11 [ffff880407ffff00] AppfireDestroyProcess at ffffffffa030aa5b [sisips]
#12 [ffff880407ffff30] hook_exit_group at ffffffffa02fd388 [sisips]
#13 [ffff880407ffff80] system_call_fastpath at ffffffff81485abe

On all the active running tasks, which are also calling NMI, _raw_write_lock is called from ExAcquireResourceExclusiveLite():
crash>  bt -a | grep 'NMI exception stack' -A3
--- <NMI exception stack> ---
 #7 [ffff8803c70d9e48] __write_lock_failed at ffffffff81272a09
 #8 [ffff8803c70d9e48] _raw_write_lock at ffffffff8147d42e
 #9 [ffff8803c70d9e50] ExAcquireResourceExclusiveLite at ffffffffa030342c [sisips]
--
--- <NMI exception stack> ---
 #7 [ffff8804c2311e48] __write_lock_failed at ffffffff81272a09
 #8 [ffff8804c2311e48] _raw_write_lock at ffffffff8147d42e
 #9 [ffff8804c2311e50] ExAcquireResourceExclusiveLite at ffffffffa030342c [sisips]
--
--- <NMI exception stack> ---
 #7 [ffff880446dc3e38] __write_lock_failed at ffffffff81272a09
 #8 [ffff880446dc3e38] _raw_write_lock at ffffffff8147d42e
 #9 [ffff880446dc3e40] ExAcquireResourceExclusiveLite at ffffffffa030342c [sisips]
--
--- <NMI exception stack> ---
 #7 [ffff8804baa93e48] __write_lock_failed at ffffffff81272a09
 #8 [ffff8804baa93e48] _raw_write_lock at ffffffff8147d42e
 #9 [ffff8804baa93e50] ExAcquireResourceExclusiveLite at ffffffffa030342c [sisips]
--
--- <NMI exception stack> ---
 #7 [ffff880446fede38] __write_lock_failed at ffffffff81272a09
 #8 [ffff880446fede38] _raw_write_lock at ffffffff8147d42e
 #9 [ffff880446fede40] ExAcquireResourceExclusiveLite at ffffffffa030342c [sisips]
--
--- <NMI exception stack> ---
 #7 [ffff8804baa99e48] __write_lock_failed at ffffffff81272a09
 #8 [ffff8804baa99e48] _raw_write_lock at ffffffff8147d42e
 #9 [ffff8804baa99e50] ExAcquireResourceExclusiveLite at ffffffffa030342c [sisips]

[sisips] is a 3rd-party proprietary kernel module coming from "Symantec Critical System Protection":
crash> mod -t
NAME         TAINTS
sisips       PFEN
sisfim       PFEN

 TAINT: (P) Proprietary module has been loaded
 TAINT: (F) Module was forcibly loaded
 TAINT: (N) Unsupported modules loaded






 

Resolution

As [sisips] is a 3rd-party proprietary,  we would recommend to involve Symantec Support about this issue. As a temporarily workaround you may unload sisips module or stop the referenced SCSP services.

Cause

This issues is caused by the sisips kernel module, causing all the active running tasks to stuck on run queue waiting for a lock. 

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:000019787
  • Creation Date: 16-Nov-2020
  • Modified Date:18-Nov-2020
    • SUSE Linux Enterprise Server

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.


SUSE Customer Support Quick Reference Guide SUSE Technical Support Handbook Update Advisories
Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Go to Customer Center