System hang/freeze with a high number of tasks waiting for McAfee userspace services to write the fanotify responses

This document (000020332) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise Server for SAP Applications 12 SP5
SUSE Linux Enterprise Server for SAP Applications 15 SP4
SUSE Linux Enterprise Server for SAP Applications 12 SP3

SUSE Linux Enterprise Server 12 SP5
SUSE Linux Enterprise Server 12 SP4 LTSS
SUSE Linux Enterprise Server 12 SP3 LTSS

SUSE Linux Enterprise Server 11 SP4 LTSS

Situation

A system with 2 CPU cores which is  running McAfee userspace client ( ISecESP, ISecRt, MFEcma, McAfeeESP, McAfeeESPAac, McAfeeESPFileAccess, McAfeeRt, McAfeeTP...) hangs/freeze with high load and a considerable number of tasks in uninterruptible sleep waiting for fanotify event responses (blocked on fanotify_get_response_from_access() for responses which are handled by McAfee userspace services).
LOAD AVERAGE: 13.41, 7.57, 3.55

The tasks will stay in a blocked state for as long as the fanotify response is not received. Normally every file access on a filesystem monitored by McAfee would trigger a fanotify event. In case the main McAfee services are not responding, that would increase and pile up the number of blocked tasks (UN state) waiting for fanotify events response, consequently leading to a system hang/freeze.

While checking the crash dump, it's interesting that the oldest task in UN state is a McAfee service (Global Thread Intelligence Manager):
PID: 7947   TASK: ffff880041216300  CPU: 1   COMMAND: "GTI LookUp Mgr"
 #0 [ffff880041219b78] schedule at ffffffff8147b6b9
 #1 [ffff880041219cd0] schedule_timeout at ffffffff8147be9d
 #2 [ffff880041219d60] wait_for_common at ffffffff8147aee5
 #3 [ffff880041219de0] wait_for_completion_killable at ffffffff8147afd8
 #4 [ffff880041219df0] call_usermodehelper_exec at ffffffff8108359d
 #5 [ffff880041219e40] __request_module at ffffffff81083993
 #6 [ffff880041219f00] __sock_create at ffffffff813a5758
 #7 [ffff880041219f50] sys_socket at ffffffff813a59ed
 #8 [ffff880041219f80] system_call_fastpath at ffffffff8148633e

The rest of the (UN state) tasks are normal tasks like sshd, modprobe, cron, TaniumClient etc... all with the same stack trace as below, waiting for the fanotify events response:
PID: 7995   TASK: ffff880040e261c0  CPU: 0   COMMAND: "modprobe"
 #0 [ffff880041797928] schedule at ffffffff8147b6b9
 #1 [ffff880041797a80] fanotify_get_response_from_access at ffffffff811a823d
 #2 [ffff880041797ae0] fanotify_handle_event at ffffffff811a8362
 #3 [ffff880041797b10] fsnotify at ffffffff811a49f1
 #4 [ffff880041797bc0] __dentry_open at ffffffff81167e1b
 #5 [ffff880041797c10] do_last at ffffffff81177e32
 #6 [ffff880041797c70] path_openat at ffffffff8117931e
 #7 [ffff880041797d00] do_filp_open at ffffffff8117954c
 #8 [ffff880041797dd0] do_sys_open at ffffffff8116984f
 #9 [ffff880041797e30] mfe_aac_sys_openat at ffffffffa028423c [mfe_aac_100613122]
#10 [ffff880041797f80] system_call_fastpath at ffffffff8148633e

PID: 8028   TASK: ffff880136e74280  CPU: 1   COMMAND: "sshd"
 #0 [ffff8801372f19e8] schedule at ffffffff8147b6b9
 #1 [ffff8801372f1b40] fanotify_get_response_from_access at ffffffff811a823d
 #2 [ffff8801372f1ba0] fanotify_handle_event at ffffffff811a8362
 #3 [ffff8801372f1bd0] fsnotify at ffffffff811a49f1
 #4 [ffff8801372f1c80] __dentry_open at ffffffff81167e1b
 #5 [ffff8801372f1cd0] do_last at ffffffff81177e32
 #6 [ffff8801372f1d30] path_openat at ffffffff811790c7
 #7 [ffff8801372f1dc0] do_filp_open at ffffffff8117954c
 #8 [ffff8801372f1e90] open_exec at ffffffff811719bf
 #9 [ffff8801372f1eb0] do_execve_common at ffffffff811727c9
#10 [ffff8801372f1f20] sys_execve at ffffffff8100b1e9
#11 [ffff8801372f1f50] stub_execve at ffffffff814867fc

The only tasks that have opened fanotify events are McAfee releated tasks, e.g. :
crash> foreach files -R ffff880135bacb80

PID: 7756   TASK: ffff880126d94480  CPU: 1   COMMAND: "mfetpd"
ROOT: /    CWD: /
 FD       FILE            DENTRY           INODE       TYPE PATH
 24 ffff880135bacb80 ffff88005b9a0b00 ffff88013b0f3c98 UNKN [fanotify]

PID: 7757   TASK: ffff880135672040  CPU: 0   COMMAND: "mfetpd"
ROOT: /    CWD: /
 FD       FILE            DENTRY           INODE       TYPE PATH
 24 ffff880135bacb80 ffff88005b9a0b00 ffff88013b0f3c98 UNKN [fanotify]

PID: 7758   TASK: ffff880135732240  CPU: 0   COMMAND: "Req Thread->SAM"
ROOT: /    CWD: /
 FD       FILE            DENTRY           INODE       TYPE PATH
 24 ffff880135bacb80 ffff88005b9a0b00 ffff88013b0f3c98 UNKN [fanotify]

PID: 7759   TASK: ffff880137346580  CPU: 0   COMMAND: "Rsp Thread<-SAM"
ROOT: /    CWD: /
 FD       FILE            DENTRY           INODE       TYPE PATH
 24 ffff880135bacb80 ffff88005b9a0b00 ffff88013b0f3c98 UNKN [fanotify]

PID: 7761   TASK: ffff8801348b2440  CPU: 0   COMMAND: "mfetpd"
ROOT: /    CWD: /
 FD       FILE            DENTRY           INODE       TYPE PATH
 24 ffff880135bacb80 ffff88005b9a0b00 ffff88013b0f3c98 UNKN [fanotify]

PID: 7975   TASK: ffff880040e86400  CPU: 0   COMMAND: "Collect FA Evnt"
ROOT: /    CWD: /
 FD       FILE            DENTRY           INODE       TYPE PATH
 24 ffff880135bacb80 ffff88005b9a0b00 ffff88013b0f3c98 UNKN [fanotify]

While checking the fanotify structures, we can see that the notification list (notification_list contains events generated by the kernel but not yet read by the userspace) is empty so McAfee has already read all the events (later we will check if the AV has written the responses as well):
 
crash> struct -x file.private_data ffff880135bacb80
  private_data = 0xffff880009872a80

crash> struct -x fsnotify_group.notification_list 0xffff880009872a80
  notification_list = {
    next = 0xffff880009872ab0,
    prev = 0xffff880009872ab0
  }

crash> list 0xffff880009872ab0 -H
(empty)

Listing the elements which needs to be de-queued:
crash> struct -x fsnotify_group.fanotify_data.access_list 0xffff880009872a80
  fanotify_data.access_list = {
    next = 0xffff880040e813d0,
    prev = 0xffff880040e81a30
  },

crash> list -H 0xffff880040e813d0 |wc -l
16

crash> list -H 0xffff880040e813d0
ffff880040e813b0
ffff880040e81390
ffff880040e81630
ffff880040e81610
ffff880040e81b90
ffff880040e81bb0
ffff880040e81c70
ffff880040e81c10
ffff880040e81b70
ffff880040e818d0
ffff880040e81770
ffff880040e81a90
ffff880040e81a70
ffff880040e81a50
ffff880040e81a30

crash> p "*((struct fanotify_response_event*)0xffff880040e813d0)->event"
$7 = {
  holder = {
    event = 0x0,
    event_list = {
      next = 0xffff880138695dd8,
      prev = 0xffff880138695dd8
    }
  },
  lock = {
    {
      rlock = {
        raw_lock = {
          slock = 131074
        }
...
  response = 0,

crash> p "*((struct fanotify_response_event*)0xffff880040e81630)->event"
...
  response = 0,

The combination of "response = 0" and empty notification_list and nonempty access_list, indicates that AV has read the requests, but hasn't written the response. This situation doesn't look like a kernel issue as at least one of the list of fsnotify_group (access_list) is not empty. In case we would have had tasks blocked on fanotify_get_response_from_access() with both notification_list and access_list empty, then that would have point a kernel issue, but this is not the case here.

Resolution

This issue is very likely to be caused either by some misconfigurations on userspace McAfee services, or maybe some bug on McAfee stack, therefore it would be highly recommended to follow up further with McAfee support.
 

Status

Reported to Engineering

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:000020332
  • Creation Date: 14-Jul-2021
  • Modified Date:16-Jul-2021
    • SUSE Linux Enterprise Server
    • SUSE Linux Enterprise Server for SAP Applications

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.


SUSE Customer Support Quick Reference Guide SUSE Technical Support Handbook Update Advisories
Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Go to Customer Center