System hang/freeze with a high number of tasks waiting for McAfee userspace services to write the fanotify responses
This document (000020332) is provided subject to the disclaimer at the end of this document.
Environment
SUSE Linux Enterprise Server for SAP Applications 15 SP4
SUSE Linux Enterprise Server for SAP Applications 12 SP3
SUSE Linux Enterprise Server 12 SP5
SUSE Linux Enterprise Server 12 SP4 LTSS
SUSE Linux Enterprise Server 12 SP3 LTSS
SUSE Linux Enterprise Server 11 SP4 LTSS
Situation
LOAD AVERAGE: 13.41, 7.57, 3.55
The tasks will stay in a blocked state for as long as the fanotify response is not received. Normally every file access on a filesystem monitored by McAfee would trigger a fanotify event. In case the main McAfee services are not responding, that would increase and pile up the number of blocked tasks (UN state) waiting for fanotify events response, consequently leading to a system hang/freeze.
While checking the crash dump, it's interesting that the oldest task in UN state is a McAfee service (Global Thread Intelligence Manager):
PID: 7947 TASK: ffff880041216300 CPU: 1 COMMAND: "GTI LookUp Mgr" #0 [ffff880041219b78] schedule at ffffffff8147b6b9 #1 [ffff880041219cd0] schedule_timeout at ffffffff8147be9d #2 [ffff880041219d60] wait_for_common at ffffffff8147aee5 #3 [ffff880041219de0] wait_for_completion_killable at ffffffff8147afd8 #4 [ffff880041219df0] call_usermodehelper_exec at ffffffff8108359d #5 [ffff880041219e40] __request_module at ffffffff81083993 #6 [ffff880041219f00] __sock_create at ffffffff813a5758 #7 [ffff880041219f50] sys_socket at ffffffff813a59ed #8 [ffff880041219f80] system_call_fastpath at ffffffff8148633e
The rest of the (UN state) tasks are normal tasks like sshd, modprobe, cron, TaniumClient etc... all with the same stack trace as below, waiting for the fanotify events response:
PID: 7995 TASK: ffff880040e261c0 CPU: 0 COMMAND: "modprobe" #0 [ffff880041797928] schedule at ffffffff8147b6b9 #1 [ffff880041797a80] fanotify_get_response_from_access at ffffffff811a823d #2 [ffff880041797ae0] fanotify_handle_event at ffffffff811a8362 #3 [ffff880041797b10] fsnotify at ffffffff811a49f1 #4 [ffff880041797bc0] __dentry_open at ffffffff81167e1b #5 [ffff880041797c10] do_last at ffffffff81177e32 #6 [ffff880041797c70] path_openat at ffffffff8117931e #7 [ffff880041797d00] do_filp_open at ffffffff8117954c #8 [ffff880041797dd0] do_sys_open at ffffffff8116984f #9 [ffff880041797e30] mfe_aac_sys_openat at ffffffffa028423c [mfe_aac_100613122] #10 [ffff880041797f80] system_call_fastpath at ffffffff8148633e PID: 8028 TASK: ffff880136e74280 CPU: 1 COMMAND: "sshd" #0 [ffff8801372f19e8] schedule at ffffffff8147b6b9 #1 [ffff8801372f1b40] fanotify_get_response_from_access at ffffffff811a823d #2 [ffff8801372f1ba0] fanotify_handle_event at ffffffff811a8362 #3 [ffff8801372f1bd0] fsnotify at ffffffff811a49f1 #4 [ffff8801372f1c80] __dentry_open at ffffffff81167e1b #5 [ffff8801372f1cd0] do_last at ffffffff81177e32 #6 [ffff8801372f1d30] path_openat at ffffffff811790c7 #7 [ffff8801372f1dc0] do_filp_open at ffffffff8117954c #8 [ffff8801372f1e90] open_exec at ffffffff811719bf #9 [ffff8801372f1eb0] do_execve_common at ffffffff811727c9 #10 [ffff8801372f1f20] sys_execve at ffffffff8100b1e9 #11 [ffff8801372f1f50] stub_execve at ffffffff814867fc
The only tasks that have opened fanotify events are McAfee releated tasks, e.g. :
crash> foreach files -R ffff880135bacb80 PID: 7756 TASK: ffff880126d94480 CPU: 1 COMMAND: "mfetpd" ROOT: / CWD: / FD FILE DENTRY INODE TYPE PATH 24 ffff880135bacb80 ffff88005b9a0b00 ffff88013b0f3c98 UNKN [fanotify] PID: 7757 TASK: ffff880135672040 CPU: 0 COMMAND: "mfetpd" ROOT: / CWD: / FD FILE DENTRY INODE TYPE PATH 24 ffff880135bacb80 ffff88005b9a0b00 ffff88013b0f3c98 UNKN [fanotify] PID: 7758 TASK: ffff880135732240 CPU: 0 COMMAND: "Req Thread->SAM" ROOT: / CWD: / FD FILE DENTRY INODE TYPE PATH 24 ffff880135bacb80 ffff88005b9a0b00 ffff88013b0f3c98 UNKN [fanotify] PID: 7759 TASK: ffff880137346580 CPU: 0 COMMAND: "Rsp Thread<-SAM" ROOT: / CWD: / FD FILE DENTRY INODE TYPE PATH 24 ffff880135bacb80 ffff88005b9a0b00 ffff88013b0f3c98 UNKN [fanotify] PID: 7761 TASK: ffff8801348b2440 CPU: 0 COMMAND: "mfetpd" ROOT: / CWD: / FD FILE DENTRY INODE TYPE PATH 24 ffff880135bacb80 ffff88005b9a0b00 ffff88013b0f3c98 UNKN [fanotify] PID: 7975 TASK: ffff880040e86400 CPU: 0 COMMAND: "Collect FA Evnt" ROOT: / CWD: / FD FILE DENTRY INODE TYPE PATH 24 ffff880135bacb80 ffff88005b9a0b00 ffff88013b0f3c98 UNKN [fanotify]
While checking the fanotify structures, we can see that the notification list (notification_list contains events generated by the kernel but not yet read by the userspace) is empty so McAfee has already read all the events (later we will check if the AV has written the responses as well):
crash> struct -x file.private_data ffff880135bacb80 private_data = 0xffff880009872a80 crash> struct -x fsnotify_group.notification_list 0xffff880009872a80 notification_list = { next = 0xffff880009872ab0, prev = 0xffff880009872ab0 } crash> list 0xffff880009872ab0 -H (empty)
Listing the elements which needs to be de-queued:
crash> struct -x fsnotify_group.fanotify_data.access_list 0xffff880009872a80 fanotify_data.access_list = { next = 0xffff880040e813d0, prev = 0xffff880040e81a30 }, crash> list -H 0xffff880040e813d0 |wc -l 16 crash> list -H 0xffff880040e813d0 ffff880040e813b0 ffff880040e81390 ffff880040e81630 ffff880040e81610 ffff880040e81b90 ffff880040e81bb0 ffff880040e81c70 ffff880040e81c10 ffff880040e81b70 ffff880040e818d0 ffff880040e81770 ffff880040e81a90 ffff880040e81a70 ffff880040e81a50 ffff880040e81a30 crash> p "*((struct fanotify_response_event*)0xffff880040e813d0)->event" $7 = { holder = { event = 0x0, event_list = { next = 0xffff880138695dd8, prev = 0xffff880138695dd8 } }, lock = { { rlock = { raw_lock = { slock = 131074 } ... response = 0, crash> p "*((struct fanotify_response_event*)0xffff880040e81630)->event" ... response = 0,
The combination of "response = 0" and empty notification_list and nonempty access_list, indicates that AV has read the requests, but hasn't written the response. This situation doesn't look like a kernel issue as at least one of the list of fsnotify_group (access_list) is not empty. In case we would have had tasks blocked on fanotify_get_response_from_access() with both notification_list and access_list empty, then that would have point a kernel issue, but this is not the case here.
Resolution
Status
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:000020332
- Creation Date: 14-Jul-2021
- Modified Date:16-Jul-2021
-
- SUSE Linux Enterprise Server
- SUSE Linux Enterprise Server for SAP Applications
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com