SUSE Support

Here When You Need Us

mce EDAC memory scrubbing error

This document (000020932) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise Server 15
SUSE Linux Enterprise Server 12

Situation

On various Cisco hardware systems, memory errors can be seen in messages logs.
Cisco Hardware examples from supportconfig's basic-environment.txt:
Manufacturer:  Cisco Systems Inc
Hardware:      UCSC-C460-M4

Manufacturer:  Cisco Systems Inc
Hardware:      UCSB-EX-M4-3

Memory error examples from /var/log/messages or /var/log/warn:
kernel: [780347.201907] mce: [Hardware Error]: Machine check events logged
kernel: [780347.201913] EDAC sbridge MC3: HANDLING MCE MEMORY ERROR
kernel: [780347.201915] EDAC sbridge MC3: CPU 0: Machine Check Event: 0 Bank 13: 8c00004e000800c0
kernel: [780347.201916] EDAC sbridge MC3: TSC 320fbc8c89c9a4
kernel: [780347.201918] EDAC sbridge MC3: ADDR 52baf54000
kernel: [780347.201918] EDAC sbridge MC3: MISC 900020002001c8c
kernel: [780347.201920] EDAC sbridge MC3: PROCESSOR 0:406f1 TIME 1672341137 SOCKET 0 APIC 0
kernel: [780347.201936] EDAC MC3: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x52baf54 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c0 socket:0 ha:0 channel_mask:1 rank:0)
 
kernel: [5336480.342062] EDAC sbridge MC2: HANDLING MCE MEMORY ERROR
kernel: [5336480.342067] EDAC sbridge MC2: CPU 130: Machine Check Event: 0 Bank 8: cc00038000010091
kernel: [5336480.342070] EDAC sbridge MC2: TSC 0
kernel: [5336480.342071] EDAC sbridge MC2: ADDR b686270ec0
kernel: [5336480.342072] EDAC sbridge MC2: MISC 15646d086
kernel: [5336480.342074] EDAC sbridge MC2: PROCESSOR 0:406f1 TIME 1667973782 SOCKET 1 APIC 59
kernel: [5336480.342090] mce: [Hardware Error]: Machine check events logged
kernel: [5336480.342106] EDAC MC6: 14 CE memory read error on CPU_SrcID#1_Ha#1_Chan#1_DIMM#1 (channel:1 slot:1 page:0xb686270 offset:0xec0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0091 socket:1 ha:1 channel_mask:2 rank:5)

 

Resolution

The memory errors mentioned above don't get logged to hardware logs due to a bug in firmware.
Cisco recommends to disable the "edac" kernel module
by adding blacklist sb_edac to /etc/modprobe.d/50-blacklist.conf

To do that, run the command:
echo "blacklist sb_edac" >> /etc/modprobe.d/50-blacklist.conf

Then reboot for the setting to take effect.

Cause

The Linux kernel module sb_edac and the hardware EDAC conflicts with each other and this causes the hardware errors to not write to hardware logs.

Additional Information

https://community.cisco.com/t5/unified-computing-system-discussions/memory-scrubbing-error/td-p/4071639
"If you have EDAC modules enabled in your Linux OS, then you really want to disable/black list those.They are notorious for not correctly identifying the actual DIMM that's triggering ECCs, and you really want to let the hardware do that (which it won't if you have EDAC active in os).Then you should be able to see in your SEL log, the DIMM slot this is triggering ECCs."

https://quickview.cloudapps.cisco.com/quickview/bug/CSCvf14908
"Symptom: When this issue occurs, the following two error/stack lines are often observed:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
IP: [] sbridge_mce_output_error+0x36a/0xdf0 [sb_edac]
Conditions: Cisco UCS B or C series Servers running SLES12 SP1. The EDAC module and UCS error detection conflicts with each other and can cause system crashes. EDAC module should be blacklisted."

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:000020932
  • Creation Date: 17-Jan-2023
  • Modified Date:18-Jan-2023
    • SUSE Linux Enterprise Server
    • SUSE Linux Enterprise Server for SAP Applications

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.