My Favorites

Close

Please to see your favorites.

  • Bookmark
  • Email Document
  • Printer Friendly
  • Favorite
  • Rating:

Considerations for dealing with correctable memory error messages

This document (7022118) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise Server 11 (all versions)
SUSE Linux Enterprise Server 12 (all versions)
x86_64
mcelog

Situation

Error-correcting code memory (ECC memory) can detect and correct most of the errors it encounters. These errors are logged by the kernel and often lead to the question how to interpret and deal with this kind of information.
Generally, the kernel detects and reports such errors but systems with sophisticated firmware functionality can sometimes do a better job at that due to their most intimate knowledge of the platform and abilities in performing the proper recovery actions.

The kernel however is not the only source, the system service controller may or may not also detect this kind of issue. Especially if the service controller does not show the events seen in the OS the administrator starts to wonder whether there is an issue or not.

Resolution

Whether or not correctable memory errors are logged is a company or IT department policy and there is no general rule which will fit every IT department's goals.

The operating system (in this case the kernel) is as verbose as possible and logs those events by default which may lead to false/positive alerts if no errors are reported in the hardware management board.
The kernel-source.rpm contains the file

    /usr/src/linux/Documentation/x86/x86_64/boot-options.txt

which provides a number of kernel options to influence the logging behavior of the kernel. The question mainly is, should the administrator worry about corrected ECC errors at all?

From a technical point of view, a corrected memory message should be considered as an informational message only because the error has been corrected by the built-in hardware error correction mechanisms and it has not had any effect on system execution. However, todays hardware management boards may provide defined thresholds how many errors may occur before a warning / action is triggered.

Uncorrected errors on the other hand are the ones to worry about. In case of such an event, the kernel panics automatically to prevent data corruption (see option mce=tolerancelevel# in /usr/src/linux/Documentation/x86/x86_64/boot-options.txt)

A kernel option that may influence the behaviour of ECC RAM error logging are (taken from /usr/src/linux/Documentation/x86/x86_64/boot-options.txt):
               
mce=ignore_ce
             Disable features for corrected errors, e.g. polling timer
             and CMCI.  All events reported as corrected are not cleared
             by OS and remained in its error banks.
             [...]

This option instructs the kernel to ignore correctable errors in the presence of a hardware management board which takes care of monitoring such events instead.

Additional Information

For HPE servers specifically, a document discussing this topic is available at:

http://h20564.www2.hpe.com/hpsc/doc/public/display?docId=emr_na-a00016026en_us&sp4ts.oid=3884323


Please note disabling EDAC as discussed in this article will not effect the kernel's ability to react on uncorrectable memory error events. In this case, a machine check exception will be executed and the system will crash to prevent data corruption.
In case of any questions, please open a support request with the respective hardware vendor do discuss recommended settings for the hardware platform in use.

Disclaimer

This Support Knowledgebase provides a valuable tool for NetIQ/Novell/SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:7022118
  • Creation Date:17-OCT-17
  • Modified Date:19-OCT-17
    • SUSESUSE Linux Enterprise Server

Did this document solve your problem? Provide Feedback

< Back to Support Search

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.


SUSE Customer Support Quick Reference Guide SUSE Technical Support Handbook Update Advisories
Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Go to Customer Center