Linux system hangs or is unstable

This document (3301593) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise Server 15
SUSE Linux Enterprise Server 12
SUSE Linux Enterprise Server 11
SUSE Linux Enterprise Server 10
 

Situation

System hangs
System is unstable
System oops or panic

Resolution

Follow a structured troubleshooting process covering the following areas discussed in detail below:
  • Problem characterization
  • Hardware layer
  • BIOS / firmware layer
  • Storage layer
  • Software layer

Additional Information

Introduction

Due to the large number of different potential causes, system hangs are among the most difficult problems to troubleshoot and a systematic approach is required for troubleshooting to be effective. This document describes such an approach, in general terms.


Problem characterization

First of all, establish a detailed characterization of the problem which answers at a minimum the following questions:
  • What is meant by a hang or instability? Is the system not providing a particular service (reliably) anymore, has the system as whole become completely inaccessible (both via network and via console), or is it still responsive to some forms of connection (e.g. SSH, VNC or ping) or commands?
  • For a hang, is it a single occurrence or has the hang occurred multiple times?
  • For a recurring hang, is there a pattern to the hangs? E.g. can the hang be triggered by a particular sequence of operations, or does it always occur around a particular time of day, after a particular period of system uptime, or when particular cron jobs are executed.


Hardware layer

System hangs or instabilities can be caused by hardware that is defective or improperly configured. Unfortunately, this happens more than most people realize, for two main reasons:
  • A ground rule with hardware is "Cheap, reliable, fast. Pick any two". Hardware that is cheap and reliable is not fast; hardware that is fast and cheap is not reliable; hardware that is reliable and fast is not cheap.
  • Proper hardware configurationis difficult. Most hardware has many settings which can be tweaked, but knowing when and what to tweak can be something of a black art.
Use diagnostics software

Fortunately, reputable hardware vendors offer diagnostics software that can and should be used to detect hardware problems. If hardware problems are incorrectly disregarded as a problem source, much time will be wasted on analysing the software level.

Aside from vendor hardware diagnostics software, for x86 and x86_64 systems there are very thorough diagnostic tools for the memory subsystem: Memtest86 and Memtest86+. These tools are often better at identifying memory subsystem issues than vendor hardware diagnostics software. A version of them is included on the boot CD of SUSE Linux products and these tools can also be obtained from the www.memtest86.org and www.memtest86.com web sites.


Consult vendor configuration guides

As for hardware configuration, some vendors (e.g. IBM) provide detailed configuration guides for SUSE Linux products on specific hardware models on their support sites. When available, this type of guide should be followed, preferably from the initial installation onwards. Even when such a guide has not been followed during initial installation, it should be consulted later on to check the system configuration and bring it in line with the hardware vendor's recommendations.


Consult certification documentation

Additionally, for SUSE YES CERTIFIED configurations, consult the
YES CERTIFIED bulletin Search . Where applicable, the certification bulletins contain configuration details such as Linux kernel parameters.

Address power supply issues

In some regions or at some locations, power from the regular electrical grid may be too variable in voltage, frequency or current for hardware to operate reliably. In such locations, appropriate electrical hardware like surge protectors, voltage regulators, uninterruptible power supplies and/or generators should be used to provide reliable power for computer systems operation.

Isolate components

In some cases, stability issues and hangs are caused by specific extension cards. Remove all non-essential extension cards, test the system then put them back one by one, testing the system after every added card.

Best practice: "burn in" testing

In light of these considerations, it is considered best practice for hardware that is to be used for production services to undergo thorough "burn in" testing covering diagnostics and stress and load testing prior to being put into production use.


BIOS layer

On PC-based systems, the BIOS (Basic Input/Output System) is responsible for the initial setup of the system and devices up to the point where a boot loader can be started to boot the system. On other architectures, the term "BIOS" is not used, but equivalent embedded software exists, e.g. "Open Firmware" or "Extensible Firmware Interface".

The BIOS and its equivalents on non-PC architectures may also be involved in power management, hardware monitoring and hotplugging of extension cards.

A BIOS, like any other software, may contain general programming defects (bugs) and may not always be following or supporting relevant standards such as ACPI fully. Vendors regularly release updated versions of BIOSes to correct such defects. Given the central role of the BIOS, it is important to track such version updates and to ensure the most recent non-development version of the BIOS is installed.

Most reputable vendors provide a search interface on their support sites that make it easy to find the current BIOS revision for a particular hardware model as well as update instructions.


Other Firmware

With modern hardware many components, for instance NICs, HBAs and storage controllers, include embedded software or firmware of their own. This firmware should be brought up to date as well.


Storage layer

Ensure that your storage is consistent by performing filesystem checks (and recovery) on all storage areas, including the root filesystem. To check the root filesystem, use the rescue environment from the service pack or installation CDs or DVDs.


Software layer

Check for corrupted data

Even when the filesystems check out cleanly, data contained in them may be corrupted, including code and data vital to proper operation of the operating system. The package management system stores checksums of data under its control. Run
rpm -Vva
to verify the contents of your file system against those checksums.
Check the output of this command for signs of changes in files that are not configuration files, like binaries and libraries.

Keep the software installation up to date


SUSE actively maintains released products for long periods of time. This maintenance includes fixes for software defects in particular as well as the addition of drivers for newer hardware models. Use the tools supplied by SUSE, in particular the SPident tool, the SUSE Customer Center and the online update facilities of your product to check whether your software installation is up to date and to bring it up to date if it isn't.

Check recent updates

Unfortunately, updated packages can occasionally introduce new defects. You can use the package management system of your SUSE Linux Enterprise product to determine what updates have been installed recently, e.g. through:
rpm -qa --last
This may help isolating what software update introduced a defect. When an updated package breaks a previously functioning system, please inform SUSE Support through a service request or a bug report.


Support from SUSE Technical Services

Basic information

When opening a service request with SUSE Support for a server hang or instability issue, the following information may be vital to an efficient resolution process:
Crash dumps

During the handling of your service request, you may be asked to provide a system crash dump for analysis, which may require substantial setup (e.g. of a serial console and/or second server to receive dumps). You can prepare for this by consulting the relevant TID or SLES Documentation for details e.g.:

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:3301593
  • Creation Date: 29-Oct-2007
  • Modified Date:27-May-2021
    • SUSE Linux Enterprise Desktop
    • SUSE Linux Enterprise Server

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback@suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.


SUSE Customer Support Quick Reference Guide SUSE Technical Support Handbook Update Advisories
Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Go to Customer Center