SUSE Conversations


A Basic Server Health Check with Supportconfig



By: jrecord

March 4, 2008 4:08 pm

Reads:348

Comments:1

Rating:0

Overview

Methodology

Health Check Overview

Basic Environment Check

Basic Health Check

Basic Problem Solving

Table Summary

Conclusion

Overview

When problems arise with a server, the simple is often overlooked. Is the monitor plugged in? Is the install media in the DVD drive? Was the service started? Supportconfig is a tool designed to gather system information in a way that promotes resolving problems as quickly as possible. The goal of this article is to show the administrator how to use supportconfig to check the basic health of the server. A test case of a server with high CPU utilization will illustrate the process. Once you have created a supportconfig tar ball, you should perform a server health check. Checking the basics begins with the basic-environment.txt and basic-health-check.txt files.

Methodology

Supportconfig has three primary purposes, 1) gather important system information, 2) reduce problem resolution time, and 3) teach useful system commands. Of course information is critical to any problem solving scenario. If there’s a problem, the basic supportconfig philosophy is, gather as much information as possible, so we only have to ask for it once. Once the information is gathered, it should be organized in such a manner that problems can be solved quickly and efficiently. As a result of this objective, several pieces of information are replicated to create a kind of one-stop-shop environment. For example, all services and their current run level states are recorded in the chkconfig.txt file. However, the current state for services specific to Logical Volume Management (LVM) are also recorded in the lvm.txt file. The lvm.txt file is one location to review much of the LVM information. All files end with a ‘.txt’ extension so they are easily recognized and opened with default editors across platforms. As far as a teaching tool, all commands used to gather information are first logged to the appropriate log file, and then the output is recorded. This way if the supportconfig ever “hangs,” you know what command it hung on. You can also quickly repeat any piece of information you want because the exact command with it’s path and options were recorded in the text file.

To get good at reading a supportconfig, you need experience. After you look at 100 supportconfig tar balls, you will learn what is normal. The abnormal will then stand out. How can you cut the learning curve if you don’t have 100 supportconfigs or the time to look at them? After each of your systems is running smoothly and tuned to your needs, get a supportconfig. Copy the tar ball off the server for comparison when and if the server experiences a problem that needs troubleshooting. Compare the supportconfig tar ball taken when the problem occured with the saved good copy for that server. Make sure you submit both tar balls to Novell Technical Services if you need to open a service request.

Health Check Overview

Every time I get a supportconfig, I always check the basics for obvious problems. The goal in checking the health of the server is to note red and yellow flags. Red Flags are problems that must be explained before moving forward. They are issues that directly relate to the reported problem or may affect the overall server performance. Red flags are not necessarily bad, but just need to be explained. Yellow Flags are issues that probably should be addressed or at least monitored, but are not directly related to the problem. A basic server health check in it’s simplest form confirms that the server is patched and up-to-date. The CPU utilization, memory usage and disk space are within normal limits, and the kernel and running services are healthy.

Basic Environment Check

Look at the basic-environment.txt file first. The goal for reviewing this file is to confirm the server is patched, major packages installed and if there may be any firewall concerns.

 

  1. Download and use the latest supportconfig. We want the best diagnostic information possible.

 

  1. Verify the script execution date is relative to the problem time frame to make sure we aren’t looking at obsolete information.
Script version and run time

Click to view.
Figure 1 – Script version and run time

Test Case: As of this writing, version 2.18-11 is the current supportconfig, and the date ran matches the time frame of the issue.

 

  • Check the hostname to confirm the supportconfig was run on the host with the problem.

 

  • Compare the running kernel version with TID3594951: Table of Kernel Versions for SUSE Linux Enterprise Server to ensure it’s current. The kernel is the heart of the SUSE distribution. As such, security vulnerabilities and bugs are corrected to keep the distribution safe and effective. If you are running an older kernel, be sure the reasons for doing so out way the lost security and vulnerability. From a technical support perspective, it’s also important to know the type of kernel running (default, smp, bigsmp, etc) and the system architecture.

    Red Flag: The running kernel according to uname is not the latest kernel or is different than the installed kernel RPM package.

 

Host name, running kernel and architecture

Click to view.
Figure 2 – Host name, running kernel and architecture

Test Case: The kernel is not up to date. Since the problem is CPU utilization, updating the kernel may be a valid troubleshooting step. However, we would want to first know how the CPU is being utilized; keep reading.

 

  • Make sure the kernel version, SuSE-release and SPident all show the same patch level. The kernel, SuSE-release and SPident all come from different RPM packages. So if the server has been installed and patched correctly, all of these packages will agree with one another. If they don’t, you need to explain why.

    Red Flag: There is a mismatch among the three.

 

uname, release and SPident

Click to view.
Figure 3 – uname, release and SPident

Test Case: SPident and SuSE-release say the server is at SLES10 SP1. The running kernel is 2.6.16.53-0.16, which is newer than the SLES10 SP1 kernel 2.6.16.46-0.12. Since the server has been patched, all three are consistent.

 

  • Did SPident pass verification? If not, you may consider reinstalling the SPident RPM package so you can rely on it’s output, or simply don’t trust the output.

 

  • Verify that all RPM packages are current according to SPident. SPident compares the version of each RPM package installed on the system with a list of known versions for the shipping distribution and each service pack.

    Red Flag: An RPM package relating to the problem is outdated or any other package that is very old.

 

More SPident

Click to view.
Figure 4 – More SPident

Test Case: The SPident RPM package is fine, and there are no conflicting packages. Updates have been applied, but since the kernel is outdated, there are probably other package updates as well. It may be worth patching the server, rebooting and retesting the CPU utilization issue.

 

  • Are there any unsupported RPM package distributions installed that may related to the issue? Normally, third party packages simply add functionality to the server, but don’t replace the packages distributed with SUSE.

    Red Flag: A third party package replaces a SUSE Linux distributed package.

 

Third party replacements

Click to view.
Figure 5 – Third party replacements

Test Case: The only third party package installed is from the “Novell NTS” distribution. It provides supportconfig itself and does not replace a distributed package, like apache or LVM. Sometimes third party packages don’t list a distribution, so they show up as “(none).” SUSE also distributes some packages with a “(none)” distribution. You can see these packages by searching for “(none)” in the rpm.txt file.

 

  • Will the firewall come on after a reboot, and are there currently active rules? If there are and the problem is related to networking, maybe the firewall is interrupting service.

 

Firewall status

Click to view.
Figure 6 – Firewall status

Test Case: The firewall services are turned off and there are not current active firewall rules. The firewall won’t play a part in the problem.

Basic Health Check

Look at the basic-health-check.txt file next. The goal for reviewing this file is to check CPU utilization, memory, disk utilization, kernel taint status, and the health of running processes.

 

  1. Check the load averages. The load average is the average number of processes waiting to get on the run queue in the past one, five and fifteen minutes. It is a good indication of how busy the kernel is. A high load average may not be bad, but should be explainable and not impacting the overall server performance.

    Red Flag: Load averages greater than 20.

Uptime and load average

Click to view.
Figure 7 – Uptime and load average

Test Case: The uptime is 112 days. This is good to know since the first line of vmstat is an average over the uptime of the server. Subsequent vmstat lines are current snapshots in time. This allows us to observe which of the values have changed over time. The load averages show the kernel is busy, but probably nothing to worry about. It is consistent with the reported CPU utilization concern.

 

  • CPU utilization. I am more concerned about how the CPU is being utilized, than how much it’s being utilized. To better understand how the CPU is being used, the vmstat and mpstat commands are helpful. The mpstat averages are over the mpstat samples, and not the server up time. If there is high user space CPU activity, then check the “Top 10 CPU Processes” to find the offending binaries. If it’s system space, then look at the vmstat “system” columns. High interrupts (in) may indicate misbehaving hardware or an impending hardware failure. Look at procinfo in hardware.txt to track down which interrupt is causing the problem. A high number of context switches (cs) may indicate an application bug.

    Red Flag: Values for “in” or “cs” greater than 10,000. NOTE: A high number of context switches is normal for the SLERT kernel.

 

CPU Utilization

Click to view.
Figure 8 – CPU Utilization

Top 10 CPU

Click to view.
Figure 9 – Top 10 CPU

Test Case: Comparing the first vmstat line with the other lines shows on average the CPU has been idle, but recently it has spiked to 100%. Each CPU is topped at 100% according to mpstat. The load averages are about the same for the past 1, 5 and 15 minutes. Notice that the user space is consuming the CPU. The interrupts (in) and context switches (cs) are not a concern. Since the problem seems isolated to user space, look at the top ten CPU processes. A program called “loop” is the major offender here. There are five of them, and they are all consuming heavy CPU time.

 

  • Memory utilization. A small amount free memory does not necessarily mean the server is running out of memory. Linux is efficient with memory usage and caches as much as it can. You can also look at the “Top 10 Memory Processes” to find out which applications are using the most memory.

    Red Flag: The server is frequently swapping to disk, and free memory below 2MB.

 

swap, free and cache

Click to view.
Figure 10 – swap, free and cache

Top 10 Memory

Click to view.
Figure 11 – Top 10 Memory

Test Case: The server is not currently swapping to disk, there is a lot of cached memory and free memory is 39MB. I wouldn’t worry about memory.

 

  • Disk utilization. It is bad to run out of disk space. Linux uses files as way to write to memory, disk, all sorts of things. If a temporary file or named pipe cannot be created, the system will be unreliable. This is particularly problematic if root, /tmp or /home get full.

    Red Flag: Running out of disk space on the root “/”, /tmp or /home partitions. Make sure you are not running out of inodes on these file systems either.

 

Disk space and inodes

Click to view.
Figure 12 – Disk space and inodes

Test Case: There is plenty of disk space and free inodes.

 

  • Kernel taint status. Will Novell support a tainted kernel? Yes and No. If the kernel is tainted with third party drivers, then the kernel development teams will have a difficult time providing a patch for the kernel, since the kernel has changed from anything Novell provides. However, the support teams will do their best to help, regardless. If at all possible, reboot your server and duplicate the problem on an untainted kernel.

    Red Flag: A tainted kernel.

 

Kernel taint status

Click to view.
Figure 13 – Kernel taint status

Test Case: The kernel is tainted, but since the problem is a third party application, “loop”, the taint status will not affect supportability. If it is ever determined that one of these drivers is part of the problem, then the taint status would affect supportability.

 

  • Health of system processes. Look for processes in a “D” (uninterruptible sleep) or “Z” (zombie/defunct) state. If you have several processes in a D state, this usually means the process is waiting on disk I/O. Any command that accesses the same disk I/O path, may appear to be hung while it waits on that particular disk I/O. This may explain why the server appears to “hang” at times. Processes in a Z state mean the process died or segfaulted without the parent process knowing about it. This may indicate an unhealthy parent process. A high number of D state processes put the server performance at risk, whereas Z state processes put running applications at risk.

 

Health of processes

Click to view.
Figure 14 – Health of processes

Test Case: One of the “loop” processes is in a “D” state. Since this is one of the applications causing the utilization, this is a red flag. However, since there is only one and the others are working, I suspect this condition is normal and temporary.

 

  • AppArmor reject messages. If you have any AppArmor reject messages, try turning off AppArmor, rebooting the server and retesting the problem. AppArmor is powerful and can even prevent the root account from performing operations.

 

AppArmor

Click to view.
Figure 15 – AppArmor

Test Case: The only thing I’m really concerned about is the presence of a REJECT message, not what kind of message it is. Since this server has one, you might consider turning off AppArmor, rebooting the server and retesting. Maybe the “loop” application will run properly without any AppArmor constraints.

 

  • List of running processes. It’s good to know what is running on the system. Make sure you are only running necessary applications and daemons.

 

 

Test Case Summary: A basic server health check has revealed the root cause of the high CPU utilization issue. The “loop” processes are the cause. However, even if the CPU utilization is high, 100% in this case, I would still ask the question, “Is the server experiencing performance degradation?” If not, then I probably wouldn’t worry about the high CPU utilization. Otherwise, consider offloading or splitting up those applications among other servers. The point is, you were able to create an action plan to address the problem from a basic server health check.

Basic Problem Solving

When troubleshooting a specific problem, I like to eliminate the obvious first. A general approach is to verify associated RPM packages, make sure the service is configured to start at boot time and is currently running, validate the configuration files and check it’s log file for obvious errors.

Table Summary

The following is a summary of the red and yellow flags with their suggested limits.

Category Description Yellow Flag Red Flag
RPM Outdated Packages Unrelated Packages Related Packages
RPM Third Party Packages N/A Replacements
Kernel Running Version N/A Old Kernel
Kernel Load Averages > 5 > 20
CPU Utilization – %Busy (%Idle) > 80% (< 20%) > 90% (< 10%)
CPU Interrupts/sec (in) > 8000 > 10000
CPU Context Switches/sec (cs) > 8000 > 10000
Memory Free < 4MB < 2MB
Disk Percent Space Used > 80% > 90%
Disk Percent Inodes Used > 80% > 90%
Kernel Taint Status N/A tainted
Processes Number in D State > 3 > 5
Processes Number in Z State > 5 > 10

Conclusion

When a problem occurs on your server, you should first get a supportconfig tar ball. Perform a basic server health check using the basic-environment.txt and basic-health-check.txt files. Some problems can be corrected or minimized by simply checking the basic health of the server. The first basic action item is to ensure the server is patched and up-to-date. Next make sure the server is not overloaded, running out of memory, or out of disk space. Identify any Red Flags you might have, and make sure you can explain why they exist. If you open a service request with Novell Technical Services, include your health check results along with the supportconfig tar ball of the healthy server and a supportconfig taken when the problem occurs.

VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Tags: , ,
Categories: Enterprise Linux, SUSE Linux Enterprise Server, Technical Solutions

Disclaimer: As with everything else at SUSE Conversations, this content is definitely not supported by SUSE (so don't even think of calling Support if you try something and it blows up).  It was contributed by a community member and is published "as is." It seems to have worked for at least one person, and might work for you. But please be sure to test, test, test before you do anything drastic with it.

1 Comment

  1. By:konecnya

    Some suggestions

    Do something to make the file names stand out from the text they are in, it all blends together and my first read of the first one was “Look at the basic environment file first.”
    I guess just part of the fuzzy the nature of the old wetware, and am also dealing with a (hopefully temporary) vision problem that makes presision reading a challenge.

    in “Basic Environment Check” item 2, second bullet point on kernel version, I think you mean ‘outweigh’, vs ‘out way’
    I see that I have a newer kernel than showing on the TID, but then I see the document was last updated a whole 40 days ago, so you are ‘only’ every couple of months, not with every patch. It might be worth a comment to that effect in that TID and/or this document as I doubt you’ll be able to keep that TID 100% up to date all the time.

    A very nice run done of the basics, now to get my Linux skills up to the point where I can fix the things I am finding, Thank You.

Comment

RSS