Streamline Your Root Cause Analysis
Hey everyone, it’s Colin Hamilton here again. I’m one of the Linux Support Engineers on the SUSE team. A really common request that will come into our queue is an RCA (Root Cause Analysis) for an unexpected reboot of the server. I’ve noticed it can sometimes be frustrating for customers when the root cause takes longer than hoped or we aren’t able to provide the root cause with that particular reboot.
I’m going to provide the steps I go through in an RCA in order to help you streamline your root cause analysis and also to see where we sometimes reach the end of our analysis without a root cause. I’m focusing on an unexpected reboot, but these steps are by and large the same for a system hang as well. To better help you know what we’re going to need and want, here are the steps I take whenever I get one of those cases.
Prologue – Supportconfig
Once upon a time there was a SUSE support engineer. This engineer was very sad because he was tasked to give a root cause analysis without a supportconfig. He was unable to give the root cause analysis and died of sadness. The end.
I hope you were charmed by my very intricate story about the sad engineer. The moral of the story is that without a supportconfig you’ll be leaving us to our divination in order to figure out the root cause. I highly recommend that you don’t rely on our divination which is fueled purely by junk food and video games with a very small percentage of accuracy.
Those who may not have heard of the supportconfig utility before, this is a tool we provide on our SLE OS in the supportutils package that pulls server data for us to use in our analysis. It’s a simple as running the command supportconfig. There are other flags you can use with that command to have it automatically upload to a service request, if you have one open with us, but the command by itself will still create the tarballed file that we need in your /var/log/ directory.
Yes, support engineers have questions that need answering! In other news water is wet and the sky is blue. Here are the questions I’ll likely ask for any reboot RCA:
- What is the approximate date and time this last occurred?
- Did the server start running any services/processes in particular just before or when the reboot took place?
- How long has this issue been going on? Or is this the first time?
- Have you made any recent changes to the system?
The answers to these questions provide the necessary context for us to search more efficiently and effectively throughout the supportconfig for the root cause.
2. Patch Level
This step is the nasty medicine nobody wants to drink. I’ll preface this with a small spoonful of sugar; We will always do what we can to provide a best effort analysis even if we’re dealing with unsupported code. However, the catch is, what we can do is limited when we’re dealing with unsupported code. The more likely the issue becomes a bug and not a configuration issue the more likely it is we’ll ask you to patch.
If the server is not on a supported version of SLE my main recommendation is always going to be to upgrade. This is a necessary prerequisite to our full support. We can only support old code for so long, after that the solution is to upgrade. Here’s a link you can use to check out the life cycle of our various versions of SLE:
The next thing that I’m going to check is the patch level. Even if the server is on a supported version of SLE, if it’s on a kernel that is a year old we’re going to have the same problem until it’s been patched. The reason for this is that our engineering team can’t continue to debug a kernel we’ve already debugged and patched. The changes are high that the issue is already fixed in a patch.
This is one of the most important files for a root cause analysis of an unexpected reboot. If some process failed, if the kernel locked up, if the server went OOM, if someone ran a reboot command and didn’t tell anyone(yes, I’ve actually seen that, and more than once!), we should be able to see it in this log.
If the reboot was a long time ago the messages log may have rotated. If that’s the case the supportconfig utility won’t pull what we need by default. If the reboot was far enough back in time please make sure to include, with the supportconfig, the rotated log of messages that does include that time period. To quickly get to the right place in the messages log I like to use the search in vim or less to skip right to where the server first booted up.
Here’s the first log a SLE11 machine will show in messages, by default, when booting up:
syslog-ng: syslog-ng starting up; version=”
I recommend searching for this log using “syslog-ng start”.
Here’s the first log a SLE12 machine will show in messages, by default, when booting up:
rsyslogd: [origin software=”rsyslogd” swVersion=”” x-pid=”” x-info=”http://www.rsyslog.com”] start
For this I’ve found the best search term to quickly find this log is “rsyslogd.*start”.
This log is for “machine check exceptions” logged by the CPU. These are errors that can indicate a hardware problem. For a better understanding of this log I’d recommend a run through the mcelog man page. It’s important to note though, that if this log is empty it doesn’t mean there are no hardware problems, it just means that the CPU didn’t catch any hardware problems.
SAR data is system data collected throughout the day by a daemon in the background. There’s plenty of information out on the web dedicated to teaching how to analyze and use SAR data to your advantage so I won’t go into specifics there. Suffice it to say that you can check memory usage, load average, disk I/O, context switches, network utility, etc, all around the time of the unexpected reboot using this data. We can use it to inform us as to what might have caused the reboot.
There are times I’ve seen SAR has not been setup on the server to collect data. On SLE11 you can run through setup like so:
- # zypper install sysstat
- # /etc/init.d/boot.sysstat start
That’s it for SLE11. After that cron jobs are created that run a sadc command which builds and collects the data on a regular basis. On SLE12 you run through the setup this way:
- # zypper install sysstat
- # systemctl enable sysstat
- # systemctl start sysstat
- # systemctl status sysstat
From then on you’ll be collecting that invaluable SAR data. If it wasn’t setup during the reboot it won’t have the data we need, but at least now it’ll be setup in case it happens again.
6. Kernel Core
If the above steps still haven’t been enough to complete a root cause analysis then it’s time to pull out the big guns. In an ideal world kdump will already have been configured on the server and when the server unexpectedly rebooted a core was created in the /var/crash/ directory with no problems. In my experience though, unless someone has already run into this, kdump usually hasn’t been setup.
If kdump has not been configured, and the previous steps weren’t enough, then a root cause analysis won’t be possible this time around. However, you’ll want to setup kdump so that next time it happens a core is created and we can take the analysis further using that core. Here’s a SUSE TID on setting that up:
7. Calling on the Devs
Alright, so let’s say we finally got that core. It may be that the issue is in a complex area of code or just that us lowly support engineers aren’t able to “crack the code”. (Embrace the punny.) Now it’s time to get our developers involved. This step is only possible with supported machines. Going back to that medicine in step 2, the server will have to be on a fully patched supported SLE OS for us to take this to the developers. From there the issue can be patched if it is an unresolved bug.
As always, I hope I’ve changed all of your lives forever and that this paradigm shift in your lives will mean less work for me. 😉