The following article has been contributed by Matt Fleming, Linux kernel EFI co-maintainer and performance engineer at SUSE.
I’ve been on the kernel performance team at SUSE for a little over a year now, and in that time I’ve had the opportunity to work on a whole bunch of cool and challenging tasks, from analysing Linux kernel performance to sending patches upstream for the scheduler.
But last week I got the chance to do something I’ve never done before: investigate a performance issue on the machines powering the openSUSE Build Service.
Initially, I was pulled in via copy on a bug report which said the virtual machines running the backend services for OBS were not being properly load balanced on the hypervisor. Guests were seeing upto 75% steal time even though around half of the hypervisor’s CPUs were idle, as seen in the output of
top(1) on the hypervisor:
Usually steal time only becomes an issue if the hypervisor is running near maximum utilisation, i.e. all hypervisor CPUs are busy. Clearly this wasn’t the case here.
A key clue in the bug report was that the issue only appeared when automatic NUMA balancing was enabled. The ops team had disabled NUMA balancing as a workaround to restore performance to OBS.
NUMA balancing mainly involves the kernel memory management subsystem, but it also interacts with the kernel scheduler, and that controls the load of the entire system. So, straight away I had two areas I would need to investigate. After having read the entire bug report I mailed my SSH key to the ops team lead and got access to the OBS virtual machines and hypervisor. Given that I had been granted a window of time within I could poke around, I wanted to take advantage of that immediately and avoid any disruption to OBS, so I logged into the hypervisor.
There’s nothing like the pressure of working on a live, production system to make you think twice about every keystroke you make.
I verified that NUMA balancing had currently been turned off by the ops team (it had) then logged into one of the problematic guests to check the general state of the system and, to my surprise, saw that the steal time was fluctuating a lot and peaked at around 25% on some of the guest’s CPUs even though NUMA balancing was disabled on the hypervisor!
Turning NUMA balancing on did show a stable 75% steal time in the guest, so things were definitely worse with it enabled.
But it was starting to look like NUMA balancing wasn’t the culprit after all, and that it only exacerbated an existing, underlying issue. Having your guest wait for the CPU 75% of the time is painful and sets all kinds of alarm bells ringing, but it’s quite easy to miss if it you’re only seeing random periods of 25% steal time and other periods when everything is fine.
I went back to the hypervisor to try and find a reason that the guest was seeing scheduler latency. Lots of performance work is spent correlating two metrics, and in this case I wanted to correlate an increase in steal time in the guest with an increase in some other metric on the hypervisor.
My goto tool for these initial steps is always
mpstat, since it gives a fantastic high level overview of the health of the system, and it’s easy to pipe the output to a log file for later analysis. I ran
mpstat in one window and simultaneously ran
top on the guest in another alongside it. Here’s what I saw on the hypervisor:
And on the guest:
Those entries marked in red show the correlation of time spent handling softirqs on the hypervisor and spikes in steal time in the guest. Even at this early stage, I had a good idea what the problem was – the hypervisor was servicing softirqs and causing the guest to wait for the CPU to become free.
But I needed to figure out exactly which softirq was being serviced so much. Luckily,
mpstat also provides this data via the
-I SCPU argument:
Again, I used the side-by-side window trick and saw that spikes in steal time seemed to coincide with an increase in the NET_RX/s softirqs rate. The red item in the above figure shows that the softirq rate was hitting 24.5k/s, which seemed like a seriously large number when compared with the average rate over a 5 minute period.
Unfortunately, the softirq NET_RX/s value aggregates across all network interfaces, and I needed to see if a single interface was to blame for the rate spikes. You can get the individual packets per second rate of received and transmitted packets for all network interfaces using the
sar -n DEV 1 command. I ran it, but it didn’t show any one interface receiving more traffic than the rest – there was no smoking gun.
At this point, I didn’t think I could dig any further without speaking to the ops staff since I didn’t know the layout of the network. Luckily, SUSE runs an internal IRC server where almost everyone hangs out during the work day.
It turned out that the guest’s virtual disks were serviced by a Fibre Channel Over Ethernet link which was known to make up the majority of the network traffic. Unfortunately, verifying this would be next to impossible because all packet manipulation is done in hardware and no software introspection is provided.
Having reached the limit of my FCoE knowledge, I asked the SUSE Storage and Networking team lead to join the ops IRC channel, where we explained the situation. He replied with this seemingly simple but incredibly important statement:
“We have seen some cross-speak between FCoE and Ethernet on those cards, though; disabling the ethernet port cause the FCoE traffic to go down, too; but that should’ve been fixed with later firmware releases.”
Sure enough, after upgrading the firmware on the FCoE cards the softirq rate spikes disappeared, and enabling auto NUMA balancing no longer caused high steal time in the guests.
Bug report closed, high fives all around!
One thing that dawned on me while writing this was that no single person was responsible for solving the problem. Everyone involved provided some crucial piece of information that allowed the investigation to move forward. When you’re debugging live platforms in production that people rely on, you need to be sure that your teams can find each other, communicate effectively, and work together to fix things.
At SUSE, I get to team up with domain experts to work on problems that matter. I think it’s a fantastic place to work. If it sounds that way to you too, why not come and join us?