Case files of a TSE: CIFS mounts
October 8, 2021 | By: Anthony Stalker
Problems mounting a CIFS filesystem
This is part of a series that attempts to showcase the kind of work that SUSE Support does and how we help customers resolve issues when running SUSE Products. The cases that are selected will be based on real cases. However, all details will be fully anonymized and stripped of identifying marks.
This case came in on my first day at SUSE and I took it soon after joining. When taking this case I didn’t know much more about CIFS and SAMBA. During my time with SUSE, I’ve become somewhat of a minor expert, however, at this point I had just about the knowledge of any sysadmin who worked in a mixed environment.
Before beginning work at SUSE, I had been a systems engineer in Linux operations at a major logistics company, doing all sort of system administration tasks. What attracted me to SUSE was solving tough problems with no obvious answer, and I got one right away. This might be a long story, but I promise it has a happy ending.
Defining the problem
The customer’s description of their environment was superb. It was detailed, but the customer included only relevant details, and it was clear that they had monitored the issue and troubleshooted extensively. I was immediately quite impressed with their knowledge and professionalism.
This is a textbook perfect description, and this greatly helped us find a quick resolution. They had already sent a supportconfig file [a collection of system configurations and logs] as well.
It really helps us give quicker and better answers when a customer describes the issue up front and asks good questions.
This was their description:
We have two systems with SLES12 SP5 – one for production and one for testing. Both systems are smbclients and have a configured CIFS-fileshare. The patchlevel of my two SLES12 systems are almost identical. Also the line with the mount definition in /etc/fstab. The fileserver is a Nutanix cluster (Version 184.108.40.206 LTS). Our systems are intergrated in our active directory via kerberos, winbind and samba. The share permission is based on a computer account (kerberos).The problem: On production system the CIFS share is randomly drops out after an undefined time span (never at the same time or day). On test system it seems to be stable. If I try mount -a -vvv the command on the system with problems, the command will hang and never come back until I press CTRL+C. It is configured as DFS [Distributed File System] – with nslookup I will get 8 network addresses for the cifs-server. On nutanix side (fileserver side) there are no logs. At the moment I don’t know where to look.
As I analyzed the problem, the customer continuously updated us with some troubleshooting they were doing:
I have found out, that the share contains was many many files and folders (thousands up to millions). Maybe it could be a useful information for support.
This morning the share was unmounted again. I tried something with the mount parameters in fstab and switched back to ver=2.0. The parameter ver=3.0 has worked before – also currently on the test system.
and they managed to find a potentially useful log item, all sensitive information anonymized in this article, of course:
2020-08-05T20:12:05.173167+01:00 geekohost kernel: [369820.765647] CIFS VFS: \\somecifshare.subdomain.domain.tld cifs_reconnect: no target servers for DFS failover <-can't failover to a different share 2020-08-05T20:12:05.173186+01:00 geekohost kernel: [369820.766661] BUG: unable to handle kernel paging request at fffffffffffffff8 <-- unhandled exception 2020-08-05T20:12:05.173187+01:00 geekohost kernel: [369820.767167] IP: cifs_reconnect+0x57e/0xdd0 [cifs] <-- information from the stack 2020-08-05T20:12:05.173188+01:00 geekohost kernel: [369820.767615] PGD 25700e067 P4D 25700e067 PUD 257010067 PMD 0 2020-08-05T20:12:05.173188+01:00 geekohost kernel: [369820.768051] Oops: 0000 [#3] SMP PTIPractice point: To many people this is a giant bowl of alphabet soup, but to a Linux systems engineer, this is very exciting because this is the kernel itself talking to us through the kernel ring buffer about the cifs kernel module. And it’s telling us that something is not right. It simply can’t reconnect and it “oopses” [believe it or not this is actually the technical term]. But it’s not giving us the full picture. It’s not telling us why it can’t reconnect.
The customer has provided us an update:
My problem has expanded. On the test system I now have the same kernel. The cifs network share can also no longer be mounted also on this test system. From file server side (Nutanix) we have upgraded to the latest stable version (3.6.5).
Well, this definitely looks like a problem with the kernel.
It’s my first week on the job and I already have to deal with kernel debugging! I must say that I wasn’t expecting that. Fortunately, TSEs are not alone to face this problem. Upon seeing this message and hearing that the problem seems to be isolated to a kernel version – I immediately sought the collaboration of a senior engineer. I brought it up at a weekly technical team meeting, which is a meeting where we discuss tough cases. If it was a bug that was affecting customers, then we need to have all hands on deck. I had eliminated most of the low-hanging fruit by troubleshooting the issue with options and collecting information.
When we run into a problem that we can’t solve at L1 we can rely on the support of a backline engineer. These engineers have a wealth of experience and insight and are often subject specialists in one or several areas. They do more advanced things to troubleshoot, like kernel crash dump analyses, reading stack traces etc. We actually sucessfully resolve over 80% of cases at L1, but there are cases where we are out of our depth and this happened to be one of them. A big part of being a good Technical Support Engineer is indentifying cases that are genuinely difficult and moving them up and which just seem difficult at first glance, but can be solved relatively simply.
Every so often, it can take a while for Backline Engineers to respond. Their knowledge and experience is in extreme demand and they often have to dive deep into a problem. While the Backline Engineer was reviewing the debug logs, I looked in depth into a seemingly unrelated configuration file.
Usually in a case like this, the /etc/samba/smb.conf file is irrelevant. The cifs kernel module handles the mount. The command line parameter or the contents of /etc/fstab define the mount options. /etc/samba/smb.conf doesn’t even come into it. But in this case, I thought that it might. Why?
Remember how the customer described the authentication for these shares? They used windbind and kerberos authentication to a KDC for authentication to the shares [Key Distribution Center, it’s a vendor specific term for a Kerberos Ticket Granting Server] . Now that’s interesting because winbind takes it’s configuration from /etc/samba/smb.conf
I spent hours a lot of time carefully reviewing the samba and winbind and documentation and the available messages.
I had noticed a potentially important difference between the production setup and the test set up in the samba configuration. The production setup has these additional lines in /etc/samba/smb.conf versus the test system:
wins server = 10.10.10.10 name resolve order = lmhosts host wins bcast
and since we’re using ADS for security, the following configuration would be advised by the documentation:
name resolve order = wins bcast
It was quite possible that winbind can’t find the domains for the other file servers from WINS and/or that it’s an AD issue. Here are some basic jumping off points to troubleshoot those:https://documentation.suse.com/sles/12-SP5/html/SLES-all/cha-security-ad.html#sec-security-ad-configI noticed that the two servers have different /etc/nsswitch.conf files:PROD:hosts: files wins dnsTEST:hosts: files wins dns mdns4_minimal [NOTFOUND=return] dns myhostname
Here are some general samba docs:
It’s good practice to run the testparm utility on changed smb.conf files to validate them before deploying into production to try to detect configuration file inconsistencies.
The Backline engineer got back to me saying that the key piece of what was missing was the “kernel dump” which would be necessary to have the full stack trace of the bug. So, I sent instructions to the customer on how to configure kdump.
The backline engineer also noticed this in /var/log/samba/log.smbd:[2020/07/30 18:40:07.349935, 0] ../../lib/param/loadparm.c:1751(handle_name_resolve_order) handle_name_resolve_order: WARNING: Ignoring invalid list value 'hosts' for parameter 'name resolve order'To me, this was the “Eureka” moment. Now I was more confident in the solution that I provided. This was the smoking gun (or so I thought)! This would confirm my original suspicion that the samba config line:
name resolve order = lmhosts host wins bcastisn’t being parsed sanely, which means then that failovers to different DFS hosts could fail because they can’t find where to authenticate the next host as recorded in those kernel ring buffer messages [the alphabet soup].The message says that it is ignorning the list value though, so it’s just warning us. It’s not actually fatal. Programs should have ways to handle expections and be able to handle some types of configuration mistakes.
The real fix
The customer implemented the recommendations and the issue seemed to have stopped. The customer seemed to have been satisfied with the solution. They said for me to close the case as they were going on a 3-week vacation. Of course, they were free to open a new case if they encountered the issue again.
It turns out that the solution that seemed to work wasn’t the answer, but behind the scenes, the story didn’t end. That kdump [kernel dump] was very useful. The backline engineer analyzed it of course and it really turned out to have been a bug – identifying number 1172428. You can even see the bug ID referenced in this update announcement:
By the time the customer had gotten back from vacation and opened a case, the bug had been replicated, identified and fixed. All that was necessary was for their colleague who was acting as backup to update to the latest maintenance patches. We monitored the solution and this resolved that issue for good.
The things I learned from this case
- Don’t take ‘yes’ for an answer. Many people’s instinct would be to think: looks like it works, this customer is happy with the solution so we can wipe our hands of the whole deal. A SUSE Technical Support Engineer’s instinct is often to keep digging, because we have to think about all of our customers at once. We want to be extra sure that the product is actually working as intended. We have a responsibility to the product’s quality and reporting its defects.
- Only very few of the cases we get end up in a bug report. Most often there is an issue with the configuration, the customer needs help interpreting some message or the customer needs some specific techincal question answered. Not everything is a bug. An even smaller number of bug reports turn out to be valid bugs, but small bugs hide big ones so our bugfixers don’t treat reports lightly. This just happened to be a valid bug, which we caught relatively soon.
- Debugging the kernel is hard, but not definitely not impossible and there are many tools to help narrow the issue down. If you’re interested about learning more here are references to some documentation:
man 7 kdump
- Unfortunately Debugging CIFS/SMB is unnecessarily difficult, since it is in essence based on a proprietary protocol. SUSE champions open source software and standard protocols so that any machine can talk to any machine freely. Compatibility with the SMB family of dialects is the result of the tireless work of many very skilled software engineers (including some contributing engineers from SUSE) who work as part of the upstream SAMBA Project www.samba.org
I hope that this deep dive was of interest to you. I promise that next time we’ll cover something a little bit less arcane.
Have you ever provided a solution that a customer found satisfactory, but you continued working on the problem believed that it needed a better fix? Have you tried to use debugging tools to analyze the kernel? Share your experiences below…