Troubleshooting Calamari
SUSE Enterprise Storage versions 1 through 4 include the Calamari web GUI for monitoring and some basic management tasks. As mentioned in the SUSE Enterprise Storage 4 release notes, Calamari is being deprecated in favor of openATTIC, but right now people are still using Calamari, and every so often run into some kinks getting it talking to their clusters. If that sounds like you, this post should help.
(Strictly speaking Calamari is the name of the backend REST API, and the GUI you see is named Romana, but everyone still calls the whole thing Calamari, so that’s what I’ve called it here. But I digress…)
I’ll assume you’ve got a Ceph cluster running and have installed Calamari, either with ceph-deploy as outlined in the docs, or if you’re using DeepSea, by running zypper in romana ; calamari-ctl initialize
on your Salt master node.
(Another digression: if you’re not deploying with DeepSea yet, please give it a try, as ceph-deploy is also being deprecated.)
The most common Calamari problems are:
- You’re stuck on the “New Calamari Installation” screen.
- Calamari sees the cluster, but the status isn’t updating.
- The graphs are broken.
First, if the cluster status isn’t updating in Calamari (or hasn’t been updated for more than about a minute), immediately check that the cluster really is up (ssh to one of the ceph nodes and run ceph status
). If the cluster isn’t quorate (i.e. if more than half the MONs are down), Calamari won’t have any idea what’s going on. Also your users will probably be screaming at you, because the cluster is broken.
Assuming the cluster is running, the first things to check are:
- Make sure host name resolution works properly. For example, from each node, you should be able to ping each other node by name. There should be no weird junk in the
/etc/hosts
files. All nodes should have FQDNs, not short names (although short names will work, just don’t mix FQDNs and short names). - If host name resolution was broken and you’ve since fixed it, make sure the salt minion IDs actually match the system hostnames (check
/etc/salt/minion_id
, and if necessary delete that file, restart the salt minion and re-accept its key on the salt master). - Make sure you’ve got the same version of salt installed on all nodes (this can get out of sync if you happen to have applied the latest updates to, say, the salt master, but not any of the other nodes).
- Make sure the salt master and minions are all actually running (run
systemctl status salt-master
on the salt master, andsystemctl status salt-minion
on the minions) and the master can talk to the minions (runsalt '*' test.ping
on the salt master).
If you’ve verified all the above and the graphs still aren’t working, there’s a small chance that the diamond metrics collector wasn’t installed automatically when Calamari was first deployed. To fix this (and also to fix any other possible odd salt state synchronization issues), run salt '*' state.highstate
on the master.
If you’re stuck on the “New Calamari Installation” screen, or Calamari sees the cluster, but the status isn’t updating, make sure that all MON nodes have the Ceph client admin keyring file (/etc/ceph/ceph.client.admin.keyring
). Without this, the cluster heartbeats internal to Calamari won’t work properly. Also, make sure no custom salt configuration has been added that might have stomped on what’s shipped with Calamari (try rpm -q --verify calamari-server
and see if any of the /srv/salt/...
files have been changed).
If in doubt, from the salt master, try running salt '*' ceph.heartbeat
– if all the right pieces are in place, this should print “None” for each node. Next, try running salt '*' ceph.get_heartbeats --out=json
. This should print a large blob of status information. There’ll be one JSON object per node, for example here’s an object representing a MON node that also happens to have one OSD (I’ve shortened this slightly for clarity):
{ "ses4-2.example.com": [ { "services": { "ceph-osd.1": { "status": null, "cluster": "ceph", "version": "10.2.4-211-g12b091b", "type": "osd", "id": "1", "fsid": "9b259825-0af1-36a9-863a-e058e4b0706b" }, "ceph-mon.ses4-2": { "status": { "election_epoch": 80, "name": "ses4-2", "monmap": { "modified": "2016-11-01 23:16:37.042293", "epoch": 1, "mons": [ # ...more stuff here... ], "fsid": "9b259825-0af1-36a9-863a-e058e4b0706b", "created": "2016-11-01 23:16:37.042293" }, }, # ...more stuff here... } }, "boot_time": 1490166740, "ceph_version": "10.2.4+git.1481215985.12b091b-1.1" }, { "9b259825-0af1-36a9-863a-e058e4b0706b": { "name": "ceph", "fsid": "9b259825-0af1-36a9-863a-e058e4b0706b", "versions": { "osd_map": 137, "pg_summary": "6c934c46daa20e77545f408b70021d60", "mds_map": 1, "mon_status": 80, "health": "3015cd769b41ee2b55c678d2e38dfed7", "mon_map": 1, "config": "bf1e66c0d3df8d48024fb019c944eef2" } } } ] }
The last section – the part that starts with "9b259825-0af1-36a9-863a-e058e4b0706b": {
– is the piece that will be missing if the Ceph client admin keyring is not present. In that case, you’d just see:
# ...more stuff here... } }, "boot_time": 1490166740, "ceph_version": "10.2.4+git.1481215985.12b091b-1.1" }, {} # See? It's missing! (except this helpful # comment won't be present in real life) ] }
That’s it for the common problems. If you’re run into any of these and the above instructions didn’t fix it (or, if you’ve hit an uncommon problem ;-)), let me know in the comments and I’ll try to help.
No comments yet