salt-run state.orch ceph.stage.3 fails with: "Module function osd.deploy threw an exception. Exception: Missing device /dev/sdm in the Salt mine for cephdisks.list. Try updating the mine with salt \* mine.update" when attempting to deploy/redeploy osd's
This document (000019626) is provided subject to the disclaimer at the end of this document.
Environment
Situation
salt-run state.orch ceph.stage.3 fails with the following error :
"Module function osd.deploy threw an exception. Exception: Missing device /dev/sdm in the Salt mine for cephdisks.list. Try updating the mine with salt \* mine.update" when attempting to deploy/redeploy osd's. Symptom: Name: sysctl - Function: salt.state - Result: Changed Started: - 18:26:40.221263 Duration: 679.899 ms ---------- ID: storage Function: salt.state Result: False Comment: Run failed on minions: osd06-cephstorage1-gsc.webair.com Failures: osd06.ceph.example.com: Name: /var/lib/ceph/bootstrap-osd/ceph.keyring - Function: file.managed - Result: Clean Started: - 18:26:41.463884 Duration: 18.313 ms Name: /etc/ceph/ceph.client.storage.keyring - Function: file.managed - Result: Clean Started: - 18:26:41.482327 Duration: 6.693 ms ---------- ID: deploy OSDs Function: module.run Name: osd.deploy Result: False Comment: Module function osd.deploy threw an exception. Exception: Missing device /dev/sdm in the Salt mine for cephdisks.list. Try updating the mine with salt \* mine.update Started: 18:26:41.489492 Duration: 1704.425 ms Changes: Summary for osd06.ceph.example.com ------------ Succeeded: 2 Failed: 1 ------------ Total states run: 3 Total run time: 1.729 s Started: 18:26:40.901326 Duration: 12229.855 ms Changes: osd03.ceph.example.com:
Resolution
Compare the out put between a direct call and a mine call?
From the minion run:
salt-call cephdisks.list
From the Admin node run:
salt-call mine.get '*ses3*' cephdisks.list
(where ses3 is some part of the minion name)
Those two output should be identical, but look for the device in question (/dev/sdm as per them message above from stage.3) and see if it is missing from the mine.get call. If so, restart the salt-minion and try again.
On the OSD node in question (osd06):
Restart the minion.
systemctl restart salt-minion
Then on the Admin node:
salt \* mine.update salt-run state.orch ceph.stage.3
If the issue is not resolved after running stage.3 a second time.
On the OSD node in question (osd06):
Reboot the host.
reboot
Make sure the OSD node comes backup and osd's rejoin the cluster.
Then on the Admin node:
validate osd's rejoined cluster:
ceph -s ceph osd tree ceph osd tree
Reboot should have addressed the following:
Make sure there are no orphaned "ceph" mount points:
mount | grep ceph
Compare to "ceph osd tree"
Normally mines are updated as part of stage.0, update the mines by running:
salt \* mine.update
or
salt 'osd06*' mine.update
Run stage.1:
salt-run state.orch ceph.stage.1
Run stage.2:
salt-run state.orch ceph.stage.2
After stage.2 runs, deepsea will display the devices that it intends to deploy.
Validate here that the correct number of devices are displayed.
If there is a incorrect number of devices displayed, run:
Update the grains to the current devices, either of these will work
salt 'MinionName*' state.apply ceph.osd.grains
or
salt 'MinionName*' osd.retain
salt 'MinionName*' osd.retain
Run stage.2:
salt-run state.orch ceph.stage.2
Again, validate that the correct number of devices are displayed.
If the correct number of devices are displayed, run:
salt-run state.orch ceph.stage.3
"ceph -s" should show osd count increment at osd's are deployed, until all osd's are deployed. "ceph osd tree" will do the same.
The cluster will start rebalancing when osd's are added to the cluster, unless "osd crush initial weight = 0" has been added to the ceph.conf prior to deploying the osd's.
"ceph osd df tree" will show pg's being added to the new osd's.
Stage.3 will run for a period of time after all of the osd's have been deploy. This is normal.
Cause
Additional Information
Make sure the intended devices that will be deployed as osd's have been wiped.
See section 4.3 step 12 in the deployment guide: https://documentation.suse.com/ses/5.5/single-html/ses-deployment/index.html#ceph-install-stack
It is the administrators responsibility to wipe the correct devices. Wiping the wrong devices can lead to data corruption on the cluster.
Tools like "lsblk" should display the devices with out partitions, If "lsblk" display the devices with partitions after wiping the devices, then a reboot is recommended.
"ceph-disk list" is also a good tool to validate correct devices.
The following command can help validate that the devices only have GPT partition table and no partitions.
parted -s /dev/devXX unit GB print free
Check the OSD node yml file to ensure correct devices are configured:
Example:
cd /srv/pillar/ceph/proposals/profile-default/stack/default/ceph/minions/
Note: "profile-default" is the default path. Check the policy.cfg for correct path.
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:000019626
- Creation Date: 14-May-2020
- Modified Date:23-Oct-2020
-
- SUSE Enterprise Storage
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com