salt-run state.orch ceph.stage.3 fails with: "Module function osd.deploy threw an exception. Exception: Missing device /dev/sdm in the Salt mine for cephdisks.list. Try updating the mine with salt \* mine.update" when attempting to deploy/redeploy osd's

This document (000019626) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Enterprise Storage 5

Situation

Customer is attempting to deploy/redeploy osd's on SES5.5 cluster. 


salt-run state.orch ceph.stage.3 fails with the following error :
 "Module function osd.deploy threw an exception. Exception: Missing device /dev/sdm in the Salt mine for cephdisks.list.  Try updating the mine with salt \* mine.update" when attempting to deploy/redeploy osd's.

Symptom:
Name: sysctl - Function: salt.state - Result: Changed Started: - 18:26:40.221263 Duration: 679.899 ms
----------
          ID: storage
    Function: salt.state
      Result: False
     Comment: Run failed on minions: osd06-cephstorage1-gsc.webair.com
              Failures:
                  osd06.ceph.example.com:
                    Name: /var/lib/ceph/bootstrap-osd/ceph.keyring - Function: file.managed - Result: Clean Started: - 18:26:41.463884 Duration: 18.313 ms
                    Name: /etc/ceph/ceph.client.storage.keyring - Function: file.managed - Result: Clean Started: - 18:26:41.482327 Duration: 6.693 ms
                  ----------
                            ID: deploy OSDs
                      Function: module.run
                          Name: osd.deploy
                        Result: False
                       Comment: Module function osd.deploy threw an exception. Exception: Missing device /dev/sdm in the Salt mine for cephdisks.list.             Try updating the mine with salt \* mine.update
                       Started: 18:26:41.489492
                      Duration: 1704.425 ms
                       Changes:   

                  Summary for osd06.ceph.example.com
                  ------------
                  Succeeded: 2
                  Failed:    1
                  ------------
                  Total states run:     3
                  Total run time:   1.729 s
     Started: 18:26:40.901326
    Duration: 12229.855 ms
     Changes:   
              osd03.ceph.example.com:

 

Resolution

Try the following:
Compare the out put between a direct call and a mine call?  

From the minion run:
salt-call cephdisks.list

From the Admin node run:
salt-call mine.get '*ses3*' cephdisks.list


(where ses3 is some part of the minion name)

Those two output should be identical, but look for the device in question (/dev/sdm as per them message above from stage.3)  and see if it is missing from the mine.get call.  If so, restart the salt-minion and try again.

On the OSD node in question (osd06):

Restart the minion. 

systemctl restart salt-minion


Then on the Admin node:

  salt \* mine.update
  salt-run state.orch ceph.stage.3


If the issue is not resolved after running stage.3 a second time. 

On the OSD node in question (osd06):
Reboot the host.

reboot

Make sure the OSD node comes backup and osd's rejoin the cluster. 
  
Then on the Admin node:
validate osd's rejoined cluster:

ceph -s
ceph osd tree  ceph osd tree


Reboot should have addressed the following:
Make sure there are no orphaned "ceph" mount points:

mount | grep ceph


Compare to "ceph osd tree"

Normally mines are updated as part of stage.0, update the mines by running: 

salt \* mine.update 

or

salt 'osd06*' mine.update



Run stage.1:  

salt-run state.orch ceph.stage.1


Run stage.2:

salt-run state.orch ceph.stage.2

  
After stage.2 runs, deepsea will display the devices that it intends to deploy.
Validate here that the correct number of devices are displayed. 

If there is a incorrect number of devices displayed, run:
Update the grains to the current devices, either of these will work

salt 'MinionName*' state.apply ceph.osd.grains

or
salt 'MinionName*' osd.retain

salt 'MinionName*' osd.retain


Run stage.2:

salt-run state.orch ceph.stage.2



Again, validate that the correct number of devices are displayed.
If the correct number of devices are displayed, run:

salt-run state.orch ceph.stage.3


"ceph -s" should show osd count increment at osd's are deployed, until all osd's are deployed. "ceph osd tree" will do the same. 
The cluster will start rebalancing when osd's are added to the cluster, unless "osd crush initial weight = 0" has been added to the ceph.conf prior to deploying the osd's. 
"ceph osd df tree" will show pg's being added to the new osd's.

Stage.3 will run for a period of time after all of the osd's have been deploy.  This is normal. 

Cause

Salt is not generally dynamic when detection hardware changes.  Hot-plug devices are susceptible to these symptoms.  The salt mine needs to be updated as indicated in the message.

Additional Information

Double check the following:
Make sure the intended devices that will be deployed as osd's have been wiped.
See section 4.3 step 12 in the deployment guide: https://documentation.suse.com/ses/5.5/single-html/ses-deployment/index.html#ceph-install-stack

It is the administrators responsibility to wipe the correct devices.  Wiping the wrong devices can lead to data corruption on the cluster. 
Tools like "lsblk" should display the devices with out partitions, If "lsblk" display the devices with partitions after wiping the devices, then a reboot is recommended.
"ceph-disk list" is also a good tool to validate correct devices.

The following command can help validate that the devices only have GPT partition table and no partitions.
  parted -s /dev/devXX unit GB print free

Check the OSD node yml file to ensure correct devices are configured:
Example:
 cd /srv/pillar/ceph/proposals/profile-default/stack/default/ceph/minions/

Note: "profile-default" is the default path. Check the policy.cfg for correct path. 
 

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:000019626
  • Creation Date: 23-Oct-2020
  • Modified Date:23-Oct-2020
    • SUSE Enterprise Storage

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.


SUSE Customer Support Quick Reference Guide SUSE Technical Support Handbook Update Advisories
Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Go to Customer Center