Replace osd disk validate false during stage3 due to failed to properly cleanup the DB/WAL partitions

This document (000019749) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Enterprise Storage 5.5

Situation

A disk failed and was replaced by a new disk.

The customer removed this disk from the osd node instead of replacing it, but the remove procedure had now failed to properly cleanup the DB/WAL partitions.
# salt-run state.orch ceph.stage.3 
...  
            OSD-node:
                            ID: deploy OSDs
                      Function: module.run
                          Name: osd.deploy
                        Result: False
                       Comment: Module function osd.deploy threw an exception. Exception: /usr/sbin/sgdisk -n 4:0:+500M -t 4:30CD0809-C2B2-499C-8879-2D6B78529876 /dev/sdk failed
                       Started: 14:27:37.072998
                      Duration: 5606.105 ms
...

Resolution

In the example, each DB/WAL was configured to have a size of 149G, x3 = 447G.
So it turns out the device is full and that is the reason why the redeploy is failing.

To determine which one of the partitions on there are stale, please run the following command on the OSD-node:
# readlink -f /var/lib/ceph/osd/ceph-*/{block.db,block.wal} | sort

This will show which active / currently running OSDs are using which partitions on the DB/WAL disks.

Compare the above output for the sdk device with the current partitions on there. Whichever partition currently resides on the sdk device that is NOT listed with the above readlink command will be the stale partition.
This is the partition that will then need to be manually deleted using for example 'parted' tool.

Once the stale partition is removed, redeploy again, redployment should succeed now.

Cause

Remove osd procedure failed to properly cleanup the DB/WAL partitions.
 

Status

Top Issue

Additional Information

Differences in replace.osd and remove.osd command :

The Salt's replace.osd and remove.osd commands are identical except that replace.osd leaves the OSD as 'destroyed' in the CRUSH Map while remove.osd removes all traces from the CRUSH Map.

See also :
https://documentation.suse.com/ses/5.5/single-html/ses-admin/#ds-osd-replace
https://documentation.suse.com/ses/5.5/single-html/ses-admin/#salt-removing-osd

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:000019749
  • Creation Date: 22-Oct-2020
  • Modified Date:27-Oct-2020
    • SUSE Enterprise Storage

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback@suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.


SUSE Customer Support Quick Reference Guide SUSE Technical Support Handbook Update Advisories
Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Go to Customer Center