SES5.5 How to remove/replace an osd

This document (000019687) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Enterprise Storage 5.5

Situation

Customer needs to remove an osd from the cluster.
Customer needs to redeploy osd's with different configuration.
Customer needs to removed failed hdd, ssd, or nvme device(s).

Resolution

SES5.5 How to remove/replace and osd.
Choose remove.osd vs replace.osd
https://documentation.suse.com/ses/5.5/single-html/ses-admin/#salt-removing-osd
https://documentation.suse.com/ses/5.5/single-html/ses-admin/#ds-osd-replace

This document covers remove.osd, and not replace.osd. However, replace.osd is very similar.

There are three reasons to remove osd from the cluster:
-osd device failed. In this case there would be one osd marked down.
-journaling device failed. In this case all associated osd's would be marked down.
-last, reconfigure osd's with different journaling partition sizes.

The first two reasons require replacing hardware and requires configuration to address the issue. The last option is just a reconfiguration and no replacement of hardware.

If an OSD nodes needs to be shut down, consider setting the “noout” flag before shutting down the OSD node.
   "ceph osd set noout"

Removing and osd can be intimidating.
Customer should validate and revalidate each step, to ensure that the correct osd, device, partitions are being removed.
Removing the incorrect osd, devices, partitions can be harmful to the cluster and has the potential of data loss.
Please use caution!

Kernel names /dev/sd? are not persistent. If the OSD node is rebooted/restarted, there is potential that the kernel names have changed.
Take precautions and revalidate kernel names if the node is rebooted.

This document is an example and does NOT cover all conditions.
It is the customers responsibility to validate/verify steps.
SUSE is not responsible for data loss.

As a rule, osd's in the same failure domain can be removed at the same time. osd's in different failure domains should only be removed in a serial sequence, and ensuring the cluster is healthy between removing osd's.

Removing an osd from a cluster:
1 -On the Admin node, identify the osd and its node/host name.
   "ceph osd tree"

   Example: osd.63 on OSD node ceph01

Preparatory information:
2 -On the OSD node, identify device's the osd is using:
   "ssh $OSD-Node"
   "ceph-disk list"
   Example for osd.63:
       /dev/sdl :
       /dev/sdl1 ceph data, active, cluster ceph, osd.63, block /dev/sdl2, block.db /dev/sdm4, block.wal /dev/sdm3
       /dev/sdl2 ceph block, for /dev/sdl1

   "osd.63 is using device /dev/sdl for data, and /dev/sdm4 & /dev/sdm3 partitions for journaling devices.

3 -On the OSD node, record "/dev/disk/by-id/" label for /dev/sdl is using.
   Example with information provided above:
   cd /dev/disk/by-id/
   ll
   --cut here--
   lrwxrwxrwx 1 root root 9 2020-08-06 12:20 scsi-350000399a8c8f172 -> ../../sdl
   lrwxrwxrwx 1 root root 10 2020-08-06 12:20 scsi-350000399a8c8f172-part1 -> ../../sdl1
   lrwxrwxrwx 1 root root 10 2020-08-06 12:42 scsi-350000399a8c8f172-part2 -> ../../sdl2
   ---and---
   lrwxrwxrwx 1 root root 9 2020-08-06 12:20 scsi-SATA_TOSHIBA_MG07ACA1_9980A0C2F9SG -> ../../sdl
   lrwxrwxrwx 1 root root 10 2020-08-06 12:20 scsi-SATA_TOSHIBA_MG07ACA1_9980A0C2F9SG-part1 -> ../../sdl1
   lrwxrwxrwx 1 root root 10 2020-08-06 12:42 scsi-SATA_TOSHIBA_MG07ACA1_9980A0C2F9SG-part2 -> ../../sdl2

   This information will be used when editing "yml" file.

4 -On the OSD node, record journaling partitions.
   Example from the information provided above:
   "parted /dev/vdb print free"

5 -On the OSD node, record the Serial Number of the hard drive so that it can be identified physically when the drive is removed from the OSD node.
   Examples:
   hdparm -I /dev/sdl | egrep -i 'Model\ Number|Serial\ Number'
   smartctl --xall /dev/sdl | egrep -i 'Model\ Family|Device\ Model|Serial\ Number'

6 -On the Admin node, Record policy.cfg storage information. Default "profile-default"
   Example
   On admin node:
   cd /srv/pillar/ceph/proposals/
   cat policy.cfg | grep profile

   profile-custom-hdd/cluster/ceph0[012345].ses5.com.sls
   profile-custom-hdd/stack/default/ceph/minions/ceph0[012345].ses5.com.yml
   #profile-custom-ssd/cluster/ceph0[012345].ses5.com.sls
   #profile-custom-ssd/stack/default/ceph/minions/ceph0[012345].ses5.com.yml

7 -Make a backup of the yml file for host the osd is located on.
   Example:
   cp profile-custom-ssd/stack/default/ceph/minions/$OSD-Node-Name.ses5.com.yml \
   profile-custom-ssd/stack/default/ceph/minions/$OSD-Node-Name.ses5.com.yml.bck

   Where $OSD-Node-Name is OSD host name. In this example "ceph01"

   Sometimes Deepsea will remove the device entry from the yml file, which can be desired or undesired, but having a backup can make the process simpler.

8 -A a precaution, drain the osd:
       ceph osd reweight $OSD_ID 0

   Allow the cluster to get healthy.
   Monitor with "ceph -s", "ceph osd df tree". pg's will be migrated away from the osd(in this case osd.63).

9 -Remove the osd:
   Generally, the work above is done as a precaution. Now it’s time to allow Deepsea to do its job.
   https://documentation.suse.com/ses/5.5/single-html/ses-admin/#salt-removing-osd
   Run the following command on the admin node:
       salt-run disengage.safety
       salt-run remove.osd OSD_ID
       or
       salt-run disengage.safety; salt-run remove.osd OSD_ID

       (Where OSD_ID is the # only.)

   Example, from information provide above, the command as follows:
       salt-run remove.osd 63

   "salt-run remove.osd" can be run multiple times if there is a failure.

   If the command is successful, the osd will NOT be listed with the following command:
   Example, from information provide above, the command as follows:
       ceph osd tree | grep osd.63

   Sometimes it requires force to remove and osd:
   https://documentation.suse.com/ses/5.5/single-html/ses-admin/#osd-forced-removal
   salt target osd.remove OSD_ID force=True

   Example, from information provide above, the command as follows:
   salt 'ceph01*' osd.remove 63 force=True

   In extrem circumstances it may be necessary to remove the osd with:
   "ceph osd purge"

   Example from information above, Step #1:
       ceph osd purge 63

   After "salt-run remove.osd OSD_ID" is run, it is good practice to verify the partitions have also been deleted.
   On the OSD node run:

       ceph-disk list

   "ceph-disk list" will not associate device with any osd. Information provided above, "/dev/sdl" will not be associated with osd.63.
   Partitions /dev/sdm4 & /dev/sdm3 will not be associated with any osd as well.

   Validate, Example from information above, Step #2:
       lsblk
       and
       parted /dev/sdl print free
       parted /dev/sdm print free

       Note: journaling partitions /dev/sdm4, /dev/sdm3 were deleted:

   If Deepsea did not remove the journaling partitions, then it will be necessary to remove the journaling partitions manually.
   Example from information above, Step #2:
   To remove partitions /dev/sdm4 and /dev/sdm3 run the following command:
       parted -s /dev/sdm rm 4
       parted -s /dev/sdm rm 3

   Caution! Deleting the wrong partitions can cause the cluster harm and data loose.

   The mount point for the osd should no longer exist:
   Example from information above, Step #2:
       Mount | grep "ceph-63"

   If the mount point still exists, use umount.
   Example from information above:
       umount /var/lib/ceph/osd/ceph-63

   The osd daemon should no longer be running.
   Example from information above, Step #2:
       systemctl status ceph-osd@63.service

   If the osd daemon is still running, stop and disable:
   Example from information above:
       systemctl stop ceph-osd@63.service
       systemctl disable ceph-osd@63.service

   The auth key for the osd, should have also been removed:
   Example from information above, Step #2:
       ceph auth get osd.63

   If the auth key for the osd, is still in the keyring, remove:
   Example from information above:
       ceph auth rm osd.63

10- After the osd has been removed from the cluster, it is safe to remove the hard drive from the system.
   Verify the cluster gets healthy. "ceph -s"

   Identify the device to be removed with "ledctl".
If the drive in question is dead, then this step may not work.
On the OSD node with drive in question:

       Install "ledmon"
       zypper in ledmon

       To turn drive light on:
       ledctl locate=/dev/sd?

       To trun drive light off:
       ledctl locate_off=/dev/sd?

   for devices on HPE SmartArray use "hpssacli"
   for devices on LSI MegaRAID use "storcli"

   Some hardware does not provide a means to view drive lights. In this case, manually check each drive for the correct serial number.

   Validate the correct hdd drive was removed with the serial number recorded in Step #5.
   Examples:
   # hdparm -I /dev/sdl | egrep -i 'Model\ Number|Serial\ Number'
   # smartctl --xall /dev/sdl | egrep -i 'Model\ Family|Device\ Model|Serial\ Number'

   If wrong device is removed, replace the devivce back into the same drive bay it was removed from.
   It is best to shut down the OSD node when removing the device.

Adding an osd to the cluster.
https://documentation.suse.com/ses/5.5/single-html/ses-admin/#salt-node-add-disk

Requirements:
- osd disks must not have partition tables or partitions.
- Journaling devices must have enough free space to create new journaling partitions.
- yml file needs to be correct for desired osd disks and journaling devices.
Before installing the new hdd device in the OSD node, write down the Serial Number recorded on the hdd device label.

A -On the OSD node, install the new drive.
   It may be necessary to shut down the OSD node to do this task properly.
   If an OSD nodes needs to be shut down, consider setting the “noout” flag before shutting down the OSD node.
   "ceph osd set noout"

   Discover which device it is the new device "/dev/sd???":
       ceph-disk list
       lsblk

   Typically, the new device will be recognizable by the lack of partition table, or partitions.

   Also validate the correct hdd drive was installed by locating the device with the correct serial number.
   Examples:
hdparm -I /dev/sdl | egrep -i 'Model\ Number|Serial\ Number'
smartctl --xall /dev/sdl | egrep -i 'Model\ Family|Device\ Model|Serial\ Number'

   Make note of the "/dev/disk/by-id/???"
       cd /dev/disk/by-id/
       ll
   See Step #3 above.

B -On the Admin node edit the yml file for this node:
   If using multiple journaling devices, ensure that each journaling device is equal number of osd's assigned to each.
   See Step #6:
   Example:
   "ceph0.ses5.com.yml"
   locate the entry for the removed osd "scsi-SATA_TOSHIBA_MG07ACA1_9980A0C2F9SG":
       /dev/disk/by-id/scsi-SATA_TOSHIBA_MG07ACA1_9980A0C2F9SG:
           db: /dev/disk/by-id/scsi-SATA_MTFDDAK480TDN_191821E49174
           db_size: 81920m
           format: bluestore
           wal: /dev/disk/by-id/scsi-SATA_MTFDDAK480TDN_191821E49174
           wal_size: 2048m

   and replace the entry with the new dev/disk/by-id/??
       /dev/disk/by-id/??:
           db: /dev/disk/by-id/scsi-SATA_MTFDDAK480TDN_191821E49174
           db_size: 81920m
           format: bluestore
           wal: /dev/disk/by-id/scsi-SATA_MTFDDAK480TDN_191821E49174
           wal_size: 2048m

   If Deepsea removed the entry from the yml file, it is possible to add the entry manually.
   Note the yml file is space sensitive. Make sure the yml file has correct syntax. See Step #7.

   If the new drive is new from the factory, then it should have not partition table. If the replacement drive is a repurposed drive, it may have a partition table. The partition table and partitions need to be removed as per the documentation. See:
   https://documentation.suse.com/ses/5.5/single-html/ses-deployment/#ceph-install-stack
   Step: 12e

   If this is *not* a new node, but the admin wants to proceed as if it was, then remove destroyedOSDs.yml on the target OSD node.
mv /etc/ceph/destroyedOSDs.yml /etc/ceph/destroyedOSDs.yml.old
   Or
rm /etc/ceph/destroyedOSDs.yml

If everything was done correctly, run:
   salt-run state.orch ceph.stage.1
   salt-run state.orch ceph.stage.2

To summarize the steps that will be taken when the actual replacement is deployed, you can run the following command:
   salt-run advise.osds

   Example:
   salt-run advise.osds
   These devices will be deployed

   data1.ceph:
   /dev/disk/by-id/cciss-3600508b1001c7c24c537bdec8f3a698f:

   Run 'salt-run state.orch ceph.stage.3'

Note stage.2 should see the new device. If not, something is wrong. Review steps above.
If all is good, run stage.3 to deploy the osd.

salt-run state.orch ceph.stage.3

If flags were set, then remove flags.
Use “ceph -s” to see if flags were set.
   ceph osd unset noout

Repeat the steps for each osd that needs to be replaced.

Additional steps for debugging the osd deployment process:
If the osd did not deploy in stage.3, below are addtional steps to help troubleshoot where the issue may be. Start with "Adding an osd to the cluster". Instead of running stage.1-2, run stage.1, then validate with the steps below.

Does the pillar reflect the correct devices?
salt 'MinionName*' pillar.get ceph

Are the grains correct? This
salt 'MinionName*' grains.get ceph

MinionName # cat /etc/salt/grains

To update those grains to the current devices, either of these will work
salt 'MinionName*' state.apply ceph.osd.grains
or
salt 'MinionName*' osd.retain

After that, the osd.report may behave itself.
salt 'minion' osd.report

Example:
master:~ # salt 'MinionName*' osd.report
MinionName.gtslab.prv.suse.com:
No OSD configured for
/dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_08080808

Does the following file exist on the minion:
/etc/ceph/destroyedOSDs.yml
If so, was the customer intending to destroy an osd and deploy a new one in its place?

Try deploying the osd from the minion.
On the minion run:
salt-call -l debug osd.deploy >/temp/osd.deploy.log

If you think everything is as it should be, then run:
salt 'MinionName*' state.apply ceph.osd
or
salt 'MinionName*' osd.deploy
or
salt-run state.orch ceph.stage.3

Example:
master:~ # salt 'MinionName*' osd.deploy
MinionName.example.com:
None

Cause

failed hdd, sdd or nvme device.

Status

Top Issue

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.