How to resolve "1 scrub errors/Possible data damage: 1 pg inconsistent"

This document (000019694) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Enterprise Storage 5.5
SUSE Enterprise Storage 6

Situation

"ceph -s" reports: 1 scrub errors, Possible data damage: 1 pg inconsistent

#==[ Command ]======================================#
# /usr/bin/ceph --id=storage --connect-timeout=5 -s
cluster:
id: 0260f99a-117e-4c7e-8fbe-86c483bcd7e9
health: HEALTH_ERR
1 scrub errors
Possible data damage: 1 pg inconsistent

services:
: 3 daes, quorum mon01,mon02,mon03 (age 10w)
mgr: mon01(active, since 7w), standbys: mon02, mon03
mds: cephfs:1 {0=mds01=up:active} 2 up:standby
osd: 285 osds: 285 up (since 43h), 285 in (since 2w)
rgw: 3 daes active (cephrgw01, cephrgw02, cephrgw03)

data:
pools: 8 pools, 4328 pgs
objects: 294.96M objects, 463 TiB
usage: 694 TiB used, 1.3 PiB / 2.0 PiB avail
pgs: 4320 active+clean
7 active+clean+scrubbing+deep
1 active+clean+scrubbing+deep+inconsistent

io:
client: 3.8 MiB/s rd, 188 MiB/s wr, 11 op/s rd, 732 op/s wr

Resolution

Run "ceph health detail" to find the pg ID for the inconsistent pg:

#==[ Command ]======================================#
# /usr/bin/ceph --id=storage --connect-timeout=5 health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 5.6f1 is active+clean+scrubbing+deep+inconsistent, acting [7,141,208,199,70,37,182,131,120,259]

To repair the pg run"

ceph pg repair $pgid

example:

ceph pg repair 5.6f1

Additional information can be provided by:

rados list-inconsistent-obj $pgid --format=json-pretty

Cause

Possible cause for inconsistent pg, could include failing osd hard drives. Check /var/log/messages for: medium, i/o error, sector errors, or smartctl Prefailures messages. Prefailure messages are not a clear indication that the drive is failing, but the other messages will be good indicators.

Example command to be run on each osd node:
egrep -i 'medium|i\/o error|sector|Prefailure' /var/log/messages

Example command to be run on the admin node:
salt '*' cmd.shell 'egrep -i "medium|i\/o error|sector|Prefailure" /var/log/messages'

The output of these commands will provide the kernel names of devices. For SES5.5 use "ceph-disk list" to correlate with osds. For SES6 use "ceph-volume lvm list" to correlate with osds.

If hdd drives are failing, then the osd's will need to be removed from the cluster and replaced with a new device.

Status

Top Issue

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.