How to resolve "1 scrub errors/Possible data damage: 1 pg inconsistent"

This document (000019694) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Enterprise Storage 5.5
SUSE Enterprise Storage 6

Situation

"ceph -s" reports:  1 scrub errors, Possible data damage: 1 pg inconsistent

#==[ Command ]======================================#
# /usr/bin/ceph --id=storage --connect-timeout=5 -s
  cluster:
    id:     0260f99a-117e-4c7e-8fbe-86c483bcd7e9
    health: HEALTH_ERR
            1 scrub errors
            Possible data damage: 1 pg inconsistent

  services:
    : 3 daes, quorum mon01,mon02,mon03 (age 10w)
    mgr: mon01(active, since 7w), standbys: mon02, mon03
    mds: cephfs:1 {0=mds01=up:active} 2 up:standby
    osd: 285 osds: 285 up (since 43h), 285 in (since 2w)
    rgw: 3 daes active (cephrgw01, cephrgw02, cephrgw03)

  data:
    pools:   8 pools, 4328 pgs
    objects: 294.96M objects, 463 TiB
    usage:   694 TiB used, 1.3 PiB / 2.0 PiB avail
    pgs:     4320 active+clean
             7    active+clean+scrubbing+deep
             1    active+clean+scrubbing+deep+inconsistent

  io:
    client:   3.8 MiB/s rd, 188 MiB/s wr, 11 op/s rd, 732 op/s wr

 

Resolution

Run "ceph health detail" to find the pg ID for the inconsistent pg:

#==[ Command ]======================================#
# /usr/bin/ceph --id=storage --connect-timeout=5 health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
    pg 5.6f1 is active+clean+scrubbing+deep+inconsistent, acting [7,141,208,199,70,37,182,131,120,259]

To repair the pg run"
ceph pg repair $pgid
  example:
ceph pg repair 5.6f1

Additional information can be provided by:
rados list-inconsistent-obj $pgid --format=json-pretty

Cause

Possible cause for inconsistent pg, could include failing osd hard drives.  Check /var/log/messages for: medium, i/o error, sector errors, or smartctl Prefailures messages.  Prefailure messages are not a clear indication that the drive is failing, but the other messages will be good indicators.   

Example command to be run on each osd node:
egrep -i 'medium|i\/o error|sector|Prefailure' /var/log/messages

Example command to be run on the admin node:
salt '*' cmd.shell 'egrep -i "medium|i\/o error|sector|Prefailure" /var/log/messages'

The output of these commands will provide the kernel names of devices.  For SES5.5 use "ceph-disk list" to correlate with osds. For SES6 use "ceph-volume lvm list" to correlate with osds. 

If hdd drives are failing, then the osd's will need to be removed from the cluster and replaced with a new device. 

Status

Top Issue

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:000019694
  • Creation Date: 03-Sep-2020
  • Modified Date:03-Sep-2020
    • SUSE Enterprise Storage

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback@suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.


SUSE Customer Support Quick Reference Guide SUSE Technical Support Handbook Update Advisories
Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Go to Customer Center