Cluster Pools got marked read only, OSDs are near full.

This document (000019724) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Enterprise Storage 7.1
SUSE Enterprise Storage 6
SUSE Enterprise Storage 5.5

Situation

OSDs (Object Storage Daemons) were allowed to fill up resulting in the pools/cluster being marked as "ReadOnly" and / or OSDs are near full.

Resolution

IMPORTANT NOTE: With SES 6 and later it is recommended to activate the balancer module instead of making manual OSD weight changes, for details see the SES 7.1 online documentation .

OSDs should never be full in theory and administrators should monitor how full OSDs are with "ceph osd df tree". If OSDs are approaching 80% full, it’s time for the administrator to take action to prevent OSDs from filling up. Action can include re-weighting the OSDs in question and or adding more OSDs to the cluster.

Ceph has several parameters to help notify the administrator when OSDs are filling up:

# ceph osd dump | grep ratio
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85

By default, when OSDs reach 85% capacity, nearfull_ratio warning is triggered.

By default when OSDs reach 90% capacity, backfillfull_ratio warning is triggered. At this point the cluster will deny backfilling to the OSD in question.

By default when OSDs reach 95% capacity, full_ratio is triggered, all PGs (Placement Groups) on the OSDs in question will be marked Read Only, as well as all pools which are associated with the PGs on the OSD. The cluster is marked Read Only, to prevent corruption from occurring.

It is good practice to set the "noout" flag when in this situation to avoid re-balancing if OSDs go down.

How to set/unset noout:

ceph osd set noout
ceph osd unset noout

To get the cluster out of this state, data needs to be pushed away or removed from the OSDs in question. In the below example it is a single OSD in question (osd.52), but there could be many OSDs that are marked full.

To push data away from the osd, run:

ceph osd reweight $osdID $Weight

Where $osdID is the OSD number "52" and $Weight is a value below 1, for example:

ceph osd reweight 52 .85

The first objective is to get the OSDs that are full below 95% capacity, so the OSDs /pool(s)/cluster is not marked Read Only. It is possible to achieve this goal with a $Weight of .95 in some cases, and in other cases, it may require a lower value, .90, .85, .80, etc.

The second objective is to get the OSDs in question first below 90%, then below 85% of capacity. This is achieved by continuing to monitor the cluster and re-weighting down the OSDs in question. This process can take a few hours to complete.

Keep in mind that when data is being pushed away or removed from an OSD, that data is being placed on another OSD in the cluster. This action can in turn cause other OSDs to fill up as well. The administrator will need to monitor ALL OSDs with "ceph osd df tree" to ensure that proper action is taken.

In some cases it will be necessary to change the following settings temporarily:

ceph osd set-nearfull-ratio <ratio>
ceph osd set-backfillfull-ratio <ratio>
ceph osd set-full-ratio <ratio>

"ceph osd set-full-ratio .96" will change the "full_ratio" to 96% and remove the Read Only flag on OSDs which are 95% -96% full. If OSDs are 96% full it's possible to set "ceph osd set-full-ratio .97", however, do NOT set this value too high.

"ceph osd set-backfillfull-ratio 91" will change the "backfillfull_ratio" to 91% and allow backfill to occur on OSDs which are 90-91% full. This setting is helpful when there are multiple OSDs which are full.

In some cases, it will appear that the cluster is trying to add data to the OSDs before the cluster will start pushing data away from the OSDs in question.

Once OSDs are below 95% capacity (full-ratio) or below the 90% capacity (backfillfull-ratio), set the setting sback to default:

ceph osd set-nearfull-ratio .85
ceph osd set-backfillfull-ratio .90
ceph osd set-full-ratio .95

This will ensure that there is breathing room should any OSDs get marked full again at some point in time.

If the Administrator is confident the issue addressed and it is safe to re-weight OSDs back up, it can be done in the same way:

ceph osd reweight $osdID $Weight

Where $osdID is the OSD number "52" and $Weight is a value up to 1:

ceph osd reweight 52 .90
    or
ceph osd reweight 52 .95
    or
ceph osd reweight 52 1

Monitor with:

ceph -s
ceph health detail
ceph osd df tree.

Should the OSDs start to fill up again, re-weight the OSDs back down again.

Note: osd.52 is now 81% full and has a re-weight value of "0.84999"

#==[ Command ]======================================#
# /usr/bin/ceph --connect-timeout=5 osd df tree
ID CLASS WEIGHT    REWEIGHT SIZE    USE     AVAIL   %USE  VAR  PGS TYPE NAME                    
-1       335.08347        -  335TiB  168TiB  167TiB 50.25 1.00   - root default 
52   ssd   0.72769  0.84999  745GiB  606GiB  140GiB 81.27 1.41  18         osd.52               
57   ssd   0.727#==[ Command ]======================================#
# /usr/bin/ceph --connect-timeout=5 osd df tree
ID CLASS WEIGHT    REWEIGHT SIZE    USE     AVAIL   %USE  VAR  PGS TYPE NAME                    
-1       335.08347        -  335TiB  168TiB  167TiB 50.25 1.00   - root default 
52   ssd   0.72769  0.84999  745GiB  606GiB  140GiB 81.27 1.41  18         osd.52               
57   ssd   0.72769  1.00000  745GiB  436GiB  310GiB 58.46 1.01  19         osd.57  
69  1.00000  745GiB  436GiB  310GiB 58.46 1.01  19         osd.57

When the administrator is confident issue is resolved, remove the "noout" flag:

ceph osd unset noou

Cause

The OSDs had filled up to 95%.

Other contributors:

Cluster was filled up beyond the failure domain capacity, then a failure happened filling up the remaining OSDs.
Cluster is configured with OSDs of various sizes. It is recommended that all OSDs be of the same size for even distribution of data.
If the cluster is 70% full, its time to add more OSDs to the cluster or to remove unwanted data.

Status

Top Issue

Additional Information

"ceph -s" will report the number of full OSDs and the number of pools affected:

#==[ Command ]======================================#
# /usr/bin/ceph --connect-timeout=5 -s
  cluster:
    id:     8007d21c-6c85-3f03-85df-f56fc7cf85eb
    health: HEALTH_ERR
            1 full osd(s)
            20 pool(s) full
 
  services:
    mon: 3 daemons, quorum mon-03,mon-02,mon-01
    mgr: mon-03(active), standbys: mon-02, mon-01
    mds: cephfs-1/1/1 up  {0=mon-03=up:active}
    osd: 62 osds: 62 up, 62 in
    rgw: 1 daemon active
 
  data:
    pools:   20 pools, 2456 pgs
    objects: 25.07M objects, 93.0TiB
    usage:   168TiB used, 167TiB / 335TiB avail
    pgs:     2451 active+clean
             5    active+clean+scrubbing+deep

#==[ Command ]======================================#
# /usr/bin/ceph --connect-timeout=5 health detail
HEALTH_ERR 1 full osd(s); 20 pool(s) full; clock skew detected on mon.mon-02, mon.mon-01
osd.52 is full
pool 'cephfs_data' is full (no space)
pool 'cephfs_metadata'
#==[ Command ]======================================#
# /usr/bin/ceph --connect-timeout=5 -s
  cluster:
    id:     8007d21c-6c85-3f03-85df-f56fc7cf85eb
    health: HEALTH_ERR
            1 full osd(s)
            20 pool(s) full
 
  services:
    mon: 3 daemons, quorum mon-03,mon-02,mon-01
    mgr: mon-03(active), standbys: mon-02, mon-01
    mds: cephfs-1/1/1 up  {0=mon-03=up:active}
    osd: 62 osds: 62 up, 62 in
    rgw: 1 daemon active
 
  data:
    pools:   20 pools, 2456 pgs
    objects: 25.07M objects, 93.0TiB
    usage:   168TiB used, 167TiB / 335TiB avail
    pgs:     2451 active+clean
             5    acti is full (no space)
pool '.rgw.root' is full (no space)
pool 'default.rgw.control' is full (no space)
pool 'default.rgw.meta' is full (no space)
pool 'default.rgw.log' is full (no space)
pool 'default.rgw.buckets.index' is full (no space)
--[cut here]--
pool 'default.rgw.buckets.data' is full (no space)
pool 'default.rgw.buckets.non-ec' is full (no space)
--[cut here]--

"ceph report" will also provide similar output:

#==[ Command ]======================================#
# /usr/bin/ceph --connect-timeout=5 report
report 4224157997
{
    "cluster_fingerprint": "1bf16fc2-bff3-4a0d-b2ff-28fd11d349d8",
    "version": "12.2.12-594-g02236657ca",
    "commit": "02236657ca915367985ddf280fed3699124fa76d",
    "timestamp": "2020-09-21 14:41:27.871676",
    "tag": "",
    "health": {
        "checks": {
            "OSD_FULL": {
                "severity": "HEALTH_ERR",
                "summary": {
                    "message": "1 full osd(s)"
                },
                "detail": [
                    {
                        "message": "osd.52 is full"
                    }
                ]
            },

"ceph osd df tree" will provide detail about OSDs and full capacity:

#==[ Command ]======================================#
# /usr/bin/ceph --connect-timeout=5 osd df tree
ID CLASS WEIGHT    REWEIGHT SIZE    USE     AVAIL   %USE  VAR  PGS TYPE NAME                    
-1       335.08347        -  335TiB  168TiB  167TiB 50.25 1.00   - root default
---[cut here]--- 
52   ssd   0.72769  1.00000  745GiB  708GiB 37.2GiB 95.00 1.89  24         osd.52               
57   ssd   0.72769  1.00000  745GiB  436GiB  309GiB 58.54 1.17  19         osd.57

NOTE: osd.52 is 95% full.

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.