cephfs, When deleting files get: rm: cannot remove 'file-name': No space left on device

This document (000020569) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Enterprise Storage 6

Situation

Users mounting cephfs file system report "No space left on device" error when trying to remove/delete files.

ses-master:~ # ceph health detail
HEALTH_WARN 1 MDSs report oversized cache
MDS_CACHE_OVERSIZED 1 MDSs report oversized cache
mds.ses-mds-2(mds.0): MDS cache is too large (79GB/31GB); 31162038 inodes in use by clients, 996594 stray files

/cases/00327226/scc_SR00327226_ses-master_220113_1249_ea9ccd74-0155-4587-b2a6-bd6a1717dc1f/ceph> cat ceph-status
#==[ Command ]======================================#
# /usr/bin/ceph --connect-timeout=5 -s
  cluster:
    id:     7c9dc5a7-373d-4203-ad19-1a8d24c208d0
    health: HEALTH_WARN
            1 MDSs report oversized cache

  services:
    mon: 3 daemons, quorum ses-mon-1,ses-mon-2,ses-mon-3 (age 9d)
    mgr: ses-mon-2(active, since 9d), standbys: ses-mon-3, ses-mon-1
    mds: cephfs:1 {0=ses-mds-2=up:active} 1 up:standby
    osd: 247 osds: 247 up (since 23m), 246 in (since 9d)

  task status:

  data:
    pools:   14 pools, 3688 pgs
    objects: 538.54M objects, 744 TiB
    usage:   1.1 PiB used, 1.1 PiB / 2.3 PiB avail
    pgs:     2081 active+clean+snaptrim_wait
             1173 active+clean
             427  active+clean+snaptrim
             3    active+clean+scrubbing+deep+snaptrim_wait
             2    active+clean+scrubbing+deep
             2    active+clean+scrubbing+snaptrim_wait

  io:
    client:   69 MiB/s rd, 6.1 MiB/s wr, 4.65k op/s rd, 457 op/s wr

After 30 minutes the cluster health has gone to "HEALTH_OK" but reports of "No space left on device" when deleting some files/directories continues.

Resolution

The error is observed when "num_strays" is near the default value of 1000000 (1 million), which is 10x the value of  "mds_bal_fragment_size_max" setting. To view "num_strays", run the following command on the active mds node: "ceph daemon mds.`hostname -s` perf dump | grep strays"

"mds_bal_fragment_size_max" is configured with default value "100000" (100 thousand).  The value of "mds_bal_fragment_size_max" can be viewed by running the following command on the mds node:
"ceph daemon mds.`hostname -s` config get mds_bal_fragment_size_max"

Example:
ses-mds-2:~ # ceph daemon mds.`hostname -s` config get mds_bal_fragment_size_max
{
    "mds_bal_fragment_size_max": "100000"
}

ses-mds-2:~ # ceph daemon mds.`hostname -s` perf dump | grep strays
        "num_strays": 996646,
        "num_strays_delayed": 0,
        "num_strays_enqueuing": 0,
        "strays_created": 10101709,
        "strays_enqueued": 9532957,
        "strays_reintegrated": 555,
        "strays_migrated": 0,

Increase "mds_bal_fragment_size_max = 200000" on the mds nodes:
  ceph daemon mds.$HOSTNAME config set mds_bal_fragment_size_max 200000

Then observe "ceph daemon mds.`hostname -s` perf dump | grep num_stray" to ensure the value remains under 2000000. If the value is near 2000000, increasing "mds_bal_fragment_size_max 400000"

Also add the desired configuration to the ceph.conf, as the above configure is not persistent. 

Cause

Note: Customer had increased "mds cache memory limit = 34359738367"
There is a correlation between `mds_cache_memory_limit` and `mds_bal_fragment_size_max` settings.  When increasing 'mds_cache_memory_limit',  `mds_bal_fragment_size_max` should also be increased as well, if the "num_strays" value is reaching its limit.  

Status

Top Issue

Additional Information

See:
https://docs.ceph.com/en/latest/cephfs/dirfrags/#size-thresholds
https://docs.ceph.com/en/latest/cephfs/mds-config-ref/

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:000020569
  • Creation Date: 31-Jan-2022
  • Modified Date:31-Jan-2022
    • SUSE Enterprise Storage

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.


SUSE Customer Support Quick Reference Guide SUSE Technical Support Handbook Update Advisories
Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Go to Customer Center