ceph -s reports: 1 clients failing to respond to capability release, 1 clients failing to advance oldest client/flush tid, 1 MDSs report slow requests

This document (000019628) is provided subject to the disclaimer at the end of this document.

Environment

SES6

Situation

ses-master:~ # ceph -s
  cluster:
    id:     7c9dc5a7-373d-4203-ad19-1a8d24c208d0
    health: HEALTH_WARN
            1 clients failing to respond to capability release
            1 clients failing to advance oldest client/flush tid
            1 MDSs report slow requests
            54 pgs not deep-scrubbed in time
 
  services:
    mon: 3 daemons, quorum ses-mon-1,ses-mon-2,ses-mon-3 (age 13d)
    mgr: ses-mon-3(active, since 13d), standbys: ses-mon-1, ses-mon-2
    mds: cephfs:1 {0=ses-mds-1=up:active} 1 up:standby
    osd: 206 osds: 206 up (since 3d), 206 in (since 6d)
 
  data:
    pools:   10 pools, 3016 pgs
    objects: 129.47M objects, 410 TiB
    usage:   618 TiB used, 1.2 PiB / 1.9 PiB avail
    pgs:     3008 active+clean
             8    active+clean+scrubbing+deep
 
  io:
    client:   0 B/s rd, 44 MiB/s wr, 2 op/s rd, 50 op/s wr

#==[ Command ]======================================#
# /usr/bin/ceph --connect-timeout=5 health detail
HEALTH_WARN 1 clients failing to respond to capability release; 1 clients failing to advance oldest client/flush tid; 1 MDSs report slow requests; 74 pgs not deep-scrubbed in time
MDS_CLIENT_LATE_RELEASE 1 clients failing to respond to capability release
    mdsses-mds-1(mds.0): Client cephfs-client1 failing to respond to capability release client_id: 15004271
MDS_CLIENT_OLDEST_TID 1 clients failing to advance oldest client/flush tid
    mdsses-mds-1(mds.0): Client cephfs-client2 failing to advance its oldest client/flush tid.  client_id: 13400693
MDS_SLOW_REQUEST 1 MDSs report slow requests
    mdsses-mds-1(mds.0): 4 slow requests are blocked > 30 secs
    
The client machines are responding fine except that cephfs-client2 is stuck on that one directory. The du command is still running...  The du command waskilled it and the process released fine. 

Socket errors are observed days before the event. 
cephfs-client1 shows the following in dmesg:
[Thu May  7 09:53:42 2020] libceph: osd205 up
[Thu May  7 15:54:18 2020] libceph: mon2 172.21.99.206:6789 session established
[Thu May  7 15:54:18 2020] libceph: client15004271 fsid 7c9dc5a7-373d-4203-ad19-1a8d24c208d0
[Fri May  8 09:35:02 2020] libceph: osd17 weight 0xd999
[Fri May  8 09:35:02 2020] libceph: osd17 weight 0x10000 (in)
[Sat May  9 09:30:42 2020] libceph: osd172 down
[Sat May  9 09:30:50 2020] libceph: osd172 up
[Sat May  9 14:35:21 2020] libceph: osd71 172.20.09.214:6832 socket error on write
[Sat May  9 14:35:21 2020] libceph: osd169 172.20.09.209:6876 socket closed (con state OPEN)
[Sat May  9 14:35:21 2020] libceph: osd42 172.20.09.211:6848 socket error on write
[Sat May  9 14:35:21 2020] libceph: osd25 172.20.09.208:6812 socket error on write

cephfs-client2 shows:
May  7 00:06:56 cephfs-client2 kernel: libceph: osd33 172.20.09.208:6816 socket error on write
May  7 09:57:52 cephfs-client2 kernel: libceph: osd166 172.20.09.209:6864 socket closed (con state OPEN)
May  7 10:01:51 cephfs-client2 kernel: libceph: osd35 172.20.09.207:6820 socket closed (con state OPEN)
May  7 10:01:52 cephfs-client2 kernel: libceph: osd24 172.20.09.207:6812 socket closed (con state OPEN)
May  7 10:01:53 cephfs-client2 kernel: libceph: osd0 172.20.09.207:6800 socket error on write
May  7 10:01:55 cephfs-client2 kernel: libceph: osd157 172.20.09.207:6864 socket closed (con state OPEN)
May  7 16:14:13 cephfs-client2 kernel: libceph: osd17 weight 0xd999
May  7 17:02:50 cephfs-client2 kernel: libceph: osd17 weight 0x10000 (in)
May  9 09:31:57 cephfs-client2 kernel: libceph: osd172 down
May  9 09:32:02 cephfs-client2 kernel: libceph: osd172 up
May  9 14:36:35 cephfs-client2 kernel: libceph: osd157 172.20.09.207:6864 socket error on write
May  9 14:36:50 cephfs-client2 kernel: libceph: osd27 172.20.09.215:6814 socket error on write
May  9 17:50:29 cephfs-client2 kernel: libceph: osd79 172.20.09.214:6836 socket error on write
May  9 19:41:56 cephfs-client2 kernel: libceph: osd160 172.20.09.207:6872 socket closed (con state OPEN)
May  9 21:16:17 cephfs-client2 kernel: libceph: osd110 172.20.09.209:6828 socket closed (con state OPEN)
May 10 05:49:36 cephfs-client2 kernel: libceph: osd174 172.20.09.210:6872 socket error on write
May 10 06:52:36 cephfs-client2 kernel: libceph: osd160 172.20.09.207:6872 socket closed (con state OPEN)
May 10 07:42:18 cephfs-client2 kernel: libceph: osd174 172.20.09.210:6872 socket closed (con state OPEN)
May 10 08:51:48 cephfs-client2 kernel: libceph: osd18 172.20.09.215:6842 socket closed (con state OPEN)
May 10 13:54:32 cephfs-client2 kernel: libceph: osd91 172.20.09.215:6808 socket closed (con state OPEN)
May 10 17:05:47 cephfs-client2 kernel: libceph: osd33 172.20.09.208:6816 socket closed (con state OPEN)
May 10 18:32:18 cephfs-client2 kernel: libceph: osd174 172.20.09.210:6872 socket closed (con state OPEN)
May 11 06:06:41 cephfs-client2 kernel: libceph: osd114 172.20.09.209:6840 socket closed (con state OPEN)

Resolution

Customer rebooted "cephfs-client1" and the following messages were no longer displayed:
            1 clients failing to respond to capability release
            1 clients failing to advance oldest client/flush tid
            1 MDSs report slow requests
    

Cause

It seems like, at some point, a network issue has occurred and the clients lost connection to (some?) OSDs.  But the clients should be able to recover from the condition. 

Status

Top Issue

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:000019628
  • Creation Date: 15-May-2020
  • Modified Date:15-May-2020
    • SUSE Enterprise Storage

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback@suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.


SUSE Customer Support Quick Reference Guide SUSE Technical Support Handbook Update Advisories
Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Go to Customer Center