ceph daemons will not start due to: Error: readlink /var/lib/containers/storage/overlay/l/CXMD7IEI4LUKBJKX5BPVGZLY3Y: no such file or directory

This document (000019888) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Enterprise Storage 7

Situation

Cluster had a power outage resulting in ceph daemons not starting.  
Below are some of the symptoms at the time the cluster was started:

saltmaster:~ # ceph -s
  cluster:
    id:     c064a3f0-de87-4721-bf4d-f44d39cee754
    health: HEALTH_WARN
            failed to probe daemons or devices
            2 osds down
            2 hosts (12 osds) down
            Reduced data availability: 6 pgs inactive
            Degraded data redundancy: 1664/25452 objects degraded (6.538%), 78 pgs degraded, 139 pgs undersized

  services:
    mon: 2 daemons, quorum mon6,mon7 (age 4m)
    mgr: mon6(active, since 3h), standbys: mon7, mon8.zndnvk
    mds: cephfs:1 {0=cephfs.ceph9.ucrcbl=up:active} 4 up:standby
    osd: 36 osds: 24 up (since 3h), 26 in (since 3h)

  data:
    pools:   12 pools, 674 pgs
    objects: 8.48k objects, 10 GiB
    usage:   215 GiB used, 785 GiB / 1000 GiB avail
    pgs:     0.890% pgs not active
             1664/25452 objects degraded (6.538%)
             535 active+clean
             76  active+undersized+degraded
             57  active+undersized
             4   undersized+peered
             2   undersized+degraded+peered

  io:
    client:   1.7 KiB/s rd, 1 op/s rd, 0 op/s wr

saltmaster:~ # ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME          STATUS  REWEIGHT  PRI-AFF
 -1         1.46393  root default
 -9         0.24399      host osd10
  3    hdd  0.04880          osd.3        down         0  1.00000
  9    hdd  0.04880          osd.9        down         0  1.00000
 15    hdd  0.04880          osd.15       down         0  1.00000
 21    hdd  0.04880          osd.21       down         0  1.00000
 28    ssd  0.02440          osd.28       down         0  1.00000
 33    ssd  0.02440          osd.33       down   1.00000  1.00000
---[cut here]---
 -7         0.24399      host osd15
  2    hdd  0.04880          osd.2        down         0  1.00000
  8    hdd  0.04880          osd.8        down         0  1.00000
 14    hdd  0.04880          osd.14       down         0  1.00000
 19    hdd  0.04880          osd.19       down         0  1.00000
 26    ssd  0.02440          osd.26       down         0  1.00000
 34    ssd  0.02440          osd.34       down   1.00000  1.00000

saltmaster:~ # ceph orch ps | grep error 
NAME                                HOST      STATUS         REFRESHED  AGE  VERSION    IMAGE NAME                                                    IMAGE ID      CONTAINER ID
crash.osd10                      osd10  error          16s ago    19h  <unknown>  registry.suse.com/ses/7/ceph/ceph:latest                      <unknown>     <unknown>   
crash.osd15                      osd15  error          16s ago    19h  <unknown>  registry.suse.com/ses/7/ceph/ceph:latest                      <unknown>     <unknown>   
crash.mon5                       mon5   error          0s ago     19h  <unknown>  registry.suse.com/ses/7/ceph/ceph:latest                      <unknown>     <unknown>   
mgr.mon5                         mon5   error          0s ago     18h  <unknown>  registry.suse.com/ses/7/ceph/ceph                             <unknown>     <unknown>   
node-exporter.osd10              osd10  error          16s ago    19h  <unknown>  registry.suse.com/caasp/v4.5/prometheus-node-exporter:0.18.1  <unknown>     <unknown>   
node-exporter.osd15              osd15  error          16s ago    19h  <unknown>  registry.suse.com/caasp/v4.5/prometheus-node-exporter:0.18.1  <unknown>     <unknown>   
node-exporter.mon5               mon5   error          0s ago     19h  <unknown>  registry.suse.com/caasp/v4.5/prometheus-node-exporter:0.18.1  <unknown>     <unknown>   
osd.14                              osd15  error          16s ago    18h  <unknown>  registry.suse.com/ses/7/ceph/ceph                             <unknown>     <unknown>   
osd.15                              osd10  error          16s ago    18h  <unknown>  registry.suse.com/ses/7/ceph/ceph                             <unknown>     <unknown>   
osd.19                              osd15  error          16s ago    18h  <unknown>  registry.suse.com/ses/7/ceph/ceph                             <unknown>     <unknown>   
osd.2                               osd15  error          16s ago    18h  <unknown>  registry.suse.com/ses/7/ceph/ceph                             <unknown>     <unknown>   
osd.21                              osd10  error          16s ago    18h  <unknown>  registry.suse.com/ses/7/ceph/ceph                             <unknown>     <unknown>   
osd.26                              osd15  error          16s ago    18h  <unknown>  registry.suse.com/ses/7/ceph/ceph                             <unknown>     <unknown>   
osd.28                              osd10  error          16s ago    18h  <unknown>  registry.suse.com/ses/7/ceph/ceph                             <unknown>     <unknown>   
osd.3                               osd10  error          16s ago    18h  <unknown>  registry.suse.com/ses/7/ceph/ceph                             <unknown>     <unknown>   
osd.33                              osd10  error          16s ago    18h  <unknown>  registry.suse.com/ses/7/ceph/ceph                             <unknown>     <unknown>   
osd.34                              osd15  error          16s ago    18h  <unknown>  registry.suse.com/ses/7/ceph/ceph                             <unknown>     <unknown>   
osd.8                               osd15  error          16s ago    18h  <unknown>  registry.suse.com/ses/7/ceph/ceph                             <unknown>     <unknown>   
osd.9                               osd10  error          16s ago    18h  <unknown>  registry.suse.com/ses/7/ceph/ceph                             <unknown>     <unknown>   
rgw.default.default.ceph9.mxytwy  ceph9   error          13s ago    18h  <unknown>  registry.suse.com/ses/7/ceph/ceph:latest                      <unknown>     <unknown>   
rgw.default.default.ceph9.pubxcy  ceph9   error          13s ago    4h   <unknown>  registry.suse.com/ses/7/ceph/ceph:latest                      <unknown>     <unknown>   

mon5:~ # ceph health detail
HEALTH_WARN failed to probe daemons or devices; 1 stray daemons(s) not managed by cephadm
[WRN] CEPHADM_REFRESH_FAILED: failed to probe daemons or devices
    host mon8 ceph-volume inventory failed: cephadm exited with an error code: 1, stderr:Non-zero exit code 125 from /usr/bin/podman run --rm --ipc=host --net=host --entrypoint stat -e CONTAINER_IMAGE=registry.suse.com/ses/7/ceph/ceph:latest -e NODE_NAME=mon8 registry.suse.com/ses/7/ceph/ceph:latest -c %u %g /var/lib/ceph
stat:stderr Error: readlink /var/lib/containers/storage/overlay/l/2X52XHV2MZM4L33XEWGHQJ7XNZ: no such file or directory
Traceback (most recent call last):
  File "<stdin>", line 6115, in <module>
  File "<stdin>", line 1299, in _infer_fsid
  File "<stdin>", line 1382, in _infer_image
  File "<stdin>", line 3583, in command_ceph_volume
  File "<stdin>", line 1477, in make_log_dir
  File "<stdin>", line 2086, in extract_uid_gid
RuntimeError: uid/gid not found
[WRN] CEPHADM_STRAY_DAEMON: 1 stray daemons(s) not managed by cephadm
    stray daemon mgr.mon8.zndnvk on host mon8 not managed by cephadm

mon5:~ # cephadm shell
Inferring fsid c064a3f0-de87-4721-bf4d-f44d39cee754
Using recent ceph image registry.suse.com/ses/7/ceph/ceph:latest
Non-zero exit code 125 from /usr/bin/podman run --rm --ipc=host --net=host --entrypoint stat -e CONTAINER_IMAGE=registry.suse.com/ses/7/ceph/ceph:latest -e NODE_NAME=mon5 registry.suse.com/ses/7/ceph/ceph:latest -c %u %g /var/lib/ceph
stat:stderr Error: readlink /var/lib/containers/storage/overlay/l/CXMD7IEI4LUKBJKX5BPVGZLY3Y: no such file or directory
Traceback (most recent call last):
  File "/usr/sbin/cephadm", line 6114, in <module>
    r = args.func()
  File "/usr/sbin/cephadm", line 1322, in _infer_fsid
    return func()
  File "/usr/sbin/cephadm", line 1353, in _infer_config
    return func()
  File "/usr/sbin/cephadm", line 1381, in _infer_image
    return func()
  File "/usr/sbin/cephadm", line 3474, in command_shell
    make_log_dir(args.fsid)
  File "/usr/sbin/cephadm", line 1476, in make_log_dir
    uid, gid = extract_uid_gid()
  File "/usr/sbin/cephadm", line 2085, in extract_uid_gid
    raise RuntimeError('uid/gid not found')
RuntimeError: uid/gid not found

In all cases the common denominator was this message: 
stat:stderr Error: readlink /var/lib/containers/storage/overlay/l/CXMD7IEI4LUKBJKX5BPVGZLY3Y: no such file or directory

Resolution

On affected nodes, remove and pull the appropriate images. 
mon5:~ # podman image rm registry.suse.com/ses/7/ceph/ceph
Untagged: registry.suse.com/ses/7/ceph/ceph:latest
Deleted: f1a7d8e63a7eb956904027325e1924fc6d187994fce646a040f2ea8c7b2cec7d

mon5:~ # podman pull registry.suse.com/ses/7/ceph/ceph
Trying to pull registry.suse.com/ses/7/ceph/ceph...
Getting image source signatures
Copying blob 20dcc9d2116b done
Copying blob 19daf7f5570e done
Copying config f1a7d8e63a done
Writing manifest to image destination
Storing signatures
f1a7d8e63a7eb956904027325e1924fc6d187994fce646a040f2ea8c7b2cec7d

mon5:~ # podman ps
CONTAINER ID  IMAGE                                     COMMAND               CREATED         STATUS             PORTS   NAMES
e191ce4a16bf  registry.suse.com/ses/7/ceph/ceph:latest  -n mon.mon5 -f...  13 minutes ago  Up 13 minutes ago          ceph-c064a3f0-de87-4721-bf4d-f44d39cee754-mon.mon5

mon5:~ # cephadm shell
Inferring fsid c064a3f0-de87-4721-bf4d-f44d39cee754
Inferring config /var/lib/ceph/c064a3f0-de87-4721-bf4d-f44d39cee754/mon.mon5/config
Using recent ceph image registry.suse.com/ses/7/ceph/ceph:latest

After the container images have been pulled and validated, then restart appropriate services. 
saltmaster:~ # ceph orch restart osd
saltmaster:~ # ceph orch restart mds

Use "ceph orch ps | grep error" to look for process that could be affected.  

saltmaster:~ # ceph -s
  cluster:
    id:     c064a3f0-de87-4721-bf4d-f44d39cee754
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum mon6,mon7,mon5 (age 17m)
    mgr: mon6(active, since 4h), standbys: mon7, mon8.zndnvk
    mds: cephfs:1 {0=cephfs.ceph9.szvwmo=up:active} 3 up:standby
    osd: 36 osds: 36 up (since 13m), 36 in (since 13m)

  data:
    pools:   12 pools, 674 pgs
    objects: 8.48k objects, 10 GiB
    usage:   310 GiB used, 1.2 TiB / 1.5 TiB avail
    pgs:     674 active+clean

  io:
    client:   1.7 KiB/s rd, 1 op/s rd, 0 op/s wr

saltmaster:~ # ceph osd tree down
ID  CLASS  WEIGHT  TYPE NAME  STATUS  REWEIGHT  PRI-AFF
saltmaster:~ #

Cause

The container images were corrupted. 

Status

Top Issue

Additional Information

saltmaster:~ # podman images
REPOSITORY                                             TAG     IMAGE ID      CREATED       SIZE
registry.suse.com/ses/7/ceph/ceph                      latest  f1a7d8e63a7e  2 months ago  835 MB
registry.suse.com/caasp/v4.5/prometheus-server         2.18.0  848b38cc04c2  2 months ago  297 MB
registry.suse.com/caasp/v4.5/prometheus-alertmanager   0.16.2  4683615b36cb  2 months ago  193 MB
registry.suse.com/ses/7/ceph/grafana                   7.0.3   8807a216c843  3 months ago  298 MB
registry.suse.com/caasp/v4.5/prometheus-node-exporter  0.18.1  a149a78bcd37  6 months ago  189 MB

It may be necessary to remove and pull other images as needed on each node.  

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:000019888
  • Creation Date: 07-Apr-2021
  • Modified Date:07-Apr-2021
    • SUSE Enterprise Storage

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback@suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.


SUSE Customer Support Quick Reference Guide SUSE Technical Support Handbook Update Advisories
Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Go to Customer Center