How to recover an RKE v0.2.x, v0.3.x or v1.x.x cluster after restoration with an incorrect or missing rkestate file

This document (000020165) is provided subject to the disclaimer at the end of this document.

Situation

Issue

When using RKE (Rancher Kubernetes Engine) v0.2.x, v0.3.x, v1.0.x or v1.1.0, if you have restored a cluster with the incorrect or missing rkestate file you will end up in a state where your infrastructure pods will not start. This includes all pods in the kube-system, cattle-system and ingress-nginx namespaces. As a result of these stopped infrastructure pods, workload pods will not function correctly. If you find yourself in this situation you can use the directions below to fix the cluster. For more information about the cluster state file, please see RKE documentation on Kubernetes Cluster State.

Pre-requisites

  • RKE v0.2.x, v0.3.x, v1.0.x or v1.1.0
  • A cluster restoration performed with the incorrect or missing rkestate file

Workaround

  1. Delete all service-account-token secrets in kube-system, cattle-system and ingress-nginx namespaces:

    kubectl get secret -n cattle-system | awk '{ if ($2 == "kubernetes.io/service-account-token") system("kubectl -n cattle-system delete secret " $1) }'
    kubectl get secret -n kube-system | awk '{ if ($2 == "kubernetes.io/service-account-token") system("kubectl -n kube-system delete secret " $1) }'
    kubectl get secret -n ingress-nginx | awk '{ if ($2 == "kubernetes.io/service-account-token") system("kubectl -n ingress-nginx delete secret " $1) }'
    kubectl get secret -n cert-manager | awk '{ if ($2 == "kubernetes.io/service-account-token") system("kubectl -n cert-manager delete secret " $1) }'
  2. Restart Docker on all nodes in the cluster currently:

    systemctl restart docker
  3. Force delete all pods stuck in a CrashLoopBackOff, Terminating, Error and Evicted state:

    kubectl get po --all-namespaces | awk '{ if ($4 =="CrashLoopBackOff") system("kubectl delete po --force --grace-period=0 -n " $1 " " $2) }'
    kubectl get po --all-namespaces | awk '{ if ($4 =="Terminating") system("kubectl delete po --force --grace-period=0 -n " $1 " " $2) }'
    kubectl get po --all-namespaces | awk '{ if ($4 =="Error") system("kubectl delete po --force --grace-period=0 -n " $1 " " $2) }'
    kubectl get po --all-namespaces | awk '{ if ($4 =="Evicted") system("kubectl delete po --force --grace-period=0 -n " $1 " " $2) }'
  4. Once your force delete has finished, restart Docker again to clear out any stale containers from the above force delete command:

    systemctl restart docker
  5. You may have to delete service account tokens more than once or delete pods more than once. After you go through the guide once, monitor pod statuses with a watch command in one terminal as shown below.

    watch -n1 'kubectl get po --all-namespaces | grep -i  "cattle-system\|kube-system\|ingress-nginx\|cert-manager"'

    If you see any pods still in an error state, you can describe them to get idea of what is wrong. Most likely you'll see an error like the following which indicates that you need to delete its service account tokens again.

    Warning  FailedMount  7m23s (x126 over 4h7m)  kubelet, 18.219.82.148  MountVolume.SetUp failed for volume "rancher-token-tksxr" : secret "rancher-token-tksxr" not found
    Warning  FailedMount  114s (x119 over 4h5m)   kubelet, 18.219.82.148  Unable to attach or mount volumes: unmounted volumes=[rancher-token-tksxr], unattached volumes=[rancher-token-tksxr]: timed out waiting for the condition

    Delete the service account tokens again for that one namespace so that pods in other namespaces don't have to be disturbed if they are good. Once the service account tokens are deleted, run a delete pod command for just the namespace with pods still in an error state. cattle-node-agent and cattle-cluster-agent depend on the Rancher pod to be online, so you can ignore those until the very end. Once Rancher pods are stable, go back in and delete all the agents again to get them to restart more quickly.

Resolution

An update to enable successful restoration of an RKE provisioned cluster without the correct rkestate file is targetted for an RKE v1.1.x patch release. For more information please see RKE GitHub issue #1336.

Further Reading

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:000020165
  • Creation Date: 06-May-2021
  • Modified Date:06-May-2021
    • SUSE Rancher

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.


SUSE Customer Support Quick Reference Guide SUSE Technical Support Handbook Update Advisories
Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Go to Customer Center