How to recover an RKE v0.2.x, v0.3.x or v1.x.x cluster after restoration with an incorrect or missing rkestate file
This document (000020165) is provided subject to the disclaimer at the end of this document.
Situation
Issue
When using RKE (Rancher Kubernetes Engine) v0.2.x, v0.3.x, v1.0.x or v1.1.0, if you have restored a cluster with the incorrect or missing rkestate file you will end up in a state where your infrastructure pods will not start. This includes all pods in the kube-system, cattle-system and ingress-nginx namespaces. As a result of these stopped infrastructure pods, workload pods will not function correctly. If you find yourself in this situation you can use the directions below to fix the cluster. For more information about the cluster state file, please see RKE documentation on Kubernetes Cluster State.
Pre-requisites
- RKE v0.2.x, v0.3.x, v1.0.x or v1.1.0
- A cluster restoration performed with the incorrect or missing rkestate file
Workaround
-
Delete all service-account-token secrets in kube-system, cattle-system and ingress-nginx namespaces:
kubectl get secret -n cattle-system | awk '{ if ($2 == "kubernetes.io/service-account-token") system("kubectl -n cattle-system delete secret " $1) }' kubectl get secret -n kube-system | awk '{ if ($2 == "kubernetes.io/service-account-token") system("kubectl -n kube-system delete secret " $1) }' kubectl get secret -n ingress-nginx | awk '{ if ($2 == "kubernetes.io/service-account-token") system("kubectl -n ingress-nginx delete secret " $1) }' kubectl get secret -n cert-manager | awk '{ if ($2 == "kubernetes.io/service-account-token") system("kubectl -n cert-manager delete secret " $1) }'
-
Restart Docker on all nodes in the cluster currently:
systemctl restart docker
-
Force delete all pods stuck in a CrashLoopBackOff, Terminating, Error and Evicted state:
kubectl get po --all-namespaces | awk '{ if ($4 =="CrashLoopBackOff") system("kubectl delete po --force --grace-period=0 -n " $1 " " $2) }' kubectl get po --all-namespaces | awk '{ if ($4 =="Terminating") system("kubectl delete po --force --grace-period=0 -n " $1 " " $2) }' kubectl get po --all-namespaces | awk '{ if ($4 =="Error") system("kubectl delete po --force --grace-period=0 -n " $1 " " $2) }' kubectl get po --all-namespaces | awk '{ if ($4 =="Evicted") system("kubectl delete po --force --grace-period=0 -n " $1 " " $2) }'
-
Once your force delete has finished, restart Docker again to clear out any stale containers from the above force delete command:
systemctl restart docker
-
You may have to delete service account tokens more than once or delete pods more than once. After you go through the guide once, monitor pod statuses with a watch command in one terminal as shown below.
watch -n1 'kubectl get po --all-namespaces | grep -i "cattle-system\|kube-system\|ingress-nginx\|cert-manager"'
If you see any pods still in an error state, you can describe them to get idea of what is wrong. Most likely you'll see an error like the following which indicates that you need to delete its service account tokens again.
Warning FailedMount 7m23s (x126 over 4h7m) kubelet, 18.219.82.148 MountVolume.SetUp failed for volume "rancher-token-tksxr" : secret "rancher-token-tksxr" not found Warning FailedMount 114s (x119 over 4h5m) kubelet, 18.219.82.148 Unable to attach or mount volumes: unmounted volumes=[rancher-token-tksxr], unattached volumes=[rancher-token-tksxr]: timed out waiting for the condition
Delete the service account tokens again for that one namespace so that pods in other namespaces don't have to be disturbed if they are good. Once the service account tokens are deleted, run a delete pod command for just the namespace with pods still in an error state. cattle-node-agent and cattle-cluster-agent depend on the Rancher pod to be online, so you can ignore those until the very end. Once Rancher pods are stable, go back in and delete all the agents again to get them to restart more quickly.
Resolution
An update to enable successful restoration of an RKE provisioned cluster without the correct rkestate file is targetted for an RKE v1.1.x patch release. For more information please see RKE GitHub issue #1336.
Further Reading
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:000020165
- Creation Date: 06-May-2021
- Modified Date:06-May-2021
-
- SUSE Rancher
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com