How to recover a cluster when all control plane nodes have failed

This document (000020695) is provided subject to the disclaimer at the end of this document.

Environment

In a disaster recovery scenario, the control plane and etcd nodes managed by Rancher in a downstream cluster may no longer be available or functioning. The cluster can be rebuilt by adding control plane and etcd nodes again, followed by restoring from an available snapshot.

Resolution

Pre-requisites

  • A cluster built by Rancher v2.x or the Rancher Kubernetes Engine CLI (RKE)
  • Nodes to add to the cluster with control plane and etcd roles with adequate resources
  • An offline copy of a snapshot to be used as the recovery point, often stored in S3 or copied off node filesystems to a backup location

Note: This article assumes that all control plane and etcd nodes are no longer functional and/or cannot be repaired via any other means, like a VM snapshot restore.

Steps

To recover the downstream cluster, any existing nodes with the control plane and/or etcd roles must be removed. Worker nodes can remain in the cluster, and these may continue to operate with running workloads.

Please use the following steps as a guideline to recover the cluster, from this point the cluster that has experienced the disaster will be referred to as the downstream cluster.

  1. As a precaution, it's recommended to take a snapshot of the Rancher local cluster. Please see the documentation (RKE , RKE2) for the appropriate way to take a snapshot for the Rancher installation.

    Alternatively the rancher-backup operator can be used to backup all of the related objects for restoration.

  2. Delete all nodes with the control plane and/or etcd roles from the downstream cluster in the Rancher UI.

    The delete action can fail when the downstream cluster is in this condition, if nodes do not get removed, follow the below to remove it from the cluster:

    1. Click on the node and select View in API, click the delete button for the object
    2. If this does not succeed, using kubectl or the Cluster Explorer for the Rancher local cluster, edit the corresponding nodes.management.cattle.io object in the namespace that matches the downstream cluster ID to remove the finalizers field
  3. Add a clean node back to the cluster with the all role (control plane, etcd, worker). The IP address does not have to match any of the previous nodes. If the node has previously been used in a cluster, use the extended cleanup script steps to remove any previous configuration.

    The newly added node will fail to successfully register to the downstream cluster, it won't proceed past "Waiting to register with Kubernetes", this is normal.

  4. Copy the snapshot into place on the new node, under the /opt/rke/etcd-snapshots directory structure.

    The filename must match a snapshot name in the list of snapshots in the Rancher UI for the downstream cluster, any snapshot should be usable, if the name is different, rename the file to match one of the known snapshots in the list.

  5. Initiate a snapshot restore from Rancher UI using the same snapshot name used in the previous step.

  6. Monitor the Rancher pod logs for progress.

    To follow all pod logs at once, a kubeconfig for the Rancher local cluster can be used with this kubectl command:

    • kubectl logs -n cattle-system -l app=rancher -f -c rancher
  7. Once the new node reaches the active state, check the cluster and add additional nodes by repeating step 3 when ready, the additional nodes can be added with only the control plane and etcd roles if desired.


As a follow up, once all desired nodes are added and the cluster is healthy, the control plane and etcd node roles can be configured as needed. For example, if the all role is not needed, update the the node by removing and adding the node again in a rolling fashion .

Additional Information

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:000020695
  • Creation Date: 10-Aug-2022
  • Modified Date:11-Aug-2022

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.


SUSE Customer Support Quick Reference Guide SUSE Technical Support Handbook Update Advisories
Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Go to Customer Center