How to remove and replace an unresponsive control plane / etcd node in the local Rancher server cluster, provisioned by the Rancher Kubernetes Engine (RKE) CLI

This document (000020033) is provided subject to the disclaimer at the end of this document.

Situation

Task

This article details how to remove and replace an unresponsive control plane / etcd node from a local Rancher server cluster, provisioned via the Rancher Kubernetes Engine (RKE) CLI.

Pre-requisites

Resolution

This operation is relatively simple, and uses the example cluster.yaml below for demonstration purposes.

N.B. Be sure to use your cluster.yaml and matching cluster.rkestate for the relevant cluster.

In this demonstration example, the node that is failing has the address 1.2.3.3:

nodes:
    - address: 1.2.3.1
      user: ubuntu
      role:
        - controlplane
        - etcd
    - address: 1.2.3.2
      user: ubuntu
      role:
        - controlplane
        - etcd
    - address: 1.2.3.3
      user: ubuntu
      role:
        - controlplane
        - etcd 
[...] # rest of cluster.yaml except control plane / etcd nodes restracted
Step 1. Validate the cluster is quorate and confirm the unresponsive node

On the control plane / etcd nodes perform the following command, per the Rancher Troubleshooting Documentation to determine etcd endpoint health:

docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','") etcd etcdctl endpoint health

On the unresponsive node the command may fail to execute, on the healthy nodes you should see output of the following format indicating the health status of each node:

{"level":"warn","ts":"2020-12-31T12:11:41.840Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-c65a15b4-9646-4c71-914d-f3c892c04c2f/1.2.3.3:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 1.2.3.3:2379: connect: connection refused\""}
https://1.2.3.1:2379 is healthy: successfully committed proposal: took = 13.442336ms
https://1.2.3.2:2379 is healthy: successfully committed proposal: took = 18.227226ms
https://1.2.3.3:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Step 2. Remove the unresponsive node

Having confirmed which node is unresponsive in the cluster, remove this from the nodes block in the cluster configuration file (cluster.yaml), per the example of 1.2.3.3 removed below:

nodes:
    - address: 1.2.3.1
      user: ubuntu
      role:
        - controlplane
        - etcd
    - address: 1.2.3.2
      user: ubuntu
      role:
        - controlplane
        - etcd

[...] # rest of cluster.yaml except control plane / etcd nodes restracted

After updating the cluster.yaml file, execute an rke up run to remove the node:

rke up --config cluster.yaml

The above action will remove the problematic and unresponsive control plane / etcd node.

Step 3. Clean and add the removed node back to the cluster

Once the rke up invocation has run through without any errors, and you can see the node removed from the Rancher UI or kubectl get nodes output, it is safe to move onto adding the node back in.

First clean the removed node (1.2.3.3) in our example, using the Extended Rancher 2 Cleanup script.

After cleaning the node, add this back into the cluster configuration (cluster.yaml) file:

nodes:
    - address: 1.2.3.1
      user: ubuntu
      role:
        - controlplane
        - etcd
    - address: 1.2.3.2
      user: ubuntu
      role:
        - controlplane
        - etcd
    - address: 1.2.3.3
      user: ubuntu
      role:
        - controlplane
        - etcd 
[...] # rest of cluster.yaml except control plane / etcd nodes restracted

And run the rke up command again:

rke up --config cluster.yaml
Step 4. Validate final cluster state

Once the rke up command has completed, without errors, you can now verify the node is visible and ready via kubectl get nodes and the Rancher UI.

The etcd endpoint health commands on the control plane / etcd nodes should also show each endpoint as healthy, per the following example output:

https://1.2.3.1:2379 is healthy: successfully committed proposal: took = 13.442336ms
https://1.2.3.2:2379 is healthy: successfully committed proposal: took = 18.227226ms
https://1.2.3.3:2379 is healthy: successfully committed proposal: took = 22.065616ms

Further reading

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:000020033
  • Creation Date: 06-May-2021
  • Modified Date:06-May-2021
    • SUSE Rancher

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback@suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.


SUSE Customer Support Quick Reference Guide SUSE Technical Support Handbook Update Advisories
Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Go to Customer Center