How to remove and replace an unresponsive control plane / etcd node in the local Rancher server cluster, provisioned by the Rancher Kubernetes Engine (RKE) CLI

This document (000020033) is provided subject to the disclaimer at the end of this document.

Environment

A Rancher Kubernetes Engine (RKE) CLI provisioned cluster

A Highly Available control plane / etcd configuration, with an odd number of mixed role control plane / etcd nodes, commonly 3 or 5
The cluster is quorate, i.e. with 3 control plane / etcd nodes only a single node is unresponsive, or with 5 control plane / etcd nodes upto two nodes are unresponsive
The cluster configuration file (e.g. cluster.yml) and .rkestate file (e.g. cluster.rkestate)
The RKE binary and SSH access to the nodes

Situation

This article details how to remove and replace an unresponsive control plane / etcd node from a local Rancher server cluster, provisioned via the Rancher Kubernetes Engine (RKE) CLI.

Resolution

This operation is relatively simple, and uses the example `cluster.yaml` below for demonstration purposes.

N.B. Be sure to use your cluster.yaml and matching cluster.rkestate for the relevant cluster.

In this demonstration example, the node that is failing has the address 1.2.3.3:

nodes:
    - address: 1.2.3.1
      user: ubuntu
      role:
        - controlplane
        - etcd
    - address: 1.2.3.2
      user: ubuntu
      role:
        - controlplane
        - etcd
    - address: 1.2.3.3
      user: ubuntu
      role:
        - controlplane
        - etcd 
[...] # rest of cluster.yaml except control plane / etcd nodes restracted

Step 1. Validate the cluster is quorate and confirm the unresponsive node

On the control plane / etcd nodes perform the following command, per the Rancher Troubleshooting Documentation to determine etcd endpoint health:

docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','") etcd etcdctl endpoint health

On the unresponsive node the command may fail to execute, on the healthy nodes you should see output of the following format indicating the health status of each node:

{"level":"warn","ts":"2020-12-31T12:11:41.840Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-c65a15b4-9646-4c71-914d-f3c892c04c2f/1.2.3.3:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 1.2.3.3:2379: connect: connection refused\""}
https://1.2.3.1:2379 is healthy: successfully committed proposal: took = 13.442336ms
https://1.2.3.2:2379 is healthy: successfully committed proposal: took = 18.227226ms
https://1.2.3.3:2379 is unhealthy: failed to commit proposal: context deadline exceeded

Step 2. Remove the unresponsive node

Having confirmed which node is unresponsive in the cluster, remove this from the nodes block in the cluster configuration file (cluster.yaml), per the example of 1.2.3.3 removed below:

nodes:
    - address: 1.2.3.1
      user: ubuntu
      role:
        - controlplane
        - etcd
    - address: 1.2.3.2
      user: ubuntu
      role:
        - controlplane
        - etcd

[...] # rest of cluster.yaml except control plane / etcd nodes restracted

After updating the cluster.yaml file, execute an rke up run to remove the node:

rke up --config cluster.yaml

The above action will remove the problematic and unresponsive control plane / etcd node.

Step 3. Clean and add the removed node back to the cluster

Once the rke up invocation has run through without any errors, and you can see the node removed from the Rancher UI or kubectl get nodes output, it is safe to move onto adding the node back in.

First clean the removed node (1.2.3.3) in our example, using the Extended Rancher 2 Cleanup script.

After cleaning the node, add this back into the cluster configuration (cluster.yaml) file:

nodes:
    - address: 1.2.3.1
      user: ubuntu
      role:
        - controlplane
        - etcd
    - address: 1.2.3.2
      user: ubuntu
      role:
        - controlplane
        - etcd
    - address: 1.2.3.3
      user: ubuntu
      role:
        - controlplane
        - etcd 
[...] # rest of cluster.yaml except control plane / etcd nodes restracted

And run the rke up command again:

rke up --config cluster.yaml

Step 4. Validate final cluster state

Once the rke up command has completed, without errors, you can now verify the node is visible and ready via kubectl get nodes and the Rancher UI.

The etcd endpoint health commands on the control plane / etcd nodes should also show each endpoint as healthy, per the following example output:

https://1.2.3.1:2379 is healthy: successfully committed proposal: took = 13.442336ms
https://1.2.3.2:2379 is healthy: successfully committed proposal: took = 18.227226ms
https://1.2.3.3:2379 is healthy: successfully committed proposal: took = 22.065616ms

Status

Top Issue

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

Document ID:000020033
Creation Date: 06-May-2021
Modified Date:22-Jul-2024
- SUSE Rancher

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.

Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Report a Software Vulnerability

Go to Customer Center

SUSE Support

Here When You Need Us

How to remove and replace an unresponsive control plane / etcd node in the local Rancher server cluster, provisioned by the Rancher Kubernetes Engine (RKE) CLI

Environment

A Rancher Kubernetes Engine (RKE) CLI provisioned cluster

Situation

This article details how to remove and replace an unresponsive control plane / etcd node from a local Rancher server cluster, provisioned via the Rancher Kubernetes Engine (RKE) CLI.

Resolution

This operation is relatively simple, and uses the example `cluster.yaml` below for demonstration purposes.

Step 1. Validate the cluster is quorate and confirm the unresponsive node

Step 2. Remove the unresponsive node

Step 3. Clean and add the removed node back to the cluster

Step 4. Validate final cluster state

Further reading

Status

Disclaimer

SUSE Support Forums

Support Resources

Open an Incident

SUSE Support

Here When You Need Us

How to remove and replace an unresponsive control plane / etcd node in the local Rancher server cluster, provisioned by the Rancher Kubernetes Engine (RKE) CLI

Environment

A Rancher Kubernetes Engine (RKE) CLI provisioned cluster

Situation

This article details how to remove and replace an unresponsive control plane / etcd node from a local Rancher server cluster, provisioned via the Rancher Kubernetes Engine (RKE) CLI.

Resolution

This operation is relatively simple, and uses the example cluster.yaml below for demonstration purposes.

Step 1. Validate the cluster is quorate and confirm the unresponsive node

Step 2. Remove the unresponsive node

Step 3. Clean and add the removed node back to the cluster

Step 4. Validate final cluster state

Further reading

Status

Disclaimer

SUSE Support Forums

Support Resources

Open an Incident

This operation is relatively simple, and uses the example `cluster.yaml` below for demonstration purposes.