SUSE Support

Here When You Need Us

How to remove and replace an unresponsive control plane / etcd node in the local Rancher server cluster, provisioned by the Rancher Kubernetes Engine (RKE) CLI

This document (000020033) is provided subject to the disclaimer at the end of this document.

Environment

A Rancher Kubernetes Engine (RKE) CLI provisioned cluster


Situation

This article details how to remove and replace an unresponsive control plane / etcd node from a local Rancher server cluster, provisioned via the Rancher Kubernetes Engine (RKE) CLI.

     

      Resolution

      This operation is relatively simple, and uses the example cluster.yaml below for demonstration purposes.

      N.B. Be sure to use your cluster.yaml and matching cluster.rkestate for the relevant cluster.

      In this demonstration example, the node that is failing has the address 1.2.3.3:

      nodes:
          - address: 1.2.3.1
            user: ubuntu
            role:
              - controlplane
              - etcd
          - address: 1.2.3.2
            user: ubuntu
            role:
              - controlplane
              - etcd
          - address: 1.2.3.3
            user: ubuntu
            role:
              - controlplane
              - etcd 
      [...] # rest of cluster.yaml except control plane / etcd nodes restracted
      Step 1. Validate the cluster is quorate and confirm the unresponsive node

      On the control plane / etcd nodes perform the following command, per the Rancher Troubleshooting Documentation to determine etcd endpoint health:

      docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','") etcd etcdctl endpoint health

      On the unresponsive node the command may fail to execute, on the healthy nodes you should see output of the following format indicating the health status of each node:

      {"level":"warn","ts":"2020-12-31T12:11:41.840Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-c65a15b4-9646-4c71-914d-f3c892c04c2f/1.2.3.3:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 1.2.3.3:2379: connect: connection refused\""}
      https://1.2.3.1:2379 is healthy: successfully committed proposal: took = 13.442336ms
      https://1.2.3.2:2379 is healthy: successfully committed proposal: took = 18.227226ms
      https://1.2.3.3:2379 is unhealthy: failed to commit proposal: context deadline exceeded
      Step 2. Remove the unresponsive node

      Having confirmed which node is unresponsive in the cluster, remove this from the nodes block in the cluster configuration file (cluster.yaml), per the example of 1.2.3.3 removed below:

      nodes:
          - address: 1.2.3.1
            user: ubuntu
            role:
              - controlplane
              - etcd
          - address: 1.2.3.2
            user: ubuntu
            role:
              - controlplane
              - etcd
      
      [...] # rest of cluster.yaml except control plane / etcd nodes restracted

      After updating the cluster.yaml file, execute an rke up run to remove the node:

      rke up --config cluster.yaml

      The above action will remove the problematic and unresponsive control plane / etcd node.

      Step 3. Clean and add the removed node back to the cluster

      Once the rke up invocation has run through without any errors, and you can see the node removed from the Rancher UI or kubectl get nodes output, it is safe to move onto adding the node back in.

      First clean the removed node (1.2.3.3) in our example, using the Extended Rancher 2 Cleanup script.

      After cleaning the node, add this back into the cluster configuration (cluster.yaml) file:

      nodes:
          - address: 1.2.3.1
            user: ubuntu
            role:
              - controlplane
              - etcd
          - address: 1.2.3.2
            user: ubuntu
            role:
              - controlplane
              - etcd
          - address: 1.2.3.3
            user: ubuntu
            role:
              - controlplane
              - etcd 
      [...] # rest of cluster.yaml except control plane / etcd nodes restracted

      And run the rke up command again:

      rke up --config cluster.yaml
      Step 4. Validate final cluster state

      Once the rke up command has completed, without errors, you can now verify the node is visible and ready via kubectl get nodes and the Rancher UI.

      The etcd endpoint health commands on the control plane / etcd nodes should also show each endpoint as healthy, per the following example output:

      https://1.2.3.1:2379 is healthy: successfully committed proposal: took = 13.442336ms
      https://1.2.3.2:2379 is healthy: successfully committed proposal: took = 18.227226ms
      https://1.2.3.3:2379 is healthy: successfully committed proposal: took = 22.065616ms

      Further reading

      Status

      Top Issue

      Disclaimer

      This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

      • Document ID:000020033
      • Creation Date: 06-May-2021
      • Modified Date:22-Jul-2024
        • SUSE Rancher

      < Back to Support Search

      For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com

      tick icon

      SUSE Support Forums

      Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

      tick icon

      Support Resources

      Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.

      tick icon

      Open an Incident

      Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.