Troubleshooting - Nodes won't join cluster or show unavailable

This document (000020081) is provided subject to the disclaimer at the end of this document.

Environment

Custom RKE clusters and clusters launched with a node driver (RKE).

Situation

Issue - Nodes are not added to Rancher or are not provisioned correctly

The following article should help empower Rancher administrators diagnose and troubleshoot when a node is not added to Rancher or when a node is not provisioned correctly. We'll outline the process nodes undergo when they are added to a cluster.

Resolution

Tracing the steps during the bootstrapping of a node.

When working with custom RKE clusters, the way to add nodes to the cluster is by executing a docker run command generated for the created cluster. In case of a custom cluster, the command will be generated and displayed on the final step of cluster creation.
In case of a cluster launched with a node driver, the command is generated and executed as the final command after creating the node and installing Docker.

Example of displayed command on the final step of a cluster creation. Please note that not all roles may be present in the generated command, depending on what role(s) is/are selected.

sudo docker run -d \
 --privileged \
 --restart=unless-stopped \
 --net=host \
 -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:<version> --server https://<server_url> \
 --token <token> \
 --ca-checksum <checksum_value> \
 --etcd \
 --controlplane \
 --worker

What happens next:

The docker run command launches a bootstrap agent container. It will be identified with a randomly generated name.
The entry point is a shell script which parses the flags and runs some validation tests on said flags and their provided values.
A token is then used to authenticate against your Rancher server in order to interact with it.
The agent retrieves the CA certificate from the Rancher server and places it in /etc/kubernetes/ssl/certs/serverca, then the checksum is used to validate if the CA certificate retrieved from Rancher matches. This only applies when a self-signed certificate is in use.
Runs an agent binary and connects to Rancher using a Web Socket connection.
The agent then checks in with the Rancher server to see if the node is unique, and gets a node plan.
The agent executes the node plan provided by the Rancher server.
Docker run command will create the path /etc/kubernetes if it doesn't exist.
Rancher will run cluster provisioning/reconcile based on the desired role for the node being added (etcd and control plane nodes only). This process will copy certificates down from the server via the built in rke cluster provisioning.
On worker nodes, the process is slightly different. The agent requests a node plan from the Rancher server. The Rancher server generates the node config then sends it back down to the agent. The agent then executes the plan contained in the node config.
- This involves the certificate generation for the Kubernetes components and the container create commands to create the following services: kubelet, kube-proxy, and nginx-proxy.
The Rancher agent uses the node plan to write out a cloud-config to configure cloud provider settings.

If the provisioning of the node succeeds, the node will be registering to the Kubernetes cluster and cattle-node-agent DaemonSet pods will be scheduled to the node, and the pod will remove and replace the agent container that was created via the Docker run command.

The share-mnt binary (aka bootstrap phase 2):

The share-mnt container runs the share-root.sh which creates filesystem resources that other container end up using: certificate folders, configuration files, etc.
The container spings up another container that runs a share mount binary. This container makes sure /var/lib/kubelet or /var/lib/rancher have the right share permissions for systems like boot2docker.

Note: All Kubernetes control plane components talk directly with the Kubernetes API server that is housed on the same node. This proxy is configured to front all k8s API servers within the cluster.

If all goes well, the share-mnt bootstrap and share-root container exit and the share-root container gets removed. The kubelet starts, registers with Kubernetes, and cattle-node-agent DaemonSet schedules a pod. The pod should then take over the websocket connection to the rancher server. This should end our provisioning journey and hopefully lead to a functional, happy cluster.

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.