The system-upgrade-controller is not updating tainted nodes
This document (000021800) is provided subject to the disclaimer at the end of this document.
Environment
- Product: Standalone Rancher Kubernetes Engine (RKE2) / K3s (not Rancher-deployed clusters)
- Component: system-upgrade-controller
- Versions Affected: All versions of system-upgrade-controller
- Kubernetes Versions: 1.21.x and above
- Operating Systems: All supported Linux distributions
Situation
When attempting to perform an automated upgrade of a standalone RKE2 or K3s cluster using the system-upgrade-controller, nodes with taints may be skipped during the upgrade process. The upgrade jobs are created, but they remain in a pending state indefinitely for tainted nodes. The system-upgrade-controller logs may show that plans were applied, but some nodes are not receiving the upgrade.
Users observe that:
- Non-tainted nodes upgrade successfully
- Tainted nodes (such as Longhorn storage nodes with node.longhorn.io/create-default-disk:NoSchedule) remain on the old version
- The upgrade plan shows as applied, but doesn't complete for all nodes
Resolution
To resolve this issue, modify your Plan resource to include the appropriate tolerations for the tainted nodes. Here's how to do it:
Option 1: Add specific tolerations for Longhorn nodes
If you know the specific taints on your nodes (such as Longhorn storage nodes), you can add those specific tolerations:
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: k3s-upgrade
namespace: system-upgrade
spec:
concurrency: 1
version: v1.32.3+rke2r1
nodeSelector:
matchExpressions:
- key: kubernetes.io/os
operator: In
values:
- linux
serviceAccountName: system-upgrade
cordon: true
upgrade:
image: rancher/k3s-upgrade
# Add tolerations for Longhorn tainted nodes
tolerations:
- key: "node.longhorn.io/create-default-disk"
operator: "Exists"
effect: "NoSchedule"
# Add any other specific tolerations your nodes might have
Option 2: Ignore all taints (recommended for mixed workload clusters)
For clusters with multiple types of taints, you can use a blanket toleration that ignores all taints:
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: agent-plan
namespace: system-upgrade
labels:
rke2-upgrade: agent
spec:
concurrency: 1
cordon: true
drain:
force: true
nodeSelector:
matchExpressions:
- key: beta.kubernetes.io/os
operator: In
values:
- linux
- key: node-role.kubernetes.io/control-plane
operator: NotIn
values:
- "true"
prepare:
args:
- prepare
- server-plan
image: rancher/rke2-upgrade
serviceAccountName: system-upgrade
# This single toleration will ignore ALL taints
tolerations:
- operator: Exists
upgrade:
image: rancher/rke2-upgrade
version: v1.32.3+rke2r1
Apply the updated Plan:
kubectl apply -f your-plan.yaml
Verify that the upgrade jobs are now being scheduled on the tainted nodes:
kubectl get pods -n system-upgrade -w
If needed, you can check node versions after the upgrade completes:
kubectl get nodes -o wide
Cause
This issue occurs because, by default, the system-upgrade-controller doesn't deploy upgrade jobs on tainted nodes. The upgrade job pods respect node taints unless explicitly configured to tolerate them.
The system-upgrade-controller creates upgrade jobs as standard Kubernetes pods, and these pods follow the normal scheduling rules, including respecting node taints. Since taints are specifically designed to prevent pods from being scheduled on nodes, the upgrade jobs can't be scheduled on tainted nodes.
Additional Information
- This issue applies only to standalone RKE2/K3s clusters, not to clusters deployed through Rancher
- You can identify taints on your nodes with: kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
- Common taints to be aware of include:
- Longhorn: node.longhorn.io/create-default-disk:NoSchedule
- Storage nodes: Various CSI provider taints
- Control-plane: node-role.kubernetes.io/control-plane:NoSchedule
- The operator: Exists toleration with no key specified is a wildcard that tolerates all taints
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:000021800
- Creation Date: 22-Apr-2025
- Modified Date:22-Apr-2025
-
- SUSE Rancher
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com