SUSE Support

Here When You Need Us

The system-upgrade-controller is not updating tainted nodes

This document (000021800) is provided subject to the disclaimer at the end of this document.

Environment

  • Product: Standalone Rancher Kubernetes Engine (RKE2) / K3s (not Rancher-deployed clusters)
  • Component: system-upgrade-controller
  • Versions Affected: All versions of system-upgrade-controller
  • Kubernetes Versions: 1.21.x and above
  • Operating Systems: All supported Linux distributions

Situation

When attempting to perform an automated upgrade of a standalone RKE2 or K3s cluster using the system-upgrade-controller, nodes with taints may be skipped during the upgrade process. The upgrade jobs are created, but they remain in a pending state indefinitely for tainted nodes. The system-upgrade-controller logs may show that plans were applied, but some nodes are not receiving the upgrade.

Users observe that:

  • Non-tainted nodes upgrade successfully
  • Tainted nodes (such as Longhorn storage nodes with node.longhorn.io/create-default-disk:NoSchedule) remain on the old version
  • The upgrade plan shows as applied, but doesn't complete for all nodes

Resolution

To resolve this issue, modify your Plan resource to include the appropriate tolerations for the tainted nodes. Here's how to do it:

Option 1: Add specific tolerations for Longhorn nodes
If you know the specific taints on your nodes (such as Longhorn storage nodes), you can add those specific tolerations:

apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  name: k3s-upgrade
  namespace: system-upgrade
spec:
  concurrency: 1
  version: v1.32.3+rke2r1
  nodeSelector:
    matchExpressions:
      - key: kubernetes.io/os
        operator: In
        values:
          - linux
  serviceAccountName: system-upgrade
  cordon: true
  upgrade:
    image: rancher/k3s-upgrade
  # Add tolerations for Longhorn tainted nodes
  tolerations:
    - key: "node.longhorn.io/create-default-disk"
      operator: "Exists"
      effect: "NoSchedule"
    # Add any other specific tolerations your nodes might have

Option 2: Ignore all taints (recommended for mixed workload clusters)
For clusters with multiple types of taints, you can use a blanket toleration that ignores all taints:

apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  name: agent-plan
  namespace: system-upgrade
  labels:
    rke2-upgrade: agent
spec:
  concurrency: 1
  cordon: true
  drain:
    force: true
  nodeSelector:
    matchExpressions:
    - key: beta.kubernetes.io/os
      operator: In
      values:
      - linux
    - key: node-role.kubernetes.io/control-plane
      operator: NotIn
      values:
      - "true"
  prepare:
    args:
    - prepare
    - server-plan
    image: rancher/rke2-upgrade
  serviceAccountName: system-upgrade
  # This single toleration will ignore ALL taints
  tolerations:
  - operator: Exists
  upgrade:
    image: rancher/rke2-upgrade
  version: v1.32.3+rke2r1

Apply the updated Plan:

kubectl apply -f your-plan.yaml

Verify that the upgrade jobs are now being scheduled on the tainted nodes:

kubectl get pods -n system-upgrade -w

If needed, you can check node versions after the upgrade completes:

kubectl get nodes -o wide

Cause

This issue occurs because, by default, the system-upgrade-controller doesn't deploy upgrade jobs on tainted nodes. The upgrade job pods respect node taints unless explicitly configured to tolerate them.

The system-upgrade-controller creates upgrade jobs as standard Kubernetes pods, and these pods follow the normal scheduling rules, including respecting node taints. Since taints are specifically designed to prevent pods from being scheduled on nodes, the upgrade jobs can't be scheduled on tainted nodes.

Additional Information

  • This issue applies only to standalone RKE2/K3s clusters, not to clusters deployed through Rancher
  • You can identify taints on your nodes with: kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
  • Common taints to be aware of include:
    • Longhorn: node.longhorn.io/create-default-disk:NoSchedule
    • Storage nodes: Various CSI provider taints
    • Control-plane: node-role.kubernetes.io/control-plane:NoSchedule
  • The operator: Exists toleration with no key specified is a wildcard that tolerates all taints

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:000021800
  • Creation Date: 22-Apr-2025
  • Modified Date:22-Apr-2025
    • SUSE Rancher

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com

tick icon

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

tick icon

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.

tick icon

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.