SUSE AI: A new foundation for your private AI implementation (part 3)
Last time…
This is a series of articles dealing with SUSE AI, please take a look at the first article here.
In the last article, we were able to get things really going! We ended with a working Kubernetes cluster using RKE2 on all nodes, and with a working Rancher Manager. In this article, we’ll focus on configuring some aspects of our cluster, which will be useful for using our SUSE AI stack further on.
About Application Collection
The Application Collection is a collection of artifacts curated by SUSE that follows the best practices in supply chain security, and is the preferred source for all components used in SUSE AI and Rancher Prime. In order to use it, you do need to create a Secret object in the namespace containing the credentials mentioned above. Without it, you’ll get errors whenever a referenced image is downloaded.
IMPORTANT: Secret objects need to be defined in the namespace they are going to be used for each deployment. You cannot refer to a Secret object from another namespace. This is a Kubernetes limitation.
So, for example, if you are going to deploy applications in the “suseai” namespace, you need to do the following.
First, create the namespace you’ll use for the deployment if it’s not already been created:
# kubectl create namespace suseai
Then, create the shared secret for accessing Application Collection. This will be used by all subsequent component installations that reference that namespace:
# kubectl create secret docker-registry application-collection --docker-server=dp.apps.rancher.io --docker-username=<USERNAME FROM APP COLLECTION SERVICE ACCOUNT> --docker-password=<SERVICE ACCOUNT TOKEN FROM APP COLLECTION> -n suseai
Login to Helm:
# helm registry login dp.apps.rancher.io -u <USERNAME FROM APP COLLECTION SERVICE ACCOUNT> -p <SERVICE ACCOUNT TOKEN FROM APP COLLECTION>
And you should be ready to start deploying Helm charts from Application Collection!
IMPORTANT: ALL Helm charts contained in the Application Collection expect credentials named “application-collection” as a Secret object in the same namespace.
If you’re getting errors like “ErrImagePull” or “ImagePullBackOff”, you can indicate which Secret to use in the Helm command line with an extra parameter:
--set "global.imagePullSecrets[0].name=application-collection"
Or, if you’re editing a YAML file, add this section:
global: imagePullSecrets: - application-collection
About Storage Classes
StorageClass objects are the preferred way for Kubernetes deployments to allocate and use persistent storage. A deployment has to request Persistent Volume Claims (PVCs) for each persistent area of storage. These PVCs are addressed by the configured CSI drivers on your cluster. A CSI driver is responsible for implementing a special interface that Kubernetes understands and provide Persistent Volume objects (PVs) back to the deployments.
For some of the SUSE AI components, you’ll need a CSI that has the ALLOWVOLUMEEXPANSION feature enabled. Without it, it’ll silently fail on some of the deployments, in special Milvus.
One option is to use SUSE Storage (née Longhorn), which will provide such a StorageClass by leveraging each of the cluster node’s disks as a storage pool, also providing redundancy in the form of volume replicas.
Another option, if you do own a storage array (like the ones sold by Dell, Pure Storage, NetApp, Huawei, etc), is to install its compatible CSI driver. . If you do not have such a CSI available or are installing SUSE AI in a lab, consider using the NFS CSI storage class.
It’s simple to set up, and will use a NFS share from one of your nodes as storage.
Installing the NFS CSI provider
To install the NFS CSI provider, run the following command on your cluster:
# kubectl create secret docker-registry application-collection --docker-server=dp.apps.rancher.io --docker-username=<USERNAME FROM APP COLLECTION SERVICE ACCOUNT> --docker-password=<SERVICE ACCOUNT TOKEN FROM APP COLLECTION> -n kube-system # helm registry login dp.apps.rancher.io -u <USERNAME FROM APP COLLECTION SERVICE ACCOUNT> -p <SERVICE ACCOUNT TOKEN FROM APP COLLECTION> # helm upgrade --install csi-driver-nfs oci://dp.apps.rancher.io/charts/kubernetes-csi-driver-nfs -n kube-system --set "global.imagePullSecrets[0].name=application-collection"
Now, you need to create a NFS server somewhere that is accessible by DNS to the whole cluster.
If you’re running a SLE Micro-based system:
# transactional-update pkg install nfs-kernel-server
A reboot is needed to apply the changes.
If you’re running a SLE-based system:
# zypper install nfs-kernel-server
Create a mount point and a NFS share:
# vim /etc/exports.d/data.exports /data *(rw,sync,no_root_squash)
Enable NFS on the server:
# systemctl enable --now nfs-server
Create a StorageClass:
# vim nfs-csi.yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: nfs-csi provisioner: nfs.csi.k8s.io parameters: server: <NFS hostname> share: <NFS share name> # csi.storage.k8s.io/provisioner-secret is only needed for providing mountOptions in DeleteVolume # csi.storage.k8s.io/provisioner-secret-name: "mount-options" # csi.storage.k8s.io/provisioner-secret-namespace: "default" reclaimPolicy: Delete volumeBindingMode: Immediate mountOptions: - nfsvers=4.1 # kubectl apply -f nfs-csi.yaml
Then just use “nfs-csi” as the storage class name for the next PVCs and you’re done!
Installing the NVIDIA GPU Operator
The NVIDIA GPU Operator acts as a middleman between the native GPU drivers (which we installed previously on our nodes with GPUs), and the container/kubernetes world. It’s responsible for correctly identifying which nodes have working GPUs, and label them accordingly.
These labels will be used by your deployments that need GPUs and tell Kubernetes to find out on which nodes they can be scheduled and run on. The GPU Operator also provides the internal Kubernetes objects for vGPUs, which can be requested by your deployments.
First, add the helm repository for NVIDIA:
# helm repo add nvidia https://helm.ngc.nvidia.com/nvidia # helm repo update
Create a values.yaml configuration file and check the settings of the following environmental variables:
apiVersion: helm.cattle.io/v1 kind: HelmChart metadata: name: gpu-operator namespace: kube-system spec: repo: https://helm.ngc.nvidia.com/nvidia chart: gpu-operator targetNamespace: gpu-operator createNamespace: true valuesContent: |- toolkit: env: - name: CONTAINERD_CONFIG value: /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl - name: CONTAINERD_SOCKET value: /run/k3s/containerd/containerd.sock - name: CONTAINERD_RUNTIME_CLASS value: nvidia - name: CONTAINERD_SET_AS_DEFAULT value: "true"
Be VERY attentive to the spaces, they’re essential to YAML. These values will tell the GPU Operator how to communicate with the containerd daemon provided by RKE2, and where to create the corresponding deployments.
Next, install the GPU operator itself:
# helm install --wait gpu-operator nvidia/gpu-operator \ -n gpu-operator --create-namespace --set driver.enabled=false -f values.yaml
You should see the new deployments appearing in the “gpu-operator” namespace. Be patient, this can take some minutes until all nodes are properly validated and show the Active status.
To check whether your GPU nodes are properly recognized, go to the Rancher UI, select your downstream cluster, then click on Cluster-> Nodes. Inspect the labels appearing below each node entry. You should see labels like “nvidia.com/gpu=1” on the GPU nodes that were correctly validated.
You’re now set to run AI applications!
What’s next?
In the next article, we’ll take a look at installing more AI components: Milvus, Ollama and OpenWebUI. We’re getting close!
Related Articles
Aug 21st, 2025