SUSE AI: A new foundation for your private AI implementation (part 3)

Share
Share

Last time…

This is a series of articles dealing with SUSE AI, please take a look at the first article here.

In the last article, we were able to get things really going! We ended with a working Kubernetes cluster using RKE2 on all nodes, and with a working Rancher Manager. In this article, we’ll focus on configuring some aspects of our cluster, which will be useful for using our SUSE AI stack further on. 

About Application Collection

The Application Collection is a collection of artifacts curated by SUSE that follows the best practices in supply chain security, and is the preferred source for all components used in SUSE AI and Rancher Prime. In order to use it, you do need to create a Secret object in the namespace containing the credentials mentioned above. Without it, you’ll get errors whenever a referenced image is downloaded.

IMPORTANT: Secret objects need to be defined in the namespace they are going to be used for each deployment. You cannot refer to a Secret object from another namespace. This is a Kubernetes limitation.

So, for example, if you are going to deploy applications in the “suseai” namespace, you need to do the following.

First, create the namespace you’ll use for the deployment if it’s not already been created:

# kubectl create namespace suseai

Then, create the shared secret for accessing Application Collection. This will be used by all subsequent component installations that reference that namespace:

# kubectl create secret docker-registry application-collection --docker-server=dp.apps.rancher.io --docker-username=<USERNAME FROM APP COLLECTION SERVICE ACCOUNT> --docker-password=<SERVICE ACCOUNT TOKEN FROM APP COLLECTION> -n suseai

Login to Helm:

# helm registry login dp.apps.rancher.io -u <USERNAME FROM APP COLLECTION SERVICE ACCOUNT> -p <SERVICE ACCOUNT TOKEN FROM APP COLLECTION>

And you should be ready to start deploying Helm charts from Application Collection!

 

IMPORTANT: ALL Helm charts contained in  the Application Collection expect credentials named “application-collection” as a Secret object in the same namespace. 

If you’re getting errors like “ErrImagePull” or “ImagePullBackOff”, you can indicate which Secret to use in the Helm command line with an extra  parameter:

--set "global.imagePullSecrets[0].name=application-collection"

 

Or, if you’re editing a YAML file, add this section:

global:
  imagePullSecrets:
  - application-collection

 

About Storage Classes

StorageClass objects are the preferred way for Kubernetes deployments to allocate and use persistent storage. A deployment has to request Persistent Volume Claims (PVCs) for each persistent area of storage. These PVCs are addressed by the configured CSI drivers on your cluster. A CSI driver is responsible for implementing a special interface that Kubernetes understands and provide Persistent Volume objects (PVs)  back to the deployments.

For some of the SUSE AI components, you’ll need a CSI that has the ALLOWVOLUMEEXPANSION feature enabled. Without it, it’ll silently fail on some of the deployments, in special Milvus. 

One option is to use SUSE Storage (née Longhorn), which will provide such a StorageClass by leveraging each of the cluster node’s disks as a storage pool, also providing redundancy in the form of volume replicas. 

Another option, if you do own a storage array (like the ones sold by Dell, Pure Storage, NetApp, Huawei, etc), is to install its compatible CSI driver. . If you do not have such a CSI available or are installing SUSE AI in a lab, consider using the NFS CSI storage class. 

It’s simple to set up, and will use a NFS share from one of your nodes as storage.

 

Installing the NFS CSI provider

To install the NFS CSI provider, run the following command on your cluster:

# kubectl create secret docker-registry application-collection --docker-server=dp.apps.rancher.io --docker-username=<USERNAME FROM APP COLLECTION SERVICE ACCOUNT> --docker-password=<SERVICE ACCOUNT TOKEN FROM APP COLLECTION> -n kube-system
# helm registry login dp.apps.rancher.io -u <USERNAME FROM APP COLLECTION SERVICE ACCOUNT> -p <SERVICE ACCOUNT TOKEN FROM APP COLLECTION>
# helm upgrade --install csi-driver-nfs oci://dp.apps.rancher.io/charts/kubernetes-csi-driver-nfs -n kube-system --set "global.imagePullSecrets[0].name=application-collection"

Now, you need to create a NFS server somewhere that is accessible by DNS to the whole cluster.

If you’re running a SLE Micro-based system:

# transactional-update pkg install nfs-kernel-server

A reboot is needed to apply the changes.

 

If you’re running a SLE-based system:

# zypper install nfs-kernel-server

 

Create a mount point and a NFS share:

# vim /etc/exports.d/data.exports
/data *(rw,sync,no_root_squash)

Enable NFS on the server:

# systemctl enable --now nfs-server

 

Create a StorageClass:

# vim nfs-csi.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nfs-csi
provisioner: nfs.csi.k8s.io
parameters:
  server: <NFS hostname>
  share: <NFS share name>
  # csi.storage.k8s.io/provisioner-secret is only needed for providing mountOptions in DeleteVolume
  # csi.storage.k8s.io/provisioner-secret-name: "mount-options"
  # csi.storage.k8s.io/provisioner-secret-namespace: "default"
reclaimPolicy: Delete
volumeBindingMode: Immediate
mountOptions:
  - nfsvers=4.1

# kubectl apply -f nfs-csi.yaml

Then just use “nfs-csi” as the storage class name for the next PVCs and you’re done!

 

Installing the NVIDIA GPU Operator

 

The NVIDIA GPU Operator acts as a middleman between the native GPU drivers (which we installed previously on our nodes with GPUs), and the container/kubernetes world. It’s responsible for correctly identifying which nodes have working GPUs, and label them accordingly.

These labels will be used by your deployments that need GPUs and tell Kubernetes to find out on which nodes they can be scheduled and run on. The GPU Operator also provides the internal Kubernetes objects for vGPUs, which can be requested by your deployments.

First, add the helm repository for NVIDIA:

# helm repo add nvidia https://helm.ngc.nvidia.com/nvidia

# helm repo update

Create a values.yaml configuration file and check the settings of the following environmental variables:

apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: gpu-operator
  namespace: kube-system
spec:
  repo: https://helm.ngc.nvidia.com/nvidia
  chart: gpu-operator
  targetNamespace: gpu-operator
  createNamespace: true
  valuesContent: |-
    toolkit:
    env:
    - name: CONTAINERD_CONFIG
      value: /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
    - name: CONTAINERD_SOCKET
      value: /run/k3s/containerd/containerd.sock
    - name: CONTAINERD_RUNTIME_CLASS
      value: nvidia
    - name: CONTAINERD_SET_AS_DEFAULT
      value: "true"

Be VERY attentive to the spaces, they’re essential to YAML. These values will tell the GPU Operator how to communicate with the containerd daemon provided by RKE2, and where to create the corresponding deployments.

Next, install the GPU operator itself:

# helm install --wait gpu-operator nvidia/gpu-operator \
  -n gpu-operator --create-namespace --set driver.enabled=false -f values.yaml

You should see the new deployments appearing in the “gpu-operator” namespace. Be patient, this can take some minutes until all nodes are properly validated and show the Active status.

To check whether your GPU nodes are properly recognized, go to the Rancher UI, select your downstream cluster, then click on Cluster-> Nodes. Inspect the labels appearing below each node entry. You should see labels like “nvidia.com/gpu=1” on the GPU nodes that were correctly validated.

You’re now set to run AI applications!

What’s next?

In the next article, we’ll take a look at installing more AI components: Milvus, Ollama and OpenWebUI. We’re getting close!

 

Share
(Visited 5 times, 1 visits today)
Avatar photo
512 views
Erico Mendonca Solution Architect @ SUSE