SUSE AI: A new foundation for your private AI implementation (part 7)
Last time…
In the last article, we installed SUSE Observability, a powerful tool to get insights for all aspects of our cluster and workloads. Today, we’ll add the metrics module for our LLM and GPU workloads and explore a bit of the WebUI!
Adding GenAI Observability metrics
To collect metrics related to GenAI (GPU/LLM/Vector Databases) to SUSE Observability, you’ll need to install the relevant add-on from Application Collection
First, we’ll need to retrieve both the API and CLI keys from SUSE Observability.
- Open the SUSE Observability web UI
- Open the menu on the top left corner, and scroll down and click on “CLI”
- Make note of the value under “Your API token”. This is the CLI API Key.
- Open the side menu again, scroll down and click on “StackPacks”
- Click on “Kubernetes”, then click on the active instance (with a green checkmark)
- Right at the first “helm” command example, you will see parameters with the values we’ll need to collect:
- stackstate.apiKey – this is the main API Key.
- stackstate.cluster.name – this is our observed cluster name.
Now, we’ll need to prepare a “values” file with the information we collected.
# vim genai-values.yaml serverUrl: http://suse-observability-router.suse-observability.svc.cluster.local:8080 apiKey: <value taken from stackstate.apiKey> apiToken: <value taken from the CLI key> clusterName: <value taken from stackstate.cluster.name>
The next step is to install the GenAI Observability helm chart:
# helm upgrade --install --namespace suse-observability -f genai-values.yaml genai-observability oci://dp.apps.rancher.io/charts/suse-ai-observability-extension --version 1.0.1
Just wait for the deployment to finish, and you should now have the extension installed.
Installing the OpenTelemetry Collector
This is necessary to collect the GPU metrics and forward them to SUSE Observability.
First, we need to create the Secret object used by the collector:
# kubectl create secret generic open-telemetry-collector \ --namespace suse-observability \ --from-literal=API_KEY='<value taken from stackstate.apiKey>'
Create a new file named otel-values.yaml with the following content.
extraEnvsFrom: - secretRef: name: open-telemetry-collector mode: deployment image: repository: "otel/opentelemetry-collector-k8s" ports: metrics: enabled: true presets: kubernetesAttributes: enabled: true extractAllPodLabels: true config: receivers: prometheus: config: scrape_configs: - job_name: 'gpu-metrics' scrape_interval: 10s scheme: http kubernetes_sd_configs: - role: endpoints namespaces: names: - gpu-operator exporters: otlp: endpoint: http://suse-observability-otel-collector.suse-observability.svc.cluster.local:4317 headers: Authorization: "SUSEObservability ${env:API_KEY}" tls: insecure: true processors: tail_sampling: decision_wait: 10s policies: - name: rate-limited-composite type: composite composite: max_total_spans_per_second: 500 policy_order: [errors, slow-traces, rest] composite_sub_policy: - name: errors type: status_code status_code: status_codes: [ ERROR ] - name: slow-traces type: latency latency: threshold_ms: 1000 - name: rest type: always_sample rate_allocation: - policy: errors percent: 33 - policy: slow-traces percent: 33 - policy: rest percent: 34 resource: attributes: - key: k8s.cluster.name action: upsert value: CLUSTER_NAME - key: service.instance.id from_attribute: k8s.pod.uid action: insert filter/dropMissingK8sAttributes: error_mode: ignore traces: span: - resource.attributes["k8s.node.name"] == nil - resource.attributes["k8s.pod.uid"] == nil - resource.attributes["k8s.namespace.name"] == nil - resource.attributes["k8s.pod.name"] == nil connectors: spanmetrics: metrics_expiration: 5m namespace: otel_span routing/traces: error_mode: ignore table: - statement: route() pipelines: [traces/sampling, traces/spanmetrics] service: extensions: - health_check pipelines: traces: receivers: [otlp, jaeger] processors: [filter/dropMissingK8sAttributes, memory_limiter, resource] exporters: [routing/traces] traces/spanmetrics: receivers: [routing/traces] processors: [] exporters: [spanmetrics] traces/sampling: receivers: [routing/traces] processors: [tail_sampling, batch] exporters: [debug, otlp] metrics: receivers: [otlp, spanmetrics, prometheus] processors: [memory_limiter, resource, batch] exporters: [debug, otlp]
Also create a file for the necessary RBAC bindings, called otel-rbac.yaml:
# vim otel-rbac.yaml apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: suse-observability-otel-scraper rules: - apiGroups: - "" resources: - services - endpoints verbs: - list - watch --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: suse-observability-otel-scraper roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: suse-observability-otel-scraper subjects: - kind: ServiceAccount name: opentelemetry-collector namespace: observability
Next, install the collector:
# helm upgrade --install opentelemetry-collector oci://dp.apps.rancher.io/charts/opentelemetry-collector --version 0.126.0 -f otel-values.yaml -n suse-observability
Apply the RBAC bindings:
# kubectl apply -n gpu-operator -f otel-rbac.yaml
What’s Next?
In the next article, we’ll wrap up the series and get some insights on the future of SUSE AI. See you then!
Related Articles
Oct 09th, 2024
Achieve Compliance and Security in Your Linux Estate
Dec 10th, 2025
SUSE Receives 64 Badges in the Winter G2 Report
Dec 16th, 2024