SUSE AI: A new foundation for your private AI implementation (part 7)
Last time…
In the last article, we installed SUSE Observability, a powerful tool to get insights for all aspects of our cluster and workloads. Today, we’ll add the metrics module for our LLM and GPU workloads and explore a bit of the WebUI!
Adding GenAI Observability metrics
To collect metrics related to GenAI (GPU/LLM/Vector Databases) to SUSE Observability, you’ll need to install the relevant add-on from Application Collection
First, we’ll need to retrieve both the API and CLI keys from SUSE Observability.
- Open the SUSE Observability web UI
- Open the menu on the top left corner, and scroll down and click on “CLI”
- Make note of the value under “Your API token”. This is the CLI API Key.
- Open the side menu again, scroll down and click on “StackPacks”
- Click on “Kubernetes”, then click on the active instance (with a green checkmark)
- Right at the first “helm” command example, you will see parameters with the values we’ll need to collect:
- stackstate.apiKey – this is the main API Key.
- stackstate.cluster.name – this is our observed cluster name.
Now, we’ll need to prepare a “values” file with the information we collected.
# vim genai-values.yaml serverUrl: http://suse-observability-router.suse-observability.svc.cluster.local:8080 apiKey: <value taken from stackstate.apiKey> apiToken: <value taken from the CLI key> clusterName: <value taken from stackstate.cluster.name>
The next step is to install the GenAI Observability helm chart:
# helm upgrade --install --namespace suse-observability -f genai-values.yaml genai-observability oci://dp.apps.rancher.io/charts/suse-ai-observability-extension --version 1.0.1
Just wait for the deployment to finish, and you should now have the extension installed.
Installing the OpenTelemetry Collector
This is necessary to collect the GPU metrics and forward them to SUSE Observability.
First, we need to create the Secret object used by the collector:
# kubectl create secret generic open-telemetry-collector \ --namespace suse-observability \ --from-literal=API_KEY='<value taken from stackstate.apiKey>'
Create a new file named otel-values.yaml with the following content.
extraEnvsFrom: - secretRef: name: open-telemetry-collector mode: deployment image: repository: "otel/opentelemetry-collector-k8s" ports: metrics: enabled: true presets: kubernetesAttributes: enabled: true extractAllPodLabels: true config: receivers: prometheus: config: scrape_configs: - job_name: 'gpu-metrics' scrape_interval: 10s scheme: http kubernetes_sd_configs: - role: endpoints namespaces: names: - gpu-operator exporters: otlp: endpoint: http://suse-observability-otel-collector.suse-observability.svc.cluster.local:4317 headers: Authorization: "SUSEObservability ${env:API_KEY}" tls: insecure: true processors: tail_sampling: decision_wait: 10s policies: - name: rate-limited-composite type: composite composite: max_total_spans_per_second: 500 policy_order: [errors, slow-traces, rest] composite_sub_policy: - name: errors type: status_code status_code: status_codes: [ ERROR ] - name: slow-traces type: latency latency: threshold_ms: 1000 - name: rest type: always_sample rate_allocation: - policy: errors percent: 33 - policy: slow-traces percent: 33 - policy: rest percent: 34 resource: attributes: - key: k8s.cluster.name action: upsert value: CLUSTER_NAME - key: service.instance.id from_attribute: k8s.pod.uid action: insert filter/dropMissingK8sAttributes: error_mode: ignore traces: span: - resource.attributes["k8s.node.name"] == nil - resource.attributes["k8s.pod.uid"] == nil - resource.attributes["k8s.namespace.name"] == nil - resource.attributes["k8s.pod.name"] == nil connectors: spanmetrics: metrics_expiration: 5m namespace: otel_span routing/traces: error_mode: ignore table: - statement: route() pipelines: [traces/sampling, traces/spanmetrics] service: extensions: - health_check pipelines: traces: receivers: [otlp, jaeger] processors: [filter/dropMissingK8sAttributes, memory_limiter, resource] exporters: [routing/traces] traces/spanmetrics: receivers: [routing/traces] processors: [] exporters: [spanmetrics] traces/sampling: receivers: [routing/traces] processors: [tail_sampling, batch] exporters: [debug, otlp] metrics: receivers: [otlp, spanmetrics, prometheus] processors: [memory_limiter, resource, batch] exporters: [debug, otlp]
Also create a file for the necessary RBAC bindings, called otel-rbac.yaml:
# vim otel-rbac.yaml apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: suse-observability-otel-scraper rules: - apiGroups: - "" resources: - services - endpoints verbs: - list - watch --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: suse-observability-otel-scraper roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: suse-observability-otel-scraper subjects: - kind: ServiceAccount name: opentelemetry-collector namespace: observability
Next, install the collector:
# helm upgrade --install opentelemetry-collector oci://dp.apps.rancher.io/charts/opentelemetry-collector --version 0.126.0 -f otel-values.yaml -n suse-observability
Apply the RBAC bindings:
# kubectl apply -n gpu-operator -f otel-rbac.yaml
What’s Next?
In the next article, we’ll wrap up the series and get some insights on the future of SUSE AI. See you then!
Related Articles
Jan 05th, 2026
혁신에서 비즈니스로: SUSE가 엔터프라이즈 AI의 다음 시대를 여는 방법
Aug 11th, 2025
수세, 2025년 가트너 매직 쿼드런트 컨테이너 관리 부문 리더로 선정
Sep 28th, 2025
SUSE Telco Cloud 3.4: 혁신을 가속화하는 AI 지원 클라우드 네이티브 플랫폼
Aug 01st, 2025