SUSE AI: A new foundation for your private AI implementation (part 7)

Share
Share

Last time…

In the last article, we installed SUSE Observability, a powerful tool to get insights for all aspects of our cluster and workloads. Today, we’ll add the metrics module for our LLM and GPU workloads and explore a bit of the WebUI!

Adding GenAI Observability metrics

To collect metrics related to GenAI (GPU/LLM/Vector Databases) to SUSE Observability, you’ll need to install the relevant add-on from Application Collection

First, we’ll need to retrieve both the API  and CLI keys from SUSE Observability.

  • Open the SUSE Observability web UI
  • Open the menu on the top left corner, and scroll down and click on “CLI”
  • Make note of the value under “Your API token”. This is the CLI API Key.
  • Open the side menu again, scroll down and click  on “StackPacks”
  • Click on “Kubernetes”, then click on the active instance (with a green checkmark)
  • Right at the first “helm” command example, you will see parameters with the values we’ll need to collect:
    1. stackstate.apiKey – this is the main API Key. 
    2. stackstate.cluster.name – this is our observed cluster name.

Now, we’ll need to prepare a “values” file with the information we collected.

# vim genai-values.yaml
serverUrl:  http://suse-observability-router.suse-observability.svc.cluster.local:8080
apiKey: <value taken from stackstate.apiKey>
apiToken: <value taken from the CLI key>
clusterName: <value taken from stackstate.cluster.name>

The next step is to install the GenAI Observability helm chart:

# helm upgrade --install --namespace suse-observability -f genai-values.yaml genai-observability oci://dp.apps.rancher.io/charts/suse-ai-observability-extension --version 1.0.1   

Just wait for the deployment to finish, and you should now have the extension installed.

 

Installing the OpenTelemetry Collector

This is necessary to collect the GPU metrics and forward them to SUSE Observability. 

First, we need to create the Secret object used by the collector:

# kubectl create secret generic open-telemetry-collector \
  --namespace suse-observability \
  --from-literal=API_KEY='<value taken from stackstate.apiKey>'

Create a new file named otel-values.yaml with the following content.

extraEnvsFrom:
  - secretRef:
      name: open-telemetry-collector
mode: deployment
image:
  repository: "otel/opentelemetry-collector-k8s"
ports:
  metrics:
    enabled: true
presets:
  kubernetesAttributes:
    enabled: true
    extractAllPodLabels: true
config:
  receivers:
    prometheus:
      config:
        scrape_configs:
        - job_name: 'gpu-metrics'
          scrape_interval: 10s
          scheme: http
          kubernetes_sd_configs:
            - role: endpoints
              namespaces:
                names:
                - gpu-operator
  exporters:
    otlp:
      endpoint: http://suse-observability-otel-collector.suse-observability.svc.cluster.local:4317 
      headers:
        Authorization: "SUSEObservability ${env:API_KEY}"
      tls:
        insecure: true
  processors:
    tail_sampling:
      decision_wait: 10s
      policies:
      - name: rate-limited-composite
        type: composite
        composite:
          max_total_spans_per_second: 500
          policy_order: [errors, slow-traces, rest]
          composite_sub_policy:
          - name: errors
            type: status_code
            status_code:
              status_codes: [ ERROR ]
          - name: slow-traces
            type: latency
            latency:
              threshold_ms: 1000
          - name: rest
            type: always_sample
          rate_allocation:
          - policy: errors
            percent: 33
          - policy: slow-traces
            percent: 33
          - policy: rest
            percent: 34
    resource:
      attributes:
      - key: k8s.cluster.name
        action: upsert
        value: CLUSTER_NAME 
      - key: service.instance.id
        from_attribute: k8s.pod.uid
        action: insert
    filter/dropMissingK8sAttributes:
      error_mode: ignore
      traces:
        span:
          - resource.attributes["k8s.node.name"] == nil
          - resource.attributes["k8s.pod.uid"] == nil
          - resource.attributes["k8s.namespace.name"] == nil
          - resource.attributes["k8s.pod.name"] == nil
  connectors:
    spanmetrics:
      metrics_expiration: 5m
      namespace: otel_span
    routing/traces:
      error_mode: ignore
      table:
      - statement: route()
        pipelines: [traces/sampling, traces/spanmetrics]
  service:
    extensions:
      - health_check
    pipelines:
      traces:
        receivers: [otlp, jaeger]
        processors: [filter/dropMissingK8sAttributes, memory_limiter, resource]
        exporters: [routing/traces]
      traces/spanmetrics:
        receivers: [routing/traces]
        processors: []
        exporters: [spanmetrics]
      traces/sampling:
        receivers: [routing/traces]
        processors: [tail_sampling, batch]
        exporters: [debug, otlp]
      metrics:
        receivers: [otlp, spanmetrics, prometheus]
        processors: [memory_limiter, resource, batch]
        exporters: [debug, otlp]

Also create a file for the necessary RBAC bindings, called otel-rbac.yaml:

 

# vim otel-rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: suse-observability-otel-scraper
rules:
  - apiGroups:
      - ""
    resources:
      - services
      - endpoints
    verbs:
      - list
      - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: suse-observability-otel-scraper
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: suse-observability-otel-scraper
subjects:
  - kind: ServiceAccount
    name: opentelemetry-collector
    namespace: observability

Next, install the collector:

# helm upgrade --install opentelemetry-collector oci://dp.apps.rancher.io/charts/opentelemetry-collector --version 0.126.0 -f otel-values.yaml -n suse-observability

 Apply the RBAC bindings:

# kubectl apply -n gpu-operator -f otel-rbac.yaml

 

What’s Next?

In the next article, we’ll wrap up the series and get some insights on the future of SUSE AI. See you then!

Share
(Visited 1 times, 1 visits today)
Avatar photo
72 views
Erico Mendonca Solution Architect @ SUSE