SUSE AI Announcement: KubeCon ’25

Share
Share

SUSE AI Unveils Universal Proxy, vLLM Acceleration, and enhancements to Private AI Governance

KubeCon + CloudNativeCon ’25: At this year’s event, SUSE AI is announcing new capabilities for our SUSE AI offering. In it are a couple of really useful additions. Specifically we are introducing some foundational enhancements that simplify agent-based workflows, expand LLM choice, and better governance for AI assets.

1. The SUSE AI Universal Proxy Technical Preview

The shift toward autonomous AI Smart Agents—systems that can reason, plan, and act by invoking specialized tools—introduces significant challenges in security, network management, cost controls, and specifically data management. The SUSE AI Universal Proxy is designed to provide the necessary enterprise control plane for managing and intelligently using data in this quest.

SUSE AI Universal Proxy’s Smart Agents: Cost Control and Sovereignty

A SUSE Smart Agent facilitates what to do with data “on behalf” of your LLMs. SUSE AI is using AI to leverage the efficient use of data and LLM access with the primary benefits being:

  • Multiple models can communicate with each other and identify the best way to perform the task
  • Low development effort: the smart agent is actually a pre-existing model that acts as a worker for the other LLMs 
  • LLM cost management: LLM model managed communication provides cost controls for LLM usage
    • Budget cap on reasoning (stop thinking at a certain threshold)
    • Budget cap “per user”: Get responses until a certain cap is not reached
    • Budget cap “per group”
  • Maintain Data Sovereignty: Remote LLMs do not have access to sensitive & private local data.

 

 

Bob and Alice make the same request:

  • Bob is allowed by SUSE AI

  • Alice is NOT

SUSE AI UP routes the query and apply rules:

  • Budget cap on tokens

  • Permission for specific models access

  • Number of “turns” before models agrees on final results

SUSE AI Universal Proxy Example

 

 

 

The SUSE AI Universal Proxy: Granular Access and Simplification

The Universal Proxy goes far beyond simplified single-endpoint access; it functions as an intelligent governance and routing layer:

  • Granular Access Control: The Proxy uses underlying MCPs and APIs to create a “virtual” MCP endpoint that is exposed to end-users. This virtual layer enables highly granular access to data and capabilities.

    • Using an RBAC model, the Proxy ensures two users accessing the same “base” service endpoint will access different capabilities based on their roles and permissions.

  • Management Complexity Removed: The Proxy acts as a single point of access for all agent interactions, which drastically reduces management overhead. This removes the need for agents to directly manage connections to numerous disparate MCPs or services, centralizing traffic, security, and telemetry.

  • Core Technical Benefits: The Proxy maintains core functions like Stateful Routing (maintaining conversational context) and Simplified Observability (providing a single point for telemetry collection).

2. Accelerating Inference with vLLM Integration

Enterprises need to deploy Large Language Models (LLMs) on private infrastructure to meet compliance demands, but high latency and massive GPU costs often hinder scaling. The SUSE AI platform is directly addressing this by enhancing LLM choice and inference access with deep vLLM integration.

vLLM is a high-performance serving library that significantly optimizes LLM inference, making it a cornerstone for efficient enterprise AI deployment. Its core innovation, PagedAttention, dramatically improves GPU memory utilization by managing the Key-Value (KV) cache with a virtual memory-like approach. This dynamic allocation reduces wasted memory by up to 80% and allows for continuous batching, enabling new requests to join a processing batch without delay. Studies frequently show that vLLM can deliver throughput gains of up to 24x compared to standard serving frameworks under high-concurrency loads.

For enterprise use cases, vLLM is critical for:

  • Real-time Customer Service: Powering low-latency, high-volume chatbots and virtual assistants that can handle thousands of concurrent queries without performance degradation.

  • Developer Copilots: Accelerating code suggestion and documentation generation, where milliseconds saved translate directly into developer productivity.

  • High-Throughput Back-Office Automation: Enabling cost-effective processing of large documents for summarization, risk assessment, and legal review on private GPUs. By integrating vLLM, SUSE AI maximizes the utility of existing GPU investments, transforming costly proof-of-concepts into scalable, production-ready services.

3. Trusted Assets: The Private Registry for Frameworks and Models

The central use case for the Private Registry is to mirror critical upstream projects (such as PyTorch, Hugging Face models, and MLOps tools like MLflow) into a secure, enterprise-governed environment. This solves several governance challenges:

  • Supply Chain Integrity: By mirroring external projects, the enterprise IT team can scan, validate, sign, and version control the content before making it available to developers, effectively mitigating risks associated with transient upstream repositories.

  • Air-Gapped and Compliant Operations: For environments that must remain disconnected from the public internet (air-gapped), the registry guarantees secure and audited access to essential Models (LLMs), infrastructure, and application components.

  • Consistency and Reproducibility: It ensures every team uses the same, verified versions of frameworks and models, which is essential for auditability, reproducing experimental results, and maintaining stability in production environments.

4. Enhanced Observability and Strategic Partnerships

To complement the platform advancements, SUSE AI is delivering new observability metrics explicitly tailored for AI workloads. Upcoming is the new ability to tailor and append the visualizations for customized views of this critical AI infrastructure spanning deep visibility into GPU utilization, token throughput, and latency, and vectorDB access and performance. This transforms complex AI metrics into actionable insights within the Rancher observability control plane. And it does it without losing the enhancements core to SUSE Observability; Time Machine (ability to look at visualizations going back in time to ascertain what happened) and contextually aware remediations.

For more information on the SUSE AI Universal Proxy and vLLM integration, please visit the SUSE booth at KubeCon.

Share
(Visited 1 times, 1 visits today)
Avatar photo
24 views