SUSE Rancher for AWS and Amazon Q: Governed SRE Assistance for EKS Operations
Multi-cluster EKS operations can generate an all-too-familiar drag. Signals scatter across dashboards, runbooks live in wikis that nobody updates and troubleshooting pulls senior engineers away from planned work. This toil can compound quickly as clusters multiply across regions and accounts. For operations leaders responsible for reliability at scale, the pattern is frustrating in part because it feels like it should be preventable.
SUSE and Amazon Web Services (AWS) have been co-building for nearly 15 years, and the partnership has catalyzed tangible results. Phillips 66, for example, migrated its entire SAP landscape to AWS in 16 weeks and achieved roughly 80% reduction in storage costs using SLES for SAP. The same co-development model now underpins a newer integration: an AI SRE assistant built on Amazon Q and Amazon Bedrock, delivered through SUSE Rancher for AWS.
After convening for a dedicated workshop on customer problems, the two product teams shipped a working demo at AWS re:Invent in a matter of days. On the latest episode of The Future Is Open podcast, the builders behind this work discuss how they approached the design and where they see the technology heading next.
Key takeaways
- In the latest episode of The Future Is Open, SUSE and AWS unpack how they built an SRE assistant using Amazon Q and Amazon Bedrock inside SUSE Rancher for AWS.
- Fundamentally, the AI assistant streamlines common tasks and helps SREs spend less time searching and more time deciding.
- Because modern operations are already overloaded, SUSE and AWS are helping teams move faster while keeping access scoped to roles and permissions.
- The conversation also reflects a regulated-world reality, including how European Sovereign Cloud requirements are shaping expectations for control and data governance.
- Through their 15 years of partnership, SUSE and AWS have prioritized predictable integration, clear ownership and enterprise-ready delivery over pure experimentation.
What is an agentic SRE?
An AI SRE assistant is a generative AI tool that helps operations teams work through tasks like troubleshooting incidents, validating configurations, planning upgrades and navigating documentation. The term “agentic” signals that the assistant can retrieve context, synthesize information and recommend actions. It has the potential to serve as an on-demand resource that compresses the evidence-gathering phase of incident response. In other words, an agentic SRE assistant does more than simply answer one-off questions.
In practice, this kind of assistant can help you search across multiple clusters, surface relevant runbook sections and generate YAML for common operations. When a team faces a failing deployment or an unexplained latency spike, the assistant can help you correlate signals that would otherwise require manual investigation across several tools. Ideally, such assistants surface recommendations rather than immediately executing changes, maintaining appropriate accountability measures.
For the specific AI SRE assistant in SUSE Rancher for AWS, AWS contributes the AI stack. This includes Amazon Q for the conversational interface and Amazon Bedrock for the underlying foundation models. SUSE contributes the operational guardrails, such as centralized EKS management, unified identity through single sign-on (SSO) and role-based access control (RBAC), and integrated observability through SUSE Observability. Because the assistant lives inside SUSE Rancher for AWS, its recommendations are scoped by the same permissions that govern other kinds of access to clusters.
A toil-reducing workflow
When you encounter a failing deployment today, the traditional path involves checking pod status, pulling logs, searching documentation, comparing configurations and correlating events across multiple clusters. While each step is reasonable, the aggregate cost is untenable.
With the AI SRE assistant in SUSE Rancher for AWS, you can instead describe the problem in natural language and pull relevant context into one place. The assistant can surface applicable guidance and recommend next steps based on the documentation and operational knowledge you provide. It can help validate YAML files before they reach production, surface troubleshooting guidance tailored to the issue, and support upgrade planning for clusters approaching end-of-support. Because the assistant includes built-in SUSE Observability, it can draw on metrics, logs and traces that are already flowing through your platform. As a result, you reduce the overhead of context-switching between dashboards and documentation.
SUSE and AWS designed an assistant that provides acceleration, not autopilot. It can help SREs get to an informed decision faster, while keeping ownership and accountability with the team. It can also support versioning and upgrade decisions by surfacing relevant guidance alongside your observability signals.
An ops-ready checklist for Amazon Q + Bedrock assistants
The value of an AI SRE assistant will vary by team. The following evaluation criteria can help operations leaders assess whether a given assistant fits their unique governance requirements.
- Identity and access controls. Verify that the assistant respects your existing identity infrastructure. SUSE Rancher for AWS integrates with SSO, RBAC and directory services like LDAP and Active Directory. This means its AI SRE assistant will view your environment using the same kinds of permissions that govern human users. Just as a given engineer might have read-only access to a specific namespace, the assistant receives recommendations appropriate to a defined scope.
- Human-in-the-loop governance. Confirm that the technology operates in an assist-first mode, surfacing recommendations rather than executing changes unilaterally. Look for clear boundaries between what the assistant proposes and what requires human approval. Among other implications, this distinction matters for audit trails and your change management processes.
- Operational scope and capability. Understand which tasks the assistant can effectively support. SUSE Rancher for AWS provides guidance on YAML validation, troubleshooting workflows, upgrade planning and GitOps patterns. As a result, its assistant can help you close knowledge gaps and work more confidently across multi-cluster environments. If these capabilities don’t align with your day-2 challenges, the assistant will make less of an impact on operations.
- Observability and context. Assess how the assistant accesses operational data. Integrated observability can make it possible for an assistant to draw on the same metrics, logs and traces that your team uses for incident response. This context can notably improve the quality of recommendations and reduce manual correlating of signals across tools. Managing clusters across multiple AWS regions becomes much more tractable if an assistant can synthesize information for you at scale.
- Procurement and support clarity. Review how the solution is delivered and how the platform, specifically including its AI capabilities, are supported. SUSE Rancher for AWS is available through the AWS Marketplace as a fully managed SaaS offering. This kind of model can help simplify budget conversations and align decision-making with existing cloud commitments.
- Portability and lock-in risk. While governance is an essential part of strategic control, portability also plays an important role. When evaluating any AI-assisted tooling, consider how the implementation might promote or limit your ability to adapt, migrate or exit. In many cases, it is important to avoid new dependencies that constrain your future options.
Listen in to learn more
On The Future Is Open, you can hear directly from the teams who built this integration. Cameron Seader hosts SUSE’s Christine Puccio and AWS’s Manasi Jagannatha in a discussion about multi-cluster EKS environments and recurring concerns like incident triage and configuration validation. Even organizations with a small cluster footprint or minimal operational complexity may benefit, especially if there is interest in a future EKS management layer with relatively high sophistication.
The podcast’s insights may prove especially valuable if you are navigating data residency requirements or heightened audit expectations, where governance visibility is a prerequisite for adoption. You can further deepen your understanding of this topic with SUSE’s cloud sovereignty self-assessment, which helps evaluate where your organization lands on that spectrum.
Ultimately, an AI-powered assistant becomes interesting when it reduces repetitive work. It becomes valuable when it aligns with your governance and operating model. In this episode, you’ll hear how SUSE and AWS designed this solution with both objectives in mind.
Related Articles
Jun 04th, 2024
SUSE Revolutionizes Enterprise Cloud Native Virtualization
Jun 17th, 2024