Based in Lugano, Switzerland, CSCS develops and operates cutting-edge HPC as an essential service for Swiss researchers. Scientists use the organization’s HPC resources for a diverse range of purposes — from high-resolution simulations to the analysis of complex data in fields such as climate science, material engineering and life sciences. Its core HPC systems are managed with an HPE control plane, with all compute nodes running SUSE Linux Enterprise Server (SLES) for stability and performance.
At-a-Glance
The Swiss National Supercomputing Center (CSCS) provides world-class high-performance computing (HPC) resources to researchers. To manage its complex HPC and Kubernetes infrastructure with a lean team of just two engineers, CSCS uses SUSE Rancher Prime and SUSE Virtualization. The solution provides a single point of control, enabling infrastructure-as-code automation. This has reduced time spent on infrastructure management by 80% and accelerated application deployments by 70%, allowing researchers to focus on science, not system administration.
Supporting HPC workloads with a lean team
Alongside theory and experimentation, computer simulation has become an essential element of modern science, enabling researchers to uncover new insights and develop groundbreaking hypotheses. At CSCS, HPC resources play a key role in supporting scientists and researchers in driving cutting-edge work across a wide range of fields including life sciences and medicine, climate research, astronomy and artificial intelligence (AI).
The organization’s lean IT team strives to ensure that HPC and data resources are always available quickly for research. To help it achieve this goal, CSCS uses a Kubernetes-based containerized infrastructure to streamline and automate its management processes.
Dino Conciatore, Systems Engineer at CSCS, confirms: “A team of just two platform engineers supports around 80-90 engineers that use the Kubernetes platform. For that reason, we aim to make deploying and managing our virtual machines [VMs] and Kubernetes clusters as simple as possible.”
“SUSE Rancher Prime is the key to simplifying our deployment and management processes. We get a central view of all our clusters, which makes it easy to identify and resolve issues.”
Leveraging SUSE solutions
SUSE Rancher Prime
To help it cut through the complexity of container management alongside its HPC infrastructure, CSCS uses SUSE Rancher Prime to manage its large Kubernetes environment. An enterprise-class Kubernetes management platform backed by responsive expert support, SUSE Rancher Prime offers the organization a single point of control for all its deployments.
SUSE Rancher Prime unifies management across more than 50 Kubernetes clusters spanning 20 virtual LANs, giving the lean team of just two platform engineers the ability to ensure security, efficiency and scalability across both HPC and service environments.
“SUSE Rancher Prime is the key to simplifying our deployment and management processes,” says Dino. “We get a central view of all our clusters, which makes it easy to identify and resolve issues. We use a DevOps infrastructure-as-code approach to automate the deployment process for new clusters, and SUSE Rancher Prime helps us manage those activities.”
SUSE Virtualization
To simplify the management of its secure multi-VLAN network environment, CSCS uses SUSE Virtualization. Built on SUSE's leadership in open source innovation, SUSE Virtualization provides a hyperconverged infrastructure (HCI) stack that unifies VM and container management.
The complete system comprises roughly 500 nodes, including about 300 VMs provisioned through SUSE Virtualization and another 200 bare-metal servers. Most of these bare-metal servers are dedicated to HPC workloads, while others support the supercomputer through service clusters.
“We have 16 SUSE Virtualization nodes, most with 768 GB of RAM and 128 cores each, running around 300 VMs provisioned through SUSE Virtualization,” says Dino. “The SUSE solution simplifies managing our secure multi-VLAN networks, allowing us to effectively segment internal and external-facing services.”
The impact of SUSE Rancher Prime
Increases management efficiency by 80%
For CSCS, SUSE Rancher Prime offers a single point of control for multi-cluster management. As well as providing security and observability capabilities, the platform supports the organization’s lean engineering team with infrastructure-as-code automation.
“We’ve automated everything from cluster creation to application deployment,” explains Dino. “We use Argo CD to underpin our GitOps processes and Open Tofu for infrastructure-as-a-code deployments.
When a new cluster is needed, we can spin it up rapidly and provide our developers with all the information they need. Taken together, this approach reduces the time and effort required to maintain our infrastructure by around 80%.”
Accelerates application deployments by 70%
SUSE Rancher Prime enables effective GitOps workflows by providing centralized visibility and control across all Kubernetes clusters, enhancing the capabilities of tools like Argo CD. This seamless integration significantly improves the efficiency of application deployment.
“We rely on SUSE Rancher Prime to ensure our clusters are always properly configured and available, which enables our GitOps tool, Argo CD, to efficiently manage the deployment lifecycle,” says Dino.
“Today, we’re using this integrated approach to support around 800 applications. By leveraging SUSE Rancher Prime's centralized management and Argo CD’s automation capabilities, our teams no longer need to manually spin up services, which speeds up application deployments by around 70%.”
Provides round-the-clock support
Initially, CSCS operated Rancher without official support, but after securing a support contract with SUSE Rancher Prime, it noticed a significant difference.
“After we got our SUSE Rancher Prime support contract, we saw a significant improvement in our operations because we learned something new every day,” says Dino. “Staying strictly within the SUSE support matrix has been crucial for smooth operations. Whenever we experience an issue, we know that support teams from SUSE are on hand to help us 24/7.”
The impact of SUSE Virtualization
Delivers high reliability and availability
With an infrastructure-as-code approach underpinned by SUSE Rancher Prime and SUSE Virtualization, CSCS can maintain high levels of reliability and availability for vital HPC resources.
“In all the years we’ve been using SUSE solutions, we’ve never experienced significant downtime,” comments Dino. “Because we have adopted a GitOps methodology, we could redeploy our entire infrastructure in less than a day if we needed to. SUSE helps us ensure that our researchers can always access the cluster to do their work.”
What's next for CSCS?
CSCS plans to build on its collaboration with SUSE to find new opportunities for automation and standardization, delivering greater autonomy and efficiency for users while maintaining strong governance and security controls.
Dino concludes, “When our users are happy, we’re happy, and we look forward to continuing to work with SUSE to help us find even more effective ways to deliver, manage and upgrade our virtualized and containerized infrastructure for cutting-edge research.”