Committing to Change: Next Steps in Your Move to SAP S/4HANA

Thursday, 3 May, 2018

No doubt you’re aware of SAP’s announcement about moving its applications solely to SAP S/4HANA by 2025. One of our recent blog posts discussed how to assess whether your business needs to make a change—and hopefully by now you’ve seen the benefits of moving to SAP S/4HANA.

It’s a smart decision. With SAP S/4HANA, you get exciting new capabilities, from the ability to implement real-time data analysis and machine learning to getting better results from SAP ERP workloads to enjoying a more user-friendly interface with SAP Fiori. Now is the time to commit to your decision, which means developing a strategy and figuring out the timing for your move.

This blog doesn’t have enough space to detail an SAP migration step by step—it’s obviously not a trivial undertaking—but there’s already a wealth of helpful resources out there to rely on. We simply want to lay out a big picture here of things to consider when getting started.

Develop a Solid Migration Strategy

Your SAP migration strategy will depend greatly on understanding your current business and IT environment, as well as knowing the value of migration to your business. Industry analyst KPMG states:

A migration to S/4HANA is … mainly a business-driven project that has many technological components. Therefore, while determining the strategy, the business and IT both at C-level and operational, have to be involved at every step.

Assemble Your A-Team

The right technical and strategic consultants can have an enormous impact on your migration strategy. If you don’t have SAP expertise in-house, get certified SAP migration consultants or systems integrators on board. Look to organizations such as Accenture, CapGemini, Infosys, TCS or Wipro for knowledgeable guidance.

Your go-to people need deep expertise and understanding of the HANA platform, operating system options, hosting environment and migration methodology—not to mention a high level of familiarity with the way you do business. They need to be well-informed and have solid strategies for keeping costs down and minimizing disruption.

Understand Your Options

Once you have an expert team assembled, the action can begin. Your technical experts will analyze your current SAP landscape and identify potential obstacles, such as custom code or modules that might need to be transferred. Your existing SAP systems might need to be upgraded to more recent versions in preparation for migration.

On-Premises or In the Cloud?

One of the big decisions you’ll make is whether to run SAP S/4HANA on premises or in the cloud. There are benefits to both: some organizations prefer the maximum control and reduced risk of on-premises deployment, while others choose the cloud for easy scalability, increased flexibility and faster time to value. Once you make your choice, you’ll have a variety of options when it comes to SAP-certified hardware, cloud instances, and storage solutions.

Sizing Up Your Landscape

Right-sizing your SAP HANA landscape is a critical task for your technical team. Oversizing could lead to wasted capacity and bloated hardware and undersizing can create delays that boost operational performance costs.

 Prepare the Migration and Give it a Test-Run

Your business team can play an important role in helping your technical team decide which data to move and when. Maybe you have older data that you can move to SAP HANA before your more critical data. Or perhaps you’ll decide that some of your data can go into archival storage and doesn’t need to be migrated to SAP HANA at all. You have a lot of options—entire posts have been written about this topic.

You should then test the new landscape using an anonymized copy of your live SAP data to see if everything goes as planned. This step is valuable because it will give you an idea of how long the migration will take and it helps you identify any possible issues so you can resolve them before they affect the actual migration.

Timing Matters

Delaying your move to SAP S/4HANA could put you in a tight spot later on, but you shouldn’t rush it either. Plan and prepare first. Identify slower business cycles where downtime wouldn’t be as disruptive.

Ready to get started? Moving to SAP S/4HANA isn’t a minor move, but with the right preparation it can be a smooth one. With the right team and an understanding of the right options for your business, you could be on the path today.

[1] KPMG, Should We Migrate to S/4HANA?, March 2017.

How Public Cloud Adoption Enables Increased IT Automation

Tuesday, 26 March, 2024

Person interacting with a futuristic artificial intelligence interface with icons for cloud computing, security, finance, and communication, highlighting an AI chip.

In today’s fast-paced digital landscape, businesses are increasingly turning to public cloud services to drive their digital transformation efforts. This shift is propelled by the public cloud’s ability to offer scalable, flexible, and cost-efficient IT resources on demand. As organizations strive to remain competitive, the agility provided by cloud services becomes not just an asset but a necessity. This transition is fundamentally changing how companies manage their IT infrastructure, making the journey towards public cloud adoption a key strategic move.

Amidst this shift, IT automation is a critical component of modern business operations. By automating routine and complex tasks, businesses can achieve greater efficiency, reduce human error, and free up valuable resources for strategic initiatives. Automation in cloud environments streamlines operations, from deploying servers to scaling applications, ensuring that IT infrastructures can rapidly adapt to changing business needs.

The convergence of public cloud adoption and IT automation opens a new realm of possibilities for business innovation and agility. This article explores how embracing public cloud services not only facilitates but also amplifies IT automation capabilities. Through real-world examples and expert insights, we’ll delve into the mechanisms by which public cloud platforms empower organizations to automate their IT operations more extensively, driving significant gains in operational efficiency, cost savings, and competitive advantage.

The Evolution of Cloud Computing

Cloud computing has undergone a remarkable evolution since its inception, transforming the way businesses deploy and manage IT resources. Initially, the concept of cloud computing emerged as a dynamic means to share computing power and data storage, eliminating the need for extensive on-premise hardware. This era saw the rise of private clouds, which offered organizations the ability to harness cloud capabilities while maintaining control over their IT environment. However, the scalability and cost-effectiveness of these private clouds were often limited by the need for substantial upfront investment and ongoing maintenance.

The advent of public cloud services marked a pivotal shift in this landscape. Giants like Amazon Web Services, Microsoft Azure, and Google Cloud Platform began offering computing resources as a service, accessible over the internet. This model democratized access to high-powered computing resources, making them available on a pay-as-you-go basis. The transition from private to public cloud services heralded a new era of IT flexibility, scalability, and efficiency.

The impact of cloud computing on IT operations has been profound. Traditional IT tasks, such as provisioning servers, scaling applications, and managing data storage, have been simplified and automated. The public cloud has introduced a level of agility previously unattainable, enabling businesses to respond to market demands and innovate at an unprecedented pace. This shift has not only reduced operational costs but also allowed IT teams to focus on strategic initiatives that drive business growth. As cloud computing continues to evolve, its role as a catalyst for IT automation and business innovation becomes increasingly evident.

Understanding IT Automation

IT automation is the use of software to create repeatable instructions and processes to replace or reduce human interaction with IT systems. It’s a cornerstone of modern IT operations, enabling businesses to streamline operations, reduce manual errors, and scale efficiently. Automation is crucial for managing complex, dynamic environments, especially in the context of cloud computing where resources can be adjusted with demand.

There are several types of IT automation, each addressing different aspects of IT operations. Infrastructure as Code (IaC) allows teams to manage and provision IT infrastructure through code, rather than manual processes, enhancing speed and consistency. Continuous Integration/Continuous Deployment (CI/CD) automates the software release process, from code update to deployment, ensuring that applications are efficiently updated and maintained. Automated monitoring tools proactively track system health, performance, and security, alerting teams to issues before they impact operations.

The benefits of IT automation are multifaceted. It significantly reduces the time and cost associated with manual IT management, increases operational efficiency, and minimizes the risk of human error. For businesses, this means faster time-to-market for new features or products, improved service reliability, and the ability to allocate more resources towards innovation rather than maintenance. As such, IT automation is not just a technical improvement but a strategic asset that drives competitive advantage.

How Public Cloud Services Facilitate IT Automation

Public cloud services have emerged as a catalyst for IT automation, offering tools and features that significantly enhance the efficiency and agility of IT operations.

Scalability and Flexibility

One of the most compelling attributes of public cloud platforms is their automated scaling features. These platforms can automatically adjust computing resources based on real-time demand, ensuring that applications always have the necessary resources without manual intervention. This scalability not only optimizes cost but also supports uninterrupted service delivery.

The flexibility in resource allocation provided by public clouds further supports automation. IT teams can dynamically provision and decommission resources through automated scripts or templates, significantly reducing the time and complexity involved in managing IT infrastructure.

Advanced Tools and Services

Public cloud providers offer a suite of advanced tools for automation, such as AWS CloudFormation, Azure Resource Manager, and Google Cloud Deployment Manager. These tools allow organizations to define and deploy IaC, automating the setup and management of cloud environments.

Moreover, public clouds feature robust integration capabilities with third-party automation tools. Whether it’s integrating with CI/CD pipelines for software deployment or leveraging specialized monitoring and management tools, the public cloud ecosystem is designed to support extensive automation strategies.

Public cloud services enable businesses to significantly enhance their IT automation capabilities through these mechanisms. By leveraging scalable resources, flexible management options, and comprehensive toolsets, organizations can automate a wide range of IT operations, from infrastructure provisioning to application deployment and monitoring, driving greater operational efficiency and innovation.

Cost Efficiency and Optimization

Public cloud services inherently promote cost efficiency by reducing the need for manual intervention in IT operations. Automation capabilities built into these platforms allow for the dynamic allocation and scaling of resources based on demand, eliminating overspending on underutilized resources. Through automated resource management, businesses can optimize their spending by ensuring that they only pay for the resources they use.

Examples of cost optimization include automated scaling during peak usage times to maintain performance without permanent investment in high-capacity infrastructure, and automated shutdown of resources during off-peak hours to save costs. Additionally, automated backup and data lifecycle policies help in managing storage costs efficiently. These automated processes ensure that businesses can maintain optimal service levels while minimizing expenses, showcasing the financial advantage of leveraging public cloud services for IT automation.

Overcoming Challenges in Cloud-Based IT Automation

While cloud-based IT automation offers myriad benefits, it also presents specific challenges that businesses must navigate. Two of the most significant hurdles are ensuring security and compliance and managing the complexity of automation workflows. By addressing these challenges effectively, organizations can harness the full potential of cloud-based IT automation.

Security and Compliance

Addressing Security Concerns with Automated Policies: Security in the cloud is paramount, especially when automation tools are implemented. Automated security policies enable organizations to consistently enforce security standards across their cloud environments. These policies can automatically detect and remediate non-compliant configurations or suspicious activities, ensuring a proactive approach to cloud security.

Ensuring Compliance in an Automated Public Cloud Environment: Compliance in an automated setting requires a structured approach to manage and monitor the cloud infrastructure. Utilizing cloud management platforms that offer built-in compliance frameworks can significantly ease this burden. These tools not only automate compliance checks but also provide detailed reports for auditing purposes, ensuring that businesses meet regulatory standards effortlessly.

Managing Complexity: Strategies for Simplifying Automation Workflows

As IT environments become increasingly complex, simplifying automation workflows is essential. One effective strategy is adopting a modular approach to automation, where workflows are broken down into smaller, manageable components. This not only makes the automation process more manageable but also enhances flexibility and scalability.

Tools and Best Practices for Managing Automated Systems

Leveraging the right tools is crucial for managing automated systems efficiently. Tools that offer visual workflow designers, integration capabilities, and scalable architectures can significantly reduce the complexity of automation. Additionally, adhering to best practices such as continuous monitoring, regular updates, and thorough testing of automation scripts ensures the smooth functioning of automated systems.

By tackling these challenges head-on, businesses can secure and streamline their cloud-based IT automation efforts, leading to enhanced operational efficiency and agility.

Conclusion

The exploration of cloud computing’s evolution and the strategic integration of IT automation has underscored the immense benefits that public cloud services offer to today’s enterprises. By harnessing the scalability, cost-effectiveness, and rapid innovation that public cloud platforms provide, organizations can significantly enhance their IT automation efforts. This leads to remarkable improvements in operational efficiency and business agility. Looking ahead, the synergy between IT automation and cloud computing is poised to be a cornerstone of business innovation, unlocking new avenues for growth and competitiveness.

Despite the challenges that may arise, the path to adopting public cloud services has been made smoother by the availability of robust strategies and tools. We are at the cusp of a technological transformation that will redefine the paradigms of IT operations and infrastructure management. In this pivotal moment, SUSE stands ready to guide businesses through their cloud journey with cutting-edge Linux products and open source solutions designed for seamless public cloud integration and efficient IT automation.

SUSE encourages businesses to leverage public cloud solutions to bolster their IT automation capabilities. With our expertise and innovative solutions, companies can not only navigate the complexities of cloud adoption but also harness the full potential of cloud computing and automation. Partner with SUSE to future-proof your business, ensuring you are well-equipped to thrive in the ever-evolving digital landscape.

Frequently Asked Questions (FAQ)

What Is Public Cloud Adoption?

Public cloud adoption refers to the process by which organizations transition their IT resources, applications, and operational processes to cloud services that are managed and provided by third-party companies. This move is driven by the desire to enhance flexibility, scalability, and cost-efficiency. Unlike private clouds, which are dedicated to a single organization, public clouds serve a multitude of clients, offering resources like servers and storage over the Internet. This model allows businesses to avoid the upfront cost and complexity of owning and maintaining their own IT infrastructure.

How Does Public Cloud Adoption Enhance IT Automation?

Public cloud adoption significantly enhances IT automation by providing scalable resources, advanced toolsets, and comprehensive managed services. These features facilitate the automatic scaling of resources to meet demand, streamline software deployment processes, and manage routine tasks such as backups, updates, and security checks with minimal human intervention. The inherent flexibility and breadth of services offered by public clouds enable organizations to automate their IT operations more effectively, leading to increased efficiency and reduced operational costs.

What Are the Key Benefits of IT Automation for Businesses?

The key benefits of IT automation for businesses include enhanced efficiency, reduced operational costs, improved reliability, and the ability to deploy services and applications faster. Automation reduces the need for manual intervention in routine tasks, thereby minimizing the risk of human error and ensuring operations run smoothly and consistently. It also enables organizations to respond more quickly to market changes and customer needs by facilitating rapid deployment of resources and applications.

Can Small Businesses Benefit from Public Cloud and IT Automation?

Absolutely. Small businesses stand to gain significantly from public cloud and IT automation. The scalability of cloud solutions means that businesses only pay for the resources they use, which can be scaled up or down based on demand. This flexibility makes cloud services and automation highly cost-effective, even for small enterprises, allowing them to leverage advanced technologies that were previously accessible only to larger organizations. Automation can further reduce operational costs by minimizing manual tasks, allowing small business owners to focus more on strategic growth areas.

How Do Public Cloud Services Ensure Security and Compliance in Automated Environments?

Public cloud providers invest heavily in security measures and compliance standards to protect data and ensure privacy in automated workflows. These measures include physical security controls at data centers, encryption of data in transit and at rest, and sophisticated access control mechanisms. Additionally, public clouds often comply with a broad range of international and industry-specific regulations, offering businesses peace of mind that their data handling practices are in line with legal requirements.

What Are Some Common Challenges in Implementing IT Automation via Public Cloud?

Common challenges in implementing IT automation via the public cloud include navigating the complexity of cloud services, bridging skill gaps within the organization, and addressing security concerns. Organizations may struggle with selecting the right tools and services that match their specific needs or integrating new cloud services with existing infrastructure. To overcome these challenges, businesses can invest in training for their staff, seek guidance from cloud consultants, and implement robust security practices and tools designed for cloud environments.

How Can Companies Get Started with Public Cloud Adoption and IT Automation?

Companies can start with public cloud adoption and IT automation by first assessing their business needs and identifying which processes and workloads could benefit most from moving to the cloud. The next step involves selecting the right cloud provider that aligns with their requirements in terms of services, security, and compliance. Businesses should then start small, moving a single workload or process to the cloud to gain familiarity with the environment before gradually implementing automation tools and practices across their operations.

Are There Any Industry-Specific Considerations for Public Cloud Adoption and IT Automation?

Yes, there are industry-specific considerations for public cloud adoption and IT automation. Regulatory compliance, data sensitivity, and specific operational needs vary significantly across sectors. For instance, healthcare organizations must ensure their cloud services comply with HIPAA regulations, while financial services firms have to meet strict data security and privacy standards. Understanding these nuances and selecting cloud services that offer the necessary controls and compliance certifications is crucial for successful adoption in any industry.

What Is the Future of Public Cloud and IT Automation?

The future of public cloud and IT automation is likely to be shaped by the integration of artificial intelligence (AI) and machine learning, the rise of serverless computing, and an increased focus on sustainability. AI and machine learning are set to automate even more complex tasks and decision-making processes, while serverless computing will allow businesses to run applications without managing the underlying servers, further reducing costs and operational overhead. Additionally, the cloud industry is moving towards greener practices, with providers focusing on reducing energy consumption and utilizing renewable energy sources.

How Does SUSE Support Public Cloud Adoption and IT Automation?

SUSE offers a range of solutions and services designed to facilitate easy and secure public cloud adoption and enhance IT automation capabilities for businesses. These include scalable Linux operating systems, Kubernetes management platforms such as Rancher for container orchestration, and tools for cloud-native application development. SUSE’s solutions are designed to be open and interoperable, supporting a variety of cloud providers and ensuring that businesses can leverage the full benefits of public cloud and automation without being locked into a single vendor.

Emerging Trends Impacting Today’s Infrastructure and Operations

Thursday, 14 October, 2021

COVID-19 has transformed how Infrastructure & Operations (I&O) teams must operate. The Gartner® report, “Top Trends Impacting Infrastructure and Operations for 2021,” notes “The rapid and forced enablement of remote working due to COVID-19 exposed vulnerabilities in existing and entrenched processes and workstreams.” That shift to increased remote work means I&O needs to support a more distributed workforce with enterprise-grade solutions.

With IT staff stretched for time thanks to COVID-19-induced infrastructure changes, Gartner has identified top trends that are having an impact on I&O teams. Three of the trends they cover are “Anywhere Operations”, “Optimal Infrastructure” and “Operational Continuity”.

“Anywhere Operations” leverage the benefits of flexible operations

As work has moved out of offices, enterprises have been forced to support increasingly decentralized operations. Moving forward, organizations will want to ensure they can effectively support operations at any location, with an ability to operate effectively on-premises, in the public cloud and at the edge.

Consequently, the goal of I&O leaders, Gartner says, should be to “make digital the default experience and remote the default delivery model by prioritizing five remote-work enablement technologies: collaboration and productivity, secure access, cloud and edge infrastructure, quantifying digital experience and automation of endpoint operations.” I&O teams should also implement workflows that are effective regardless of a user’s location.

“Optimal Infrastructure” means no more one-size-fits-all

Gartner says, “I&O teams can no longer expect to build one-size-fits-all infrastructure solutions.” “Different infrastructure choices — such as cloud, edge or newer technologies like computational storage — may apply to different locations and workloads,” Gartner further notes.

I&O needs to determine what infrastructure works best for each of their workloads, and ensure features such as identity management and network access work effectively across a range of solutions. For each infrastructure type, organizations should perform cost-benefit analyses that take into account “specific use-case applicability and available team skills to manage specific infrastructure types,” according to Gartner.

“Operational Continuity” considers all external factors

“IT services must be continuous, regardless of external factors,” according to Gartner. For instance, I&O teams should implement automation and zero-touch technologies, so data centers can be managed remotely in the event staff are unable to enter them.

There are other benefits to boosting operational resilience, including greater efficiency and lower costs. However, Gartner notes, there are also challenges to operational continuity, “as it requires introducing new tools and processes, it may increase complexity, and it can be difficult to justify to the business.”

Gartner recommends that I&O teams should assess their continuity plans and requirements based on business priorities, plan for worst-case scenarios, and ensure operations can continue if those scenarios become reality. They also recommend looking into service-based disaster recovery (DR) options if the team does not have the time or skills to build their own DR solution.

These trends and others are detailed in the full Gartner report

The aftermath of COVID-19 and the proliferation of cloud and edge technologies have resulted in employees and customers expecting a seamless digital experience wherever they are. Tracking new trends and assessing emerging technologies have become increasingly important for infrastructure and operations leaders seeking to optimize their infrastructure. Advice from experts like Gartner can help I&O teams stay on top of the rapidly evolving IT infrastructure landscape.

Read the full Gartner report for a comprehensive description of the top I&O trends, including “Core Modernization”, “Distributed Cloud”, and “Critical Skills vs. Critical Roles”.

Download “Top Trends Impacting Infrastructure and Operations for 2021”.

 

Gartner, Top Trends Impacting Infrastructure and Operations for 2021, Jeffrey Hewitt, 30 April 2021

GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.

Tapping Native Controls in Kubernetes to Protect Your Cloud-Native Apps

Tuesday, 15 December, 2020
Declarative Security with Rancher, KubeLinter and StackRox on Jan. 26

As companies adopt container technologies, they face a significant challenge – how do we secure this new attack surface? It’s an issue that you often see backlogged in favor of solving storage, networking and monitoring issues. Add on the challenge of educating the workforce on one of the fastest-growing open source projects to date, and it’s no wonder security has lagged as the primary focus for teams. In fact, The New Stack published a survey that shows that almost 50 percent of Kubernetes users say security is their top unresolved issue.

In this blog, we aim to demystify the Kubernetes security threats, showcase best practices for securing your cluster, and provide useful tools to enable your developers. These tools include:

  • Rancher Kubernetes Engine (RKE) for declarative deployments
  • KubeLinter for developer-focused security checks
  • StackRox for enforcing security policies across build, deploy, and runtime

We encourage you to join the Kubernetes Security Master Class: Tapping Native Controls in Kubernetes to Protect your Cloud-Native Apps. This online Master Class will thoroughly flesh out the topics covered below and demonstrate how you can better secure your cloud-native workloads.

Background

Rancher Kubernetes Engine (RKE)

RKE is a CNCF-certified Kubernetes distribution that runs entirely within Docker containers. It addresses the installation complexity of Kubernetes by removing most host dependencies and presenting a stable path for deployment, upgrades, and rollbacks. RKE uses a declarative YAML file to configure and create Kubernetes environments. This enables reproducible on-prem or remote environments.

KubeLinter

KubeLinter is a static analysis tool that checks Kubernetes YAML files to ensure that the declared application configuration adheres to best practices. KubeLinter is StackRox’s first open source tool designed for implementing security checks from the command line as well as a part of the CI process. KubeLinter is a binary that takes in paths to YAML files and runs a list of checks against them. Admins and developers can create their own policies to enforce, enabling quicker and more automated deployments.

StackRox

StackRox Kubernetes Security Platform protects vital applications across build, deploy and runtime. StackRox deploys in your infrastructure and integrates with your DevOps tooling and workflows to deliver frictionless security and compliance. The StackRox Policy Engine includes hundreds of built-in controls to enforce DevOps and security best practices, industry standards such as CIS Benchmarks and NIST, configuration management of both containers and Kubernetes runtime security. StackRox profiles your workloads to enable you to make informed decisions about the workloads’ security.

Together

RKE, KubeLinter, and StackRox enable you to deploy repeatable, secure clusters, visualize profile and access security vulnerabilities, and create declarative security policies. Let’s talk about the threats these applications can tackle together.

Assessing the Threat

Let’s start by focusing on the Kubernetes vectors of attack. Microsoft recently published an attack matrix for Kubernetes based off of the MITRE ATT&CK framework.

Image 01

The framework is adapted for Kubernetes and based on real-world observations and cases. Luckily, there are strategies to mitigate all of the various issues. First, we can start by hardening our Kubernetes control plane. After that, we will shift focus to securing our running container workloads.

Control-Plane Hardening

The Kubernetes control plane includes the following components:

  • Kubernetes API Server
  • kube-scheduler
  • kube-controller-manager
  • etcd (if applicable)
  • cloud-controller-manager (if applicable)

etcd will likely be on the control plane node; however, it can have a remote environment for high availability use cases. The cloud-controller-manager is installed in provider instances as well.

Kubernetes API Server

The Kubernetes REST API server is the core component of the control-plane. The server handles REST API calls, which include all communication between the various components and the user. This dependency makes securing the API server a top concern. There are some specific vulnerabilities in previous versions that are fixed with a simple upgrade to a newer version. However, the following hardening tasks are also within your control.

  • Enabling Role-based Access Control (RBAC)
  • Ensuring all API traffic is TLS-encrypted
  • Enabling audit logging
  • Setting up authentication for all K8s API clients

With a deployment tool such as RKE, this declarative format for setting up your cluster is easy. Below is a snippet of a default RKE config.yml file. We can see that we can enable audit logging, TLS (between Kubernetes components), and RBAC by default.

  kube-api:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
    service_cluster_ip_range: 10.43.0.0/16
    service_node_port_range: ""
    pod_security_policy: false
    always_pull_images: false
    secrets_encryption_config: null
    audit_log: null
    admission_configuration: null
    event_rate_limit: null
…
authentication:
  strategy: x509
  sans: []
  webhook: null
…
authorization:
  mode: rbac
  options: {}

Setting up authentication for all K8s API clients is an ongoing challenge. We need to apply a zero-trust model to any workloads that will run in our cluster.

kube-scheduler

Kubernetes’ default scheduler, kube-scheduler, is designed to be pluggable. You can build your scheduler or have multiple schedulers for different workloads. Regardless of the implementation, it needs to be secure and there are a few tasks to ensure that it is.

  • Set a secure port for communication to the API server
  • Ensure the scheduler runs with minimum required permissions (RBAC)
  • Restrict file permissions on kube-scheduler pod specification and kubeconfig files

With RKE, we can secure its connection to the API server by verifying the default scheduler addresses set to 127.0.0.1. Also, restrict file permissions by making sure the root user owns the scheduler YAML file.

stat -c %U:%G /etc/kubernetes/manifests/kube-scheduler.yaml

kube-controller-manager

The Kubernetes systems regulator, the kube-controller-manager, is a daemon that regulates the system using core control loops. Securing the controller requires a similar strategy as the scheduler.

  • Set a secure port for communication to the API server
  • Ensure the scheduler runs with minimum required permissions (RBAC)
  • Restrict file permissions on kube-controller-manager pod specification and kubeconfig files

Like the scheduler, we can ensure that the communication uses the local address (not an unsecured loopback interface) and ensure the root user owns the controller YAML file.

stat -c %U:%G /etc/kubernetes/manifests/kube-controller-manager.yaml

etcd

The last core component of the control plane is its key-value store, etcd. All Kubernetes objects are located in etcd, which means all of your configuration files and secrets are stored as well. The best practice is to encrypt secrets or manage secret information with a separate secrets management solution such as Hashicorp Vault or a cloud provider secrets management service. Some key factors to remember when you are managing the database are:

  • Limit read/write access to the datastore
  • Encryption

We want to limit any updates or changes to the manifests to the services that are allowed access. Using RBAC controls partnered with a zero-trust model will get you started. Lastly, encryption with etcd can be cumbersome. Rancher has a unique approach where the keys are generated as part of the initial cluster configuration. Kubernetes has a similar policy, although the file with the keys needs to be secure as well. Your organization’s security requirements will dictate where and how you will secure your sensitive information.

Cloud-controller-manager

The cloud cloud-controller-manager is unique to cloud providers or any distribution which requires our cluster to communicate with the provider’s API. When working with cloud providers, admins will not have access to the master nodes of their cluster and, therefore, will not have the ability to run the hardening steps previously outlined.

Securing Workloads using Kubernetes-Native Controls

Now that our control plane is secure, it is time to work on our applications running in Kubernetes. Similar to the previous section, let’s break down the various layers of security.

  • Container Image Security
  • Runtime
  • Persistence
  • Network
  • Role-Based Access Control (RBAC)

In the section below, we will dive deep into the various considerations for each section. In addition, the Master Class will give us more time to demonstrate the Kubernetes functionality in a comprehensive way.

Container Image Security

Managing your containers before they are used is the first hurdle in container adoption. In the beginning, we need to consider:

  • Selection of base images
  • Update frequency
  • Non-essential software
  • Accessibility to build/CI tools

The salient points are used to select secure base images, limit unnecessary packages and to secure a registry. Nowadays, most of the registries have image scanning built in to make your life easier. The StackRox Kubernetes Security Platform can automatically enforce policies on what images can be used to launch containers and identify security issues, including vulnerabilities and problematic packages, in image layers separate from the underlying base operating system (OS) image.

Image 02

If you want to learn more, read this in-depth article about container image security.

Runtime

Runtime security spans different Kubernetes functionality with the core goal of ensuring that our workloads are secure. Pod Security Policies are a great place to start securing your containers with the ability to control:

  • Linux capabilities
  • The SELinux context of the container
  • Usage of host networking and ports
  • Use of the host filesystem
  • The user and group IDs of the container

Keep in mind the zero-trust approach to systems, where capabilities should be set so the container has the minimum functionality required at runtime to function. For better visualization, StackRox’s risk profiling automatically identifies containers with potentially useful tools to attackers, including bash. It also alerts the use of suspicious tools and monitors, detects and alerts concerning runtime activity, such as executing abnormal or unexpected processes within containers.

Image 03

Persistence

Running stateful workloads in Kubernetes creates a backdoor into your containers. The attack surface increases by attaching storage and possibly giving executables or information to a container that it should not access. Best practice in Kubernetes ensures that stateful workloads run with the least privilege required. Other considerations include:

  • Use Namespaces as natural boundaries for storage.
  • No privileged containers
  • Use a Pod Security Policy to restrict pod volume access

StackRox helps mitigate these threats by delivering dynamic policy-driven admission control as part of the StackRox platform. This enables organizations to enforce security policies automatically, including limitations on host mounts and their writability before containers are ever deployed into Kubernetes clusters.

Network Access

Network access is a tough challenge in Kubernetes due to the lack of visibility into your containers. By default, network policies are disabled, and every pod can reach other pods on the Kubernetes network. Without this default, newcomers would have a tough time getting started. As your organization matures, we should strive to lock down all traffic except the traffic we deem necessary. This can be done using network policies configured by namespaces. It is also important to focus on the following:

  • Using Namespaces as natural boundaries for network policies
  • Enabling a default to deny policy in each namespace
  • Using network policies that are specific to the traffic required by each pod

One of the significant challenges with network policies is visualization. StackRox helps protect against network mapping by monitoring active network traffic between pods and automatically generating and configuring Network Policies to restrict communications to only what is necessary for application components to operate.

Image 04

Role-Based Access Control (RBAC)

RBAC is central to securing your cluster. Kubernetes RBAC permissions are additive; therefore, RBAC’s only vulnerability is if an administrator or user grants exploitable permissions. The most common problem we encounter is users having cluster-admin access when they shouldn’t. Fortunately, there are more RBAC best practices to minimize issues:

  • Use different service accounts for different types of workloads and apply the principle of least privilege
  • Regularly audit your clusters’ RBAC configurations
  • Use different service accounts for different types of workloads and apply the principle of least privilege.
  • Avoid cluster-admin overuse

An RKE cluster uses RBAC as the default authentication option during cluster setup. StackRox expands on this default option by helping organizations limit Kubernetes RBAC permissions according to the least privilege principle. We monitor the cluster RBAC settings for users and service accounts and identify ones with overly excessive privileges on clusters.

Conclusion

It is challenging to take on Kubernetes security on your own. In organizations, security can get in the way of DevOps, creating hurdles that lead to abandoned security principles in the pursuit of delivery. It doesn’t have to be this way.

By proactively identifying threats and crafting reasonable policies, we further shift security left. We can assess where our time needs to be spent and avoid bogging down DevOps teams with additional responsibilities.

If you want to learn more about protecting your Kubernetes clusters and applications, sign up for the free Master Class: Tapping Native Controls in Kubernetes to Protect your Cloud-Native Apps with StackRox on January 26, 2021. We’ll use Rancher for declarative Kubernetes deployments and explore various ways to access, implement, and streamline your security code, including using KubeLinter and StackRox.

Declarative Security with Rancher, KubeLinter and StackRox on Jan. 26

Deploying SLURM using SLE HPC patterns

Monday, 16 July, 2018

The expansion of High Performance Computing (HPC) beyond the niches of higher education and government realms to corporate and business computing use cases has been on the rise. One catalyst of this trend is increasing innovation in hardware platforms and software development. Both respectively drive down the cost of deploying supercomputing services with each iteration of advancement. Consider the commonality in science and business-based innovation spaces like Artificial Intelligence (AI) and machine learning. Access to affordable supercomputing services benefits higher education and business-based stakeholders alike. Thank you, Alexa?

Hardware overview

Making half of this point requires introduction to the hardware platform used for this example. Specifically, a collection of six ARMv8 CPU based Raspberry Pi 3 systems. Before scoffing at the example understand these Pis are really being used to demonstrate the latter, yet to be made point, of simplified HPC cluster software deployment. But, the hardware platform is still an important aspect.

ARM based Raspberry Pi cluster

Advanced RISC Machine (ARM) began working with Cray, a dominant force in the supercomputing space, in 2014. Initially collaborating with U.S. DoE and European Union based research projects interested in assessing the ARM CPU platform for scientific use. One motivator aside from the possibility of a lower cost hardware platform was the lessening of developer angst porting scientific software to ARM based systems. Most community authored scientific software does not fare well when ported to different hardware platforms (think x86 to GPU), but the move from x86 to ARMv8 is a far less troubled path. Often origin programming languages can be maintained, and existing code requires little change (and sometimes none).

Cray unveiled the first ARMv8 based supercomputer named “Isambard”, sporting 10,000 high performance cores, in 2017. . The debut comparison involved performance tests using common HPC code running on the most heavily utilised supercomputer in the U.K. at the University of Edinburgh, named “ARCHER”. The results demonstrated that the performance of the ARM based Isambard was comparable to the x86 Skylake processors used in ARCHER, but at a remarkably lower cost point.

Software overview

The Simple Linux Utility for Resource Management (SLURM), now known as the SLURM Workload Manager, is becoming the standard in many environments for HPC cluster use. SLURM is free to use, actively developed, and unifies some tasks previously distributed to discreet HPC software stacks.

  • Cluster Manager: Organising management and compute nodes into clusters that distribute computational work.
  • Job Scheduler: Computational work is submitted as jobs that utilise system resources such as CPU cores, memory, and time.
  • Cluster Workload Manager: A service that manages access to resources, starts, executes, and monitors work, and manages a pending queue of work.

 

Software packages

SLURM makes use of several software packages to provide the described facilities.

On workload manager server(s)

  • slurm: Provides the “slurmctld” service and is the SLURM central management daemon. It monitors all other SLURM daemons and resources, accepts work (jobs), and allocates resources to those jobs.
  • slurm-slurmdbd: Provides the “slurmdbd” service and provides an enterprise-wide interface to a database for SLURM. The slurmdbd service uses a database to record job, user, and group accounting information. The daemon can do so for multiple clusters using a single database.
  • mariadb: A MySQL compatible database that can be used for SLURM, locally or remotely.
  • munge: A program that obfuscates credentials containing the UID and GID of calling processes. Returned credentials can be passed to another process which can validate them using the unmunge program. This allows an unrelated and potentially remote process to ascertain the identity of the calling process. Munge is used to encode all inter-daemon authentications amongst SLURM daemons.

 

Recommendations:

  • Install multiple slurmctld instances for resiliency.
  • Install the database used by slurmdbd on a very fast disk/partition (SSD is recommended) and a very fast network link if a remote server is used.

 

On compute node servers

  • slurm-node: Provides the “slurmd” service and is the compute node daemon for SLURM. It monitors all tasks running on the compute node, accepts work (tasks), launches tasks, and kills running tasks upon request.
  • munge: A program that obfuscates credentials containing the UID and GID of calling processes. Returned credentials can be passed to another process which can validate them using the unmunge program. This allows an unrelated and potentially remote process to ascertain the identity of the calling process. Munge is used to encode all inter-daemon authentications amongst SLURM daemons.

 

Recommendations:

  • Install and configure the slurm-pam_slurm package to prevent users from logging into compute nodes not assigned to them, or where they do not have active jobs running.

 

Deployment

Identify the systems that will serve as workload manager hosts, database hosts, and compute nodes and install the minimal operating system components required. This example uses the openSUSE Leap 15 distribution. Because Leap 15 is based on the same code base as SLES 15, it is hoped this tutorial can be used interchangeably between them.

Fortunately, installing the packages required by the workload manager and compute node systems can be performed using existing installation patterns. Specifically, using the “HPC Workload Manager” and “HPC Basic Compute Node” patterns.

YaST HPC installation patterns

YaST HPC installation patterns

Note: The mariadb and slurm-pam_slurm packages are optional installations that can be selected when their respective patterns are selected.

Configuration

Following the software installations, some base configuration should be completed before implementing the SLURM control, database, or compute node daemons.

Workload manager and compute node systems

  • NTP services must be configured across all systems ensuring all are participate in the same time service and time zones.
  • DNS services are configured, and all cluster systems can resolve each other.
  • SLURM users and groups in the /etc/passwd and /etc/group files should have the same UID and GID values across systems. Adjust ownership of file system components if necessary.
    • /etc/slurm
    • /run/slurm
    • /var/spool/slurm
    • /var/log/slurm
  • Munge users and groups in the /etc/passwd and /etc/group files should have the same UID and GID values across systems. Adjust ownership of file system components if necessary.
    • /etc/munge
    • /run/munge
    • /var/lib/munge
    • /var/log/munge
  • The same munge secret key must be used across all systems.

 

By default, the munge secret key resides in /etc/munge/munge.key.

The munge.key file is created using /dev/urandom at installation time via the command:

~# dd if=/dev/urandom bs=1 count=1024

Subsequently it will differ from host to host. One option to ensure consistency across hosts is pick one from any host and copy it to all other hosts.

You can also create a new, arguably more secure, secret key using the following method:

~# dd if=/dev/random bs=1 count=1024 >/etc/munge/munge.key

The following tasks verify that the munge software has been properly configured.

Generate a credential package for the current user on stdout:

~# munge -n

Check if a credential package for the current user can be locally decoded:

~# munge -n | unmunge

Check if a credential package for the current user can be remotely decoded:

~# munge -n | ssh <somehost> unmunge

Workload manager and database systems

  • Open any required ports for the local firewall(s) as determined by daemon placement.

 

slurmctld port: 6817
slurmdbd port: 6819
scheduler port: 7321
mariadb port:    3306

Compute nodes must be able to communicate with the hosts running slurmctld.

For example, if the slurmctld, slurmdbd, and database are running on the same host:

~# firewall-cmd –permanent –zone=<cluster_network_zone> –add-port=6817/tcp
~# firewall-cmd –permanent –zone=<cluster_network_zone> –add-port=7321/tcp
~# firewall-cmd –reload

  • Configure the default database used by SLURM, “slurm_acct_db”, and the database user and password.

Assuming the local database was not configured during the pattern-based installation, use the following commands to configure the “slurm_acct_db” database and “slurmdb user” post installation.

Ensure the database is running.

~# systemctl start mariadb

~# mysql_secure_installation
~# mysql -u root -p

Provide the root password.

At the “MariaDB [(none)]>” prompt, issue the following commands:

Create the database access user and set the user password.

~# create user ‘slurmdb’@’localhost’ identified by ‘<user_password>’;

Grant rights for the user to the target database.

~# grant all on slurm_acct_db.* TO ‘slurmdb’@’localhost’;

Note: Modify ‘localhost’ with an actual FQDN if required.

Create the SLURM database.

~# create database slurm_acct_db;

Validate the user and database exist.

~# SELECT User,Host FROM mysql.user;
~# SHOW DATABASES;
~# exit

Ensure the database is enabled at system startup.

~# systemctl enable mariadb

  • Configure the database for real world use.

 

The default buffer size, log size, and lock wait time outs for the database should be adjusted before slurmdbd is started for the first time. Doing so prevents potential issues with database table and schema updates, and record purging operations.

Consider setting the buffer and log sizes equal in size to 50% or 75% of the host system memory and doubling the default time out settings.

Modify the settings in the /etc/my.cnf.d/innodb.cnf file:

[mysqld]
innodb_buffer_pool_size=256M
innodb_log_file_size=256M
innodb_lock_wait_timeout=1800

Note: The default buffer size is 128M.

To implement this change you must shut down the database and move/remove the log files:

~# systemctl stop mariadb
~# rm /var/lib/mysql/ib_logfile?
~# systemctl start mariadb

Verify the new buffer setting using the following command in the MariaDB shell:

~# SHOW VARIABLES LIKE ‘innodb_buffer_pool_size’;

  • Configure the slurmdbd.conf file.

 

Ensure the /etc/slurm/slurmdbd.conf file contains the following directives with valid values:

AuthType=auth/munge
DbdHost=localhost
SlurmUser=slurm
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/run/slurm/slurmdbd.pid
PluginDir=/usr/lib64/slurm
StorageType=accounting_storage/mysql
StorageHost=localhost
StoragePass=<user_password>
StorageUser=slurmdb
StorageLoc=slurm_acct_db

Consider adding directives and values to enforce life-cycles across job related database records:

PurgeEventAfter=12months
PurgeJobAfter=12months
PurgeResvAfter=2months
PurgeStepAfter=2months
PurgeSuspendAfter=1month
PurgeTXNAfter=12months
PurgeUsageAfter=12months

  • Configure the slurm.conf file.

 

The /etc/slurm/slurm.conf file is used by the slurmctld and slurmd daemons. There are configuration file forms available online at slurm.schedmd.com site for the latest SLURM version to assist you in generating a slurm.conf file. Additionally, if the workload manager server also provides a web server the “/usr/share/doc/slurm-<version>/html” directory can be served locally to provide the SLURM documentation and configuration forms specific to the SLURM version installed.

For a feature complete configuration file:

https://slurm.schedmd.com/configurator.html

For a feature minimal configuration file:

https://slurm.schedmd.com/configurator.easy.html

Using the configurator.easy.html form, the following initial slurm.conf file was created:

ControlMachine=darkvixen102
AuthType=auth/munge
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmUser=slurm
SwitchType=switch/none
TaskPlugin=task/none
SlurmctldPidFile=/run/slurm/slurmctld.pid
SlurmdPidFile=/run/slurm/slurmd.pid
SlurmdSpoolDir=/var/spool/slurm
StateSaveLocation=/var/spool/slurm
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
JobAcctGatherFrequency=30
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
NodeName=node[1-4] CPUs=4 RealMemory=950 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1 State=UNKNOWN
PartitionName=normal_q Nodes=node[1-4] Default=YES MaxTime=480 State=UP

Add the following directives and values to the slurm.conf file to complete the database configuration and name the cluster. The cluster name will also be added to the database when all services are running.

ClusterName=hangar_hpc
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=localhost
JobAcctGatherType=jobacct_gather/linux

Copy the completed /etc/slurm/slurm.conf file to all compute nodes.

Note: The “scontrol” utility is used to view and modify the running SLURM configuration and state across a cluster. Most changes in modified slurm.conf files distributed to cluster nodes can be implemented using the scontrol utility. Using the “reconfigure” argument the utility can force all daemons to re-read updated configuration files and modify runtime settings without requiring daemon restarts. Some configuration file changes, such as authentication, system roles, or ports, will require all daemons to be restarted.

Issue the following command on a system running slurmctld to reconfigure a cluster:

~# scontrol reconfigure

  • Modify service systemd configuration files to honour daemon dependencies.

 

SLURM requires munge to be running before any SLURM daemon loads, the database to be up before slurmdbd loads, and slurmctld requires slurmdbd to be running before it loads. Modify the systemd service files for SLURM daemons to ensure these dependencies are met.

Locally customized systemd files must be placed in the /etc/systemd/system directory.

~# cp /usr/lib/systemd/system/slurmctld.service /usr/lib/systemd/system/slurmdbd.service /etc/systemd/system/

Add the prerequisite “After= services” to the file /etc/systemd/system/slurmdbd.service:

[Unit]
Description=Slurm DBD accounting daemon
After=network.target mariadb.service munge.service
ConditionPathExists=/etc/slurm/slurm.conf

Add the prerequisite “After= services” to the file /etc/systemd/system/slurmctld.service:

[Unit]
Description=Slurm controller daemon
After=network.target slurmdbd.service munge.service
ConditionPathExists=/etc/slurm/slurm.conf

  • Enable the slurmdbd and slurmctld daemons to load at system start up, and then start them.

 

~# systemctl enable slurmdbd
~# systemctl enable slurmctld
~# systemctl start slurmdbd
~# systemctl start slurmctld

  • Name the cluster within the SLURM account database.

 

Use the SLURM account information utility to write to, and read from the database.

~# sacctmgr add cluster hangar_hpc
~# sacctmgr list cluster
~# sacctmgr list configuration
~# sacctmgr list stats

Compute node systems

SLURM compute nodes are assigned to a job queue, in SLURM parlance called a partition, enabling them to receive work. Compute nodes ideally belong to partitions that align hardware with the type of compute work to be performed. The software required by a compute job can also dictate which partition in the cluster should be used for work.

Basic compute node deployment from a SLURM perspective is a straight forward task. Once the OS (a minimal pattern is again recommended) and the “HPC Basic Compute Node” pattern is deployed it becomes a matter of completing the following tasks.

  • Open any required ports for the local firewall(s) as determined by daemon placement.

 

slurmd port:     6818

 

Note: It is recommended that local firewalls not be implemented on compute nodes. Compute nodes should rely on the host infrastructure to provide the security required.

  • Distribute the cluster specific /etc/munge/munge.key file to the node.
  • Distribute the cluster specific /etc/slurm/slurm.conf file to the node.

 

Note: The slurm.conf file specifies the partition compute nodes belongs to.

  • Modify service systemd configuration files to honour daemon dependencies.

 

Again, SLURM requires munge to be running before any daemon loads. Specifically, munge needs to be running before slurmd loads. Modify the systemd service files for SLURM daemons to ensure these dependencies are met.

Locally customized systemd files must be placed in the /etc/systemd/system directory.

~# cp /usr/lib/systemd/system/slurmd.service /etc/systemd/system/

Add the prerequisite “After= services” to the file /etc/systemd/system/slurmd.service:

[Unit]
Description=Slurm node daemon
After=network.target munge.service
ConditionPathExists=/etc/slurm/slurm.conf

  • Enable the slurmd daemon to load at system start up, and then start it.

 

~# systemctl enable slurmd
~# systemctl start slurmd

Taking the new cluster for a walk

A basic assessment of the state of the cluster is now possible because all daemons are configured and running. The “sinfo” utility is used to view information about SLURM nodes and partitions, and again the “scontrol” command is used to view and modify the SLURM configuration and state across a cluster.

The following commands are issued from the management node running slurmctld:

SLURM node info

Assessing node states and information.

SLURM partition info

Assessing partition states and information.

SLURM configuration info

Assessing cluster configuration information.


SLURM maintenance commands

Changing compute node states.


Summary

What is detailed here could easily be applied to other open source distributions of both Linux and SLURM. It should also be said that this example is not intended to over simplify what represents a proper production HPC cluster. Without even mentioning data and workflow design considerations, many standard HPC cluster system roles are not discussed. The short list would include high performance parallel file systems used for compute work operating over high speed interconnects, high capacity storage used as longer-term storage for completed compute work, applications (delivered traditionally or using containers), job submission nodes, and data transfer nodes, also using high speed interconnects. Hopefully this example serves as a basic SLURM tutorial, and demonstrates how the SLE 15 based openSUSE distribution unifies software components into an easily deployable HPC cluster stack that will scale and run against existing x86 and emerging ARM based hardware platforms.