Security in High Performance Computing (HPC) environments has always been a proposal of relativity. Striking a balance between implementing an adequate level of security and not interfering with “getting the science done”. In the simplest terms the primary security realms HPC systems and clusters are exposed to are environmental and user facing. More specifically the security posture of the data centers hosting HPC systems and how users are permitted to access them. The security at the cluster edges is certainly a worthwhile discussion. However, this writing focuses on user access security within HPC clusters, where the science is getting done, and security is at its most minimal by design. Implementing user access restrictions to cluster compute nodes may seem unnecessary at first blush.
Why would there be a need to restrict user access to compute nodes? Two primary reasons merit mention.
Maintaining fairness in cluster resource allocation and access.
The high CPU-GPU and memory density of modern HPC compute nodes provide sufficient resources for concurrent distributed workloads. Workloads on a compute node will usually belong to different users, and those workloads are understandably important to their respective owners. Moreover, research workloads may have normal runtimes measured in seconds, weeks or even months. If a user were to access that node and initiate work or processes, not managed by the cluster scheduler or resource management facilities, and cause the node to crash that would certainly not be fair.
Maintaining accuracy in cluster metrics and trends
Realtime metrics are important for active visibility into cluster health and utilisation. Historical health and utilisation data for clusters can also be useful for computational capacity analysis and even insight into future cluster design needs. If users access nodes they have workloads running on and augment or adjust them with additional processes and resource consumption, again not managed by the cluster scheduler or resource management facilities, it would be desirable to capture those metrics as well.
The Simple Linux Utility for Resource Management (SLURM) software stack includes Pluggable Authentication Modules (PAM) that can be used to manage user access to compute nodes in clusters it manages.
Installation of the “PAM module for restricting access to compute nodes via SLURM” package (slurm-pam_slurm) on a node where the “Minimal SLURM node” package (slurm-node) is installed provides the following two PAM modules.
/lib64/security/pam_slurm.so: Considered the legacy implementation of the two modules. This module’s functionality is limited to preventing users from logging into nodes where they do not own compute jobs.
/lib64/security/pam_slurm_adopt.so: The preferred and most capable module. In addition to preventing users from logging into nodes where they do not own compute jobs, it tracks other processes spawned by a user’s SSH connection to that node. These processes are adopted as external steps to the user’s job. Those external steps are not only integrated with SLURM’s accounting facilities, but also its control group facilities (cpuset, memory, etc.) to ensure the adopted processes are contained and even terminated properly.
Disable systemd session management on HPC compute nodes
Systemd utilises the “pam_systemd.so” module, in the session module interface of the system PAM stack, to register user sessions with the systemd login manager service. The login manager is invoked by the systemd-logind.service script. Because the systemd module and the login manager service manage the default user control group (cgroup) hierarchy, it conflicts with the pam_slurm_adopt.so module cgroup facilities. Both systemd components can be disabled for login services on HPC compute nodes.
More on how this is accomplished later in the article.
SLURM must load the plugin to support cgroups and requires the addition of the “contain” ProLog flag.
Modify the /etc/slurm/slurm.conf file:
* Not required by the pam_slurm.so module.
Best practice recommendations
The use of the “UsePAM=1” directive and value in the slurm.conf file should be understood before it is implemented. It is not required by either the “pam_slurm.so” or “pam_slurm_adopt.so” modules. Rather it is used to provision a user’s environment on a compute node instead of the standard user profile captured from an origin login or submission node.
The use of the “UsePAM=1” directive and value also requires the use of a custom PAM file to implement the desired environment. An example is provided below.
auth required pam_localuser.so
account required pam_unix.so
session required pam_limits.so
Using the “UsePAM=1” directive and value in the slurm.conf file and the custom slurm PAM file provide an alternate method of enforcing resource limits in environments when the pam_slurm.so module is used. In most environments it should not be required.
The SSH daemon on compute nodes should support the use of PAM services to authenticate users.
Modify the /etc/ssh/sshd_config file:
Best practice recommendations
Require key based authentication for root user logins and disable simple password-based authentication.
* Challenge Response Authentication supports modern forms of authentication in addition to prompting for, accepting, and validating passwords.
This deployment example demonstrates the configuration of the “pam_slurm_adopt.so” module.
On the target compute nodes.
- Ensure the /etc/slurm/slurm.conf file requirements are met.
- Ensure the /etc/ssh/sshd_config file requirements are met.
- Install the required packages.
~# zypper install slurm-pam_slurm
- Create local groups on compute nodes that will be used to permit access to administrative users irrespective of compute job ownership.
- Modify the configuration file used by the “pam_access.so” module (to be implemented in a later step) to support the administrator user access.
This configuration permits access to members of the “hpc_admin_g” group from the “192.168.0.0/24” network, to user “root” from all networks, and denies access to all others.
* Reference the man file for the pam_access.so module for additional information on this configuration.
- Create custom PAM files for SLURM module implementation
Create new files in the /etc/pam.d directory with the “-pc” suffix. Then create symbolic links to those files (conforming to the standards in the /etc/pam.d directory). These new files will be referenced in the yet to be modified /etc/pam.d/sshd file.
The use of custom files retains as much default content in the PAM service files as possible. This is useful in the event the system needs to be rolled back to the default authentication services and to ensure system patches do not modify customised content.
~# cd /etc/pam.d
~# cp common-account-pc slurm-common-account-pc
~# ln -s slurm-common-account-pc slurm-common-account
Modify the new file.
account required pam_unix.so try_first_pass
account optional pam_sss.so use_first_pass
account sufficient pam_access.so
account required pam_slurm_adopt.so
* This configuration uses the SSSD for Active Directory authentication in addition to local UNIX authentication. The “pam_sss.so” module may not be present in other configurations.
~# cp common-session-pc slurm-common-session-pc
~# ln -s slurm-common-session-pc slurm-common-session
Modify the new file.
# session optional pam_systemd.so
session required pam_limits.so
session required pam_unix.so try_first_pass
session optional pam_sss.so
session optional pam_umask.so
session optional pam_env.so
* The pam_systemd.so module is removed from the customised configuration.
Modify the PAM configuration file used by the SSH service to reference the customised files.
auth requisite pam_nologin.so
auth include common-auth
account requisite pam_nologin.so
account include slurm-common-account
password include common-password
session required pam_loginuid.so
session include slurm-common-session
session optional pam_lastlog.so silent noupdate showfailed
- Disable and mask the systemd login service daemon.
~# systemctl stop systemd-logind
~# systemctl mask systemd-logind
Taking the new configuration for a walk
The required components are now configured, and the new user access model can be tested against a compute node.
The following commands are issued from the job submission node that has access to compute nodes in the cluster.
Figure 2: Users without active jobs should not be able to log in to compute nodes.
Implementing the SLURM PAM modules can certainly aid in improving the inward facing security, aid in node reliability, and the accounting facets of a cluster. The configuration footprint is relatively lightweight and easy to deploy. The availability of the modules is one of the many benefits of using the SLURM cluster management stack, enjoy!