Recommended update for slurm

Announcement ID: SUSE-RU-2018:2706-1
Rating: moderate
References:
Affected Products:
  • HPC Module 15
  • SUSE Linux Enterprise High Performance Computing 15

An update that has two fixes can now be installed.

Description:

This update for slurm provides version 17.11.9 and fixes the following issues:

  • When using a remote shared StateSaveLocation, slurmctld needs to be started after remote filesystems have become available. (bsc#1103561)
  • Fix race in the slurmctld backup controller which prevents it to clean up allocations on nodes properly after failing over. (bsc#1084917)
  • Fix segfault in slurmctld when a job's node bitmap is NULL during a scheduling cycle.
  • Remove erroneous unlock in acct_gather_energy/ipmi.
  • Enable support for hwloc version 2.0.1.
  • Fix 'srun -q' (--qos) option handling.
  • Fix socket communication issue that can lead to lost task completion messages, which will cause a permanently stuck srun process.
  • Avoid node layout fragmentation if running with a fixed CPU count but without Sockets and CoresPerSocket defined.
  • burst_buffer/cray: Fix datawarp swap default pool overriding jobdw.
  • Fix incorrect job priority assignment for multi-partition job with different PriorityTier settings on the partitions.
  • Fix sinfo to print correct node state.
  • Do not allocate nodes that were marked down due to the node not responding by ResumeTimeout.
  • task/cray plugin: Search for "mems" cgroup information in the file "cpuset.mems" then fall back to the file "mems".
  • Fix ipmi profile debug uninitialized variable.
  • PMIx: Fixed the direct connect inline msg sending.
  • MYSQL: Fix issue not handling all fields when loading an archive dump.
  • Allow a job_submit plugin to change the admin_comment field during job_submit_plugin_modify().
  • job_submit/lua: Fix access into reservation table.
  • MySQL: Prevent deadlock caused by archive logic locking reads.
  • Don't enforce MaxQueryTimeRange when requesting specific jobs.
  • Modify --test-only logic to properly support jobs submitted to more than one partition.
  • Prevent slurmctld from abort when attempting to set non-existing qos as def_qos_id.
  • Add new job dependency type of "afterburstbuffer". The pending job will be delayed until the first job completes execution and it's burst buffer stage-out is completed.
  • Reorder proctrack/task plugin load in the slurmstepd to match that of slurmd and avoid race condition calling task before proctrack can introduce.
  • Prevent reboot of a busy KNL node when requesting inactive features.
  • Fix to reinitialize previously adjusted job members to their original value when validating the job memory in multi-partition requests.
  • Fix _step_signal() from always returning SLURM_SUCCESS.
  • Combine active and available node feature change logs on one line rather than one line per node for performance reasons.
  • Prevent occasionally leaking freezer cgroups.
  • Fix potential segfault when closing the mpi/pmi2 plugin.
  • Fix issues with --exclusive=[user|mcs] to work correctly with preemption or when job requests a specific list of hosts.
  • mpi/pmix: Fixed the collectives canceling.
  • SlurmDBD: Improve error message handling on archive load failure.
  • Fix incorrect locking when deleting reservations.
  • Fix incorrect locking when setting up the power save module.
  • Fix setting format output length for squeue when showing array jobs.
  • Add xstrstr function.
  • Fix printing out of --hint options in sbatch, salloc --help.
  • Prevent possible divide by zero in _validate_time_limit().
  • Add Delegate=yes to the slurmd.service file to prevent systemd from interfering with the jobs' cgroup hierarchies.
  • Change the backlog argument to the listen() syscall within srun to 4096 to match elsewhere in the code, and avoid communication problems at scale.
  • Recommend slurm-munge for slurm-slurmdbd.

Patch Instructions:

To install this SUSE update use the SUSE recommended installation methods like YaST online_update or "zypper patch".
Alternatively you can run the command listed for your product:

  • HPC Module 15
    zypper in -t patch SUSE-SLE-Module-HPC-15-2018-1898=1

Package List:

  • HPC Module 15 (aarch64 x86_64)
    • slurm-auth-none-17.11.9-6.9.1
    • slurm-lua-debuginfo-17.11.9-6.9.1
    • slurm-slurmdbd-17.11.9-6.9.1
    • perl-slurm-17.11.9-6.9.1
    • slurm-debugsource-17.11.9-6.9.1
    • slurm-debuginfo-17.11.9-6.9.1
    • libpmi0-17.11.9-6.9.1
    • slurm-sql-17.11.9-6.9.1
    • slurm-sql-debuginfo-17.11.9-6.9.1
    • slurm-lua-17.11.9-6.9.1
    • libslurm32-17.11.9-6.9.1
    • libpmi0-debuginfo-17.11.9-6.9.1
    • slurm-pam_slurm-17.11.9-6.9.1
    • slurm-node-debuginfo-17.11.9-6.9.1
    • slurm-auth-none-debuginfo-17.11.9-6.9.1
    • slurm-pam_slurm-debuginfo-17.11.9-6.9.1
    • slurm-torque-debuginfo-17.11.9-6.9.1
    • slurm-doc-17.11.9-6.9.1
    • slurm-config-17.11.9-6.9.1
    • perl-slurm-debuginfo-17.11.9-6.9.1
    • slurm-munge-debuginfo-17.11.9-6.9.1
    • slurm-plugins-17.11.9-6.9.1
    • slurm-slurmdbd-debuginfo-17.11.9-6.9.1
    • slurm-devel-17.11.9-6.9.1
    • slurm-plugins-debuginfo-17.11.9-6.9.1
    • slurm-torque-17.11.9-6.9.1
    • slurm-node-17.11.9-6.9.1
    • libslurm32-debuginfo-17.11.9-6.9.1
    • slurm-17.11.9-6.9.1
    • slurm-munge-17.11.9-6.9.1

References: