Recommended update for slurm
| Announcement ID: | SUSE-RU-2023:3759-1 |
|---|---|
| Rating: | moderate |
| References: | |
| Affected Products: |
|
An update that has one fix can now be installed.
Description:
This update for slurm fixes the following issues:
- Updated to 23.02.4 with the following changes:
- Bug Fixes:
- Fix main scheduler loop not starting after a failover to backup
controller. Avoid slurmctld segfault when specifying
AccountingStorageExternalHost(bsc#1214983). - Fix sbatch return code when
--waitis requested on a job array. - Fix collected
GPUUtilizationvalues foracct_gather_profileplugins. - Fix
slurmrestdhandling of job hold/release operations. - Fix step running indefinitely when slurmctld takes more than
MessageTimeoutto respond. Now,slurmctldwill cancel the step when detected, preventing following steps from getting stuck waiting for resources to be released. - Fix regression to make
job_desc.min_cpusaccurate again injob_submitwhen requesting a job with--ntasks-per-node. - Fix handling of
ArrayTaskThrottlein backfill. - Fix regression in 23.02.2 when checking gres state on
slurmctldstartup or reconfigure. Gres changes in the configuration were not updated on slurmctld startup. On startup or reconfigure, these messages were present in the log:"error: Attempt to change gres/gpu Count". - Fix potential double count of gres when dealing with limits.
- Fix
slurmstepdsegfault whenContainerPathis not set inoci.conf - Fixed an issue where jobs requesting licenses were incorrectly rejected.
scrontab- Fix cutting off the final character of quoted variables.smail- Fix issues where e-mails at job completion were not being sent.scontrol/slurmctld- fix comma parsing when updating a reservation's nodes.- Fix
--gpu-bind=single bindingtasks to wrong gpus, leading to some gpus having more tasks than they should and other gpus being unused. - Fix regression in 23.02 that causes slurmstepd to crash when
srunrequests more thanTreeWidthnodes in a step and uses the pmi2 or pmix plugin. job_container/tmpfs- Fix%hand%nsubstitution inBasePathwhere%hwas substituted as the NodeName instead of the hostname, and %n was substituted as an empty string.- Fix regression where
--cpu-bind=verbosewould overrideTaskPluginParam. scancel- Fix--clusters/-Mfor federations. Only filtered jobs (e.g.-A,-u,-p, etc.) from the specified clusters will be canceled, rather than all jobs in the federation. Specific jobids will still be routed to the origin cluster for cancellation.
- Fix main scheduler loop not starting after a failover to backup
controller. Avoid slurmctld segfault when specifying
- Other changes:
- Make spank
S_JOB_ARGVitem value hold the requested commandargvinstead of thesrun --bcastvalue when--bcastrequested (only in local context). scontrol- Permit changes to StdErr and StdIn for pending jobs.scontrol- Resetstd{err,in,out} when set to empty string.slurmrestd- mark environment as a required field for job submission descriptions.slurmrestd- avoid dumping null in OpenAPI schema required fields.data_parser/v0.0.39- avoid rejecting validmemory_per_nodeformatted as dictionary provided with a job description.data_parser/v0.0.39- avoid rejecting validmemory_per_cpuformatted as dictionary provided with a job description.slurmrestd- Return HTTP error code 404 when job query fails.slurmrestd- Add return schema to error response to job and license query.- Change the log message warning for rate limited users from debug to verbose.
cgroup/v2- Avoid capturing log output for ebpf when constraining devices, as this can lead to inadvertent failure if the log buffer is too small.- Added error message when attempting to use sattach on batch or extern steps.
- Reject job
ArrayTaskThrottleupdate requests from unprivileged users. data_parser/v0.0.39- populate description fields of property objects in generated OpenAPI specifications where defined.slurmstepd- Avoid segfault caused byContainerPathnot being terminated by/inoci.conf.data_parser/v0.0.39- Changev0.0.39_job_inforesponse to tagexit_codefield as being complex instead of only an unsigned integer.
- Make spank
- Updated to 23.02.3 with the following changes:
- Bug Fixes:
slurmctld- Fix backup slurmctld crash when it takes control multiple times.- Fix regression in 23.02.2 that ignored the partition
DefCpuPerGPUsetting on the first pass of scheduling a job requesting--gpus --ntasks. srun- fix issue creating regular and interactive steps because environment variables were incorrectly set on non-HetSteps.- Fix dynamic nodes getting stuck in allocated states when reconfiguring.
- Fix regression in 23.02.2 that set the
SLURM_NTASKSenvironment variable in sbatch jobs from--ntasks-per-nodewhen--ntaskswas not requested. - Fix regression in 23.02 that caused sbatch jobs to set the wrong number
of tasks when requesting
--ntasks-per-nodewithout--ntasks, and also requesting one of the following options:--sockets-per-node,--cores-per-socket,--threads-per-core(or--hint=nomultithread), or-B,--extra-node-info. - Fix double counting suspended job counts on nodes when reconfiguring, which prevented nodes with suspended jobs from being powered down or rebooted once the jobs completed.
- Fix backfill not scheduling jobs submitted with
--preferand--constraintproperly. - mpi/pmix - fix regression introduced in 23.02.2 which caused PMIx shmem backed files permissions to be incorrect.
- api/submit - fix memory leaks when submission of batch regular jobs or batch HetJobs fails (response data is a return code).
- Fix regression in 23.02 leading to error() messages being sent at
INFOinstead ofERRin syslog. - Fix
TresUsageIn[Tot|Ave]calculation forgres/gpumemandgres/gpuutil. - Fix issue in the gpu plugins where gpu frequencies would only be set if both gpu memory and gpu frequencies were set, while one or the other suffices.
- Fix reservations group ACL's not working with the root group.
- Fix updating a job with a ReqNodeList greater than the job's node count.
- Fix inadvertent permission denied error for
--task-prologand--task-epilogwith filesystems mounted withroot_squash. - Fix missing detailed cpu and gres information in json/yaml output from
scontrol,squeueandsinfo. - Fix regression in 23.02 that causes a failure to allocate job steps that
request
--cpus-per-gpuand gpus with types. - Fix potentially waiting indefinitely for a defunct process to finish,
which affects various scripts including
PrologandEpilog. This could have various symptoms, such as jobs getting stuck in a completing state. - Fix losing list of reservations on job when updating job with list of reservations and restarting the controller.
- Fix nodes resuming after down and drain state update requests from clients older than 23.02.
- Fix advanced reservation creation/update when an association that should have access to it is composed with partition(s).
- Fix job layout calculations with
--ntasks-per-gpu, especially when--nodeshas not been explicitly provided. - Fix X11 forwarding for jobs submitted from the slurmctld host.
- When a job requests
--no-killand one or more nodes fail during the job, fix subsequent job steps unable to use some of the remaining resources allocated to the job. - Fix shared gres allocation when using
--tres-per-taskwith tasks that span multiple sockets. auth/jwt- Fix memory leak.
- Other changes:
openapi/dbv0.0.39/users- If a default account update failed, resulting in a no-op, the query returned success without any warning. Now a warning is sent back to the client that the default account wasn't modified.- Avoid job write lock when nodes are dynamically added/removed.
burst_buffer/lua- allow jobs to get scheduled sooner afterslurm_bb_data_incompletes.openapi/v0.0.39- fix memory leak in_job_post_het_submit().- Avoid possible
slurmctldsegfault caused by race condition with already completedslurmdbd_connconnections. Slurmdbd.confchecks included conf files for 0600 permissionsslurmrestd- fix regression "oversubscribe" fields were removed from job descriptions and submissions from v0.0.39 end points.accounting_storage/mysql- Query for indiviual QOS correctly when you have more than 10.- Add warning message about ignoring
--tres-per-tasks=licensewhen used on a step. sshare- Fix command to work when usingpriority/basic.- Avoid loading
cli_filterplugins outside ofsalloc/sbatch/scron/srun. This fixes a number of missing symbol problems that can manifest for executables linked against libslurm (and notlibslurmfull). - Allow cloud_reg_addrs to update dynamically registered node's addrs on subsequent registrations.
- Revert a change in 22.05.5 that prevented tasks from sharing a core if
--cpus-per-task> threads per core, but caused incorrect accounting and cpu binding. Instead,--ntasks-per-core=1may be requested to prevent tasks from sharing a core. - Correctly send
assoc_mgrlock to mcs plugin. - Avoid unnecessary
gres/gpumemandgres/gpuutilTRESposition lookups. sacct- when printingPLANNEDtime, use end time instead of start time for jobs cancelled before they started.- Hold the job with "
(Reservation ... invalid)" state reason if the reservation is not usable by the job. sbatch- Added new--export=NILoption.
Patch Instructions:
To install this SUSE update use the SUSE recommended
installation methods like YaST online_update or "zypper patch".
Alternatively you can run the command listed for your product:
-
openSUSE Leap 15.5
zypper in -t patch SUSE-2023-3759=1 openSUSE-SLE-15.5-2023-3759=1 -
HPC Module 15-SP5
zypper in -t patch SUSE-SLE-Module-HPC-15-SP5-2023-3759=1 -
SUSE Package Hub 15 15-SP5
zypper in -t patch SUSE-SLE-Module-Packagehub-Subpackages-15-SP5-2023-3759=1
Package List:
-
openSUSE Leap 15.5 (aarch64 ppc64le s390x x86_64)
- slurm-node-debuginfo-23.02.4-150500.5.6.1
- slurm-plugin-ext-sensors-rrd-23.02.4-150500.5.6.1
- slurm-cray-23.02.4-150500.5.6.1
- slurm-torque-debuginfo-23.02.4-150500.5.6.1
- slurm-plugin-ext-sensors-rrd-debuginfo-23.02.4-150500.5.6.1
- slurm-debugsource-23.02.4-150500.5.6.1
- slurm-lua-23.02.4-150500.5.6.1
- slurm-lua-debuginfo-23.02.4-150500.5.6.1
- slurm-rest-debuginfo-23.02.4-150500.5.6.1
- slurm-testsuite-23.02.4-150500.5.6.1
- slurm-cray-debuginfo-23.02.4-150500.5.6.1
- perl-slurm-debuginfo-23.02.4-150500.5.6.1
- slurm-auth-none-23.02.4-150500.5.6.1
- slurm-plugins-23.02.4-150500.5.6.1
- slurm-torque-23.02.4-150500.5.6.1
- libslurm39-debuginfo-23.02.4-150500.5.6.1
- slurm-hdf5-debuginfo-23.02.4-150500.5.6.1
- perl-slurm-23.02.4-150500.5.6.1
- libpmi0-23.02.4-150500.5.6.1
- slurm-pam_slurm-debuginfo-23.02.4-150500.5.6.1
- slurm-hdf5-23.02.4-150500.5.6.1
- slurm-devel-23.02.4-150500.5.6.1
- slurm-munge-23.02.4-150500.5.6.1
- slurm-slurmdbd-debuginfo-23.02.4-150500.5.6.1
- slurm-sql-23.02.4-150500.5.6.1
- slurm-debuginfo-23.02.4-150500.5.6.1
- slurm-auth-none-debuginfo-23.02.4-150500.5.6.1
- slurm-munge-debuginfo-23.02.4-150500.5.6.1
- libpmi0-debuginfo-23.02.4-150500.5.6.1
- libslurm39-23.02.4-150500.5.6.1
- libnss_slurm2-debuginfo-23.02.4-150500.5.6.1
- slurm-node-23.02.4-150500.5.6.1
- slurm-23.02.4-150500.5.6.1
- slurm-rest-23.02.4-150500.5.6.1
- slurm-sql-debuginfo-23.02.4-150500.5.6.1
- slurm-pam_slurm-23.02.4-150500.5.6.1
- libnss_slurm2-23.02.4-150500.5.6.1
- slurm-sview-debuginfo-23.02.4-150500.5.6.1
- slurm-slurmdbd-23.02.4-150500.5.6.1
- slurm-plugins-debuginfo-23.02.4-150500.5.6.1
- slurm-sview-23.02.4-150500.5.6.1
-
openSUSE Leap 15.5 (noarch)
- slurm-doc-23.02.4-150500.5.6.1
- slurm-seff-23.02.4-150500.5.6.1
- slurm-sjstat-23.02.4-150500.5.6.1
- slurm-config-23.02.4-150500.5.6.1
- slurm-webdoc-23.02.4-150500.5.6.1
- slurm-openlava-23.02.4-150500.5.6.1
- slurm-config-man-23.02.4-150500.5.6.1
-
HPC Module 15-SP5 (aarch64 x86_64)
- slurm-node-debuginfo-23.02.4-150500.5.6.1
- slurm-plugin-ext-sensors-rrd-23.02.4-150500.5.6.1
- slurm-cray-23.02.4-150500.5.6.1
- slurm-torque-debuginfo-23.02.4-150500.5.6.1
- slurm-plugin-ext-sensors-rrd-debuginfo-23.02.4-150500.5.6.1
- slurm-debugsource-23.02.4-150500.5.6.1
- slurm-lua-23.02.4-150500.5.6.1
- slurm-lua-debuginfo-23.02.4-150500.5.6.1
- slurm-rest-debuginfo-23.02.4-150500.5.6.1
- slurm-cray-debuginfo-23.02.4-150500.5.6.1
- perl-slurm-debuginfo-23.02.4-150500.5.6.1
- slurm-auth-none-23.02.4-150500.5.6.1
- slurm-plugins-23.02.4-150500.5.6.1
- slurm-torque-23.02.4-150500.5.6.1
- libslurm39-debuginfo-23.02.4-150500.5.6.1
- perl-slurm-23.02.4-150500.5.6.1
- libpmi0-23.02.4-150500.5.6.1
- slurm-pam_slurm-debuginfo-23.02.4-150500.5.6.1
- slurm-devel-23.02.4-150500.5.6.1
- slurm-munge-23.02.4-150500.5.6.1
- slurm-slurmdbd-debuginfo-23.02.4-150500.5.6.1
- slurm-sql-23.02.4-150500.5.6.1
- slurm-debuginfo-23.02.4-150500.5.6.1
- slurm-auth-none-debuginfo-23.02.4-150500.5.6.1
- slurm-munge-debuginfo-23.02.4-150500.5.6.1
- libpmi0-debuginfo-23.02.4-150500.5.6.1
- libslurm39-23.02.4-150500.5.6.1
- slurm-node-23.02.4-150500.5.6.1
- libnss_slurm2-debuginfo-23.02.4-150500.5.6.1
- slurm-23.02.4-150500.5.6.1
- slurm-rest-23.02.4-150500.5.6.1
- slurm-sql-debuginfo-23.02.4-150500.5.6.1
- slurm-pam_slurm-23.02.4-150500.5.6.1
- libnss_slurm2-23.02.4-150500.5.6.1
- slurm-sview-debuginfo-23.02.4-150500.5.6.1
- slurm-slurmdbd-23.02.4-150500.5.6.1
- slurm-plugins-debuginfo-23.02.4-150500.5.6.1
- slurm-sview-23.02.4-150500.5.6.1
-
HPC Module 15-SP5 (noarch)
- slurm-config-man-23.02.4-150500.5.6.1
- slurm-doc-23.02.4-150500.5.6.1
- slurm-config-23.02.4-150500.5.6.1
- slurm-webdoc-23.02.4-150500.5.6.1
-
SUSE Package Hub 15 15-SP5 (ppc64le s390x)
- slurm-node-debuginfo-23.02.4-150500.5.6.1
- slurm-cray-23.02.4-150500.5.6.1
- slurm-torque-debuginfo-23.02.4-150500.5.6.1
- slurm-debugsource-23.02.4-150500.5.6.1
- slurm-lua-23.02.4-150500.5.6.1
- slurm-lua-debuginfo-23.02.4-150500.5.6.1
- slurm-rest-debuginfo-23.02.4-150500.5.6.1
- slurm-cray-debuginfo-23.02.4-150500.5.6.1
- perl-slurm-debuginfo-23.02.4-150500.5.6.1
- slurm-auth-none-23.02.4-150500.5.6.1
- slurm-plugins-23.02.4-150500.5.6.1
- slurm-torque-23.02.4-150500.5.6.1
- slurm-hdf5-debuginfo-23.02.4-150500.5.6.1
- perl-slurm-23.02.4-150500.5.6.1
- libpmi0-23.02.4-150500.5.6.1
- slurm-pam_slurm-debuginfo-23.02.4-150500.5.6.1
- slurm-hdf5-23.02.4-150500.5.6.1
- slurm-devel-23.02.4-150500.5.6.1
- slurm-munge-23.02.4-150500.5.6.1
- slurm-slurmdbd-debuginfo-23.02.4-150500.5.6.1
- slurm-sql-23.02.4-150500.5.6.1
- slurm-debuginfo-23.02.4-150500.5.6.1
- slurm-auth-none-debuginfo-23.02.4-150500.5.6.1
- slurm-munge-debuginfo-23.02.4-150500.5.6.1
- libpmi0-debuginfo-23.02.4-150500.5.6.1
- slurm-node-23.02.4-150500.5.6.1
- libnss_slurm2-debuginfo-23.02.4-150500.5.6.1
- slurm-23.02.4-150500.5.6.1
- slurm-rest-23.02.4-150500.5.6.1
- slurm-sql-debuginfo-23.02.4-150500.5.6.1
- slurm-pam_slurm-23.02.4-150500.5.6.1
- libnss_slurm2-23.02.4-150500.5.6.1
- slurm-sview-debuginfo-23.02.4-150500.5.6.1
- slurm-slurmdbd-23.02.4-150500.5.6.1
- slurm-plugins-debuginfo-23.02.4-150500.5.6.1
- slurm-sview-23.02.4-150500.5.6.1
-
SUSE Package Hub 15 15-SP5 (noarch)
- slurm-doc-23.02.4-150500.5.6.1
- slurm-seff-23.02.4-150500.5.6.1
- slurm-sjstat-23.02.4-150500.5.6.1
- slurm-config-23.02.4-150500.5.6.1
- slurm-webdoc-23.02.4-150500.5.6.1
- slurm-openlava-23.02.4-150500.5.6.1
- slurm-config-man-23.02.4-150500.5.6.1