Recommended update for slurm
Announcement ID: | SUSE-RU-2023:3759-1 |
---|---|
Rating: | moderate |
References: | |
Affected Products: |
|
An update that has one fix can now be installed.
Description:
This update for slurm fixes the following issues:
- Updated to 23.02.4 with the following changes:
- Bug Fixes:
- Fix main scheduler loop not starting after a failover to backup
controller. Avoid slurmctld segfault when specifying
AccountingStorageExternalHost
(bsc#1214983). - Fix sbatch return code when
--wait
is requested on a job array. - Fix collected
GPUUtilization
values foracct_gather_profile
plugins. - Fix
slurmrestd
handling of job hold/release operations. - Fix step running indefinitely when slurmctld takes more than
MessageTimeout
to respond. Now,slurmctld
will cancel the step when detected, preventing following steps from getting stuck waiting for resources to be released. - Fix regression to make
job_desc.min_cpus
accurate again injob_submit
when requesting a job with--ntasks-per-node
. - Fix handling of
ArrayTaskThrottle
in backfill. - Fix regression in 23.02.2 when checking gres state on
slurmctld
startup or reconfigure. Gres changes in the configuration were not updated on slurmctld startup. On startup or reconfigure, these messages were present in the log:"error: Attempt to change gres/gpu Count
". - Fix potential double count of gres when dealing with limits.
- Fix
slurmstepd
segfault whenContainerPath
is not set inoci.conf
- Fixed an issue where jobs requesting licenses were incorrectly rejected.
scrontab
- Fix cutting off the final character of quoted variables.smail
- Fix issues where e-mails at job completion were not being sent.scontrol/slurmctld
- fix comma parsing when updating a reservation's nodes.- Fix
--gpu-bind=single binding
tasks to wrong gpus, leading to some gpus having more tasks than they should and other gpus being unused. - Fix regression in 23.02 that causes slurmstepd to crash when
srun
requests more thanTreeWidth
nodes in a step and uses the pmi2 or pmix plugin. job_container/tmpfs
- Fix%h
and%n
substitution inBasePath
where%h
was substituted as the NodeName instead of the hostname, and %n was substituted as an empty string.- Fix regression where
--cpu-bind=verbose
would overrideTaskPluginParam
. scancel
- Fix--clusters/-M
for federations. Only filtered jobs (e.g.-A
,-u
,-p
, etc.) from the specified clusters will be canceled, rather than all jobs in the federation. Specific jobids will still be routed to the origin cluster for cancellation.
- Fix main scheduler loop not starting after a failover to backup
controller. Avoid slurmctld segfault when specifying
- Other changes:
- Make spank
S_JOB_ARGV
item value hold the requested commandargv
instead of thesrun --bcast
value when--bcast
requested (only in local context). scontrol
- Permit changes to StdErr and StdIn for pending jobs.scontrol
- Resetstd
{err
,in
,out
} when set to empty string.slurmrestd
- mark environment as a required field for job submission descriptions.slurmrestd
- avoid dumping null in OpenAPI schema required fields.data_parser/v0.0.39
- avoid rejecting validmemory_per_node
formatted as dictionary provided with a job description.data_parser/v0.0.39
- avoid rejecting validmemory_per_cpu
formatted as dictionary provided with a job description.slurmrestd
- Return HTTP error code 404 when job query fails.slurmrestd
- Add return schema to error response to job and license query.- Change the log message warning for rate limited users from debug to verbose.
cgroup/v2
- Avoid capturing log output for ebpf when constraining devices, as this can lead to inadvertent failure if the log buffer is too small.- Added error message when attempting to use sattach on batch or extern steps.
- Reject job
ArrayTaskThrottle
update requests from unprivileged users. data_parser/v0.0.39
- populate description fields of property objects in generated OpenAPI specifications where defined.slurmstepd
- Avoid segfault caused byContainerPath
not being terminated by/
inoci.conf
.data_parser/v0.0.39
- Changev0.0.39_job_info
response to tagexit_code
field as being complex instead of only an unsigned integer.
- Make spank
- Updated to 23.02.3 with the following changes:
- Bug Fixes:
slurmctld
- Fix backup slurmctld crash when it takes control multiple times.- Fix regression in 23.02.2 that ignored the partition
DefCpuPerGPU
setting on the first pass of scheduling a job requesting--gpus --ntasks
. srun
- fix issue creating regular and interactive steps because environment variables were incorrectly set on non-HetSteps.- Fix dynamic nodes getting stuck in allocated states when reconfiguring.
- Fix regression in 23.02.2 that set the
SLURM_NTASKS
environment variable in sbatch jobs from--ntasks-per-node
when--ntasks
was not requested. - Fix regression in 23.02 that caused sbatch jobs to set the wrong number
of tasks when requesting
--ntasks-per-node
without--ntasks
, and also requesting one of the following options:--sockets-per-node
,--cores-per-socket
,--threads-per-core
(or--hint=nomultithread
), or-B,--extra-node-info
. - Fix double counting suspended job counts on nodes when reconfiguring, which prevented nodes with suspended jobs from being powered down or rebooted once the jobs completed.
- Fix backfill not scheduling jobs submitted with
--prefer
and--constraint
properly. - mpi/pmix - fix regression introduced in 23.02.2 which caused PMIx shmem backed files permissions to be incorrect.
- api/submit - fix memory leaks when submission of batch regular jobs or batch HetJobs fails (response data is a return code).
- Fix regression in 23.02 leading to error() messages being sent at
INFO
instead ofERR
in syslog. - Fix
TresUsageIn[Tot|Ave]
calculation forgres/gpumem
andgres/gpuutil
. - Fix issue in the gpu plugins where gpu frequencies would only be set if both gpu memory and gpu frequencies were set, while one or the other suffices.
- Fix reservations group ACL's not working with the root group.
- Fix updating a job with a ReqNodeList greater than the job's node count.
- Fix inadvertent permission denied error for
--task-prolog
and--task-epilog
with filesystems mounted withroot_squash
. - Fix missing detailed cpu and gres information in json/yaml output from
scontrol
,squeue
andsinfo
. - Fix regression in 23.02 that causes a failure to allocate job steps that
request
--cpus-per-gpu
and gpus with types. - Fix potentially waiting indefinitely for a defunct process to finish,
which affects various scripts including
Prolog
andEpilog
. This could have various symptoms, such as jobs getting stuck in a completing state. - Fix losing list of reservations on job when updating job with list of reservations and restarting the controller.
- Fix nodes resuming after down and drain state update requests from clients older than 23.02.
- Fix advanced reservation creation/update when an association that should have access to it is composed with partition(s).
- Fix job layout calculations with
--ntasks-per-gpu
, especially when--nodes
has not been explicitly provided. - Fix X11 forwarding for jobs submitted from the slurmctld host.
- When a job requests
--no-kill
and one or more nodes fail during the job, fix subsequent job steps unable to use some of the remaining resources allocated to the job. - Fix shared gres allocation when using
--tres-per-task
with tasks that span multiple sockets. auth/jwt
- Fix memory leak.
- Other changes:
openapi/dbv0.0.39/users
- If a default account update failed, resulting in a no-op, the query returned success without any warning. Now a warning is sent back to the client that the default account wasn't modified.- Avoid job write lock when nodes are dynamically added/removed.
burst_buffer/lua
- allow jobs to get scheduled sooner afterslurm_bb_data_in
completes.openapi/v0.0.39
- fix memory leak in_job_post_het_submit()
.- Avoid possible
slurmctld
segfault caused by race condition with already completedslurmdbd_conn
connections. Slurmdbd.conf
checks included conf files for 0600 permissionsslurmrestd
- fix regression "oversubscribe" fields were removed from job descriptions and submissions from v0.0.39 end points.accounting_storage/mysql
- Query for indiviual QOS correctly when you have more than 10.- Add warning message about ignoring
--tres-per-tasks=license
when used on a step. sshare
- Fix command to work when usingpriority/basic
.- Avoid loading
cli_filter
plugins outside ofsalloc
/sbatch
/scron
/srun
. This fixes a number of missing symbol problems that can manifest for executables linked against libslurm (and notlibslurmfull
). - Allow cloud_reg_addrs to update dynamically registered node's addrs on subsequent registrations.
- Revert a change in 22.05.5 that prevented tasks from sharing a core if
--cpus-per-task
> threads per core, but caused incorrect accounting and cpu binding. Instead,--ntasks-per-core=1
may be requested to prevent tasks from sharing a core. - Correctly send
assoc_mgr
lock to mcs plugin. - Avoid unnecessary
gres/gpumem
andgres/gpuutil
TRES
position lookups. sacct
- when printingPLANNED
time, use end time instead of start time for jobs cancelled before they started.- Hold the job with "
(Reservation ... invalid)
" state reason if the reservation is not usable by the job. sbatch
- Added new--export=NIL
option.
Patch Instructions:
To install this SUSE update use the SUSE recommended
installation methods like YaST online_update or "zypper patch".
Alternatively you can run the command listed for your product:
-
openSUSE Leap 15.5
zypper in -t patch SUSE-2023-3759=1 openSUSE-SLE-15.5-2023-3759=1
-
HPC Module 15-SP5
zypper in -t patch SUSE-SLE-Module-HPC-15-SP5-2023-3759=1
-
SUSE Package Hub 15 15-SP5
zypper in -t patch SUSE-SLE-Module-Packagehub-Subpackages-15-SP5-2023-3759=1
Package List:
-
openSUSE Leap 15.5 (aarch64 ppc64le s390x x86_64)
- slurm-node-debuginfo-23.02.4-150500.5.6.1
- slurm-plugin-ext-sensors-rrd-23.02.4-150500.5.6.1
- slurm-cray-23.02.4-150500.5.6.1
- slurm-torque-debuginfo-23.02.4-150500.5.6.1
- slurm-plugin-ext-sensors-rrd-debuginfo-23.02.4-150500.5.6.1
- slurm-debugsource-23.02.4-150500.5.6.1
- slurm-lua-23.02.4-150500.5.6.1
- slurm-lua-debuginfo-23.02.4-150500.5.6.1
- slurm-rest-debuginfo-23.02.4-150500.5.6.1
- slurm-testsuite-23.02.4-150500.5.6.1
- slurm-cray-debuginfo-23.02.4-150500.5.6.1
- perl-slurm-debuginfo-23.02.4-150500.5.6.1
- slurm-auth-none-23.02.4-150500.5.6.1
- slurm-plugins-23.02.4-150500.5.6.1
- slurm-torque-23.02.4-150500.5.6.1
- libslurm39-debuginfo-23.02.4-150500.5.6.1
- slurm-hdf5-debuginfo-23.02.4-150500.5.6.1
- perl-slurm-23.02.4-150500.5.6.1
- libpmi0-23.02.4-150500.5.6.1
- slurm-pam_slurm-debuginfo-23.02.4-150500.5.6.1
- slurm-hdf5-23.02.4-150500.5.6.1
- slurm-devel-23.02.4-150500.5.6.1
- slurm-munge-23.02.4-150500.5.6.1
- slurm-slurmdbd-debuginfo-23.02.4-150500.5.6.1
- slurm-sql-23.02.4-150500.5.6.1
- slurm-debuginfo-23.02.4-150500.5.6.1
- slurm-auth-none-debuginfo-23.02.4-150500.5.6.1
- slurm-munge-debuginfo-23.02.4-150500.5.6.1
- libpmi0-debuginfo-23.02.4-150500.5.6.1
- libslurm39-23.02.4-150500.5.6.1
- libnss_slurm2-debuginfo-23.02.4-150500.5.6.1
- slurm-node-23.02.4-150500.5.6.1
- slurm-23.02.4-150500.5.6.1
- slurm-rest-23.02.4-150500.5.6.1
- slurm-sql-debuginfo-23.02.4-150500.5.6.1
- slurm-pam_slurm-23.02.4-150500.5.6.1
- libnss_slurm2-23.02.4-150500.5.6.1
- slurm-sview-debuginfo-23.02.4-150500.5.6.1
- slurm-slurmdbd-23.02.4-150500.5.6.1
- slurm-plugins-debuginfo-23.02.4-150500.5.6.1
- slurm-sview-23.02.4-150500.5.6.1
-
openSUSE Leap 15.5 (noarch)
- slurm-doc-23.02.4-150500.5.6.1
- slurm-seff-23.02.4-150500.5.6.1
- slurm-sjstat-23.02.4-150500.5.6.1
- slurm-config-23.02.4-150500.5.6.1
- slurm-webdoc-23.02.4-150500.5.6.1
- slurm-openlava-23.02.4-150500.5.6.1
- slurm-config-man-23.02.4-150500.5.6.1
-
HPC Module 15-SP5 (aarch64 x86_64)
- slurm-node-debuginfo-23.02.4-150500.5.6.1
- slurm-plugin-ext-sensors-rrd-23.02.4-150500.5.6.1
- slurm-cray-23.02.4-150500.5.6.1
- slurm-torque-debuginfo-23.02.4-150500.5.6.1
- slurm-plugin-ext-sensors-rrd-debuginfo-23.02.4-150500.5.6.1
- slurm-debugsource-23.02.4-150500.5.6.1
- slurm-lua-23.02.4-150500.5.6.1
- slurm-lua-debuginfo-23.02.4-150500.5.6.1
- slurm-rest-debuginfo-23.02.4-150500.5.6.1
- slurm-cray-debuginfo-23.02.4-150500.5.6.1
- perl-slurm-debuginfo-23.02.4-150500.5.6.1
- slurm-auth-none-23.02.4-150500.5.6.1
- slurm-plugins-23.02.4-150500.5.6.1
- slurm-torque-23.02.4-150500.5.6.1
- libslurm39-debuginfo-23.02.4-150500.5.6.1
- perl-slurm-23.02.4-150500.5.6.1
- libpmi0-23.02.4-150500.5.6.1
- slurm-pam_slurm-debuginfo-23.02.4-150500.5.6.1
- slurm-devel-23.02.4-150500.5.6.1
- slurm-munge-23.02.4-150500.5.6.1
- slurm-slurmdbd-debuginfo-23.02.4-150500.5.6.1
- slurm-sql-23.02.4-150500.5.6.1
- slurm-debuginfo-23.02.4-150500.5.6.1
- slurm-auth-none-debuginfo-23.02.4-150500.5.6.1
- slurm-munge-debuginfo-23.02.4-150500.5.6.1
- libpmi0-debuginfo-23.02.4-150500.5.6.1
- libslurm39-23.02.4-150500.5.6.1
- slurm-node-23.02.4-150500.5.6.1
- libnss_slurm2-debuginfo-23.02.4-150500.5.6.1
- slurm-23.02.4-150500.5.6.1
- slurm-rest-23.02.4-150500.5.6.1
- slurm-sql-debuginfo-23.02.4-150500.5.6.1
- slurm-pam_slurm-23.02.4-150500.5.6.1
- libnss_slurm2-23.02.4-150500.5.6.1
- slurm-sview-debuginfo-23.02.4-150500.5.6.1
- slurm-slurmdbd-23.02.4-150500.5.6.1
- slurm-plugins-debuginfo-23.02.4-150500.5.6.1
- slurm-sview-23.02.4-150500.5.6.1
-
HPC Module 15-SP5 (noarch)
- slurm-config-man-23.02.4-150500.5.6.1
- slurm-doc-23.02.4-150500.5.6.1
- slurm-config-23.02.4-150500.5.6.1
- slurm-webdoc-23.02.4-150500.5.6.1
-
SUSE Package Hub 15 15-SP5 (ppc64le s390x)
- slurm-node-debuginfo-23.02.4-150500.5.6.1
- slurm-cray-23.02.4-150500.5.6.1
- slurm-torque-debuginfo-23.02.4-150500.5.6.1
- slurm-debugsource-23.02.4-150500.5.6.1
- slurm-lua-23.02.4-150500.5.6.1
- slurm-lua-debuginfo-23.02.4-150500.5.6.1
- slurm-rest-debuginfo-23.02.4-150500.5.6.1
- slurm-cray-debuginfo-23.02.4-150500.5.6.1
- perl-slurm-debuginfo-23.02.4-150500.5.6.1
- slurm-auth-none-23.02.4-150500.5.6.1
- slurm-plugins-23.02.4-150500.5.6.1
- slurm-torque-23.02.4-150500.5.6.1
- slurm-hdf5-debuginfo-23.02.4-150500.5.6.1
- perl-slurm-23.02.4-150500.5.6.1
- libpmi0-23.02.4-150500.5.6.1
- slurm-pam_slurm-debuginfo-23.02.4-150500.5.6.1
- slurm-hdf5-23.02.4-150500.5.6.1
- slurm-devel-23.02.4-150500.5.6.1
- slurm-munge-23.02.4-150500.5.6.1
- slurm-slurmdbd-debuginfo-23.02.4-150500.5.6.1
- slurm-sql-23.02.4-150500.5.6.1
- slurm-debuginfo-23.02.4-150500.5.6.1
- slurm-auth-none-debuginfo-23.02.4-150500.5.6.1
- slurm-munge-debuginfo-23.02.4-150500.5.6.1
- libpmi0-debuginfo-23.02.4-150500.5.6.1
- slurm-node-23.02.4-150500.5.6.1
- libnss_slurm2-debuginfo-23.02.4-150500.5.6.1
- slurm-23.02.4-150500.5.6.1
- slurm-rest-23.02.4-150500.5.6.1
- slurm-sql-debuginfo-23.02.4-150500.5.6.1
- slurm-pam_slurm-23.02.4-150500.5.6.1
- libnss_slurm2-23.02.4-150500.5.6.1
- slurm-sview-debuginfo-23.02.4-150500.5.6.1
- slurm-slurmdbd-23.02.4-150500.5.6.1
- slurm-plugins-debuginfo-23.02.4-150500.5.6.1
- slurm-sview-23.02.4-150500.5.6.1
-
SUSE Package Hub 15 15-SP5 (noarch)
- slurm-doc-23.02.4-150500.5.6.1
- slurm-seff-23.02.4-150500.5.6.1
- slurm-sjstat-23.02.4-150500.5.6.1
- slurm-config-23.02.4-150500.5.6.1
- slurm-webdoc-23.02.4-150500.5.6.1
- slurm-openlava-23.02.4-150500.5.6.1
- slurm-config-man-23.02.4-150500.5.6.1