SUSE Support

Here When You Need Us

Recurring backup jobs with short intervals for large volumes in Longhorn fails with file lock contention

This document (000021960) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Enterprise Storage - Longhorn - All versions


Situation

Recurring backup jobs for volumes—especially large ones—in Longhorn sometimes fail due to a lock acquisition error. Frequent creation of backups and deletion of old backups can lead to conflicts when acquiring a file lock necessary to complete the backup operation on a volume. A systematic approach to resolving this conflict involves understanding the volume size, the frequency of backup jobs, and the storage I/O speed.

 

Resolution

  • Backup creation or deletion seems to be failing for a huge volume. The example below shows errors during old backup deletion operation on the UI:
    messages:
        error: |-
          Failed to delete the backup s3://example-backup/?backup=backup-xxxxxxxxxxxxxxx&volume=pvc-xxxxx-xxxxx-xxxxx-xxxxx-xxxxx in the backupstore, err F
    ailed to delete the backup s3://example-backup/?backup=backup-xxxxxxxxxxxxxxx&volume=pvc-xxxxx-xxxxx-xxxxx-xxxxx-xxxxx in the backupstore, err error deleting backup s3://example-backup/?backup=backup-xxxxxxxxxxxxxxx&volume=pvc-xxxxx-xxxxx-xxxxx-xxxxx-xxxxx: failed to execute: /var/lib/longhorn/engine-binaries/rancher-mirrored-longhornio-longhorn-engine-v1.8.1/longhorn [backup rm s3://example-backup/?backup=backup-xxxxxxxxxxxxxxx&volume=pvc-xxxxx-xxxxx-xxxxx-xxxxx-xxxxx], output failed to acquire lock backupstore/volumes/97/d4/pvc-xxxxx-xxxxx-xxxxx-xxxxx-xxxxx/locks/lock-xxxxxxxx.lck when performing backup delete, please try again later.
  • Upon further investigation, the instance-manager pod shows a lot of file lock operations within short time span:
    2025-07-06T19:10:10.829872895+02:00 [pvc-xxxxx-xxxx-xxxx-xxxx-xxxxxxx-r-xxxxxxxx] time="2025-07-06T17:10:10Z" level=info msg="Loading config in backupstore" func=backupstore.LoadConfigInBackupStore file="config.go:56" filepath=backupstore/volumes/97/d4/pvc-xxxxx-xxxx-xxxx-xxxx-xxxxxxx/locks/lock-xxxxxxxxxxxxxxx.lck kind=s3 object=config pkg=backupstore reason=start
    2025-07-06T19:10:10.837085198+02:00 [pvc-xxxxx-xxxx-xxxx-xxxx-xxxxxxx-r-xxxxxxxx] time="2025-07-06T17:10:10Z" level=info msg="Loaded config in backupstore" func=backupstore.LoadConfigInBackupStore file="config.go:67" filepath=backupstore/volumes/97/d4/pvc-xxxxx-xxxx-xxxx-xxxx-xxxxxxx/locks/lock-xxxxxxxxxxxxxxx.lck kind=s3 object=config pkg=backupstore reason=complete
    2025-07-06T19:10:10.837143736+02:00 [pvc-xxxxx-xxxx-xxxx-xxxx-xxxxxxx-r-xxxxxxxx] time="2025-07-06T17:10:10Z" level=info msg="Loaded lock backupstore/volumes/97/d4/pvc-xxxxx-xxxx-xxxx-xxxx-xxxxxxx/locks/lock-xxxxxxxxxxxxxxx.lck type 2 on backupstore" func=backupstore.loadLock file="lock.go:185" pkg=backupstore
    2025-07-06T19:10:10.843997593+02:00 [pvc-xxxxx-xxxx-xxxx-xxxx-xxxxxxx-r-xxxxxxxx] time="2025-07-06T17:10:10Z" level=info msg="Loading config in backupstore" func=backupstore.LoadConfigInBackupStore file="config.go:56" filepath=backupstore/volumes/97/d4/pvc-xxxxx-xxxx-xxxx-xxxx-xxxxxxx/locks/lock-xxxxxxxxxxxxxxx.lck kind=s3 object=config pkg=backupstore reason=start
    2025-07-06T19:10:10.845949676+02:00 [pvc-xxxxx-xxxx-xxxx-xxxx-xxxxxxx-r-xxxxxxxx] time="2025-07-06T17:10:10Z" level=info msg="Loaded config in backupstore" func=backupstore.LoadConfigInBackupStore file="config.go:67" filepath=backupstore/volumes/97/d4/pvc-xxxxx-xxxx-xxxx-xxxx-xxxxxxx/locks/lock-xxxxxxxxxxxxxxx.lck kind=s3 object=config pkg=backupstore reason=complete
    2025-07-06T19:10:10.848350469+02:00 [pvc-xxxxx-xxxx-xxxx-xxxx-xxxxxxx-r-xxxxxxxx] time="2025-07-06T17:10:10Z" level=info msg="Loaded lock backupstore/volumes/97/d4/pvc-xxxxx-xxxx-xxxx-xxxx-xxxxxxx/locks/lock-xxxxxxxxxxxxxxx.lck type 2 on backupstore" func=backupstore.loadLock file="lock.go:185" pkg=backupstore
     
    2025-07-06T19:10:10.850004342+02:00 [pvc-xxxxx-xxxx-xxxx-xxxx-xxxxxxx-r-xxxxxxxx] time="2025-07-06T17:10:10Z" level=info msg="Trying to acquire lock backupstore/volumes/97/d4/pvc-xxxxx-xxxx-xxxx-xxxx-xxxxxxx/locks/lock-xxxxxxxxxxxxxxx.lck" func="backupstore.(*FileLock).canAcquire" file="lock.go:71" lock="{ volume: pvc-xxxxx-xxxx-xxxx-xxxx-xxxxxxx, name: lock-xxxxxxxxxxxxx, type: 1, acquired: false, serverTime: 2025-07-06 17:10:06 +0000 UTC }" pkg=backupstore
    2025-07-06T19:10:10.850137030+02:00 time="2025-07-06T17:10:10Z" level=info msg="Acquired lock backupstore/volumes/97/d4/pvc-xxxxx-xxxx-xxxx-xxxx-xxxxxxx/locks/lock-xxxxxxxxxxxxxxx.lck for backup create/restore on backupstore" func="backupstore.(*FileLock).Lock" file="lock.go:128" pkg=backupstore
    2025-07-06T19:10:10.850146813+02:00 time="2025-07-06T17:10:10Z" level=info msg="Saving config in backupstore" func=backupstore.SaveConfigInBackupStore file="config.go:81" filepath=backupstore/volumes/97/d4/pvc-xxxxx-xxxx-xxxx-xxxx-xxxxxxx/locks/lock-xxxxxxxxxxxxxxx.lck kind=s3 object=config pkg=backupstore reason=start
    2025-07-06T19:10:10.854400788+02:00 [pvc-xxxxx-xxxx-xxxx-xxxx-xxxxxxx-r-xxxxxxxx] time="2025-07-06T17:10:10Z" level=info msg="Saved config in backupstore" func=backupstore.SaveConfigInBackupStore file="config.go:92" filepath=backupstore/volumes/97/d4/pvc-xxxxx-xxxx-xxxx-xxxx-xxxxxxx/locks/lock-xxxxxxxxxxxxxxx.lck kind=s3 object=config pkg=backupstore reason=complete
    2025-07-06T19:10:10.856706716+02:00 [pvc-xxxxx-xxxx-xxxx-xxxx-xxxxxxx-r-xxxxxxxx] time="2025-07-06T17:10:10Z" level=info msg="Stored lock backupstore/volumes/97/d4/pvc-xxxxx-xxxx-xxxx-xxxx-xxxxxxx/locks/lock-xxxxxxxxxxxxxxx.lck type 1 on backupstore" func=backupstore.saveLock file="lock.go:204" pkg=backupstore
  • In this example a recurring job of volume backup was created to backup the volume every 3 hours. So, here is the sequence of events happening in this scenario:
    • The first job starts creating a backup.

    • After 3 hours, the second job starts and tries to create another backup, but fails and retries.

    • Meanwhile, the first job finishes its backup and begins deleting old backups.

    • The deletion process from the first job and the backup creation from the second job both try to access the same lock files, causing file lock conflicts.

  • To fix the issue, one can implement any/all of below solutions:
    • Modify the recurring backup job with a larger interval for huge volumes. 
    • Review Storage I/O throughput or any latency issues. However, increasing the interval as mentioned above should give some relief. 

Cause

  • During backup, a file lock is placed to prevent other jobs (especially deletions) from interfering. Once backup is completed successfully, the lock is released. But, in case of large volumes, it needs some time to finish the creation job before starting to delete old backups. 
  • There has been a recent improvement to clarify the error message during such backup failures: 8713 

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:000021960
  • Creation Date: 05-Aug-2025
  • Modified Date:12-Aug-2025
    • SUSE Rancher Longhorn

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com

tick icon

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

tick icon

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.

tick icon

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.