System hang caused by vm.pagecache_limit_mb

This document (000020418) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise Server for SAP Applications 12
SUSE Linux Enterprise Server for SAP Applications 11

Situation

The system is unresponsive, appearing to be on a hang/freeze state. A kernel crash dump has been triggered manually, the crash dump analyses reveals that the system is not actually hanging but it is very busy trying to shrink the pagecache down to the limit specified on vm.pagecache_limit_mb: 
crash> sys|grep LOAD
LOAD AVERAGE: 89.98, 82.02, 82.12

The number of tasks currently waiting for the page cache shrink:
crash> foreach bt | grep __shrink_page_cache | wc -l
68

Comparing the current page cache size with the configured limit:
crash> kmem -i|grep CACHE
       CACHED  42364004     161.6 GB   10% of TOTAL MEM

crash>  p vm_pagecache_limit_mb
vm_pagecache_limit_mb = $1 = 31458   #30GB

The stack trace of the tasks that are reclaiming the page cache could look as below:
#14 [ffffb441bbacbbb8] __shrink_page_cache at ffffffffad1cddad
#15 [ffffb441bbacbc38] add_to_page_cache_lru at ffffffffad1b39a5
#16 [ffffb441bbacbc68] pagecache_get_page at ffffffffad1b518b
#17 [ffffb441bbacbca0] grab_cache_page_write_begin at ffffffffad1b58fc
#18 [ffffb441bbacbcb0] ext4_da_write_begin at ffffffffc096670d [ext4]
#19 [ffffb441bbacbd28] generic_perform_write at ffffffffad1b2f72
#20 [ffffb441bbacbda0] __generic_file_write_iter at ffffffffad1b6454
#21 [ffffb441bbacbde8] ext4_file_write_iter at ffffffffc095427e [ext4]
#22 [ffffb441bbacbe48] __vfs_write at ffffffffad249e2c
#23 [ffffb441bbacbec0] vfs_write at ffffffffad24ae4d

Resolution

vm_pagecache_limit_mb is not a hard limit, that means the page cache could grow larger if the system has lots of free memory, and by the time the system is facing a memory pressure, a lot of page cache could have been already accumulated. That would lead to an aggressive and very expensive task of shrinking the large accumulated page cache down to the limit. 

We would strongly recommend to not use the page cache limit option, unless it is really necessary (only in rare cases if the systems are continuously swapping excessively, and only if the aggressive swap-out is indeed causing performance issues). The pagecache limit can be disabled by setting:
# /etc/sysctl.conf
-------------------------------
vm.pagecache_limit_mb = 0
vm.pagecache_limit_ignore_dirty = 1

Cause

Actually the pagecache limit is not a hard limit, it will allow the pagecache to grow beyond the specified limit. In large systems which could have a lot of free memory, the pagecache will accumulate and grow continuously. In case of a  memory pressure, the kernel will (as expected) try to shrink the reclaimable memory resources (page_cache, slab cache) and if the page_cache_limit setting has been enabled, then the __shrink_page_cache() function will kick in and will try to shrink down to the pagecache to the configured limit, referencing the source:
# mm/vmscan.c
--------------------------------------------
3883 /*
3884  * Function to shrink the page cache
3885  *
3886  * This function calculates the number of pages (nr_pages) the page
3887  * cache is over its limit and shrinks the page cache accordingly.
3888  *
3889  * The maximum number of pages, the page cache shrinks in one call of
3890  * this function is limited to SWAP_CLUSTER_MAX pages. Therefore it may
3891  * require a number of calls to actually reach the vm_pagecache_limit_kb.
3892  *
3893  * This function is similar to shrink_all_memory, except that it may never
3894  * swap out mapped pages and only does two passes.
3895  */
3896 static void __shrink_page_cache(gfp_t mask)

Shrinking a large pagecache is an expensive tasks, and to make it worse many CPUs could also hit the page cache reclaim at the same time and fight for resources. This situation will lead to a non-responsive system, on which most of the CPUs will be busy by reclaiming the page cache, while all the other tasks will wait on sleep state till this job completes. 

Status

Reported to Engineering

Additional Information

In case there is an absolute need to set the pachecache limit (only in very rare cases if the systems are continuously swapping excessively, and only if the aggressive swap-out is indeed causing performance issues), please make sure to set the values properly as referenced on SAP Note 1557506:

vm.pagecache_limit_mb recommended value for system with a memory size up to 64GB is 1/16 (~6%) of the amount of RAM, but not less than 512 MByte.): 
<  8 GB:  512  (recommended min. limit)
  8 GB:  512  (=  8 * 1024 MB / 16)
  16 GB: 1024  (= 16 * 1024 MB / 16)
  32 GB: 2048  (= 32 * 1024 MB / 16)
  64 GB: 4096  (= 64 * 1024 MB / 16)

For large systems with mode than 64GB of memory, the recommended value would be 2% of the amount of RAM, but not less than 4096 MB, for example:
256 GB: 5243 (=2% of 256 * 1024 MB)
512 GB: 10486 (=2% of 512 * 1024 MB)
1024 GB: 20972 (=2% of 1024 * 1024 MB)
2048 GB: 41943 (=2% of 2048 * 1024 MB)
...

Please be aware that a value smaller than the recommendations can easily lead to a non-responsive system.

If the pagecache_limit is used, it should always be set to a value well above the 'dirty' limit (vm.dirty_ratio /vm.dirty_bytes), as referenced on TID#000019008, a good practice would be to set:
# /etc/sysctl.conf
-------------------------------
vm.dirty_bytes = 629145600                     
vm.dirty_background_bytes = 314572800 

The page cache limit option has been dropped on SLES 15. 

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:000020418
  • Creation Date: 27-Oct-2021
  • Modified Date:27-Oct-2021
    • SUSE Linux Enterprise Server for SAP Applications

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback@suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.


SUSE Customer Support Quick Reference Guide SUSE Technical Support Handbook Update Advisories
Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Go to Customer Center