Change in memory reclaim behaviour in SLES 15 SP4
This document (000021196) is provided subject to the disclaimer at the end of this document.
SUSE Linux Enterprise Server 15 Service Pack 4
SLE15-SP3 is based on upstream version 5.3 and SLE15-SP4 on version 5.14. Historically the memory reclaim behaviour has been biased to scan the file LRU (Last Recently Used, i.e. mostly disk based page cache) more intensively than anonymous LRU (MAP_PRIVATE, malloc or shmem), because swapping out has been considered more expensive and disruptive.
There has also been the vm.swappiness sysctl tuneable available to influence the balance to some extent, which was at times very limited. Especially starting with the upstream kernel version 3.8 and SLE11 SP3 and SP4 kernels have adopted those changes, the reclaim implementation effectively did not consider anonymous pages at all (thus bypassing the sysctl) if there was enough inactive page cache. While the code still exists in some form in the current reclaim implementation, the details about what "enough inactive page cache" means have changed a lot, so the cache trim mode as it is currently called, is not being triggered nearly as often. This example illustrates that relying on vm.swappiness value tuned on one major kernel version may not work well as it might behave unpredictably differently on the next major kernel version.
In the SLE15-SP4 kernel, the reclaim behaviour has changed substantially. Not only the vm.swappiness sysctl can now be set to a value between 0 and 200 (unlike 0-100 in SLE15-SP3), which allows prioritizing reclaim of anonymous pages over the 1:1 balance. More importantly, the kernel is able to track the reclaim performance more accurately so it can identify situations where swapping out anonymous pages to make space for more page cache is beneficial.
Instead of balancing anonymous and file backed memory based on their recent reclaim effectivity (approach by older kernels), the recent "cost of reclaiming" is considered - the type that is considered more costly to reclaim gets relatively lower priority as a result. The cost includes reclaim-related events like pageout which means the content of a dirty page has to be written (either to the swap or the file it is mapping) in order to reclaim it. The other event is a workingset refault, which means that a page that was present in the past but had to be reclaimed is accessed again which is a strong indication to be a part of a workload's working set. This is a novel concept and it can change the observed reclaim behaviour considerably. An example would be system swapping out even in presence of a lot of page cache if the page cache is refaulted often and therefore reduce overall IO.
Thanks to the new algorithm, reclaim is able to detect and react faster to working set transitions, where the workload would keep accessing frequently one group of pages for a time and then switch to another one.
Please also note that with contemporary NVME drives the relative cost of swapping is lower than in the past, so there might be benefit of swapping out relatively unused anonymous memory to make more space for more frequently accessed page cache.
The changes of reclaim implementation outlined above are intended to improve the performance across a wide range of workloads, but as with any such changes, this can lead to observing unexpected differences in monitored system metrics that may appear to be undesirable, when the same workload is running on SLE15-SP3 and SLE15-SP4. Such difference may include increased swap usage on a system running SLE15-SP4 compared to lower or no swap usage in SLE15-SP3 with identical or similar workload and hardware. That might appear counterintuitive at first, because if a workload fits in the system memory on SLE15-SP3 and need little or no swap, it should fit the same amount of memory on SLE15-SP4 and observing increased swap usage could be interpreted as that it does not fit.
Many of the reclaim events described above have counters maintained by the kernel and exported via the /proc/vmstat virtual file, which allows observing their growth over a relevant time period (such as during the nightly virus scan activity) and verifying that e.g. the increased swap usage does not cause excessive swap activity, because the pages swapped out are indeed relatively cold. Or, in situations where reclaim activity is intensive (thrashing), the counters allow confirming that the workload's working set does not fit in the available system memory.
pswpout - incremented for each anonymous page written out to swap as part of reclaim. In case of the nightly virus scan activity, this counter would increase as the swap usage grows.
pswpin - incremented when a swapped out page needs to be swapped back into memory because it is accessed (or speculatively due to swap read-ahead when a neighbour page is swapped in). In case of the nightly virus scan, if this counter is not increased (or minimally) during the virus scan, it confirms the pages swapped out are cold. Slow gradual increase outside of the virus scan suggests they are not completely cold (e.g. an unused part of a Java heap might be occasionally accessed during garbage collection).
pgmajfault - incremented whenever a process accessing its memory is stalled because either the anonymous page has to be swapped in (which increments also pswpin) or a file page has to be read into the page cache.
workingset_refault_anon/_file - incremented whenever an anonymous or file page respectively is read from swap or disk and the kernel tracks it as having been reclaimed previously. This is an indication that the memory might be part of the working set.
workingset_activate_anon/_file - incremented for a subset of the workingset_refault_* events where the refault distance is low enough so the pages are put immediately on active list and the event contributes to the tracked reclaim cost. Temporary bursts of these events may indicate the running workload is changing its working set and the kernel adapts. Prolonged high rates of these events indicates the workload cannot fit well in the system memory and a risk of trashing is imminent.
workingset_restore_anon/_file - incremented for a subset of the workingset_activate_* events where the pages refaulted have been on the active list before the reclaim. High rates of these events are even stronger indication of workload not fitting into the system memory and indicate a trashing behavior.
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:000021196
- Creation Date: 07-Sep-2023
- Modified Date:07-Sep-2023
- SUSE Linux Enterprise Server
- SUSE Linux Enterprise Server for SAP Applications
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com