Memory, I/O and DefaultTasksMax related considerations for SLES for SAP servers with huge memory
This document (7021211) is provided subject to the disclaimer at the end of this document.
SUSE Linux Enterprise for SAP 12 SP1 and later ServicePacks
SUSE Linux Enterprise for SAP 11 SP4
SUSE Linux Enterprise Server 12 SP1 and later ServicePacks
SUSE Linux Enterprise Server 11 SP4
Servers may become unresponsive, may suffer Out-of-memory issues, may appear to stall for a while and then resume normal function minutes or many hours later etc.
Plus general tuning guidance for customers running SAP for Suse Linux Enterprise.
Some general guidelines about using pagecache_limit and optimizing some of the I/O related settings:-
If on the server in question, you are *not* simultaneously mixing a heavy file I/O workload while running a memory intensive application workload, then this setting (pagecache_limit) will probably cause more harm than good. However, in most SAP environments, there is both high I/O and memory intensive workloads.
Ideally, vm.pagecache_limit_mb should be zero until such time that pagecache is seen to exhaust memory. If it does exhaust memory then trial-and-error-tuning must be used to find values that work for the specific server/workload in question.
As regards the type of settings that have both a fixed value and a 'ratio' setting option, keep in mind that ratio settings will be more and more inaccurate as the amount of memory in the server grows. Therefore, specific 'byte' settings should be used as opposed to 'ratio' type settings. The 'ratio' settings can allow too much accumulation of dirty memory which has been proven to lead to processing stalls during heavy fsync or sync write loads. Setting dirty_bytes to a reasonable value (which depends on the storage performance) leads to much less unexpected behavior.
Setting, say, a 4gb pagecache limit on a 142G machine, is asking for trouble, especially when you consider that this would be much smaller than a default dirty ratio limit (which is by default 40% of available pages).
If the pagecache_limit is used, it should always be set to a value well above the 'dirty' limit, be it a fixed value or a percentage.
The thing is that there is no universal 'correct' values for these settings. You are always balancing throughput with sync latency. If we had code in the kernel so that it would auto-tune automatically based on the amount of RAM in the server, it would be very prone to regressions because it depends on server-specific loading. So, necessarily, it falls to the server admins to come up with the best values for these settings (via trial-and-error).
*If* we know for a fact that the server does encounter issues with pagecache_limit set to 0 (not active), then choose a pagecache_limit that is suitable in relation to how much memory is in the server.
Lets assume that you have a server with 1TB of RAM, these are *suggested* values which could be used as a starting point:-
vm.pagecache_limit_mb = 20972 # 20gb - Different values could be tried from say 20gb <> 64gb
vm.pagecache_limit_ignore_dirty = 1 # see the below section on this variable to decide what it should be set too vm.dirty_ratio = 0
vm.dirty_bytes = 629145600 # This could be reduced or increased based on actual hardware performance but
keep the vm.dirty_background_bytes to approximately 50% of this setting
vm.dirty_background_ratio = 0
vm.dirty_background_bytes = 314572800 # Set this value to approximately 50% of vm.dirty_bytes
NOTE: If it is decided to try setting pagecache_limit to 0 (not active) then it's still a good idea to test different values for dirty_bytes and dirty_background_bytes in an I/O intensive environment to arrive at the best performance.
How pagecache_limit works:
The heart of this patch is a function called shrink_page_cache(). It is called from balance_pgdat (which is the worker for kswapd) if the pagecache is above the limit. The function is also called in __alloc_pages_slowpath.
shrink_page_cache() calculates the number of pages the cache is over its limit. It reduces this number by a factor (so you have to call it several times to get down to the target) then shrinks the pagecache (using the Kernel LRUs).
shrink_page_cache does several passes:
- Just reclaiming from inactive pagecache memory. This is fast -- but it might not find enough free pages; if that happens, the second pass will happen.
- In the second pass, pages from active list will also be considered.
- The third pass will only happen if pagecacahe_limig_ignore-dirty is not 1. In that case, the third pass is a repetition of the second pass, but this time we allow pages to be written out.
In all passes, only unmapped pages will be considered.
How it changes memory management:
If the vm.pagecache_limit_mb is set to zero (default), nothing changes.
If set to a positive value, there will be three different operating modes:
(1) If we still have plenty of free pages, the pagecache limit will NOT be enforced. Memory management decisions are taken as normally.
(2) However, as soon someone consumes those free pages, we'll start freeing pagecache -- as those are returned to the free page pool, freeing a few pages from pagecache will return us to state (1) -- if however someone consumes these free pages quickly, we'll continue
freeing up pages from the pagecache until we reach vm.pagecache_limit_mb.
(3) Once we are at or below the low watermark, vm.pagecache_limit_mb, the pages in the page cache will be governed by normal paging memory management decisions; if it starts growing above the limit (corrected by the free pages), we'll free some up again.
This feature is useful for machines that have large workloads, carefully sized to eat most of the memory. Depending on the applications page access pattern, the kernel may too easily swap the application memory out in favor of pagecache. This can happen even for low values of swappiness. With this feature, the admin can tell the kernel that only a certain amount of pagecache is really considered useful and that it otherwise should favor the applications memory.
The default for this setting is 1; this means that we don't consider dirty memory to be part of the limited pagecache, as we can not easily free up dirty memory (we'd need to do writes for this). By setting this to 0, we actually consider dirty (unampped) memory to be freeable and do a third pass in shrink_page_cache() where we schedule the pages for write-out. Values larger than 1 are also possible and result in a fraction of the dirty pages to be considered non-freeable.
From SAP on the subject:
If there are a lot of local writes and it is OK to throttle them by limiting the writeback caching, we recommended that you set the value to 0. If writing mainly happens to NFS filesystems, the default 1 should be left untouched. A value of 2 would be a middle ground, not limiting local write back caching as much, but potentially resulting in some paging.
Many customers are not tuning I/O settings for large memory systems and some SAP customers are setting pagecache unnecessarily, or setting a limit which is much too low for the amount of memory present and the workload pattern of the server.
Also, since that advent of systemd v228 and beyond, included in SLES12 SP2, SAP systems will likely need to adjust the DefaultTasksMax setting, which was introduced by Linux upstream as a security feature to prevent any one service from spawning to many threads and consuming all server resources.
Please note that the usage of the pagecache feature is only supported on SLES for SAP.
Please note that the pagecache feature is not used on SUSE Linux Enterprise Server 15 and above. Instead, control groups (cgroups) are used.
To review the current pagecache settings on your environment run the command:
sysctl -a | egrep "pagecache|dirty"
vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 20
vm.dirty_writeback_centisecs = 500
vm.dirtytime_expire_seconds = 43200
vm.pagecache_limit_ignore_dirty = 0
vm.pagecache_limit_mb = 1
To change pagecache settings in the /etc/sysctl.conf file add/edit the changes needed, save the file then apply with the command:
vm.pagecache_limit_ignore_dirty = 20972
vm.pagecache_limit_mb = 1
To view current effective DefaultTasksMax, view the contents of /etc/systemd/system.conf, or use the command syntax shown here:
systemctl show --property DefaultTasksMax
To change the global value for DefaultTasksMax, uncomment the line in /etc/systemd/system.conf and set to the desired value.
If you wish to change the DefaultTasksMax on a per-service basis, then the TasksMax setting can be added to the appropriate systemd unit file.
To enable the new settings, you can use 'systemctl daemon-reload' or just reboot the server.
More about DefaultTasksMax / TasksMax / UserTasksMax
SAP note 1557506 - Linux paging improvements
SAP note 2456149 - Diagnostics Agents fails to start with error OutOfMemoryError on Linux X86 SLES12 SPS2
SLES12 SP2 Release Notes - 2.3.2 Support for PIDs cgroup Controller (DefaultTasksMax)
SUSE Linux Enterprise Server for SAP Applications 12 SP2 - Guide: 7.1 Kernel: Page-Cache Limit
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:7021211
- Creation Date: 19-Aug-2017
- Modified Date:03-Jun-2021
- SUSE Linux Enterprise Server
- SUSE Linux Enterprise Server for SAP Applications
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com