Kernel makes very large "emergency reserve" of memory when using swap on NFS

This document (7007405) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise Server 11 Service Pack 1

Situation

As of SLES 11 SP1, setting up swap in a swap file on an NFS mount is now supported. (However, there are many cases in which it would not be recommended.)

Upon enabling this, the kernel logs to /var/log/messages:

Emergency reserve: 3356248

or other large number. This number represents a memory reserve in KB (kilobytes), so the number above is approximately 3.3 GB (gigabytes). Obviously, sacrificing 3.3 GB of real memory for emergency reserve, just to enable a swap file is counterproductive.

Resolution

Use the following to reduce the memory reserve:

echo 100000 > /proc/sys/net/ipv4/route/max_size

That should immediately cause a new log in /var/log/messages showing a lower reserve. Assuming this does the trick, a more permanent entry can be put in /etc/sysctl.conf to set this at boot:

net.ipv4.route.max_size = 100000

Explanation:

For swap on NFS, the kernel wants to make sure there is enough memory to support all IP communication, since swap will be relying on that. Part of that effort is to reserve enough memory to support the IP route cache to the maximum size it can reach. Each entry in the cache only takes 384 bytes, but the cache is typically allowed to grow to millions of entries in size -- a typical default will be over 8 million entries. This causes a very large amount of RAM to be reserved. Lowering the max_size to 100,000 will typically still be generous, but will only cause a reserve of about 40 MB.

If administrators wish to determine a number more suitable to a specific system, they can monitor the usage of that cache over time and chose a lower or higher setting as desired. Usage is shown with:

# grep ip_dst_cache /proc/slabinfo

ip_dst_cache 106 120 384 10 1 : tunables 54 27 8 : slabdata 12 12 0

The first number in that output is the number of entries currently used. The second number is a recent high water mark, but *not* necessarily the highest that the number has been since boot. Checking this at regular intervals throughout the day for several days should give an indication of the normal peak. Then add an extra helping on top of that (maybe 50% extra) just to be safe.

Development is considering whether the activation of swap on NFS should automatically cause this max_size number to be reduced.

Additional Information

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.