SLES 11/12: Memory, Disk/Storage IO Tuning and Optimization Part 1
This document is a basic SLES tuning guide for Memory and Disk I/O tuning and optimization. Many of the parameters and settings discussed are Linux generic and can be applied. Refer to IHV/ISV Application tuning guides or documentation before you implement the tuning parameters.
Before you start with tuning on the server make sure you create a backup of the current kernel settings using “sysctl -A” :
sysctl -A > /root/sysctl.settings.backup
Note: Some of the tuning parameters are configured aggressively to improve performance. Hence the settings should never be applied in production environments without proper testing in designated test environments.
This is a two part document.
SLES Memory Tuning and Optimization
- Disable Transparent Huge Pages (THP:) On systems with large memory, frequent access to the Translation Lookaside Buffer (TLB) may slow down the system significantly. Although THP can improve performance for a large number of workloads, for workloads that rarely reference large amounts of memory, it might regress performance. To disable THP boot the system with kernel parameter:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
to deactivate THP temporarily. To confirm if the setting is deactivated use:
always madvise [never]
- Huge Pages: If the server is a heavily used application server, e.g. a Database, it would benefit significantly by using Huge Pages. The default size of Huge Page in SLES is 2 MB, enabling Huge Pages would aid in significant improvements for Memory Intensive Applications/Databases, HPC Machines, this configuration needs to be done if the Applications support Huge Pages. If the Applications do not support Huge Pages then configuring Huge Pages would result in wastage of memory as it cannot be used any further by the OS, by Default no huge pages are allocated. Verify that the server has not allocated any Huge Pages via:
To allocate, e.g. 128 Huge Pages of size 2MB (allocating 256MB) you can pass on the parameter to kernel via grub:
Post reboot verify if 128 Huge Pages are allocated to server via:
Another recommended method to configure Huge Pages in SLES is to install oracleasm rpm. In the file /etc/sysconfig/oracle change the parameter from NR_HUGE_PAGES=0 to NR_HUGE_PAGES=128, e.g. if you want to add 128 Huge Pages and restart oracle service.
Refer to the ISV documentation for best practice procedures of allocating Huge Pages. Allocating too many Huge Pages may result in performance regressions under load.
- Swap Improvements: If swap space is used, you should also have a look at the /proc/meminfo file, to co-relate the use of swap to the amount of inactive anonymous(anon) memory pages. If the amount of used swap is larger than the amount of anon memory pages that you observe in /proc/meminfo, it shows that active memory is being swapped. This would degrade performance and may be addressed by installing more RAM. If the amount of swap in use is smaller than the amount of inactive anon memory pages in /proc/meminfo, then it would indicate good performance. If, however, you have more memory in swap than the amount of inactive anonymous pages, then it would degrade performance because active memory is being swapped. It would indicate too much I/O traffic, which might slow down your system.Swappiness: Reducing the swappiness value to 25 from Default 60 would reduce the necessity of OS swapping memory and will maximize the use of memory on your server:
echo 25 > /proc/sys/vm/swappiness
- VFS caches: To reduce the rate at which VFS caches are reclaimed it would be good to reduce the vfs_cache_pressure value from default 100 to 50, this variable controls the tendency of the kernel to reclaim memory which is used for VFS Caches versus page cache and swap.
echo 50 > /proc/sys/vm/vfs_cache_pressure
- KSM: kernel 2.6.32 introduced Kernel Samepage Merging (KSM). KSM allows for an application to register with the kernel so as to have its memory pages merged with other processes that also register to have their pages merged. For KVM the KSM mechanism allows for guest virtual machines to share pages with each other. In todays environment where many of the guest operating systems like XEN, KVM are similar and are running on same host machine, this can result in significant memory savings, the default value is set to 0 change it to 1:
echo 1 > /sys/kernel/mm/ksm/run
- Memory Overcommit: Every Linux process generally tries to claim more memory than it needs. The goal is to make the process much faster. If the process does not have this excess memory then it needs to ask the kernel to allocate more memory to the process leading to process getting slower if the request keeps happening due to memory starvation.overcommit_memory is one parameter that can be tuned to improve process memory. The default value in SLES is set to 0 which means the kernel checks to see if it has memory available before granting more memory, the other two parameters available is 1 and 2 for overcommit_memory. Changing it to 1 would make the system behave like it has all the memory it needs without checking and 2 means the kernel would decline the memory request if it does not have the memory available. Some applications tend to perform better if the system is tuned to behave that it has all the memory that application process needs, but this can also lead to out of memory situations where the kernel OOM killer gets invoked. Changing overcommit_memory can be done via:
echo 1 > /proc/sys/vm/overcommit_memory
If the default value of 0 is selected for overcommit_memory, another way to improve performance is changing the overcommit_ratio parameter from system default 50 percent to higher percentage value, An overcommit_ratio value of 50 means the kernel can allocate 50% more memory of total memory available (RAM + Swap). On a system with 8 GB RAM and 2 GB Swap the total amount of addressable memory would be 15 GB for default overcommit percentage of 50. Changing overcommit_ratio to 70% would mean more amount of memory would be available to overcommit to a process:
echo 70 > /proc/sys/vm/overcommit_ratio
- drop_caches: On system with huge amounts of RAM , when the server ends up utilising large amount of RAM and starts swapping , its possible that though your application may not be using these RAM but Linux is caching aggressively into its memory and even though the application needs memory it wont free some of these cache but would rather start swapping.
To deal with such random issues , post kernel 2.6.16 and later releases have provided a non-destructive mechanism for the kernel to drop page cache ,inode and dentry caches via drop_caches parameter , this can get rid of tons of memory which remains unused but are not freed up by kernel for some reason short of server reboot.
To free pagecache:
echo 1 > /proc/sys/vm/drop_caches
To free dentries and inodes:
echo 2 > /proc/sys/vm/drop_caches
To free pagecache, dentries and inodes:
echo 3 > /proc/sys/vm/drop_caches
As mentioned this process is non-destructive and hence dirty objects are not freed , it would be desirable to run “sync” command first before freeing up pagecache , dentries and inodes
SLES Disk I/O & Storage Tuning Optimization
- Dirty Ratio: If there are performance issues observed with write performance on systems with large memory (128GB+), change the memory percentage settings for dirty_ratio and dirty_background_ratio as documented in TID# 7010287.
echo 10 > /proc/sys/vm/dirty_ratio
echo 5 > /proc/sys/vm/dirty_background_ratio
- IO Scheduler: The default I/O scheduler for SLES is CFQ. It gives good performance for wide range of I/O task but some I/O task can perform much better for certain type of hardware or applications like Database, To Improve I/O performance for certain workloads noop or deadline scheduler may give better results.CFQ: CFQ is a fairness-oriented scheduler and is used by default on SUSE Linux Enterprise. The algorithm assigns each thread a time slice in which it is allowed to submit I/O to disk. This way each thread gets a fair share of I/O throughput. It also allows assigning tasks I/O priorities which are taken into account during scheduling decisions.
NOOP: The NOOP scheduler performs only minimal merging functions on your data. There is no sorting, and therefore, this scheduler has minimal overhead. This scheduler was developed for non-disk-based block devices, such as memory devices,SSD. It also does well on storage media that have extensive caching. In some cases it can be helpful for devices that do I/O scheduling themselves, as intelligent storage, or devices that do not depend on mechanical movement, like SSDs as NOOP scheduler has less over head it may produce better performance for certain workloads.
Deadline: The deadline scheduler works with five different I/O queues and, therefore, is very capable of making a difference between read requests and write requests. When using this scheduler, read requests will get a higher priority. Write requests do not have a deadline, and, therefore, data to be written can remain in cache for a longer period. This scheduler does well in environments in which a good read performance, as well as a good write performance, is required, but shows some more priority for reads. This scheduler does particularly well in database environments. You can use one of the scheduler at a time for system wide I/O performance, check with your Hardware/Storage vendor on the ability of their storage system to manage the I/O feature before you activate the noop scheduler.
I/O scheduler can be enabled into two different ways.
a) To enable I/O Scheduler system wide during boot time, add the following parameter to /boot/grub/menu.lst:
elevator=noop or elevator=deadline
b) To enable specific scheduler for certain block device you can echo a new value to the:
echo deadline > /sys/block/<device name>/queue/scheduler
Enabling specific scheduler allows you to run optimized I/O workloads for specific Block Devices on your Server depending on the kind of workload it is running. E.g. if your Database if located on block device sdg you can enable deadline scheduler for block device sdg while the rest of OS continue on default CFQ or NOOP IO scheduler.
- Improving I/O Reads: Though Deadline Scheduler balances out between Read and Writes with a little biased towards read, The OS can further be optimized for read request for certain type of applications on per disk basis using read_ahead_kb and nr_request parameters. The kernel can detect when an application is reading data sequentially from a file or from disk. In such scenario it performs an intelligent read-ahead algorithm whereby more data than is requested by the user is read from disk, when the user does the next attempt to read data the kernel does not have to go and fetch it but is already available in the page cache improving read performance. On a default Linux installation the read_ahead_kb value is set to 128 or 512. This can certainly be improved to make read performance much better and set it to 1024 or 2048 for server with fast disks. For device mapper devices the value can be set to as high as 8192 due to reason that device mapper has multiple underlying devices, 1024 should be a good starting point though for tuning.Similarly nr_request default value is set to 128, every request queue has a limit on the total number of request descriptors that can be allocated for each of read and write I/Os. Which mean with default value set to 128, only 128 read and write request can be queued at a time before putting process to sleep. To get better read performance you can set nr_request value to 1024, but increasing the value too high might introduce latency and degrade write performance. For latency sensitive applications the converse is also true and nr_request value must be set to lower then default 128 in some case as low as 1 so that writeback I/O cannot allocate all of the available request descriptors and fill up the device queue with write I/O. To change the value of read_ahead_kb and nr_request try:
echo 1024 > /sys/block/<device name >/queue/read_ahead_kb
echo 1024 > /sys/block/<device name>/queue/nr_requests
- File System Optimization: All modern filesystem on Linux use some kind of journaling mechanism to get the right balance between date safety and performance. The default journal mode on SLES is data=ordered which ensures data is written to disk before it is written to journal. To improve performance of filesystem in context of speed and safety of data data=writeback can be used. This option ensures internal filesystem integrity but it doesn’t guarantee new files are written to disk. Another option is to disable barrier support. In case of power failure or crash the filesystem would have to run a filesystem check to repair the structure. Even with barrier enabled most filesystems would still run a fsck in case of power failure or crash. Mounting the filesystem with data=writeback, barrier=0 would improve filesystem performance at cost of some reliability. Generally data=writeback should be good enough to add some performance gain without disabling barrier.For example: Editing /etc/fstab
/dev/sdc /SAP ext3 acl,user_xattr, data=writeback barrier=0 1 1
- Kdump: Configure Kdump on your server and make sure it is tested to work via keyboard parameter documented in the TID 3374462. Kdump though not directly related to optimization of the server would be useful in server crash or hang situation where a kernel memory dump would be needed to analyse the root cause of crash/hang of OS. For server with large amount of RAM 512 GB – 1 TB and above, with sufficient disk space available to collect memory dump a reasonable value for KDUMP_DUMPLEVEL in /etc/sysconfig/kdump file would be 15, which balances the need to capture maximum data from dump keeping dump size reasonable when used with default compression enabled in kdump.