SLES 11/12: Network, CPU Tuning and Optimization – Part 2
This document is a basic SLES tuning guide for Network and CPU tuning and optimization. Many of the parameters and settings discussed are Linux generic and can be applied. Refer to IHV/ISV Application tuning guides or documentation before you implement the tuning parameters. Read Part 1 here: SLES 11/12 OS Tuning & Optimization Guide – Part 1
Before you start with tuning on the server make sure you create a backup of the current kernel settings using “sysctl -A”:
sysctl -A > /root/sysctl.settings.backup
Note: Some of the tuning parameters are configured some what aggressively to improve performance. Hence the settings should never be applied in production environments without proper testing in designated test environments.
SLES11/12 Network Tuning & Optimization
Tuning network in the protocol stack involves to start:
a ) Tuning the packets themselves b ) the TCP/IP stack and finally the c ) Application Stack that are in use on your server. Tuning Application Stack is beyond the scope of the article, Check the ISV/Application vendor Document for Tuning and Optimizing Application Stack for improving Application specific Performance. Depending on your internet connection and maximum bandwidth/latency some of these values can be fine tuned accordingly.
SLES Kernel Packet Optimization:
Linux kernel allocates socket for every network connection that takes place. The socket are two ends of communication channel, every socket has a receive and send buffer which is also known as receive and write buffer. As these buffers get full, it does not accept any more data. As a result no new data can be processed and packets tend to get dropped.
This would affect the server network performance negatively, as packets have to be resent and processed again in case of any packet drop.
For all reserved sockets on the network, tuning comes from two proc tunables, these are:
For any packet drop in your network tuning these two network parameters are very useful.
The values for these two parameters are automatically calculated and set when the SLES server boots up but many times these are not the optimum values and can be fine tuned further. On a server of around 6 GB of RAM the size of these parameter by default is set to around 4 MB, In such situation you may benefit by changing the value to either 8MB or 12MB for servers with large RAM. For e.g. if the values are changed to more appropriate 12 MB from the default value of 4194304 KB (4MB) then these parameters would be:
echo 12582912 > /proc/sys/net/core/rmem_max echo 12582912 > /proc/sys/net/core/wmem_max
To make these value consistent after reboot:
sysctl -w net.core.rmem_max=12582912 sysctl -w net.core.wmem_max=12582912
Increasing the queue size to accommodate maximum incoming packets is also critical when increasing the read and write buffer size. This can be accomplished by tuning the netdev_max_backlog parameter. The default value in SLES for this parameter is set to 1000, further fine tuning can be done for increasing the queue size by trying to increase it to a higher value consistent with performance of your network. Increasing the value to a much higher value (i.e 9000) is desirable especially on servers where you are expecting latency or higher amount of packet drops.
echo 9000 > /proc/sys/net/core/netdev_max_backlog sysctl -w net.core.netdev_max_backlog=9000
Increasing the queue size for incoming connection along with improving the read/write buffer will significantly improve network performance, This value on SLES comes from parameter somaxconn, the default value for these are 128, doubling the size and increasing it to an acceptable value of 256 or 512 would be more appropriate to start with, e.g.
echo 512 > /proc/sys/net/core/somaxconn sysctl -w net.core.somaxconn=512
All these parameters on the kernel will help in reducing the packet drop on the server and improve network performance.
Basic TCP/IP Optimization for SLES
TCP/IP has its own read and write buffer and these are written to file tcp_rmem and tcp_wmem. By default the kernel sets these values automatically during boot up but these can be optimized further to improve performance. On a system with 6GB RAM its set to 6MB which can be tuned to set it to 9 MB
The buffer size for read and write buffer can be set to the same value.
Make sure you have adequate amount of RAM on your server before tuning these values as the RAM allocated to TCP read and write buffer cannot be allocated for other purpose by the OS.
echo "4096 87380 9437184" > /proc/sys/net/ipv4/tcp_rmem echo "4096 87380 9437184" > /proc/sys/net/ipv4/tcp_wmem sysctl -w net.ipv4.tcp_rmem="4096 87380 9437184" sysctl -w net.ipv4.tcp_wmem="4096 87380 9437184"
Disabling TCP Timestamps
Turning off TCP timestamps is good to reduce performance issues related to timestamp generation. Timestamp is an optional element of TCP headers and it would desirable to disable it even from security aspect.
echo 0 > /proc/sys/net/ipv4/tcp_timestamps sysctl -w net.ipv4.tcp_timestamps=0
Disabling TCP SACK, DSACK, FACK
If your network is on a high bandwidth then you can try disabling tcp_sack parameter which is enabled by default. Enabling tcp_sack helps to handle network traffic in efficient way which means only the network packet that is dropped is sent again and not the entire packet stream. This is good for most situations but if your network has high bandwidth (10G) it would also means that the selective packet that are sent has to be de-fragmented and put in correct order needing large TCP receive buffers and involves more use of memory.
For low bandwidth network it’s good to keep the tcp_sack parameter on ( 1 ) but for high bandwidth network it can be disabled to improve network performance. The same can be done via
echo 0 > /proc/sys/net/ipv4/tcp_sack sysctl -w net.ipv4.tcp_sack=0
Disabling selective TCP acknowledgments mandates also to disable corresponding related parameter, which is tcp_dsack and tcp_fack for specific packet type.
echo 0 > /proc/sys/net/ipv4/tcp_fack echo 0 > /proc/sys/net/ipv4/tcp_dsack sysctl -w net.ipv4.tcp_dsack=0 sysctl -w net.ipv4.tcp_fack=0
In SLES by default selective packet acknowledgment is enabled, though as mentioned above it can be disabled under certain condition to gain improvement in performance but incase it remains enabled which is default ( i.e tcp_sack parameter set to 1 ) tuning the memory reserved for incoming packets buffer should be set to high enough value. There are two parameters to tune: ipfrag_low_tresh and ipfrag_high_tresh.
The ipfrag_high_thresh tells the kernel the maximum amount of memory to use to reassemble IP fragments. When and if the high threshold is reached, the fragment handler will toss all packets until the memory usage reaches ipfrag_low_thresh. This means that all fragments that reached us during this time will have to be retransmitted.
Packets are fragmented if they are too large to pass through a certain pipe. If they are too large, the box that is trying to transmit them breaks them down into smaller pieces and send each piece one by one. When these fragments reach their destination, they need to be defragmented (ie, put together again) to be read properly.
echo 393216 > /proc/sys/net/ipv4/ipfrag_low_thresh echo 544288 > /proc/sys/net/ipv4/ipfrag_high_thresh sysctl -w net.ipv4.ipfrag_low_thresh=393216 sysctl -w net.ipv4.ipfrag_high_thresh=544288
TCP SYN QUEUE
All incoming TCP connections are queued until they are services. A TCP Syn queue is created for each port. Once the queue is filed the connection starts getting dropped.
The default value per port is very low: either 1024 or 2048 bytes are reserved for each port. There is a chance to increase the reserved port size for better performance, These ports are managed by tcp_max_syn_backlog parameter and for improved performance it can be changed to:
echo 8192 > /proc/sys/net/ipv4/tcp_max_syn_backlog sysctl -w net.ipv4.tcp_max_syn_backlog=8192
The tcp_synack_retries parameter will define the number of times the kernel will send responses to new incoming connection request. By default the value in SLES is set to 5, for customer network with 1G to 10G connection this can be further reduced to 3 as the servers are generally busy and reducing the number of retries to 3 would help improve network performance.
echo 3 > /proc/sys/net/ipv4/tcp_synack_retries sysctl -w net.ipv4.tcp_synack_retries=3
Another way to improve server performance is to reduce the value of synack_retries: reducing the number of retries to resend data for a establish connection to a remote host would improve the server ability to use it resources. The default value for tcp_retries2 in SLES is 15, this can be reduced to 6.
echo 6 > /proc/sys/net/ipv4/tcp_retries2 sysctl -w net.ipv4.tcp_retries2=6
TCP Keep Alive Time
The tcp_keepalive_time option will determine how long an inactive established connection will be maintained. The default value is 7200 seconds which is quite large and the server might end up running off resources if there are just too many requests coming in. It would be good to reduce it to much lower value.
echo 1000 > /proc/sys/net/ipv4/tcp_keepalive_time sysctl -w net.ipv4.tcp_keepalive_time=1000
tcp_keepalive_probe option will determine how many times the packet is sent to your server before its connection is declared as dead. The default value for keepalive probe is 9 which can be reduced to 4 to terminate the dead connection faster.
echo 4 > /proc/sys/net/ipv4/tcp_keepalive_probes sysctl -w net.ipv4.tcp_keepalive_probes=4
TCP Keep Alive Probe
The tcp_keepalive_intvl parameter allows you to control the interval you want to send the keep alive probe. By default in SLES this interval period is every 75 seconds which is very high as it would be more than 4 minutes for your server to see the connection has failed. A reasonable value for this parameter would be 20.
echo 20 > /proc/sys/net/ipv4/tcp_keepalive_intvl sysctl -w net.ipv4.tcp_keepalive_intvl=20
It enables fast recycling of TIME_WAIT sockets. The default value is 0 (disabled), this can be enabled.
echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle sysctl -w net.ipv4.tcp_tw_recycle=1
This allows reusing sockets in TIME_WAIT state for new connections. SLES default value is 0 (disabled). It can be enabled.
echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse sysctl -w net.ipv4.tcp_tw_reuse=1
The kernel has the tendency to wait a little before reusing the socket. It would be beneficial to enable both the tcp_tw_reuse and tcp_tw_recycle parameter for faster reusing of sockets.
This setting determines the time that must elapse before TCP/IP can release a closed connection to reuse its resource. In this TIME_WAIT state reopening the connection to the client is less costly than establishing a new connection. If the value for this setting is reduced, TCP/IP can release closed connections much faster, allocating more resources for new connections. The default value in SLES for tcp_fin_timeout is 60, this can be reduced to 30.
echo 30 > /proc/sys/net/ipv4/tcp_fin_timeout sysctl -w net.ipv4.tcp_fin_timeout=30
The maximum transmission unit (MTU) of a network interface is the size of the largest block of data that can be transmitted as a single unit. Anything larger than the MTU must be broken into smaller units prior to transmission.
Increasing the MTU size improves the throughput of high speed Ethernet. The standard Ethernet MTU remains at 1500 bytes, modern networking equipment is capable of handling larger frames but must be configured accordingly. Frames which take advantage of this ability are known as ‘jumbo frames’, and 9000 bytes is a popular choice for the MTU.
MTU values can be persistently increased for SLES for eth0 ethernet via configuration file /etc/sysconfig/network/ifcfg-eth0 and editing parameter to MTU=’9000′.
Along with this goes recommended parameter for host with Jumbo Frames Enabled which is enabling tcp_mtu_probing.
echo 1 > /proc/sys/net/ipv4/tcp_mtu_probing net.ipv4.tcp_mtu_probing=1
To enable jumbo frames and make it non-persistent.
ifconfig eth0 mtu 9000
SLES Network/Kernel Security
Some of the best practices recommended for SLES11/12 are mentioned, some of these parameters are enabled by default in SLES while others have to be enabled. To get the current status of any of these settings, you can query the kernel (/proc/sys/net/ipv4/conf/all/). It is even a good idea to store the output of the default kernel settings (sysctl -A > /root/sysctl.settings.store) before making any changes.
1) Enable TCP SYN Cookie Protection
echo 1 > /proc/sys/net/ipv4/tcp_syncookies sysctl -w net.ipv4.tcp_syncookies=1
2) Disable IP Source Routing
echo 0 > /proc/sys/net/ipv4/conf/all/accept_source_route sysctl -w net.ipv4.conf.all.accept_source_route=0
3) Disable ICMP Redirect Acceptance
echo 0 > /proc/sys/net/ipv4/conf/all/accept_redirects sysctl -w net.ipv4.conf.all.accept_redirects=0
4) Enable IP Spoofing Protection
echo 1 > /proc/sys/net/ipv4/conf/all/rp_filter sysctl -w net.ipv4.conf.all.rp_filter=1
5) Enable Ignoring to ICMP Requests
echo 1 > /proc/sys/net/ipv4/icmp_echo_ignore_all sysctl -w net.ipv4.icmp_echo_ignore_all=1
6) Enable Ignoring Broadcasts Request
echo 1 > /proc/sys/net/ipv4/echo_ignore_broadcasts sysctl -w net.ipv4.icmp_echo_ignore_broadcasts=1
7) Enable Bad Error Message Protection
echo 1 > /proc/sys/net/ipv4/icmp_ignore_bogus_error_responses sysctl -w net.ipv4.icmp_ignore_bogus_error_responses=1
8) Enable Logging of Spoofed Packets, Source Routed Packets, Redirect Packets
echo 1 > /proc/sys/net/ipv4/conf/all/log_martians sysctl -w net.ipv4.conf.all.log_martians=1
9) Virtual Address Space Randomization
echo 2 > /proc/sys/kernel/randomize_va_space ; sysctl -w kernel.randomize_va_space=2
10) Enable kptr_restrict
echo 1 > /proc/sys/kernel/kptr_restrict ; sysctl -w kernel.kptr_restrict=1
11) File System Hardening
echo 1 > /proc/sys/fs/protected_hardlinks echo 1 > /proc/sys/fs/protected_symlinks
SLES General Network Best Practise
- Make sure you have the latest network driver modules on SLES which is provided with latest kernel updates
- Use network card link aggregation to double performance of the network card in your server. ( 802.3ad bonding Provides dynamic link aggregation if supported by the connected switch. )
- Ethernet configuration settings, such as frame size, MTU, speed, and duplex mode on your network needs to be checked, for optimum performance ensure all devices involved in network communications use similar or identical settings.
SLES CPU Tuning & Optimization
The Linux kernel ensures that all the process get a fair share of CPU time, this ensures that no single process are starved of CPU time and all the process threads are handled in even manner and CPU scheduler ensures there is no latency involved and CPU cycles are shared equally among all process.
It is quite logical that many process might be involved in some I/O operation and the kernel would put such process on hold for some time, while CPU cycles are not available during I/O operations the process threads can handle its IO.
In multicpu/multicore environment the scheduling policy tends to get complex compared to single CPU system. While in single CPU system kernel ensures all the process gets a fair amount of access time to the single CPU but in multi-CPU environment it has to ensure it gets a fairly balanced time among all CPUs, the kernel tries to evenly balance process threads among the available CPUs. For the purpose an SMP kernel is used by Linux.
Multi-CPU system also allows the process and threads of application to move around other CPU or Cores while this might seem good but in moving the process/threads to other CPU the cached data also has to be moved along with the process and this can be expensive process for CPU cycles, this can slow down the process in multiple-cpu system.
While many applications are designed to make uses of multi-CPUs and can efficiently use them, the others rely on kernel to let their application process threads best make use of multi-CPU environment.
Since Multi-CPU system has now become the norm then an exception performance improvement deals with priority of a process and optimization in SMP environment.
The Linux system runs normal process and Real Time process while the Kernel Scheduler would always run some of the Real Time process with high priority the other process will still have to share the remaining CPU cycles available for them and taskset is one such command that can be used to bind a process to one or more CPU or set CPU affinity for process to multiple CPU.
CPU affinity is represented as a bitmask, where each bit in the mask represents a CPU core. If the bit is set to 1, then the thread may run on that core, if bit is set to 0 then the thread is excluded from running on the core. The default value for an affinity bitmask is all ones which allows the thread/interrupt to run on any core. A process can be instructed to run on selected CPUs by changing the affinity of the process. A Child process inherits the CPU affinities of their parents.
In a bit map the value 0x00000001 or 0x1 refers to processor 0 or CPU0, similarly the value 0x00000002 or 0x2 represents processor 1 or CPU1 and 0x00000003 or 0x3 to refer to processor 0 and 1 similarly it goes 0x00000004 or 0x4 to processor 2 or CPU2, 0x00000008 or 0x8 to CPU3 etc.
For e.g. to set the affinity of the process using PID 5384 to processor 0.
taskset -p 0x1 5384 or taskset -p 0x00000001 5384
Similarly to set affinity to processor 0 and 1
taskset -p 0x3 5384 or taskset -p 0x00000003 5385
Taskset can be used along with cset to assign process to multiple CPU, Kindly check the TID TID#7009596 Isolating CPUs From The General Scheduler.
Symmetric multiprocessing is a feature of kernel that allows programs to be handled by multiple CPU. The smp_affinity file ( /proc/irq/<irq_number>/smp_affinity ) holds interrupt affinity value for a IRQ number. You can use the bitmask that is used for taskset to set CPU Affinity for IRQ.
For example, smp_affinity entry for Etherner driver etho ( eth0 ,eth1 is bonded network ( bond0 )
grep eth0 /proc/interrupts 19: 6675 IO-APIC-fasteoi eth0,eth1
IRQ number for eth0 is 19 and corresponding smp_affinity file is located at:
cat /proc/irq/19/smp_affinity 1
In this case CPU0 will service all IRQ related request for ethernet driver. If you want that IRQ to work on CPU1, use the following command:
echo 2 > /proc/irq/19/smp_affinity
In Linux every process starts with same priority. The kernel can determine which processes require higher CPU time than others by the process nice level, also called niceness. The higher the nice level of a process is, the less CPU time it will take from other processes. Using nice command process niceness can be adjusted to -20 which grants process higher scheduling priority/CPU time to value 19 which is least favorable to CPU.
A root user has the rights to raise or lower any users nice value, a regular user can only lower the priorities of a process i.e. setting a higher nice value. The default niceness level for all the process that is started is 0.
Running nice command increments the current nice level for the given command by 10. Using nice -n level command lets you specify a new niceness relative to the current one. You can also use renice command to adjust the niceness of the command, renice command is used along with the PID of a process on which you want to adjust the priority You can use the top or ps command to find the PID of a process.
To renice all processes owned by a specific user, use the option -u user. Process groups are reniced by using the option -g process group id.
E.g. to launch gimp at higher priority
nice -n -8 gimp
For a process that’s already running you can change the nice value using renice command:
renice -19 4358 ( renice process 4358 by -19 ) renice -15 `pgrep -u sascha spamd` ( renice user sascha spamd process by -15)
TUNING the SLES CPU Scheduler
Completely Fair Scheduler (CFS) became the default scheduler for Linux kernel since 2.6.23. It ensures all the process gets a fair share of CPU time and there is minimum latency in doing so.
Tuning on Linux Scheduler needs to be done carefully as the scheduler self-tunes itself for most optimum performance. To get a list of CPU scheduler related variable run the command.
sysctl -A | grep "sched" | grep -v"domain"
Note that variables ending with _ns and _us accept values in nanoseconds and microseconds, respectively. As mentioned CPU scheduler related variable needs no modification but if you have an application that uses Java or in case where the applications forks runs ahead of parents then some tuning can be done.
Optimizing for CPU for Java Applications.
- sched_compat_yield: Enables the aggressive yield behavior of the old 0(1) scheduler which was used prior to 2.6.23 kernel. Java applications that use synchronization extensively perform better with this value set to 1. Applications that depend on the sched_yield() system call behavior to perform better with the value set to 1. Only use it when you see a drop in performance. The default value is 0. To set scheduler value to 1:
echo "1" > /proc/sys/kernel/sched_compat_yield sysctl -w kernel.sched_compat_yield=1
- While forking an applications if you would like to run the child process first over the parents then changing sched_child_runs_first parameter would be useful.
Some applications perform better if the child process after forking runs first , in server with multiple CPU the time slice received by parent and child might be the same.
Check with your application ISV documentation before changing the value. To change the default value from 0 to 1:
echo "1" > /proc/sys/kernel/sched_child_runs_first sysctl -w kernel.sched_child_runs_first=1
Additional parameter for CPU i.e cgroups, CFS Scheduler features and tuning etc will be covered in days ahead.
Document Links for SLES Tuning and Optimization: