SUSE Linux Enterprise

Performance Analysis, Tuning and Tools on SUSE Linux Enterprise Products

This document describes how to configure and tune a SUSE Linux Enterprise-based system to get the best possible performance out of it. It covers different layers, from BIOS settings to kernel parameters, to show you what can be changed and how.

On the other hand, this document does not describe the networking solutions to reach very high throughput on Linux systems (for example Data Plane Development Kit, in short DPDK), by-passing the Linux kernel stack entirely. This document focuses solely on the standard Linux kernel infrastructure.

Author: Marco Varlese, Software Engineer, SUSE
Publication Date: November 07, 2017

1 Introduction

With the evolution of computer architecture, performance has reached results which were unimaginable a few years ago. However, the complexity of modern computer architectures requires end users and developers to know how to write code. It also requires them to know how to configure and deploy software for a specific architecture to get the most out of it.

This document focuses on fine-tuning a SUSE Linux Enterprise system. It covers settings and parameters configurable on SUSE Linux Enterprise software offerings, Network Interface Card (NIC) settings and some BIOS settings which are common to most hardware vendors.

Performance tuning is hard and general recommendations are tricky. This document tries to provide an insight on configurations in the Linux kernel which have an impact on the overall system performance (throughput versus latency). While various settings are described, some examples of potential values to be used are provided. However, those values need to be considered relatively to the others for the different profiles and not necessarily as absolute values to be used.

This document does not intend to provide a generic rule-of-thumb (or values) to be used for performance tuning. The finest tuning of those parameters described still requires a thorough understanding of the workloads and the hardware they run on.

2 BIOS Setup

The BIOS is the foundation and the first level of tuning which can have an impact on the performance of your overall system.

The BIOS controls the voltage (and hence frequency) which components like CPU and RAM run at. In addition, it allows you to enable or disable specific CPU features which can have a profound impact not only on system performance, but also on power usage.

The first things to know are the states at which a CPU can be in when performing its duties.

There are two sets of states (or modes): the C-states and P-states. C-states are idle states while P-states are operational states.

Aside from the C0 state, which is the only one where the CPU is actually busy doing work, all other C-states are idle states. The basic idea behind C-states is that when a CPU is not doing any useful work it is better to shut it down. This helps reduce power usage which for an electrical component like the CPU means also extending its life-time expectancy.

P-states control the operational state of the CPU when it is doing some useful work. For instance, even if the CPU/core is in C0 state that does not mean it needs to run at its maximum speed. A very basic example is when using the laptop in battery mode: the CPU will enter a higher P-state hence reducing the frequency at which the CPU/core runs at to minimize power consumption.

This document does not go into the details of each C/P-state. The following links provide detailed references:

Whether to enable or disable C/P-states for greater throughput or lower latency depends a lot on the use case. For instance, in some ultra-low latency applications it is beneficial to disable the CPU C-states, because when the CPU is always in C0 state, there is no overhead to resume working.

Similarly, for certain use cases where you want to predict the amount of work performed by the CPU in a given amount of time, it is beneficial to set the CPU frequency to always run at a certain speed (for example 3 Ghz) and still allow Turbo Boost.

2.1 Cpupower Tool

Use the cpupower tool to read your supported CPU frequencies, and to set them. To install the tool run the command zypper in cpupower.

As an example, if you run the command # cpupower frequency-info, you can read some important information from the output:

hardware limits: 1.20 GHz - 2.20 GHz

This represents the frequency range supported by the CPU.

available frequency steps:  2.20 GHz, 2.10 GHz, 2.00 GHz, 1.90 GHz, 1.80 GHz, 
1.70 GHz, 1.60 GHz, 1.50 GHz, 1.40 GHz, 1.30 GHz, 1.20 GHz

This represents the values which the frequency can be set to if manually set.

available cpufreq governors: userspace ondemand performance

This represent the available governors supported by the kernel:

  • userspace allows the frequency to be set manually,

  • ondemand allows the CPU to run at different speed depending on the workloads

  • and performance sets the CPU frequency to the maximum allowed.

current CPU frequency: 1.70 GHz (asserted by call to hardware)

This shows the frequency at which the CPU is currently running.

boost state support:
       Supported: yes
       Active: no/

This shows whether Turbo Boost is supported by your CPU and if it is enabled or disabled.

Note
Note: Disabled P-states

If P-states are disabled then automatically Turbo Boost is not supported. This means the row Supported: above will always show no and consequentially it will not be possible to enable it.

Similarly, when P-states are enabled and managed by the intel_pstate driver (Intel CPUs), then the userspace governor is not supported. This means it is not possible to manually set a specific frequency. Currently, the only two governors supported by the intel_pstate driver are performance and ondemand.

To disable P-states on Intel platform it is sufficient to append intel_pstate=disable to the kernel boot parameters.

Where Turbo Boost is supported, you can enable it by running this command:

# echo 0 > /sys/devices/system/cpu/intel_pstate/no_turbo

or disable it by running this command:

# echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

To set a specific governor run one of these commands:

# cpupower frequency-set -g userspace

(set the governor to userspace)

# cpupower frequency-set -g ondemand

(set the governor to ondemand)

# cpupower frequency-set -g performance

(set the governor to performance)

If you want to set the CPU frequency to a particular speed, run the command:

# cpupower frequency-set -f [FREQUENCY]

Replace [FREQUENCY] with one of the values returned by cpupower frequency-info in the row available frequency steps.

3 Kernel Tuning

The Linux Kernel provides many parameters to be tuned via the sysctl interface or the proc file system. The following chapters describe those settings which can have a direct impact on overall system performance hence the values which can be used for specific profiles (for example high-throughput versus low-latency).

3.1 I/O Scheduler Tuning

The first setting which has a direct impact on I/O performance is the I/O scheduler chosen for your device. The I/O scheduler can be defined for each device. This means the Linux kernel allows you to use different scheduling policies for different devices. This can be very convenient on systems where different hard-drives perform different duties, thus different policies among them may make sense.

To retrieve or change the value of the I/O scheduler you can access the file at /sys/block/sda/queue/scheduler.

On SUSE Linux Enterprise-based distributions you can chose among three different scheduling algorithms to be assigned to each device: noop, cfq and deadline.

The Complete Fair Queuing (CFQ) is a fairness-oriented scheduler and is the default algorithm used by the kernel. The algorithm is based on the use of a time slice in which it is allowed to perform I/O on the disk.

To enable the CFQ scheduler, run the command:

echo cfq > /sys/block/sda/queue/scheduler

The DEADLINE algorithm is a latency-oriented I/O scheduler where each request is assigned a target deadline. In all those cases where several threads are performing reads or writes this algorithm offers greater throughput as long as fairness is not a requirement.

To enable the DEADLINE scheduler, run the command:

echo deadline > /sys/block/sda/queue/scheduler

The NOOP algorithm is the simplest of the three. It performs any I/O which is sent to the scheduler without any complex scheduling. We recommend to use it on those systems where storage devices can perform scheduling themselves hence this algorithm avoids competition between the storage device and the CPU which is trying to perform any scheduling. It is also recommended in virtual machines which do not have a direct access to the storage device as they are virtualized by the hypervisor.

To enable the NOOP scheduler, run the command:

echo noop > /sys/block/sda/queue/scheduler

3.2 Task Scheduler Tuning

Basic aspects and configuration of the Linux kernel task scheduler are performed during the kernel configuration and compilation. This document does not cover those details. It rather covers some sysctl settings which can have an impact on throughput or latency of the system involved with packet processing.

The default Linux kernel scheduler is the Complete Fair Scheduler (CFS) which accumulates a virtual runtime (vruntime). When a new task needs to be selected it is always the task with the minimum accumulated vruntime.

There are few scheduling policies to be assigned to running processes:

  • SCHED_OTHER is the default Linux scheduling policy.

  • SCHED_FIFO uses the First In First Out algorithm and is usually used for some time-critical applications.

  • SCHED_RR is similar to the SCHED_FIFO policy but it is implemented using a Round Robin algorithm.

  • SCHED_BATCH is designed for CPU-intensive applications which may require to get hold of the CPU for longer time to complete.

  • SCHED_IDLE is designed for low priority tasks which may run seldom or that are not time-critical.

  • SCHED_DEADLINE is designed to make a task complete within a given deadline very similarly to the I/O deadline scheduler.

It is possible to assign processes with different policies using the tool chrt (shipped with the util-linux package). The same tool can be used to retrieve information about running processes and priorities supported for each of the policy supported.

In the example below, you can retrieve valid priorities for the various scheduling policies:

# chrt -m
SCHED_SCHED_OTHER min/max priority	: 0/0
SCHED_SCHED_FIFO min/max priority	: 1/99
SCHED_SCHED_RR min/max priority	: 1/99
SCHED_SCHED_BATCH min/max priority	: 0/0
SCHED_SCHED_IDLE min/max priority	: 0/0
SCHED_SCHED_DEADLINE min/max priority	: 0/0

Based on the above priorities you can set – for example – a process with the SCHED_FIFO policy and a priority of 1:

# chrt -f -p 1 <PID>

Or you can set a SCHED_BATCH policy with a priority of 0:

# chrt -b -p 0 <PID>

The following sysctl settings can have a direct impact on throughput and latency:

  • kernel.sched_min_granularity_ns represents the minimal preemption granularity for CPU bound tasks. See sched_latency_ns for details. The default value is 4000000 nanoseconds.

  • kernel.sched_wakeup_granularity_ns represents the wake-up preemption granularity. Increasing this variable reduces wake-up preemption, reducing disturbance of compute bound tasks. Lowering it improves wake-up latency and throughput for latency critical tasks, particularly when a short duty cycle load component must compete with CPU bound components. The default value is 2500000 nanoseconds.

  • kernel.sched_migration_cost_ns is the amount of time after the last execution that a task is considered to be cache hot in migration decisions. A hot task is less likely to be migrated to another CPU, so increasing this variable reduces task migrations. The default value is 500000 nanoseconds. If the CPU idle time is higher than expected when there are runnable processes, try reducing this value. If tasks bounce between CPUs or nodes too often, try increasing it.

  • kernel.numa_balancing is a boolean flag which enables or disables automatic NUMA balacing of processes / threads. Automatic NUMA balancing uses several algorithms and data structures, which are only active and allocated if automatic NUMA balancing is active on the system.

Find below examples for a possible comparison for the three values across different performance profiles.

Table 1: Kernel Tuning - Comparison
 BalancedHigher ThroughputLower Latency

kernel.sched_min_granularity_ns

2,250,000

10,000,000

10,000,000

kernel.sched_wakeup_granularity_ns

3,000,000

15,000,000

1,000,000

kernel.sched_migration_cost_ns

500,000

250,000

5,000,000

kernel.numa_balancing

1

0

0

kernel.pid_max

32,768

1024 * NUMBER_OF_CPUS

32,768

3.3 Memory Manager Tuning

The Linux kernel stages disk writes into cache, and over time asynchronously flushes them to disk. In addition, there is the chance that a lot of I/O will overwhelm the cache. The Linux kernel allows you – via the sysctl command – to tune how much data to keep in RAM before swapping it out to disk. It also allows you to tune various other settings as described below.

  • vm.dirty_ratio is the absolute maximum amount of system memory (here expressed as a percentage) that can be filled with dirty pages before everything must get committed to disk. When the system gets to this point, all new I/O operations are blocked until dirty pages have been written to disk. This is often the source of long I/O pauses, but is a safeguard against too much data being cached unsafely in memory. (vm.dirty_bytes is preferable).

  • vm.dirty_bytes is the amount of dirty memory at which a process generating disk writes will itself start write-back.

    Note
    Note: dirty_bytes and dirty_ratio

    dirty_bytes is the counterpart of dirty_ratio. Only one of them may be specified at a time. When one sysctl is written it is immediately taken into account to evaluate the dirty memory limits and the other appears as 0 when read. The minimum value allowed for dirty_bytes is two pages (in bytes). Any value lower than this limit will be ignored and the old configuration will be retained.

  • vm.dirty_background_ratio is the percentage of system memory that can be filled with dirty pages before the pdflush/flush/kdmflush background processes kick in to write it to disk. Dirty pages are memory pages that still need to be written to disk. As an example, if you set this value to 10 (it means 10%), and your server has 256 GB of memory, then 25.6 GB of data could be sitting in RAM before something is done (vm.dirty_background_bytes is preferable).

  • vm.dirty_background_bytes is the amount of dirty memory at which the background kernel flusher threads will start write-back. This setting is the counterpart of dirty_background_ratio. Only one of them may be specified at a time. When one sysctl is written it is immediately taken into account to evaluate the dirty memory limits and the other appears as 0 when read. In some scenarios this is a better and safer setting to be used since it provides a finer tuning on the amount of memory (for example, 1% of 256 GB = 2.56 GB might already be too much for some scenarios).

  • vm.swappiness: The kernel buffers always stay in main memory, because they have to. Applications and cache however do not need to stay in RAM. The cache can be dropped, and the applications can be paged out to the swap file. Dropping cache means a potential performance hit. Likewise with paging applications out. This parameter helps the kernel decide what to do. By setting it to the maximum of 100 the kernel will swap very aggressively. By setting it to 0 the kernel will only swap to protect against an out-of-memory condition. The default is 60 which means that some swapping will occur.

Find below examples for a possible comparison for the three values across different performance profiles.

Table 2: Memory Manager Tuning - Comparison
 BalancedHigher ThroughputLower Latency

vm.dirty_ratio

20

40

10

vm.dirty_background_ratio

10

10

3

vm.dirty_bytes

16,384

32,768

8,192

vm.dirty_background_bytes

78,643,200

104,857,600

52,428,800

vm.swappiness

60

10

10

3.4 Networking Stack Tuning

The Linux kernel allows the modification of several parameters affecting the networking stack. Since kernel 2.6.17 the networking stack supports full TCP auto-tuning, allowing the resizing of buffers automatically between a minimum and maximum value.

This chapter goes through some settings which can enhance throughput and latency of the Linux kernel networking stack. These settings are configurable via the sysctl interface.

3.4.1 net.ipv4.

  • tcp_fastopen is the setting that enables or disables the RFC7413 which allows sending and receiving data in the opening SYN packet. Enabling this option has the positive effect of not losing the initial handshake packets for payload transmission. Thus it maximizes network bandwidth usage.

  • tcp_lowlatency when enabled (value set to 1) instructs the Linux kernel to make decisions that prefer low-latency to high-throughput. By default this setting is disabled (value set to 0). It is recommended to enable this option in profiles preferring lower latency to higher throughput.

  • tcp_sack when enabled allows selecting acknowledgments. By default it is disabled (value set to 0). It is recommended to enable this option to enhance performance.

  • tcp_rmem is a tuple of three values, representing the minimum, the default and the maximum size of the receive buffer used by the TCP sockets. It is guaranteed to each TCP socket also under moderate memory pressure. The default value in this tuple overrides the value set by the parameter net.core.rmem_default.

  • tcp_wmem is a tuple of three values, representing the minimum, the default and the maximum size of the send buffer used by the TCP sockets. Each TCP socket has the right to use it. The default value in this tuple overrides the value set by the parameter net.core.wmem_default.

  • ip_local_port_range defines the local port range that is used by TCP and UDP to choose the local port. The first number is the first local port number, and the second the last local port number.

  • tcp_max_syn_backlog represents the maximum number of remembered connection requests, which have not received an acknowledgment from the connecting client. The minimal value is 128 for low memory machines, and it will increase in proportion to the memory of machine. If the server suffers from overload, try increasing this number.

  • tcp_syn_retries is the number of times a SYN is retried if no response is received. A lower value means less memory usage and reduces the impact of SYN flood attacks but on lossy networks a 5+ value might be worthwhile.

  • tcp_tw_reuse allows reusing sockets in the TIME_WAIT state for new connections when it is safe from the protocol viewpoint. It is generally a safer alternative to tcp_tw_recycle, however it is disabled by default (value set to 0). It is an interesting setting for servers running services like Web servers or Database servers (for example MySQL), because it allows the servers to scale faster on accepting new connections (for example TCP SOCKET ACCEPT). Reusing the sockets can be very effective in reducing server load. Because this setting is very use case centric it should be used (enabled) with caution.

  • tcp_tw_recycle enables the fast recycling of TIME_WAIT sockets. The default value is 0 (disabled). Some sysctl documentation incorrectly states the default as enabled. It is known to cause some issues with scenarios of load balancing and fail over when enabled (value set to 1). The problem mostly effects scenarios where the machine configured with this setting enabled is a server behind a device performing natting. When recycle is enabled, the server cannot distinguish new incoming connections from different clients behind the same NAT device. Because this setting is very use case centric it should be used (enabled) with caution.

  • tcp_timestamps enables time stamps as defined in RFC1323. It is enabled by default (value set to 1). Use random offset for each connection rather than only using current time.

3.4.2 net.core.

  • netdev_max_backlog sets the maximum number of packets queued on the INPUT side when the interface receives packets faster than the kernel can process them.

  • netdev_budget: if SoftIRQs do not run for long enough, the rate of incoming data could exceed the kernel's capability to consume the buffer fast enough. As a result, the NIC buffers will overflow and traffic will be lost. Occasionally, it is necessary to increase the time that SoftIRQs are allowed to run on the CPU and this parameters allows that. The default value of the budget is 300. This will cause the SoftIRQ process to consume 300 messages from the NIC before getting off the CPU.

  • somaxconn describes the limits of socket listen() backlog, known in userspace as SOMAXCONN. The default value is set to 128. See also tcp_max_syn_backlog for additional tuning for TCP sockets.

  • busy_poll represents the low latency busy poll timeout for poll and select. Approximate time in microseconds to busy loop waiting for events. The recommended value depends on the number of sockets you poll on. For several sockets use the value 50, for several hundreds use 100. For more than that you probably want to use epoll.

    Note
    Note: Sockets

    Only sockets with SO_BUSY_POLL set will be busy polled. This means you can either selectively set SO_BUSY_POLL on those sockets or set net.busy_read globally. This will increase power usage. It is disabled by default (value set to 0).

  • busy_read represents the low latency busy poll timeout for socket reads. Approximate time in microseconds to busy loop waiting for packets on the device queue. This sets the default value of the SO_BUSY_POLL socket option. Can be set or overridden per socket by setting socket option SO_BUSY_POLL, which is the preferred method of enabling. If you need to enable the feature globally via sysctl, a value of 50 is recommended. This will increase power usage. It is disabled by default (value set to 0).

  • rmem_max represents the maximum receive socket buffer size in bytes.

  • wmem_max represents the maximum transmit socket buffer size in bytes.

  • rmem_default represents the default setting of the socket receive buffer in bytes.

  • wmem_default represents the default setting of the socket transmit buffer in bytes.

Find below a possible configuration comparison for the above parameters across different performance profiles.

Table 3: Networking Stack Tuning - Comparison
 BalancedHigher ThroughputLower Latency

net.core.netdev_max_backlog

1000

250,000

1000

net.core.netdev_budget

300

600

300

net.core.somaxconn

128

4,096

128

net.core.busy_poll

0

0

50

net.core.busy_read

0

0

50

net.core.rmem_max

212992

TOTAL_RAM_MEMORY

TOTAL_RAM_MEMORY

net.core.wmem_max

212992

TOTAL_RAM_MEMORY

TOTAL_RAM_MEMORY

net.core.rmem_default

212992

67108864

67108864

net.core.wmem_default

212992

67108864

67108864

tcp_fastopen

1

1

1

tcp_lowlatency

0

0

1

tcp_sack

1

1

1

tcp_rmem

4096 87380 6291456

10240 87380 67108864

10240 87380 67108864

tcp_wmem

4096 87380 6291456

10240 87380 67108864

10240 87380 67108864

ip_local_port_range

32768 60999

1024 64999

32768 60999

tcp_max_syn_backlog

256

8192

1024

tcp_tw_reuse

0

0 (1 is better but depends on use case)

0 (1 is better but depends on use case)

tcp_tw_recycle

0

0 (1 is better but depends on use case)

0 (1 is better but depends on use case)

tcp_timestamps

1

0

0

tcp_syn_retries

6

8

5

4 IRQ Configuration

A correct IRQ configuration – above all in multi-core architecture and multi-thread applications – can have a profound impact on throughput and latency performance.

To verify the IRQ affinitization, read the output of /proc/interrupts. You can identify the hardware you are interested in, all its interrupts and which CPU is handling them.

Different hardware vendors provide their own supported scripts to configure IRQ affinitization efficiently, taking into account also NUMA architectures.

Whether you use a vendor script or proceed manually to the IRQ-core affinitization, the first step to perform on Linux is to stop and disable the irqbalance service by running these commands:

# systemctl disable irqbalance

# systemctl stop irqbalance

Using the scripts provided by the NIC vendor is recommended. However, if you cannot use them or want to proceed manually, then perform the following steps:

  1. Find the processors attached to your port:

    # numactl --cpubind netdev:eth1 -s

    In this example, it is:

    physcpubind: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
    cpubind: 2
    nodebind: 2

    These values tell you that the port is managed by the node #2 in your NUMA architecture and the physical cores involved are numbers 48 to 71.

  2. Find the bitmask for each processor:

    Math: 2^CORE_ID and then convert the result to HEX

  3. Find the IRQs assigned to the port:

    # grep eth1 /proc/interrupts

    In this case – for the 64 queues available – the interrupt range is 52 to 115.

  4. Echo the SMP affinity (calculated at step 2) value into the corresponding IRQ entry by:

    # echo 10000 > /proc/irq/52/smp_affinity
    # echo 20000 > /proc/irq/53/smp_affinity
    […]
    # echo 40000000 > /proc/irq/114/smp_affinity

5 NIC settings setup

Different Network Interface Cards (NIC) provide different features which can enhance the throughput and reduce the latency of the networking traffic handled by a compute host.

Use the ethtool utility to enable and configure these offload capabilities.

Together with more advanced features, there are other capabilities which are common to all NICs and which – for example – allow a bigger or smaller buffer to store packets received by or transmitted from the NIC itself.

The next paragraphs will go through the common parameters first, and then cover the more advanced features.

5.1 Ring buffers

Each NIC is equipped with memory to store network packets received or to be transmitted.

A bigger buffer allows the NIC to store more packets, before issuing an interrupt, thus reducing the number of packets dropped at a specific rate.

It is possible to tune the number of packets to be received by the NIC (either read from the network or to be transmitted to the network) before triggering an interrupt. You can also control how long the NIC should wait after the configured amount is received before triggering the interrupt.

For the example at hand, the following values are set:

Ethernet Link @ 10Gb/s
Minimum frame size: 84 bytes (worst case scenario)
Packet rate:
       10,000,000,000 b/s / (84 B * 8 b/B) = 14,880,960 packets/s (maximum rate)
       ~14,880 packets/ms (millisecond) 
       ~14 packets/us (microsecond)
Interrupt rate: 100us (microseconds)
Receive buffer size required: 1400 entries

The ethtool option to query the value set for the RX (receive) and TX (transmit) ring buffer is -g. This option will show the current configured values and the maximum ones allowed by the NIC.

Example:

# ethtool -g eth1
Ring parameters for eth1:
Pre-set maximums:
RX:		4096
RX Mini:	0
RX Jumbo:	0
TX:		4096
Current hardware settings:
RX:		64
RX Mini:	0
RX Jumbo:	0
TX:		64

In the example above you can see that the NIC supports up to 4096 entries for both the RX and TX rings, but both settings are currently set to 64.

To modify the values used by the system, use ethtool with the -G option.

Example:

# ethtool -G eth1 rx 4096
# ethtool -G eth1 tx 4096
# ethtool -g eth1
Ring parameters for eth1:
Pre-set maximums:
RX:		4096
RX Mini:	0
RX Jumbo:	0
TX:		4096
Current hardware settings:
RX:		4096
RX Mini:	0
RX Jumbo:	0
TX:		4096

Increasing the ring buffer sizes to a bigger value allows the NIC to receive or send more packets at a given rate, thus increasing networking throughput. While increasing the ring buffer size has a positive effect on the throughput, it has a counter-effect on packets latency. This is because a packet will stay longer in the NIC memory before being processed by the networking stack.

To not sacrifice latency too much while still increasing your network throughput you can use the statistics provided by ethtool (option -S) to balance throughput and latency.

To accomplish this task, start with the default ring size for both receive and transmit rings while handling your target packets rate (for example 10Gb/s). Then look at rx_dropped/tx_dropped counters provided by the ethtool -S command and increase (by a power of 2) the ring buffers until the rx_dropped / tx_dropped counters stop or reach the value which is considered acceptable for your use case. Note that not all scenarios impose a 0-packet-drop requirement.

5.2 Interrupt Coalescing

As mentioned before, NICs also allow configuring:

  • how many packets to be queued in the receive (rx-frames) or transmit (tx-frames) ring before triggering an interrupt

  • how long to wait after the value of rx-frames / tx-frames has been reached before triggering an interrupt (rx-usecs/tx-usecs)

To fine-tune these parameters you can still use the statistics provided by the ethtool -S command.

However, when higher throughput is required and NAPI is being used by the NIC driver, a value of 64 for the rx-frames parameter can help to boost throughput, because at each poll the driver would consume in polling a maximum of 64 packets anyway.

To configure the above settings use the following commands:

# ethtool -C eth1 rx-frames 64

# ethtool -C eth1 tx-frames 64

# ethtool -C eth1 tx-usecs 8

# ethtool -C eth1 rx-usecs 8

To verify that the new values have been set use the following command:

# ethtool -c eth1

To use custom value for the rx-frames/tx-frames and rx-usecs/tx-usecs the Dynamic Interrupt Adaptation (DIA) needs to be turned off. DIA is the features allowing the NIC to auto-tune these settings based on network load. Not all NICs implement such a feature; some require a specific kernel and driver versions to support it.

To configure the DIA for both RX and TX use the two following commands:

# ethtool -C eth1 adaptive-rx on

# ethtool -C eth1 adaptive-tx on

5.3 Offload Capabilities

Various NIC vendors offer different offload capabilities. To check which features your NIC supports, use the command ethtool -k DEVICE. The features which are marked with a [fixed] cannot be changed since possibly your NIC (or driver) does not implement that feature (for example off [fixed]), or they are required for the NIC to work correctly (for example on [fixed]).

Example output:

# ethtool -k eth1
Features for eth1:
rx-checksumming: on
tx-checksumming: on
       tx-checksum-ipv4: on
       tx-checksum-ip-generic: off [fixed]
       tx-checksum-ipv6: on
       tx-checksum-fcoe-crc: on [fixed]
       tx-checksum-sctp: on
scatter-gather: on
       tx-scatter-gather: on
       tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
       tx-tcp-segmentation: on
       tx-tcp-ecn-segmentation: off [fixed]
       tx-tcp6-segmentation: on
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: on
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: on [fixed]
tx-gre-segmentation: off [fixed]
tx-ipip-segmentation: off [fixed]
tx-sit-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off
busy-poll: on [fixed]
hw-tc-offload: off

5.3.1 Checksum offload

The Linux kernel allows configuring the receive and transmit checksum offload on NICs.

The parameter identifying the receive checksum offload is called rx-checksumming and it can be set to either on or off.

Below you can see a full list of the sub-features:

tx-checksumming: on
       tx-checksum-ipv4: on
       tx-checksum-ip-generic: off
       tx-checksum-ipv6: on
       tx-checksum-fcoe-crc: on 
       tx-checksum-sctp: on

To enable or disable any of the allowed sub-parameters, it is sufficient to pass the sub-parameter name to the -K option. Find an example in the following command:

# ethtool -K eth1 tx-checksum-ipv4 off

5.3.2 Segmentation Offload

To send a packet over a specific network, it is necessary to be compliant with the MSS and MTU of that network. Any application should be abstracted from the actual network it runs on. This increases portability and ease of maintenance so the kernel takes care of segmenting data into multiple packets before sending it over the network.

To free up CPU cycles from this duty and allow the kernel to use buffers as big as possible, most NICs implement what is called GSO (Generic Segmentation Offload) and TSO (TCP Segmentation Offload) hence performing the resizing and repackaging by itself.

To enable or disable GSO or TSO, use the following commands:

# ethtool -K eth1 gso on

# ethtool -K eth1 gso off

# ethtool -K eth1 tso on

# ethtool -K eth1 tso off

To disable TCP Segmentation Offload, you need to also disable the Generic Segmentation Offload. Otherwise any TCP traffic will be treated as generic.

On the other hand, you can have TSO enabled while the GSO is disabled. In this case, only TCP traffic will be offloaded to the NIC for segmentation. Any other protocol will be handled (for segmentation) by the Linux kernel networking stack.

5.3.3 Receive Offload

To minimize the per packet overhead, the Linux kernel implements what is called Large Receive Offload (LRO) and Generic Receive Offload (GRO). Unfortunately, it has been proved that LRO is broken in some use cases so it is recommended to disable it.

GRO, however, implements a better technique to merge received packets: the MAC headers must be identical and only a few TCP or IP headers can differ. The set of headers which can differ is severely restricted: checksums are necessarily different, and the IP ID field is allowed to increment. Even the TCP time stamps must be identical, which is less of a restriction than it may seem; the time stamp is a relatively low-resolution field, so it is not uncommon for lots of packets to have the same time stamp. Because of these restrictions, merged packets can be resegmented losslessly. As an added benefit, the GSO code can be used to perform resegmentation. Another important aspect of GRO is that LRO is not limited to TCP/IPv4. GRO was merged since kernel 2.6.29 and is supported by a variety of 10G drivers (see also https://lwn.net/Articles/358910/]

To enable or disable GRO, use the following commands:

# ethtool -K eth1 gro on

# ethtool -K eth1 gro off

5.3.4 VLAN Offload

Most NICs these days support the VLAN offload for both receive and transmit path. This feature allows adding or stripping a VLAN tag from the packet when received or transmitted.

By default, most drivers enable this feature but in case it needs to be disabled the commands are:

# ethtool -K [DEVICE] rxvlan off

# ethtool -K [DEVICE] txvlan off

5.3.5 Tunnels (Stateless) Offload

Each of the tunneling protocols for virtual network wraps a UDP header around the original packet (for example VxLAN packet) hence adding an additional layer. Because of this extra layer which needs to be added and removed for each packet, the CPU needs to perform more work to simply receive and send each packet. Because the CPU is busy with these new steps, the throughput and latency of the system for overlay networks is worse than for flat networks.

Newer NICs implement a tunnel segmentation offload, implementing for an overlay network the same concept available for TCP (TCP Segmentation Offload).

This feature offloads the segmentation of large transmit packets to NIC hardware. For instance you may have an inner payload of 9000 bytes while you still need to comply with the maximum MTU of 1500. The operation of segmenting the payload in multiple packets (VXLAN encapsulated for instance) is performed by the NIC before transmitting the packet to the network.

To enable or disable this feature, run the command:

# ethtool -K [DEVICE] tx-udp_tnl-segmentation <off>

# ethtool -K [DEVICE] tx-udp_tnl-segmentation <on>

Another feature which can be found in some NICs is the inner packet checksum offload. When this feature is enabled, it is possible to offload to the hardware the computation of the checksum for the encapsulated packet.

To enable or disable this feature, run the command:

# ethtool -K [DEVICE] tx-udp_tnl-csum-segmentation <off>

# ethtool -K [DEVICE] tx-udp_tnl-csum-segmentation <on>

5.3.6 Hashing and Packet Steering Offload

An important aspect of modern NICs is having multiple hardware queues where packets can be placed either on the receive side or on the transmit side.

This hardware capability has many advantages in multicore architectures since each queue is also assigned a specific IRQ (see IRQ configuration). In consequence each interrupt can be pinned and handled by a specific core.

Similarly, the NIC allows you to steer particular flows matching some criteria to a particular hardware queue hence – potentially – steering that flow to a particular CPU core. This is not too far from what RSS does in software but it has the extra advantage of being performed by the hardware. The CPU is freed up from hashing the packets, classifying them and steering them to the right software queue.

The two ethtool parameters which are involved with the hashing and steering of the flows are: rxhash and ntuple.

The rxhash is a very basic parameter which can be enabled or disabled with the following commands:

# ethtool -K [DEVICE] rxhash on

# ethtool -K [DEVICE] rxhash off

The ntuple parameter is a more complex parameter which allows you to specify the flow you are interested in by configuring the match conditions on various fields of the packet itself. You can find some examples below.

As an example, to steer the TCP flow from 192.168.10.10 to 192.168.10.20 to the queue number 3, run the following command:

# ethtool -N flow-type tcp4 src-ip 192.168.10.10 dst-ip 192.168.10.20 action 3

If the value used for the parameter action is -1 then the NIC will drop the packet received.

It is possible to match on various protocols. Some parameters to be configured only apply to some protocols (for example, proto only applies to flow-type ether whilst l4proto only applies to flow-type ip4). To see a full list of supported parameters and valid values, refer to the ethtool manual page (see http://man.he.net/man8/ethtool and to the NIC vendor documentation.

To show the filters currently applied to an interface, use the command ethtool --show-ntuple.

7 Legal Notice

Copyright ©2006– 2017 SUSE LLC and contributors. All rights reserved.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or (at your option) version 1.3; with the Invariant Section being this copyright notice and license. A copy of the license version 1.2 is included in the section entitled GNU Free Documentation License.

SUSE, the SUSE logo and YaST are registered trademarks of SUSE LLC in the United States and other countries. For SUSE trademarks, see http://www.suse.com/company/legal/. Linux is a registered trademark of Linus Torvalds. All other names or trademarks mentioned in this document may be trademarks or registered trademarks of their respective owners.

This article is part of a series of documents called "SUSE Best Practices". The individual documents in the series were contributed voluntarily by SUSE's employees and by third parties.

The articles are intended only to be one example of how a particular action could be taken. They should not be understood to be the only action and certainly not to be the action recommended by SUSE. Also, SUSE cannot verify either that the actions described in the articles do what they claim to do or that they don't have unintended consequences.

Therefore, we need to specifically state that neither SUSE LLC, its affiliates, the authors, nor the translators may be held liable for possible errors or the consequences thereof. Below we draw your attention to the license under which the articles are published.

Print this page