Performance Degradation Observed After SUSE Enterprise Storage 6 Patch Cycle

This document (000019829) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Enterprise Storage 6

Situation

Performance issues arise after patching the SUSE Enterprise Storage 6 environment.

Resolution

Change the 'bluefs_buffered_io' setting back to 'true'.

This can be accomplished via these methods:

-------------------------------------------------------------------------------------------------
Without restarting OSDs (temporary setting until OSD is restarted):
-------------------------------------------------------------------------------------------------
On the OSD node:   # ceph deamon osd.<id> config set bluefs_buffered_io true

From the monitor:    # ceph tell osd.<id> injectargs '--bluefs_buffered_io=true'    # single osd
                     e.g. # ceph tell osd.56 injectargs '--bluefs_buffered_io=true'

                       # ceph tell osd.* injectargs '--bluefs_buffered_io=true'    # all osds in cluster

---------------------------------------------------------------------------------------------------------------------------------------
Permanent setting for entire cluster (and without running the more invasive 'stage' commands):
---------------------------------------------------------------------------------------------------------------------------------------
On salt master, create/edit: /srv/salt/ceph/configuration/files/ceph.conf.d/osd.conf

Add the following line to osd.conf:
bluefs_buffered_io=true

The following command builds the ceph.conf to be pushed out, incorporating osd.conf:-

        # salt '<salt_master_node_name>' state.apply ceph.configuration.create

e.g.   # salt 'mon1.suse.com' state.apply ceph.configuration.create

------------------------
mon1:/srv/salt/ceph/configuration/files/ceph.conf.d # salt 'mon1.suse.com' state.apply ceph.configuration.create
mon1.suse.com:
Name: /var/cache/salt/minion/files/base/ceph/configuration - Function: file.absent - Result: Changed Started: - 18:13:46.846499 Duration: 11.657 ms
Name: /srv/salt/ceph/configuration/cache/ceph.conf - Function: file.managed - Result: Changed Started: - 18:13:46.858342 Duration: 5640.245 ms
Name: find /var/cache/salt/master/jobs -user root -exec chown salt:salt {} ';' - Function: cmd.run - Result: Changed Started: - 18:13:52.525985 Duration: 43.848 ms

Summary for mon1.suse.com
------------
Succeeded: 3 (changed=3)
Failed:    0
------------
Total states run:     3
Total run time:   5.696 s
------------------------

The next command pushes out the ceph.conf to all nodes in the cluster:

      # salt '*' state.apply ceph.configuration

e.g.
mon1:/srv/salt/ceph/configuration/files/ceph.conf.d # salt '*' state.apply ceph.configuration
osdnode3.suse.com:
Name: /etc/ceph/ceph.conf - Function: file.managed - Result: Changed Started: - 18:16:47.268270 Duration: 101.657 ms

Summary for osdnode3.suse.com
------------
Succeeded: 1 (changed=1)
Failed:    0
------------
Total states run:     1
Total run time: 101.657 ms

.......    [text removed to shorten example - there should be one 'entry' for each node in the cluster]

Summary for osdnode2.suse.com
------------
Succeeded: 1 (changed=1)
Failed:    0
------------
Total states run:     1
Total run time: 131.195 ms
mon1:/srv/salt/ceph/configuration/files/ceph.conf.d #

Check the /etc/ceph/ceph.conf on some of the nodes to make sure the change has been made.
----------------------------------------------------------------

NOTE: If only the 'permanent' change is made, each osd will have to be restarted in order to pick up the parameter change.

Cause

Up until the 'Nautilus' release of Ceph, the bluefs_buffered_io setting defaulted to 'false'. However, with the upstream release of the 'Nautilus' version of Ceph, bluefs_buffered_io defaulted to 'true'.

Since SUSE Enterprise Storage 6 is based on the 'Nautilus' release, prior to version 14.2.9.969 of SUSE Enterprise Storage 6, the default setting for 'bluefs_buffered_io' was also 'true' (enabled).

In version 14.2.9.969 and later, the default was changed to 'false' (disabled).

A decision was made that the parameter defaulting to 'true' was essentially a regression in the Nautilus release and that it should again default to 'false'. Having 'bluefs_buffered_io' set to true, has also been linked with performance issues for files over 2GB.

Status

Top Issue

Additional Information

When 'bluefs_buffered_io' is enabled, bluefs will in some cases perform buffered reads. This allows the kernel page cache to act as a secondary cache for things like RocksDB compaction. For example, if the rocksdb block cache isn't large enough to hold blocks from the compressed SST files, they can be read from page cache instead of from the disk. This option previously was enabled by default, however in some test cases it appears to cause excessive swap utilization by the linux kernel and a large negative performance impact after several hours of run time.

The current recommendation is that if you have not seen a problem with bluefs_buffered_io enabled, you should be safe to continue to use it, but kernel swap usage should be regularly monitored for any sign of thrashing.

See also: https://www.spinics.net/lists/ceph-users/msg64164.html

Evidence appears to suggest that this issue will only occur where most or all of the OSDs are on slower hardware (spinning drives) as opposed to an environment based mostly on SSD / NVMe.

NOTE: This setting never defaulted to 'true' for SUSE Enterprise Storage 5.x

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.