SUSE and the IO500 @ SC19
Back in June we announced our first set of IO500 benchmarks. Since then, we’ve upgraded SUSE Enterprise Storage to version 6, which is based on Ceph Nautilus. This alone brought some performance improvements over the prior release, especially in terms of metadata and small I/O performance. Add to this the change in the underlying SUSE Linux Enterprise Server version to 15 SP1 where we have a newer kernel, and the outcome is some dramatically improved performance. With just these two, we were able to increase our score by a few points.
We also tweaked things to follow what we see our customers doing. One of these was changing from a 3x replica set to a 2x set. This is not uncommon in all-flash environments. Doing this gained us about half a point. At this point, our throughput was in the range of scores quite a bit above ours, but the metadata performance was not.
Some of the systems in the list seemed to have MDS on each of their storage nodes, and I can only assume that they take advantage of the local MDS. This is not something that Ceph supports completely, but we thought it worth a shot to place an MDS on each of our load generation nodes. This provides some amount of local hits on the MDS and far more metadata server capacity than the cluster needs, but the result was an additional three and a half points.
The net change in score for this IO500 run was good. June was 8.38 and our November score is 14.9067, which would have put the SUSE TigerShark cluster at number 28 on the June list. On this, the November 2019 list, we are ranked number 17.
The 10-node challenge list had a similar increase with the score going from 7.32 to 12.434, which puts our result at number 18 on the list.
It is important to note that the committee identified some issues with the metadata tests in the last several months, and thus fixed the benchmark to eliminate client-side caching advantages. Due to this change, they have started a fresh set of lists as of this submission.
As always, we are interested in sharing the configuration details with you, our customers, to help you achieve your performance goals. The hardware is still the same as it was in the June article, so everything is a software change.
The ceph.conf is below and is the same as is in the Media and Entertainment paper that I authored and published a short time ago. The biggest change over the June results was a slight increase in the memory target per OSD and the cache.
[global] fsid = 6f46b935-4c30-41fb-8d2e-339df3cf5c5c mon_initial_members = sr630-2, sr630-3, sr630-4 mon_host = 172.16.227.14, 172.16.227.13, 172.16.227.12 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx public_network = 172.16.227.0/24 cluster_network = 172.16.227.0/24 ms_bind_msgr2 = false # enable old ceph health format in the json output. This fixes the # ceph_exporter. This option will only stay until the prometheus plugin takes # over mon_health_preluminous_compat = true mon health preluminous compat warning = false rbd default features = 3 debug ms=0 debug mds=0 debug osd=0 debug optracker=0 debug auth=0 debug asok=0 debug bluestore=0 debug bluefs=0 debug bdev=0 debug kstore=0 debug rocksdb=0 debug eventtrace=0 debug default=0 debug rados=0 debug client=0 debug perfcounter=0 debug finisher=0 [osd] osd_memory_target = 6442450944 osd_memory_cache_min = 4294967296 bluestore_min_alloc_size = 4096 osd_op_num_threads_per_shard = 5 [mds] mds_cache_memory_limit=17179869184 [client]
The system tuning is similar with a few small tweaks. Being Enterprise class SSDs, these devices will commit anything left in their on-drive cache when the supercap discharges, so we’ve enabled that functionality. That being said, the results didn’t change appreciably when enabling the cache, so this decision is up to you. The settings are:
ip link set bond0 mtu 9000 for j in `cat /sys/class/net/bond0/bonding/slaves`;do LOCAL_CPUS=`cat /sys/class/net/$j/device/local_cpus`;echo $LOCAL_CPUS > /sys/class/net/$j/queues/rx-0/rps_cpus;done ethtool -G eth4 rx 8192 tx 8192 ethtool -G eth5 rx 8192 tx 8192 sysctl -w net.ipv4.tcp_timestamps=1 sysctl -w net.ipv4.tcp_sack=1 sysctl -w net.core.rmem_max=2147483647 sysctl -w net.core.wmem_max=2147483647 sysctl -w net.core.somaxconn=2048 sysctl -w net.ipv4.tcp_low_latency=1 sysctl -w net.ipv4.tcp_fastopen=1 sysctl -w net.ipv4.tcp_rmem="10240 87380 2147483647" sysctl -w net.ipv4.tcp_wmem="10240 87380 2147483647" sysctl -w net.core.netdev_max_backlog=250000 echo never > /sys/kernel/mm/transparent_hugepage/enabled [for every storage node] setpci -s 5b:00.0 68.w=5936 setpci -s 5b:00.1 68.w=5936 for i in `lsblk -oname`;do hdparm -W 1 /dev/$i;done for i in `lsblk -oname`;do hdparm -A 1 /dev/$i;done for i in `cat osdnodes.lst`;do ssh root@$i 'for i in `ls /sys/block|grep -v "dm-"`;do echo "none" >/sys/block/$i/queue/scheduler;done';done [loadgen nodes only] ethtool -A eth0 rx on;ethtool -A eth1 rx on ethtool -G eth0 rx 4078 tx 4078;ethtool -G eth1 rx 4078 tx 4078
From a kernel tuning perspective, this cluster does have all of the mitigations disabled, power options are set for max performance, and multi-queue block I/O is enabled. The kernel command line parameters used are these:
intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll scsi_mod.use_blk_mq=1 mitigations=off
There are some definite bottlenecks in the test environment that aren’t related to Ceph. Those include the test client nodes being older Nehalem systems using bonded 10GbE. I hope to upgrade these to more modern CPUs by the next test. I will also be upgrading the CPUs, and hopefully the storage devices for the next iteration.
It is also important to note that this cluster is NOT using RDMA technologies for Ceph. While there is some primitive support in place for RDMA, it currently provides no advantage and is not commercially supported. Our engineering and product teams will continue to monitor this particular item, but I don’t expect to change this in the near future.
If I were to ask the Ceph community (and our Ceph developers) for my wish-list, it would include these items to help make Ceph for HPC even better:
- Real RDMA support
- MDS affinity setting (either ceph.conf or in the mount command?)
- Continued work on OSD performance
A few other thoughts about HPC and SUSE Enterprise Storage are probably in order as well.
The first being that our storage is able to service both your high-performance AND your high-capacity needs, with support for erasure coded pools and protocols like NFS, S3, Samba, etc. Some key partners in supporting the secondary storage uses are with iRODS and HPE with DMF, though any number of solutions exist that can leverage the S3 interface for data tiering strategies.
The second thought would be that the uses of SUSE Enterprise Storage are certainly not limited to HPC environments. There are significant areas of use in virtualization, containers, media & entertainment, and more recently analytics. In the analytics space, SUSE Enterprise Storage makes the perfect solution for your data lake. Whether you are using an object-based strategy with S3 or a files-based strategy using CephFS, Samba, or NFS, we have the solution for your environment.
No matter what your use case, our current IO500 score and fantastic TCO make SUSE Enterprise Storage a great choice for your departmental or workgroup sized HPC cluster.