Share with friends and colleagues on social media

If you haven’t been hanging around the Ceph world for a bit, you may not realize that Ceph was originally intended to provide a distributed file-system to service HPC clusters.  While this was the original intent, Ceph has taken a round-a-bout path to relevance in this space, especially given that we are only supporting multiple active MDS servers since the Luminous release.  The result is that we are, only now, really starting to see adoption in the HPC space, and mostly for the second tier storage needs.

Enter, the science project.  Given an all-flash environment on SATA SSDS with a fast storage pool on Intel Optane for the metadata, would it be possible to provide a reasonable storage environment for HPC clusters?

The all-flash direction mirrors what we are seeing in the industry where adoption is ramping up with the cost of those Enterprise SATA SSDs dropping under $500 per TiB and capacities climbing quite dramatically.  For example, you can look on the open market and find a number of 1TB Enterprise SSDs well under $500 and the largest devices, like the Seagate Nytro 15.36TB SSD, at or under the $500/TiB price point.

Armed with this knowledge and some collaboration with our friends at Intel, I launched into an effort of iterative testing and tuning over the last six months.  The tuning includes testing kernel boot parameters, sysctl.conf tuning for network and various system level tunings, and tuning the Ceph cluster itself.

Most of the work is focused around Media & Entertainment use cases, but, being a long term advocate and supporter of the HPC business at SUSE, I knew the same would apply to compute clusters, I mean, that’s really what a render farm is, right?

The IO500 Project

This is where the IO500(https://www.vi4io.org/io500/start) comes in.  The IO500 is an industry group that is trying to highlight storage design and tuning benefitting the performance of HPC workloads through a competitive list.  The lists are released at ISC (https://isc-hpc.com) held annually in Frankfurt, Germany, and SC (https://sc19.supercomputing.org), held annually in the US the week before Thanksgiving, during a Birds of a Feather session.  Submissions are ranked based on a score composed of both metadata and data performance metrics (basically latency and throughput).  Submissions for the list include a number of parallel and distributed file systems representing systems at universities, labs, etc.

There are two lists.  The ranked list is composed of systems with many different node counts (both client and server) and is really there to highlight the biggest and best-tuned systems.  The second list is the 10-Node challenge.  This list highlights the best storage performance for a 10 client node environment.  This test is more oriented to clusters that can provide the best throughput and metadata performance for the smaller number of nodes.

We made submissions for both lists.

The Cluster Environment

I’m sure everyone wants to know what the hardware environment looks like, right?  You’re probably expecting a crazy expensive and hard to procure and configure bit of hardware, right?  If that’s your assumption, you’d be wrong. This hardware is very standard, readily available from Lenovo, and is what is used for this reference design with the addition of the one Intel Optane DC P4800x device in each storage (OSD) node.

Role Qty Component Notes

Admin, Monitor, Gateway, MDS Nodes

6

Lenovo ThinkSystem SR630

Configuration:
  • 1x Intel Xeon Silver 4116
  • 32GB RAM
  • 2x 480GB M.2 in RAID-1 (OS)
  • 1x Mellanox ConnectX-4 Dual Port 100GbE

OSD Nodes

10

Lenovo ThinkSystem SR650

Configuration:
  • 1x Intel Xeon Gold 6142
  • 96GB RAM
  • 2x 120GB M.2 in RAID-1 (OS)
  • 1x Mellanox ConnectX-4 Dual Port 100GbE
  • 12x 960GB SATA SSD
  • 1x Intel Optane DC P4800

Network Switch

2

Lenovo ThinkSystem NE10032 Switch

Updated with latest OS image

Client Nodes

10

HPE BL460G6

Configuration:
  • 2x Intel X5570
  • 48GB RAM
  • 2x 10GbE via HPE Virtual Connect w/160Gb/s Uplink

Software Environment

SUSE Linux Enterprise Server 12 SP3

SUSE Enterprise Storage

https://www.suse.com/products/server/

https://www.suse.com/products/suse-enterprise-storage/

We Shall Overcome!

SUSE TigerShark Cluster

SUSE TigerShark Cluster

Armed with the testing software available from their GitHub site, I made a quick test run using just two nodes.  The results were impressive, meaning not dead last on the list.  This is where my rather competitive internal drive kicks in.  I thought, “What if I can score higher than the only other CephFS results on the list, posted by CERN?  That would be cool!”  This became my goal.

At this point, I went back and carefully read through the information on the IO500 site and decided we could probably be competitive on both the full list AND the ten node challenge list.  To start working towards this goal, I expanded my client footprint to the ten older blade servers connected with 10Gb connections and began pursuit of the goal.  One of the first things I encountered was that my metadata performance was low.  Well, probably so, I only had one server deployed, so I scaled to 3.  This immediately put me within striking distance of the CERN results.

The next step was to start tuning CephFS metadata server parameters.  The first tuning I did was the only thing I really needed.  Increasing the cache to 16GB of RAM and increasing on each of the three metadata servers enabled the cluster to forge ahead and provide results that came very close the CERN numbers on the existing list without doing ridiculous things like disabling cluster authentication or using a pool with no replication.  From my perspective, that would be cheating, I mean this seriously, what good is storage if there is no data protection in place?

After running the ten node results, I expanded to my full set of client nodes, only 14 total.  With that, I was able to post a result that would be the new number 40 on the prior overall list and slightly ahead of CERN’s numbers.  Not too shabby, especially given the rather small cluster and node count.

For everyone’s reference, I’ll put the tuning I did at the end.

Results

So where did we end up?

On the ten node challenge list just released at the BoF here at ISC, the SUSE TigerShark cluster is number 27.

On the ranked list, we have come in at number 34, two above the previous results from CERN.

On the full list, which is all results posted to either, SUSE is well within the top 100 at positions 69 and 73.

The relative rankings on these lists indicate a pretty fair number of new or updated entries, especially as the number I posted would have put me in the top 10 of the prior 10 node challenge list, and around number 39 in the ranked list.  Overall, I’m pretty happy to have made a respectable showing given the small cluster and limited horsepower of the load generation systems.

Maybe I’ll see if I can get more clients with better connectivity for the next list as there were several instances where the load generation nodes were the bottleneck.  Look for updates on future lists.

Tuning for HPC

For the CephFS configuration, I created the metadata pool on the Intel Optane DC P4800x devices.  This ensures a very low latency pool for metadata operations.  The storage pool is replicated on the SATA SSDs.

Now for the tuning.  First off, I disabled all the side-channel attack mitigations, maxed out CPU performance and enabled multi-queue block I/O on the kernel command line.

intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll scsi_mod.use_blk_mq=1 nospec spectre_v2=off pti=off spec_store_bypass_disable=off l1tf=off

Next, I tuned the network and system a bit with sysctl.conf changes and some adapter tuning. Some of this is very specific to the Mellanox adapters I have in the cluster, so YMMV.  These tunings are all in a bash script and applied to all nodes, including the clients using salt.

#!/bin/bash
salt '*' cmd.run 'setpci -s 5b:00.0 68.w=5936'
salt '*' cmd.run 'setpci -s 5b:00.1 68.w=5936'
salt '*' cmd.run 'ip link set bond0 mtu 9000'
salt '*' cmd.run 'for j in `cat /sys/class/net/bond0/bonding/slaves`;do LOCAL_CPUS=`cat /sys/class/net/$j/device/local_cpus`;echo $LOCAL_CPUS > /sys/class/net/$j/queues/rx-0/rps_cpus;done'
salt '*' cmd.run 'ethtool -G eth4 rx 8192 tx 8192'
salt '*' cmd.run 'ethtool -G eth5 rx 8192 tx 8192'
salt '*' cmd.run 'sysctl -w net.ipv4.tcp_sack=0'
salt '*' cmd.run 'sysctl -w net.core.rmem_max=2147483647'
salt '*' cmd.run 'sysctl -w net.core.wmem_max=2147483647'
salt '*' cmd.run 'sysctl -w net.core.somaxconn=2048'
salt '*' cmd.run 'sysctl -w net.ipv4.tcp_low_latency=1'
salt '*' cmd.run 'sysctl -w net.ipv4.tcp_rmem="10240 87380 2147483647"'
salt '*' cmd.run 'sysctl -w net.ipv4.tcp_wmem="10240 87380 2147483647"'
salt '*' cmd.run 'sysctl -w net.ipv4.tcp_timestamps=0'
salt '*' cmd.run 'sysctl -w net.core.netdev_max_backlog=250000'
salt '*' cmd.run 'mlnx_tune -p HIGH_THROUGHPUT'
salt '*' cmd.run 'echo never > /sys/kernel/mm/transparent_hugepage/enabled'
salt '*' cmd.run 'sysctl -w vm.min_free_kbytes=1000000'

And finally some ceph.conf tunings.  Basically, turn off all process logging, both on disk and in memory and tune up the MDS cache memory.

[global]
debug ms=0
debug mds=0
debug osd=0
debug optracker=0
debug auth=0
debug asok=0
debug bluestore=0
debug bluefs=0
debug bdev=0
debug kstore=0
debug rocksdb=0
debug eventtrace=0
debug default=0
debug rados=0
debug client=0
debug perfcounter=0
debug finisher=0

[mds]
mds_cache_memory_limit=17179869184

Share with friends and colleagues on social media
Tags: , ,
Category: Ceph, High Performance Computing, Software-defined Storage, SUSE Enterprise Storage, SUSE Storage
This entry was posted Wednesday, 19 June, 2019 at 4:45 am
You can follow any responses to this entry via RSS.

Leave a Reply

Your email address will not be published. Required fields are marked *

No comments yet