SUSE Enterprise Storage: Sizing
This is the second article in the series. It covers the basic architecture of SUSE Enterprise Storage, which is powered by Ceph, and provides some of the sizing recommendations and rationale behind them.
Having been involved in the storage business for many years, I have seen a number of different types and implementations of storage devices. From single systems and SCSI based arrays to proprietary systems with a hub and spoke design to high end mesh systems based on custom ASICs and optimal pathing logic. The importance of these various designs is that at some level, the folks engineering the solution made a decision on what the bottleneck would be. This allowed them to specify the right amount of resources to ensure a level of consistency and function under a wide range of use cases. What I will attempt to do in this article is to guide the reader towards understanding my recommendations for bottleneck placement and provide some rules of thumb for sizing and the rationale behind them.
Object Storage Daemon (OSD)
The Object Storage Daemon is a process that is usually attached to a single disk device and is responsible for reading and writing objects to disk, ensuring replicas or erasure coding is up to date and managing failure remediation. Due to the way Ceph has been designed, this daemon is intelligent and understands how to calculate where data should be and what to do in the event of a failure.
The Object Storage Daemon is also made aware of architecture of the cluster via the CRUSH map. This map defines the failure domains and helps to ensure the cluster intelligently places data in a way that node, rack or even data center failures can be survived.
An OSD node will typically have 12-16 disk devices. The number of drives can be as low as two (one for OS and one for storage), which is great for lab environments, or as many drives as you can attach to a single system. While these limitations seem very broad, I generally recommend not exceeding 24 – 36 spindles in a single chassis. Doing so increases risk due to number of spindles in the failure domain and thus the likelihood of a data loss event. You also run the risk of resource starvation. SUSE recommends a minimum of 4 OSD nodes in a cluster.
If possible, we recommend using a controller with battery backed cache for the spindles. By placing the controller in write-back mode and presenting each individual drive via that controller, you are able to make use of that cache to enhance performance. We also recommend that you utilize SSD drives as journals for the spindles and generally not to exceed a ratio of 6 spindles per physical SSD, though this may change with advances in solid state storage technology.
The Ceph Monitor maintains the master copy of the various cluster maps. The monitor daemon ensures this mission-critical data is consistent at all times. Multiple monitors are recommended to ensure the ability to survive failure. SUSE recommends 3 monitors, thus allowing for 1 failure or 5 monitors, which allows for 2 failures.
When planning for a Ceph cluster, there are a few configuration rules of thumb to keep in mind. One of the most important concerns is determining where you want the bottleneck. “What!?” you may say. The truth of the matter is that there is ALWAYS a bottleneck in a storage solution. It may be placed at any one of the following points:
- System bus
It would be ideal if the software understood all the physical components just right to be able to limit things just shy of the ability of the hardware to perform, but in most cases, including Ceph, this isn’t the case. In situations like this, the architect, has the ability to decide.
My preference in most cases is to make disk I/O the limiting factor on a single node because it has the most predictable when it comes to performance impact and is the easiest to address when dealing with issues of scale. Placing the bottleneck with the disk devices is achieved through ensuring sufficient resources exist for all other aspects of the system. You can check to see if the disk devices are the bottleneck by looking at I/O wait on the nodes. It is important to note that you should not constantly be experiencing I/O wait. If you are experiencing it consistently, you know two things. 1) your system is spindle bound and 2) you should really add more spindles to the cluster.
With Ceph, it is also possible to make the front end network the limiting factor. You could do this by only exposing 10GbE network interfaces to the clients and utilizing bonded 10GbE or faster network technologies like 100GbE or Infiniband. This would limit individual nodes to 10Gb/s while providing sufficient resources for the cluster back end. This may be more desirable when erasure coding is in use, thus limiting the amount of I/O that can be consumed by user activities and ensuring that the cluster has a near guarantee on minimum rebuild performance.
Your choices may differ, but keep in mind the requirements of your workload and end users when making the selection.
Expressing the Total Storage in a Cluster
There are several ways to express the storage a cluster provides. These are “Raw”, “Right Sized” and “Usable”. Raw is basically the advertised capacity of the drives multiplied by the number of drives, e.g. 4TB x 20 Drives = 80TB.
Right sizing involves calculating the actual size of the drive using 1024 bytes per kilobyte instead of the 1,000 that most manufacturers advertise. An example of this is 4TB = 4,000,000,000,000 but properly converted using a base of 1024 bytes per KB, you get 4,000,000,000,000/(1024 x 1024 x 1024 x 1024) = 3.637978807091713 (3.63TB).
Usable space takes into account right sizing, formatted capacity (file system overhead) and data protection scheme. I’ll give two examples below, one for erasure coding and one for replication.
Erasure coding example: 20 drives @ 4TB each using k=8, m=2
Right size the drives to 3.63TB, assume 2% overhead for the file system.
20 x 3.63TB = 72.6T x 98% = 71.148TB (after FS overhead)
Now erasure coding overhead: 8 data blocks / 10 total data blocks = .8 * 71.148TB = 56.92TB of usable space.
Replication is easier: 20 drives @ 4TB each having a replica count of 3.
Right size the drives to 3.63TB, assume 2% overhead for the file system.
20 x 3.63TB = 72.6TB x 98% = 71.148TB (after FS overhead)
Now take in replication: 71.148TB / 3 = 23.7TB.
Resource Rules of Thumb
Upstream documentation and various articles written in the past include a number of recommendations in relation to the basic sizing guidelines. These are all just that: basic. I hope to provide a little more thought below about hardware sizing for the various daemons.
First, for replication based systems, I recommend a minimum of 1GHz of CPU per spindle. At this time, because there’s no easily accessible article on the performance metrics for this use case, I don’t recommend enabling hyper threading. This is something I hope to be able to test in the near future.
For systems utilizing erasure coding, I recommend more CPU per spindle. This helps ensure sufficient processor resources are present when an OSD needs to be rebuilt. The recommendation I follow is 1.5GHz per spindle, although I may float a little lower in a system that is very core and spindle dense. This is an area that I suspect will experience positive impact from hyper threading, but is yet to be verified through testing.
For both environments, I recommend the same 1GB of RAM for each TB of usable storage.
For figuring throughput needs for the network, look to the device specifications for large sequential read throughput. In most cases, a 7200 RPM drive will average about 70 MB/s. If you have a server running 16 of these spindles, you are theoretically going to be hitting towards the top of a 10Gb interface. In general, I don’t recommend using anything slower than 10Gb Ethernet. A best practice would involve bonding multiple connections across redundant switches as described later.
Monitor daemons are a little easier. They need a reasonable amount of CPU, very little RAM and fast drives. Generally speaking, if you have 12 cores, 8 GB of RAM and a mirrored pair of SSD drives in a monitor node, you are well configured. You should match the network speed of your OSD nodes as well.
RADOS gateway node sizing will be dependent on the workload placed on them. In general I would configure more RAM than a monitor node, but otherwise assume they are similar. Also, expect to utilize SUSE High Availability Extensions or a hardware load balancer to make these nodes fault tolerant.
As mentioned above, no less than 10Gb Ethernet for the connections. An ideal configuration would actually span across two dual port 10Gb NICs (network interface controllers) and either two blades on a switch or a pair of stacked switches. In this configuration, the bottom ports of the NICs would be bonded together and the top ports would be bonded together, both using 802.3ad (LACP). On the switch side, the LACP bond would be created across the blades/switches. The result is a network configuration that will survive NIC failure, cable failure and switch failure.
7200 RPM SATA drive = 70 IOPS
10K SAS = 150 IOPS
Jumbo Frame Ethernet packets generally provide significant enhancement to throughput
For bonding, 802.3AD (LACP) is the most desirable if using stacked switches
Just because you can, doesn’t mean you should. Build systems that balance failure domains with expense. A three node cluster with 80 drives per node might provide a lower cost of acquisition, but its risks are high.
The storage truism is “If you build it, they will fill it.” Always architect your system with significant capacity growth in mind
I would also like to credit those who assisted in reviewing the content of this rather lengthy post:
- Joao Luis
- Tim Serong
- Owen Synge
Thanks for the assistance and helping to ensure that I explain things in the best manner.
I hope the information above provides some help and insight into sizing your storage infrastructure. As always, if you have questions, reach out to your SUSE contact or feel free to contact me directly.