SUSE Enterprise Storage 4

Administration and Deployment Guide

Publication Date: 02/28/2017
About This Guide
Available Documentation
Feedback
Documentation Conventions
About the Making of This Manual
I SUSE Enterprise Storage
1 About SUSE Enterprise Storage
1.1 Introduction
1.2 Additional Information
2 System Requirements
2.1 Minimal Recommendations per Storage Node
2.2 Minimal Recommendations per Monitor Node
2.3 Minimal Recommendations for RADOS Gateway Nodes
2.4 Minimal Recommendations for iSCSI Nodes
2.5 Naming Limitations
II Cluster Deployment and Upgrade
3 Introduction
4 Deploying with DeepSea and Salt
4.1 Introduction to DeepSea
4.2 Deploying with DeepSea and Salt
4.3 Configuration and Customization
5 Deploying with ceph-deploy
5.1 Ceph Layout
5.2 Network Recommendations
5.3 Preparing Each Ceph Node
5.4 Cleaning Previous Ceph Environment
5.5 Running ceph-deploy
6 Deploying with Crowbar
6.1 Installing and Setting Up the Crowbar Admin Server
6.2 Deploying the Ceph Nodes
7 Upgrading from Previous Releases
7.1 General Upgrade Procedure
7.2 Upgrade from SUSE Enterprise Storage 2.1 to 4
7.3 Upgrade from SUSE Enterprise Storage 3 to 4
III Operating a Cluster
8 Introduction
9 Operating Ceph Services
9.1 Starting, Stopping, and Restarting Services using Targets
9.2 Starting, Stopping, and Restarting Individual Services
9.3 Identifying Individual Services
9.4 Service Status
10 Determining Cluster State
10.1 Checking Cluster Health
10.2 Watching a Cluster
10.3 Checking a Cluster’s Usage Stats
10.4 Checking a Cluster’s Status
10.5 Checking OSD Status
10.6 Checking Monitor Status
10.7 Checking Placement Group States
10.8 Using the Admin Socket
11 Authentication with cephx
11.1 Authentication Architecture
11.2 Key Management
12 Stored Data Management
12.1 Devices
12.2 Buckets
12.3 Rule Sets
12.4 CRUSH Map Manipulation
12.5 Mixed SSDs and HDDs on the Same Node
13 Managing Storage Pools
13.1 Operating Pools
14 Snapshots
14.1 RBD Snapshots
14.2 Pool Snapshots
15 Erasure Coded Pools
15.1 Creating a Sample Erasure Coded Pool
15.2 Erasure Code Profiles
15.3 Erasure Coded Pool And Cache Tiering
16 Cache Tiering
16.1 Tiered Storage Terminology
16.2 Points to Consider
16.3 When to Use Cache Tiering
16.4 Cache Modes
16.5 Setting Up an Example Tiered Storage
IV Accessing Cluster Data
17 Ceph RADOS Gateway
17.1 Managing RADOS Gateway with ceph-deploy
17.2 Managing RADOS Gateway Manually
17.3 Operating the RADOS Gateway Service
17.4 Managing RADOS Gateway Access
17.5 Multi-site Object Storage Gateways
18 Ceph iSCSI Gateway
18.1 iSCSI Block Storage
18.2 General Information about lrdb
18.3 Deployment Considerations
18.4 Installation and Configuration
18.5 Connecting to lrbd-managed Targets
18.6 Conclusion
19 Clustered File System
19.1 Ceph Metadata Server
19.2 CephFS
19.3 Managing Failover
20 NFS-Ganesha: Export Ceph Data via NFS
20.1 Installation
20.2 Configuration
20.3 Starting NFS-Ganesha Related Services
20.4 Verifying the Exported NFS Share
20.5 Mounting the Exported NFS Share
V Managing Cluster with GUI Tools
21 openATTIC
21.1 Installing openATTIC
21.2 openATTIC Web User Interface
21.3 Dashboard
21.4 Ceph Related Tasks
22 Calamari
22.1 Installing Calamari with ceph-deploy
22.2 Installing Calamari Using Crowbar
VI Integration with Virtualization Tools
23 Using libvirt with Ceph
23.1 Configuring Ceph
23.2 Preparing the VM Manager
23.3 Creating a VM
23.4 Configuring the VM
23.5 Summary
24 Ceph as a Back-end for QEMU KVM Instance
24.1 Installation
24.2 Usage
24.3 Creating Images with QEMU
24.4 Resizing Images with QEMU
24.5 Retrieving Image Info with QEMU
24.6 Running QEMU with RBD
24.7 Enabling Discard/TRIM
24.8 QEMU Cache Options
VII Best Practices
25 Introduction
25.1 Reporting Software Problems
26 Hardware Recommendations
26.1 Can I Reduce Data Replication
26.2 Can I Reduce Redundancy Similar to RAID 6 Arrays?
26.3 What is the Minimum Disk Size for an OSD node?
26.4 How Much RAM Do I Need in a Storage Server?
26.5 OSD and Monitor Sharing One Server
26.6 How Many Disks Can I Have in a Server
26.7 How Many OSDs Can Share a Single SSD Journal
27 Cluster Administration
27.1 Using ceph-deploy on an Already Setup Server
27.2 Adding OSDs with ceph-disk
27.3 Adding OSDs with ceph-deploy
27.4 Adding and Removing Monitors
27.5 Usage of ceph-deploy rgw
27.6 RADOS Gateway Client Usage
27.7 Automated Installation via Salt
27.8 Restarting Ceph services using DeepSea
27.9 Node Management
28 Monitoring
28.1 Usage Graphs on Calamari
28.2 Checking for Full OSDs
28.3 Checking if OSD Daemons are Running on a Node
28.4 Checking if Monitor Daemons are Running on a Node
28.5 What Happens When a Disk Fails?
28.6 What Happens When a Journal Disk Fails?
29 Disk Management
29.1 Adding Disks
29.2 Deleting disks
29.3 How to Use Existing Partitions for OSDs Including OSD Journals
30 Recovery
30.1 'Too Many PGs per OSD' Status Message
30.2 Calamari Has a Stale Cluster
30.3 'nn pg stuck inactive' Status Message
30.4 OSD Weight is 0
30.5 OSD is Down
30.6 Fixing Clock Skew Warnings
31 Accountancy
31.1 Adding S3 Users
31.2 Removing S3 Users
31.3 User Quota Management
31.4 Adding Swift Users
31.5 Removing Swift Users
31.6 Changing S3 and Swift User Access and Secret Keys
32 Tune-ups
32.1 How Does the Number of Placement Groups Affect the Cluster Performance?
32.2 Can I Use SSDs and Hard Disks on the Same Cluster?
32.3 What are the Trade-offs of Using a Journal on SSD?
33 Integration
33.1 Storing KVM Disks in Ceph Cluster
33.2 Storing libvirt Disks in Ceph Cluster
33.3 Storing Xen Disks in Ceph Cluster
33.4 Mounting and Unmounting an RBD Image
34 Cluster Maintenance and Troubleshooting
34.1 Creating and Deleting Pools from Calamari
34.2 Managing Keyring Files
34.3 Creating Client Keys
34.4 Revoking Client Keys
34.5 Checking for Unbalanced Data Writing
34.6 Time Synchronization of Nodes
34.7 Upgrading Software
34.8 Increasing the Number of Placement Groups
34.9 Adding a Pool
34.10 Deleting a Pool
34.11 Troubleshooting
35 Performance Diagnosis
35.1 Finding Slow OSDs
35.2 Is My Network Causing Issues?
36 Server Maintenance
36.1 Adding a Server to a Cluster
36.2 Removing a Server from a Cluster
36.3 Increasing File Descriptors
37 Networking
37.1 Setting NTP to a Ceph Cluster
37.2 Firewall Settings for Ceph
37.3 Adding a Private Network to a Running Cluster
Glossary
A Example Procedure of Manual Ceph Installation
B Documentation Updates
B.1 February, 2017 (Release of SUSE Enterprise Storage 4 Maintenance Update 1)
B.2 December, 2016 (Release of SUSE Enterprise Storage 4)
B.3 June, 2016 (Release of SUSE Enterprise Storage 3)
B.4 January, 2016 (Release of SUSE Enterprise Storage 2.1)
B.5 October, 2015 (Release of SUSE Enterprise Storage 2)

Copyright © 2017 SUSE LLC

Copyright © 2010-2014, Inktank Storage, Inc. and contributors.

The text of and illustrations in this document are licensed by Inktank Storage under a Creative Commons Attribution-Share Alike 4.0 International ("CC-BY-SA"). An explanation of CC-BY-SA is available at http://creativecommons.org/licenses/by-sa/4.0/legalcode. In accordance with CC-BY-SA, if you distribute this document or an adaptation of it, you must provide the URL for the original version.

This document is an adaption of original works found at http://ceph.com/docs/master/ (2015-01-30).

Red Hat, Red Hat Enterprise Linux, the Shadowman logo, JBoss, MetaMatrix, Fedora, the Infinity Logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries. Linux® is the registered trademark of Linus Torvalds in the United States and other countries. Java® is a registered trademark of Oracle and/or its affiliates. XFS® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries. MySQL® is a registered trademark of MySQL AB in the United States, the European Union and other countries. All other trademarks are the property of their respective owners.

For SUSE trademarks, see http://www.suse.com/company/legal/. All other third-party trademarks are the property of their respective owners. Trademark symbols (®, ™ etc.) denote trademarks of SUSE and its affiliates. Asterisks (*) denote third-party trademarks.

All information found in this book has been compiled with utmost attention to detail. However, this does not guarantee complete accuracy. Neither SUSE LLC, its affiliates, the authors nor the translators shall be held liable for possible errors or the consequences thereof.

About This Guide

SUSE Enterprise Storage is an extension to SUSE Linux Enterprise. It combines the capabilities from the Ceph (http://ceph.com/) storage project with the enterprise engineering and support of SUSE. SUSE Enterprise Storage provides IT organizations with the ability to deploy a distributed storage architecture that can support a number of use cases using commodity hardware platforms.

This guide helps you understand the concept of the SUSE Enterprise Storage with the main focus on managing and administrating the Ceph infrastructure. It also demonstrates how to use Ceph with other related solutions, such as OpenStack or KVM.

Many chapters in this manual contain links to additional documentation resources. These include additional documentation that is available on the system as well as documentation available on the Internet.

For an overview of the documentation available for your product and the latest documentation updates, refer to http://www.suse.com/documentation.

1 Available Documentation

The following manuals are available for this product:

Administration and Deployment Guide

Guides you through Ceph installation steps and cluster management tasks, including description of basic Ceph cluster structure and related terminology. The guide also introduces steps to integrate Ceph with virtualization solutions such as libvirt, Xen, or KVM, and ways to access objects stored in the cluster via iSCSI and RADOS gateways.

The Best Practice chapter (see Chapter 25, Introduction) includes selected practical topics sorted by categories, so that you can easily find a solution or more information to a specific problem.

HTML versions of the product manuals can be found in the installed system under /usr/share/doc/manual. Additionally, you can access the product-specific manuals as well as upstream documentation from the Help links in the graphical Web interfaces. Find the latest documentation updates at http://www.suse.com/documentation where you can download the manuals for your product in multiple formats.

2 Feedback

Several feedback channels are available:

User Comments

We want to hear your comments about and suggestions for this manual and the other documentation included with this product. Use the User Comments feature at the bottom of each page in the online documentation or go to http://www.suse.com/documentation/feedback.html and enter your comments there.

Mail

For feedback on the documentation of this product, you can also send a mail to doc-team@suse.de. Make sure to include the document title, the product version, and the publication date of the documentation. To report errors or suggest enhancements, provide a concise description of the problem and refer to the respective section number and page (or URL).

3 Documentation Conventions

The following typographical conventions are used in this manual:

  • /etc/passwd: directory names and file names

  • placeholder: replace placeholder with the actual value

  • PATH: the environment variable PATH

  • ls, --help: commands, options, and parameters

  • user: users or groups

  • Alt, AltF1: a key to press or a key combination; keys are shown in uppercase as on a keyboard

  • File, File › Save As: menu items, buttons

  • Dancing Penguins (Chapter Penguins, ↑Another Manual): This is a reference to a chapter in another manual.

4 About the Making of This Manual

This book is written in Geekodoc, a subset of DocBook (see http://www.docbook.org). The XML source files were validated by xmllint, processed by xsltproc, and converted into XSL-FO using a customized version of Norman Walsh's stylesheets. The final PDF can be formatted through FOP from Apache or through XEP from RenderX. The authoring and publishing tools used to produce this manual are available in the package daps. The DocBook Authoring and Publishing Suite (DAPS) is developed as open source software. For more information, see http://daps.sf.net/.

Part I SUSE Enterprise Storage

1 About SUSE Enterprise Storage

1.1 Introduction

SUSE Enterprise Storage is a distributed storage designed for scalability, reliability and performance based on the Ceph technology. As opposed to conventional systems which have allocation tables to store and fetch data, Ceph uses a pseudo-random data distribution function to store data, which reduces the number of look-ups required in storage. Data is stored on intelligent object storage devices (OSDs) by using daemons, which automates data management tasks such as data distribution, data replication, failure detection and recovery. Ceph is both self-healing and self-managing which results in reduction of administrative and budget overhead.

The Ceph storage cluster uses two mandatory types of nodes—monitors and OSD daemons:

Monitor

Monitoring nodes maintain information about cluster health state, a map of other monitoring nodes and a CRUSH map.

Monitor nodes also keep history of changes performed to the cluster.

OSD Daemon

An OSD daemon stores data and manages the data replication and rebalancing processes. Each OSD daemon handles one or more OSDs, which can be physical disks/partitions or logical volumes.

OSD daemons also communicate with monitor nodes and provide them with the state of the other OSD daemons.

The Ceph storage cluster can use the following optional node types:

Metadata Server (MDS)

The metadata servers store metadata for the Ceph file system. By using MDS you can execute basic file system commands such as ls without overloading the cluster.

RADOS Gateway

RADOS Gateway is an HTTP REST gateway for the RADOS object store. You can use this node type also when using the Ceph file system.

Note
Note: Each Node Type on a Separate Server

We strongly recommend to install only one node type on a single server.

The Ceph environment has the following features:

Controlled, Scalable, Decentralized Placement of replicated Data using CRUSH

The Ceph system uses a unique map called CRUSH (Controlled Replication Under Scalable Hashing) to assign data to OSDs in an efficient manner. Data assignment offsets are generated as opposed to being looked up in tables. This does away with disk look-ups which come with conventional allocation table based systems, reducing the communication between the storage and the client. The client armed with the CRUSH map and the metadata such as object name and byte offset knows where it can find the data or which OSD it needs to place the data.

CRUSH maintains a hierarchy of devices and the replica placement policy. As new devices are added, data from existing nodes is moved to the new device to improve distribution with regard to workload and resilience. As a part of the replica placement policy, it can add weights to the devices so some devices are more favored as opposed to others. This could be used to give more weights to Solid State Devices (SSDs) and lower weights to conventional rotational hard disks to get overall better performance.

CRUSH is designed to optimally distribute data to make use of available devices efficiently. CRUSH supports different ways of data distribution such as the following:

  • n-way replication (mirroring)

  • RAID parity schemes

  • Erasure Coding

  • Hybrid approaches such as RAID-10

Reliable Autonomic Distributed Object Storage (RADOS)

The intelligence in the OSD Daemons allows tasks such as data replication and migration for self-management and self-healing automatically. By default, data written to Ceph storage is replicated within the OSDs. The level and type of replication is configurable. In case of failures, the CRUSH map is updated and data is written to new (replicated) OSDs.

The intelligence of OSD Daemons enables to handle data replication, data migration, failure detection and recovery. These tasks are automatically and autonomously managed. This also allows the creation of various pools for different sorts of I/O.

Replicated Monitor Servers

The monitor servers keep track of all the devices in the system. They manage the CRUSH map which is used to determine where the data needs to be placed. In case of failures of any of the OSDs, the CRUSH map is re-generated and re-distributed to the rest of the system. At a given time, it is recommended that a system contains multiple monitor servers to add redundancy and improve resilience.

Configuration and management framework for your cluster - DeepSea

DeepSea is a collection of Salt states, runners and modules for deploying and managing Ceph.

Currently the Ceph storage cluster can provide the following services:

  • Ceph object storage

  • Ceph file system

  • RADOS block device

1.2 Additional Information

Ceph as a community project has its own extensive online documentation. For topics not found in this manual refer to http://ceph.com/docs/master/.

2 System Requirements

2.1 Minimal Recommendations per Storage Node

  • 2 GB of RAM per each Terabyte of OSD (2‰ of total raw capacity).

  • 1.5 GHz of a CPU core per OSD.

  • Bonded or redundant 10GbE networks.

  • OSD disks in JBOD configurations.

  • OSD disks should be exclusively used by SUSE Enterprise Storage.

  • Dedicated disk/SSD for the operating system, preferably in a RAID1 configuration.

  • Additional 4 GB of RAM if cache tiering is used.

2.2 Minimal Recommendations per Monitor Node

  • 3 SUSE Enterprise Storage monitor nodes recommended.

  • 2 GB of RAM per monitor.

  • SSD or fast hard disk in a RAID1 configuration

  • On installations with fewer than seven nodes, these can be hosted on the system disk of the OSD nodes.

  • Nodes should be bare metal, not virtualized, for performance reasons.

  • Mixing OSD, monitor, or RADOS Gateway nodes with the actual workload is not supported. No other load generating services other than OSDs, monitors or RADOS Gateways daemons are supported on the same host.

  • Configurations may vary from, and frequently exceed, these recommendations depending on individual sizing and performance needs.

  • Bonded network interfaces for redundancy.

2.3 Minimal Recommendations for RADOS Gateway Nodes

RADOS Gateway nodes should have 6-8 CPU cores and 32 GB of RAM (64 GB recommended).

2.4 Minimal Recommendations for iSCSI Nodes

iSCSI nodes should have 6-8 CPU cores and 16 GB of RAM.

2.5 Naming Limitations

Ceph does not generally support non-ASCII characters in configuration files, pool names, user names and so forth. When configuring a Ceph cluster we recommend using only simple alphanumeric characters (A-Z, a-z, 0-9) and minimal punctuation ('.', '-', '_') in all Ceph object/configuration names.

Part II Cluster Deployment and Upgrade

3 Introduction

This chapter outlines procedures to deploy the Ceph cluster. Currently we support the following methods of deployment:

4 Deploying with DeepSea and Salt

Salt along with DeepSea is a stack of components that help you deploy and manage server infrastructure. It is very scalable, fast, and relatively easy to get running. Read the following considerations before you start deploying the cluster with Salt:

5 Deploying with ceph-deploy

ceph-deploy is a command line utility to simplify the way you deploy Ceph cluster in small scale setups.

6 Deploying with Crowbar

Crowbar (http://crowbar.github.io/) is a framework to build complete deployments. It helps you transform groups of bare-metal nodes into an operational cluster within relatively short time.

7 Upgrading from Previous Releases

This chapter introduces steps to upgrade SUSE Enterprise Storage from the previous release(s) to the current one.

3 Introduction

This chapter outlines procedures to deploy the Ceph cluster. Currently we support the following methods of deployment:

Important
Important: Do Not Mix Installation Methods

You cannot mix the supported installation methods. For example if you decide to deploy the cluster with Crowbar, you cannot later make changes to its settings with ceph-deploy and vice versa.

4 Deploying with DeepSea and Salt

Salt along with DeepSea is a stack of components that help you deploy and manage server infrastructure. It is very scalable, fast, and relatively easy to get running. Read the following considerations before you start deploying the cluster with Salt:

  • Salt master is the host that controls the whole cluster deployment. Dedicate all the host resources to the Salt master services. Do not install Ceph on the host where you want to run Salt master.

  • Salt minions are the nodes controlled by Salt master. In the Ceph environment, Salt minion is typically an OSD or monitor.

  • Salt minions need to correctly resolve the Salt master's host name over the network. By default, they look for the salt host name. Therefore, we recommend to set the Salt master's host name to salt.

4.1 Introduction to DeepSea

The goal of DeepSea is to save the administrator time and confidently perform complex operations on a Ceph cluster. This idea has driven a few choices. Before presenting those choices, some observations are necessary.

All software has configuration. Sometimes the default is sufficient. This is not the case with Ceph. Ceph is flexible almost to a fault. Reducing this complexity would force administrators into preconceived configurations. Several of the existing Ceph solutions for an installation create a demonstration cluster of three nodes. However, the most interesting features of Ceph require more.

One aspect of configuration management tools is accessing the data such as addresses and device names of the individual servers. For a distributed storage system such as Ceph, that aggregate can run into the hundreds. Collecting the information and entering the data manually into a configuration management tool is prohibitive and error prone.

The steps necessary to provision the servers, collect the configuration, configure and deploy Ceph are mostly the same. However, this does not address managing the separate functions. For day to day operations, the ability to trivially add hardware to a given function and remove it gracefully is a requirement.

With these observations in mind, DeepSea addresses them with the following strategy. DeepSea Consolidates the administrators decisions in a single location. The decisions revolve around cluster assignment, role assignment and profile assignment. And DeepSea collects each set of tasks into a simple goal. Each goal is a Stage:

  • Stage 0—the provisioning— this stage is optional as many sites provides their own provisioning of servers. If you do not have your provisioning tool, you should run this stage. During this stage all required updates are applied and your system may be rebooted.

  • Stage 1—the discovery— here you detect all hardware in your cluster and collect necessary information for the Ceph configuration. For details about configuration refer to Section 4.3, “Configuration and Customization”.

  • Stage 2—the configuration— you need to prepare configuration data in a particular format.

  • Stage 3—the deployment— creates a basic Ceph cluster with OSD and monitors.

  • Stage 4—the services— additional features of Ceph like iSCSI, RadosGW and CephFS can be installed in this stage. Each is optional.

  • Stage 5—the removal stage. This stage is not mandatory and during the initial setup it is usually not needed. In this stage the roles of minions and also the cluster configuration are removed. Run this stage, when you need to remove a storage node from your cluster, for details refer to Section 27.9.3, “Removing and Reinstalling Salt Cluster Nodes”.

4.1.1 Organization and Important Locations

Salt has several standard locations and several naming conventions used on your master node:

/srv/pillar

The directory stores configuration data for your cluster minions. Pillar is an interface for providing global configuration values to all your cluster minions.

/srv/salt/

The directory stores Salt state files (also called sls files). State files are formatted description of states in which the cluster should be. For details refer to the Salt documentation (https://docs.saltstack.com/en/latest/topics/tutorials/starting_states.html).

/srv/module/runners

The directory stores python scripts known as runners. Runners are executed on the master node.

/srv/salt/_modules

The directory stores python scripts that are called modules. The modules are applied to all minions in your cluster.

/srv/pillar/ceph

The directory is used by DeepSea. Collected configuration data are stored there.

/srv/salt/ceph

Directory used by DeepSea. The directory stores sls files that can be in different format, but each subdirectory contains sls files. Each subdirectory contains only one type of sls file. For example, /srv/salt/ceph/stage contains orchestration files that are executed by the salt-run state.orchestrate.

4.2 Deploying with DeepSea and Salt

The cluster deployment process by using Salt has several phases. First, you need to prepare all nodes of the cluster by configuring Salt and then you deploy and configure Ceph.

The following procedure describes the cluster preparation in detail.

  1. Install and register SUSE Linux Enterprise Server 12 SP2 together with SUSE Enterprise Storage 4 extension on each node of the cluster.

  2. Install DeepSea on the Salt master:

    root@master > sudo zypper in deepsea

    The command installs the salt-master package as a dependence.

    Check that the salt-master service is enabled and started, and enable and start it if needed:

    root@master > sudo systemctl enable salt-master.service
    root@master > sudo systemctl start salt-master.service
  3. Install the package salt-minion on all cluster nodes including the Salt master node.

    root@minion > sudo zypper in salt-minion

    Check that the salt-minion service is enabled and started, and enable and start it if needed:

    root@minion > sudo systemctl enable salt-minion.service
    root@minion > sudo systemctl start salt-minion.service
  4. Configure all minions to connect to the master. If your Salt master is not reachable by the DNS name salt, edit the file /etc/salt/minion or create a new file /etc/salt/minion.d/master.conf with the following content:

    master:DNS_name_of_your_salt_master

    If you performed any changes to the configuration files mentioned above, restart the Salt service on all Salt minions:

    root@minion > sudo systemctl restart salt-minion.service
  5. Check that the Salt State file: /srv/pillar/ceph/master_minion.sls points to your Salt master. If you used the default host name for your Salt master - salt in the ses domain, then the file looks as follows:

    master_minion: salt.ses
  6. Restart the Salt service on the master node:

    root@master > sudo systemctl restart salt-master.service
  7. Accept all salt keys on the Salt master:

    root@master > salt-key --accept-all
  8. Verify that the keys have been accepted:

    root@master > salt-key --list-all
  9. Ensure that you have access to Ceph Jewel repositories.

  10. Prior to deploying SUSE Enterprise Storage make sure that all disks that were used as OSD by previous clusters are empty without partitions. To ensure this, you have to manually zap all the disks:

    sudo ceph-disk zap device_name

Now you deploy and configure Ceph

Note
Note: Salt Command Conventions

There are two possibilities how to run salt-run state.orch - one is with stage.<stage number>, the other is with a name of the stage. Both notations have the same impact and it is fully up to your preferences which command will you use.

Unless specified otherwise, all steps are mandatory.

  1. Now prepare your cluster. You cam omit this step if you have your own provisioning server.

    root@master > salt-run state.orch ceph.stage.0

    or

    root@master > salt-run state.orch ceph.stage.prep
  2. The discovery stage collects data from all minions and creates configuration fragments that are stored in the directory /srv/pillar/ceph/proposals. The data are stored in the YAML format in sls or yml files.

    root@master > salt-run state.orch ceph.stage.1

    or

    root@master > salt-run state.orch ceph.stage.discovery
  3. After the previous command finishes successfully, create a policy.cfg file in /srv/pillar/ceph/proposals. For details refer to Section 4.3.1, “The policy.cfg File”.

  4. The configuration stage parses the policy.cfg file and merges the included files into their final form. Cluster and role related contents are placed in /srv/pillar/ceph/cluster, while Ceph specific content is placed in /srv/pillar/ceph/stack/default.

    Run the following command to trigger the configuration stage:

    root@master > salt-run state.orch ceph.stage.2

    or

    root@master > salt-run state.orch ceph.stage.configure

    The configuration step may take several seconds. After the command finishes, you can view the pillar data for all minions by running:

    root@master > salt '*' pillar.items
    Note
    Note: Overwriting Defaults

    As soon as the command finishes, you can view the default configuration and change it to suit your needs. For details refer to Section 4.3.2, “Customizing the Default Configuration”.

  5. Now you run the deploy stage. In this stage the pillar is validated and monitors and ODS daemons are created on the storage nodes. Run the following to start the stage:

    root@master > salt-run state.orch ceph.stage.3

    or

    root@master > salt-run state.orch ceph.stage.deploy

    The command may take several minutes. If it fails, you have to fix the issue. Then run the previous stages again. After the command succeeds, run the following to check the status:

    ceph -s
  6. The last step of the Ceph cluster deployment is the services stage. Here you instantiate any of the currently supported services: iSCSI, CephFS, RADOS Gateway and openATTIC. In this stage necessary pool, authorizing keyrings and starting services are created. To start the stage, run the following:

    root@master > salt-run state.orch ceph.stage.4

    or

    root@master > salt-run state.orch ceph.stage.services

    Depending on the setup, the command may run several minutes. By default, openATTIC will be installed on the master node. If you need to install openATTIC on a different node, refer to Section 4.3, “Configuration and Customization”. If you don't want to install openATTIC at all as it is not a mandatory part of SUSE Enterprise Storage, refer to Chapter 21, openATTIC.

4.3 Configuration and Customization

4.3.1 The policy.cfg File

The /srv/pillar/ceph/proposals/policy.cfg configuration file is used to determine functions of individual cluster nodes (which node acts as OSD, which is a monitoring node, etc.). The file then includes configuration for individual nodes.

Currently the only way how to configure policy is by manually editing the /srv/pillar/ceph/proposals/policy.cfg configuration file. The file is divided into four sections:

The order of the sections is arbitrary, but the content of included lines overwrites matching keys from the contents of previous lines.

4.3.1.1 Cluster Assignment

In the cluster section you select minions for your cluster. You can select all minions, or you can blacklist or whitelist minions. Examples for a cluster called ceph follows.

To include all minions, add the following lines:

cluster-ceph/cluster/*.sls

To whitelist a particular minion:

cluster-ceph/cluster/abc.domain.sls

or a group of minions—you can shell glob matching:

cluster-ceph/cluster/mon*.sls

To blacklist a minion/s, set the minion/s to unassigned:

cluster-unassigned/cluster/client*.sls

4.3.1.2 Role Assignment

In the section you need to assign roles to your cluster nodes. The general pattern is the following:

role-<role name>/<path>/<included files>

Where the items have the following meaning and values:

  • <role name> is any of the following: master, admin, mon, mds, igw or rgw. Detailed description see below.

  • <path> is a relative path to sls or yml files. Usually in case of sls files it is cluster, while yml files are located at stack/default/ceph/minions.

  • <included files> are the Salt state files or YAML configuration files. Shell globing can be used for more specific matching.

An example for each role follows:

  • master - the node has admin keyrings to all Ceph clusters. Currently, only a single Ceph cluster is supported. The master role is mandatory, always add a similar line like the following:

    role-master/cluster/master*.sls
  • admin - the minion will have an admin keyring. You define the role as follows:

    role-admin/cluster/abc*.sls
  • mon - the minion will provide the monitoring service to the Ceph cluster. This role requires addresses of the assigned minions, thus you need to include the files from the stack directory on top of the sls files:

    role-mon/stack/default/ceph/minions/mon*.yml
    role-mon/cluster/mon*.sls

    The example assigns the monitoring role to a group of minions.

  • mds - the minion will provide the metadata service to support CephFS.

    role-mds/cluster/mds*.sls
  • igw - the minion will act as a iSCSI gateway. This role requires addresses of the assigned minions, thus you need to also include the files from the stack directory:

    role-igw/stack/default/ceph/minions/xyz.domain.yml
    role-igw/cluster/*.sls
  • rgw - the minion will act as a RADOS Gateway:

    role-rgw/cluster/rgw*.sls
Note
Note: Multiple Roles of Cluster Nodes

You can assign several roles to a single node. For instance, you can assign to two monitor nodes also the mds role:

role-mds/cluster/mon[12]*.sls

4.3.1.3 Common Configuration

The common configuration section includes configuration files generated during the discovery (stage 1). These configuration files store parameters like fsid or public_network. To include the required Ceph common configuration, add the following lines:

config/stack/default/global.yml
config/stack/default/ceph/cluster.yml

4.3.1.4 Profile Assignment

In Ceph, a single storage role would be insufficient to describe the many disk configurations available with the same hardware. Therefore, stage 1 will generate multiple profiles when possible for the same storage node. The administrator adds the cluster and stack related lines similar to the mon and igw roles.

The profile names begin with profile and end with a single digit. The format is the following:

profile-<label>-<single digit><path to sls or yml files>

where the items have the following meaning and values:

  • <label> is dynamically generated based on quantity, model and size of the media, e.g. 2Disk2GB.

    To view the value, you can run:

    salt '*' cephdisks.list

    If the command outputs some results, inspect files in /srv/pillar/ceph/proposals/. The filenames contains the <label> values.

  • <single digit> - defines the type of profile and reflects the count of medias attached to the minion. When 1 is specified, the media is treated like an individual OSD. When you specify 2 the node is with solid state media drive (SSD or NVMe) and the solid state media is considered as separate journals. Depending on the number of models and ratio of drives, additional profiles may be created by incrementing the digit.

  • <path to sls or yml files> - replace it with a proper path to cluster sls files or to stack yml configuration files.

Now check the content of yml files in the stack/default/ceph/minions for the specific configuration. Then configure the profiles according to the following examples:

A minion with a single disk called 3HP5588GB:

profile-3HP5588-1/cluster/*.sls
profile-3HP5588-1/stack/default/ceph/minions/*.yml

A minion with two disks 2Intel745GB and 6INTEL372GB.

profile-2Intel745GB-6INTEL372GB-2/cluster/*.sls
profile-2Intel745GB-6INTEL372GB-2/stack/default/ceph/minions/*.yml

You can add as many lines as needed to define each a profile for each storage node:

profile-24HP5588-1/cluster/cold*.sls
profile-24HP5588-1/stack/default/ceph/minions/cold*.yml
profile-18HP5588-6INTEL372GB-2/cluster/data*.sls
profile-18HP5588-6INTEL372GB-2/stack/default/ceph/minions/data*.yml

4.3.2 Customizing the Default Configuration

You can change the default configuration generated in the stage 2, for example you might need to change network settings or software that is by default installed on your Salt master. The first mentioned task is performed by modifying the pillar updated after the stage 2, the second is usually done by creating a custom sls file and adding it to the pillar. Details are described in following sections.

4.3.2.1 Using Customized Configuration Files

This section describes steps required to apply your own sls files. Such a procedure is typically used when you need to change the default deployment process, e.g. install additional software on a different node. Let's use installation of openATTIC on a Salt minion node as an example, which is by default installed on the Salt master:

Procedure 4.1: Installing openATTIC on a Salt minion
  1. Navigate to the openATTIC directory:

    cd /srv/salt/ceph/stage/openattic
  2. Create your custom.sls, e.g. by using the default.sls as a template:

    cp default.sls custom.sls
  3. Edit the custom.sls file. Change the value of

    I@roles:master

    to the desired Salt minion, oa.minion in our case:

    I@roles:oa.minion
    Note
    Note

    Bear in mind that the Salt minion target must match exactly the Salt minion name. If you use values like oa*, make sure to remove the line tgt_type: compound as Salt defaults to globbing.

  4. Edit the file /srv/pillar/ceph/stack/ceph/cluster.yml by adding the following line:

    stage_openattic: custom
  5. Now run the stage 2 again and then run stages 3 and 4.

4.3.2.2 Modifying Discovered Configuration

After you completed stage 2, you may want to change the discovered configuration. To view the current settings, run:

root@master > salt '*' pillar.items

The output of default configuration for a single minion is usually similar to the following:

----------
    available_roles:
        - admin
        - mon
        - storage
        - mds
        - igw
        - rgw
        - client-cephfs
        - client-radosgw
        - client-iscsi
        - mds-nfs
        - rgw-nfs
        - master
    cluster:
        ceph
    cluster_network:
        172.16.22.0/24
    fsid:
        e08ec63c-8268-3f04-bcdb-614921e94342
    master_minion:
        admin.ceph
    mon_host:
        - 172.16.21.13
        - 172.16.21.11
        - 172.16.21.12
    mon_initial_members:
        - mon3
        - mon1
        - mon2
    public_address:
        172.16.21.11
    public_network:
        172.16.21.0/24
    roles:
        - admin
        - mon
        - mds
    time_server:
        admin.ceph
    time_service:
        ntp

The above mentioned settings are distributed into several configuration files. The directory structure with these files is defined in the /srv/pillar/ceph/stack/stack.cfg directory. The following files usually describe your cluster:

  • /srv/pillar/ceph/stack/global.yml - the file affects all minions in the Salt cluster.

  • /srv/pillar/ceph/stack/ceph/cluster.yml - the file affects all minions in the Ceph cluster called ceph.

  • /srv/pillar/ceph/stack/ceph/roles/role.yml - affects all minions that are assigned the specific role in the ceph cluster.

  • /srv/pillar/ceph/stack/cephminions/minion ID/yml - affects the individual minion.

Note
Note: Overwriting Directories with Default Values

There is a parallel directory tree that stores default configuration setup in /srv/pillar/ceph/stack/default. Do not change values here, as they are overwritten.

The typical procedure of changing the collected configuration is the following:

  1. Find the location of the configuration item you need to change. For example, if you need to change cluster related thing like cluster network, edit the file /srv/pillar/ceph/stack/ceph/cluster.yml.

  2. Save the file.

  3. Verify the changes by running:

    root@master > salt '*' saltutil.pillar_refresh

    and then

    root@master > salt '*' pillar.items

5 Deploying with ceph-deploy

ceph-deploy is a command line utility to simplify the way you deploy Ceph cluster in small scale setups.

5.1 Ceph Layout

The minimal recommended setup uses at least five cluster nodes: one admin node and four OSD nodes sharing resources with at least three monitor nodes.

Minimal Ceph Setup
Figure 5.1: Minimal Ceph Setup

5.2 Network Recommendations

The network environment where you intend to run Ceph should ideally be a bonded set of at least two network interfaces that is logically split into a public part and trusted internal part using VLANs. The bonding mode is recommended to be 802.3ad when possible to provide maximum bandwidth and resiliency.

The public VLAN serves for providing the service to the customers, the internal part provides for the authenticated Ceph network communication. The main reason is that although Ceph authentication and protection against attacks once secret keys are in place, the messages used to configure these keys may be transferred open and are vulnerable.

Tip
Tip: Nodes Configured via DHCP

If your storage nodes are configured via DHCP, the default timeouts may not be sufficient for the network to be configured correctly before the various Ceph daemons start. If this happens, the Ceph MONs and OSDs will not start correctly (running systemctl status ceph\* will result in "unable to bind" errors), and Calamari may be unable to display graphs. To avoid this issue, we recommend increasing the DHCP client timeout to at least 30 seconds on each node in your storage cluster. This can be done by changing the following settings on each node:

In /etc/sysconfig/network/dhcp set

DHCLIENT_WAIT_AT_BOOT="30"

In /etc/sysconfig/network/config set

WAIT_FOR_INTERFACES="60"

5.3 Preparing Each Ceph Node

Before deploying the Ceph cluster, apply the following steps for each Ceph node as root:

  1. Install SUSE Linux Enterprise 12 SP2 and add the SUSE Enterprise Storage extension.

    SUSE Enterprise Storage Extension Selection
    Figure 5.2: SUSE Enterprise Storage Extension Selection

    On the Installation Settings screen, click Software. On the Software Selection and System Tasks screen, there are several tasks related to SUSE Enterprise Storage. For OSDs, monitors, or the admin server, be sure to choose SUSE Enterprise Storage server packages and confirm with OK.

    SUSE Enterprise Storage Related Installation Tasks
    Figure 5.3: SUSE Enterprise Storage Related Installation Tasks

    For more information on the extension installation, see https://www.suse.com/documentation/sles-12/book_sle_deployment/data/cha_add-ons.html.

  2. Check the firewall status

    sudo /sbin/SuSEfirewall2 status

    and if it is on, either turn it off with

    sudo /sbin/SuSEfirewall2 off

    or, if you want to keep it on, enable the appropriate set of ports. You can find detailed information in Section 37.2, “Firewall Settings for Ceph”.

  3. Make sure that network settings are correct: each Ceph node needs to route to all other Ceph nodes, and each Ceph node needs to resolve all other Ceph nodes by their short host names (without the domain suffix). If these two conditions are not met, Ceph fails.

    Tip
    Tip: Calamari Node

    If you plan to deploy the Calamari monitoring and management environment (refer to Chapter 22, Calamari for more information), each Ceph node needs to reach the Calamari node as well.

  4. Install and set up NTP—the time synchronization tool. We strongly recommend using NTP within the Ceph cluster. The reason is that Ceph daemons pass critical messages to each other, which must be processed before daemons reach a timeout threshold. If the clocks in Ceph monitors are not synchronized, it can lead to a number of anomalies, such as daemons ignoring received messages.

    Even though clock drift may still be noticeable with NTP, it is not yet harmful.

    To install NTP, run the following:

    sudo zypper in ntp yast2-ntp-client

    To configure NTP, go to YaST › Network Services › NTP Configuration. Make sure to enable the NTP service (systemctl enable ntpd.service && systemctl start ntpd.service). Find more detailed information on NTP in the SLES Administration Guide (https://www.suse.com/documentation/sles-12/book_sle_admin/data/cha_netz_xntp.html).

  5. Install SSH server. Ceph uses SSH to log in to all cluster nodes. Make sure SSH is installed (zypper in openssh) and enabled (systemctl enable sshd.service && systemctl start sshd.service).

  6. Add a cephadm user account, and set password for it. The admin node will log in to Ceph nodes as this particular cephadm user .

    sudo useradd -m cephadm && passwd cephadm
  7. The admin node needs to have password-less SSH access to all Ceph nodes. When ceph-deploy logs in to a Ceph node as a cephadm user, this user must have password-less sudo privileges.

    Edit the /etc/sudoers file (with visudo) and add the following line to add the sudo command for the cephadm user:

    cephadm ALL = (root) NOPASSWD:ALL
    Tip
    Tip: Disable requiretty

    You may receive an error while trying to execute ceph-deploy commands. If requiretty is set by default, disable it by executing sudo visudo and locate the Defaults requiretty setting. Change it to Defaults:cephadm !requiretty to ensure that ceph-deploy can connect using the cephadm user and execute commands with sudo.

  8. On the admin node, become the cephadm user, and enable password-less SSH access to all other Ceph nodes:

    su - cephadm
    cephadm > ssh-keygen

    You will be asked several questions. Leave the values at their defaults, and the passphrase empty.

    Copy the key to each Ceph node:

    ssh-copy-id cephadm@node1
    ssh-copy-id cephadm@node2
    ssh-copy-id cephadm@node3
    Tip
    Tip: Running ceph-deploy from a Different User Account Than cephadm

    It is possible to run the ceph-deploy command even if you are logged in as a different user than cephadm. For this purpose, you need to set up an SSH alias in your ~/.ssh/config file:

    [...]
    Host ceph-node1
      Hostname ceph-node1
      User cephadm

    After this change, ssh ceph-node1 automatically uses the cephadm user to log in.

5.4 Cleaning Previous Ceph Environment

If at any point during the Ceph deployment you run into trouble and need to start over, or you want to make sure that any previous Ceph configuration is removed, execute the following commands as cephadm user to purge the previous Ceph configuration.

Warning
Warning

Be aware that purging previous Ceph installation destroys stored data and access settings.

cephadm > ceph-deploy purge node1 node2 node3
cephadm > ceph-deploy purgedata node1 node2 node3
cephadm > ceph-deploy forgetkeys

5.5 Running ceph-deploy

After you prepared each Ceph node as described in Section 5.3, “Preparing Each Ceph Node”, you are ready to deploy Ceph from the admin node with ceph-deploy. Note that ceph-deploy will not successfully install an OSD on disks that have been previously used, unless you first 'zap' them. Be aware that 'zapping' erases the entire disk content:

cephadm > ceph-deploy disk zap node:vdb
  1. Install ceph and ceph-deploy:

    sudo zypper in ceph ceph-deploy
  2. Disable IPv6. Open /etc/sysctl.conf, edit the following lines, and reboot the admin node:

    net.ipv6.conf.all.disable_ipv6 = 1
    net.ipv6.conf.default.disable_ipv6 = 1
    net.ipv6.conf.lo.disable_ipv6 = 1
  3. Because it is not recommended to run ceph-deploy as root, become the cephadm user:

    su - cephadm
  4. Run ceph-deploy to install Ceph on each node:

    cephadm > ceph-deploy install node1 node2 node3
    Tip
    Tip

    ceph-deploy creates important files in the directory where you run it from. It is best to run ceph-deploy in an empty directory.

  5. Set up the monitor nodes. Create keys and local configuration. The keys are used to authenticate and protect the communication between Ceph nodes.

    cephadm > ceph-deploy new node1 node2 node3

    During this step, ceph-deploy creates local configuration files. It is recommended to inspect the configuration files in the current directory.

    Tip
    Tip: Monitor Nodes on Different Subnets

    If the monitor nodes are not in the same subnet, you need to modify the ceph.conf in the current directory. For example, if the nodes have IP addresses

    10.121.9.186
    10.121.10.186
    10.121.11.186

    add the following line to the global section of ceph.conf:

    public network = 10.121.0.0/16

    Since you are likely to experience problems with IPv6 networking, consider modifying the IPv6 mon_host settings, as in the following example:

    mon_host = [2620:...10:121:9:186,2620:...10:121:10:186,2620:...10:121:11:186]

    into its IPv4 equivalent:

    mon_host = 10.121.9.186, 10.121.10.186, 10.121.11.186
  6. Create the initial monitor service on already created monitor nodes:

    cephadm > ceph-deploy mon create-initial
  7. Any node from which you need to run Ceph command line tools needs a copy of the admin keyring. To copy the admin keyring to a node or set of nodes, run

    cephadm > ceph-deploy admin node1 node2 node3
    Important
    Important

    Because the client.admin's keyring file is readable by root only, you need to use sudo when running the ceph command.

  8. Check the firewall status

    sudo /sbin/SuSEfirewall2 status

    and if it is off, check its configuration and turn it on with

    sudo /sbin/SuSEfirewall2 on

    You can find detailed information in Section 37.2, “Firewall Settings for Ceph”.

  9. Create OSD daemons. Although you can use a directory as a storage, we recommend to create a separate disk dedicated to a Ceph node. To find out the name of the disk device, run

    cat /proc/partitions
    major minor  #blocks  name
    
     254        0   12582912 vda
     254        1    1532928 vda1
     254        2   11048960 vda2
      11        0    2831360 sr0
     254       16    4194304 vdb

    In our case the vdb disk has no partitions, so it is most likely our newly created disk.

    Now set up the disk for Ceph:

    cephadm > ceph-deploy osd prepare node:vdb
    Tip
    Tip: Using Existing Partitions

    If you need to create OSDs on already existing partitions, you need to set their GUIDs correctly. See Section 29.3, “How to Use Existing Partitions for OSDs Including OSD Journals” for more details.

    Tip
    Tip

    If the disk was already used before, add the --zap option.

    cephadm > ceph-deploy osd prepare --zap node:vdb

    Be aware that 'zapping' erases the entire disk content.

    Note
    Note: Default File System for OSDs

    The default and only supported file system for OSDs is xfs.

    Optionally, activate the OSD:

    cephadm > ceph-deploy osd activate node:vdb1
    Tip
    Tip

    To join the functionality of ceph-deploy osd prepare and ceph-deploy osd activate, use ceph-deploy osd create.

  10. To test the status of the cluster, run

    sudo ceph -k ceph.client.admin.keyring health
Tip
Tip: Non-default Cluster Name

If you need to install the cluster with ceph-deploy using a name other than the default cluster name, you need to initially specify it with --cluster, and then specify it in each ceph-deploy command related to that cluster:

ceph-deploy --cluster my_cluster new [...]
ceph-deploy --ceph-conf my_cluster.conf mon create-initial
ceph-deploy --ceph-conf my_cluster.conf osd prepare [...]
ceph-deploy --ceph-conf my_cluster.conf osd activate [...]

Note that using a name other than default cluster name is not supported by SUSE.

6 Deploying with Crowbar

Crowbar (http://crowbar.github.io/) is a framework to build complete deployments. It helps you transform groups of bare-metal nodes into an operational cluster within relatively short time.

The deployment process consists of two basic steps: first you need to install and set up the Crowbar admin server, then use it to deploy the available OSD/monitor nodes.

6.1 Installing and Setting Up the Crowbar Admin Server

Crowbar admin server is a stand-alone host with SUSE Linux Enterprise Server 12 SP2 installed, operating in the same network as the Ceph OSD/MON nodes to be deployed. You need to configure the Crowbar admin server so that it provides software repositories required for Ceph deployment via TFTP protocol and PXE network boot.

  1. Install and register SUSE Linux Enterprise Server 12 SP2 on the Crowbar admin server. Optionally, you can install and register the SUSE Enterprise Storage 4 extension at the same time. For more information on SUSE Linux Enterprise Server installation, see https://www.suse.com/documentation/sles-12/book_sle_deployment/data/cha_inst.html. For more information on the extensions installation, see https://www.suse.com/documentation/sles-12/book_sle_deployment/data/cha_add-ons.html.

    Tip
    Tip

    Crowbar admin server does not require any graphical interface. To save the system resources and disk space, it is enough to install the Base System, Minimal System and, if you chose to install the SUSE Enterprise Storage 4 extension, SUSE Enterprise Storage Crowbar patterns from the Software Selection and System Tasks window. If you plan to synchronize repositories (see Section 6.1.1, “Prepare Software Repositories”) with SMT, add the Subscription Management Tool pattern as well.

  2. Configure network settings for the Crowbar admin server. The server needs to have a static IP address assigned, and the full host name including the domain name specified (for example crowbar-admin.example.com). Check with hostname -f if the host name resolves correctly. The local network where you deploy the cluster needs to have the DHCP server disabled as the Crowbar admin server runs its own.

    Tip
    Tip

    Crowbar admin server default IP address is 192.168.124.10. If it is possible to keep that IP in your network environment, you can save some time on reconfiguring the Crowbar network settings.

  3. Configure NTP to keep the server's time synchronized. See https://www.suse.com/documentation/sles-12/book_sle_admin/data/cha_netz_xntp.html for more information on the NTP protocol.

  4. Make sure that SSH is enabled and started on the server.

  5. Install and register the SUSE Enterprise Storage 4 extension if you did not install it in step 1. For more information on extension installation, see https://www.suse.com/documentation/sles-12/book_sle_deployment/data/cha_add-ons.html. Then install the SUSE Enterprise Storage Crowbar pattern in YaST. If you prefer the command line, run sudo zypper in -t pattern ses_admin.

  6. Mount software repositories required for Ceph nodes deployment with Crowbar. See Section 6.1.1, “Prepare Software Repositories” for more detailed information.

  7. If you need to further customize the Crowbar admin server settings, refer to the Crowbar Setup chapter of the current SUSE OpenStack Cloud Deployment Guide at https://www.suse.com/documentation.

  8. Run the following commands to complete the Crowbar admin server setup. The install-ses-admin script outputs a lot of information to the /var/log/crowbar/install.log log file which can be examined in the case of failure. Run it in the screen environment for safety reasons, as the network will be reconfigured during its run and interrupts may occur.

    sudo systemctl start crowbar-init
    sudo crowbarctl database create
    screen install-ses-admin

    Be patient as the script takes several minutes to finish.

  9. After the script successfully finishes, you can view the Crowbar admin server Web UI by pointing your Web browser to the Crowbar admin server IP address (http://192.168.124.10 by default).

6.1.1 Prepare Software Repositories

Crowbar admin server needs to provide several software repositories so that the Ceph nodes can install required packages from them on PXE boot. These repositories need to be mounted/synchronized under /srv/tftpboot/suse-12.2. The following description is based on the AMD64/Intel 64 architecture.

Tip
Tip: Synchronizing Repositories

There are several ways to provide the content in the repository directories. You can, for example, run your local SMT instance, synchronize the repositories, and then export them via NFS and mount them on the Crowbar admin server.

/srv/tftpboot/suse-12.2/x86_64/install

This directory needs to contain the contents of the SUSE Linux Enterprise Server 12 SP2 DVD disc #1. Ceph nodes need it for the base SUSE Linux Enterprise Server 12 SP2 installation. You can either mount the downloaded .iso image as a loop device, or copy its content with rsync.

/srv/tftpboot/suse-12.2/x86_64/repos/SLES12-SP2-Pool

Base software repository for SUSE Linux Enterprise Server 12 SP2.

/srv/tftpboot/suse-12.2/x86_64/repos/SLES12-SP2-Updates

Repository containing updates for SUSE Linux Enterprise Server 12 SP2.

/srv/tftpboot/suse-12.2/x86_64/repos/SUSE-Enterprise-Storage-4-Pool

Base software repository for SES 4.

/srv/tftpboot/suse-12.2/x86_64/repos/SUSE-Enterprise-Storage-4-Updates

Repository containing updates for SES 4.

6.2 Deploying the Ceph Nodes

The Crowbar Web interface runs on the Administration Server. It provides an overview of the most important deployment details, including a view on the nodes and which roles are deployed on which nodes, and on the barclamp proposals that can be edited and deployed. In addition, the Crowbar Web interface shows details about the networks and switches in your cluster. It also provides graphical access to some tools with which you can manage your repositories, back up or restore the Administration Server, export the Chef configuration, or generate a supportconfig TAR archive with the most important log files.

6.2.1 Logging In

The Crowbar Web interface uses the HTTP protocol and port 80.

Procedure 6.1: Logging In to the Crowbar Web Interface
  1. On any machine, start a Web browser and make sure that JavaScript and cookies are enabled.

  2. As URL, enter the IP address of the Administration Server, for example:

    http://192.168.124.10/
  3. Log in as user crowbar. If you have not changed the password, it is crowbar by default.

Procedure 6.2: Changing the Password for the Crowbar Web Interface
  1. After being logged in to the Crowbar Web interface, select Barclamp › Crowbar.

  2. Select the Crowbar barclamp entry and Edit the proposal.

  3. In the Attributes section, click Raw to edit the configuration file.

  4. Search for the following entry:

    "crowbar": {
         "password": "crowbar"
  5. Change the password.

  6. Confirm your change by clicking Save and Apply.

6.2.2 Node Installation

The Ceph nodes represent the actual cluster infrastructure. Node installation and service deployment is done automatically from the Administration Server. Before deploying the Ceph service, SUSE Linux Enterprise Server 12 SP2 will be installed on all nodes.

To install a node, you need to boot it first using PXE. It will be booted with an image that enables the Administration Server to discover the node and make it available for installation. When you have allocated the node, it will boot using PXE again and the automatic installation will start.

  1. Boot all nodes that you want to deploy using PXE. The nodes will boot into the SLEShammer image, which performs the initial hardware discovery.

    Important
    Important: Limit the Number of Concurrent Boots using PXE

    Booting many nodes using PXE at the same time will cause heavy load on the TFTP server, because all nodes will request the boot image at the same time. It is recommended to boot the nodes time-delayed.

  2. Open a browser and point it to the Crowbar Web interface on the Administration Server, for example http://192.168.124.10/. Log in as user crowbar. The password is crowbar by default, if you have not changed it.

    Click Nodes › Dashboard to open the Node Dashboard.

  3. Each node that has successfully booted will be listed as being in state Discovered, indicated by a yellow bullet. The nodes will be listed with their MAC address as a name. Wait until all nodes are listed as being Discovered before proceeding. In case a node does not report as being Discovered, it may need to be rebooted manually.

    Discovered Nodes
    Figure 6.1: Discovered Nodes
  4. Although this step is optional, it is recommended to properly group your nodes at this stage, since it lets you clearly arrange all nodes. Grouping the nodes by role would be one option, for example monitor nodes and OSD nodes.

    1. Enter the name of a new group into the New Group text box and click Add Group.

    2. Drag and drop a node onto the title of the newly created group. Repeat this step for each node you want to put into the group.

      Grouping Nodes
      Figure 6.2: Grouping Nodes
  5. To allocate all nodes, click Nodes › Bulk Edit. To allocate a single node, click the name of a node, then click Edit.

    Editing a Single Node
    Figure 6.3: Editing a Single Node
    Important
    Important: Limit the Number of Concurrent Node Deployments

    Deploying many nodes in bulk mode will cause heavy load on the Administration Server. The subsequent concurrent Chef client runs triggered by the nodes will require a lot of RAM on the Administration Server.

    Therefore it is recommended to limit the number of concurrent Allocations in bulk mode. The maximum number depends on the amount of RAM on the Administration Server—limiting concurrent deployments to five up to ten is recommended.

  6. In single node editing mode, you can also specify the Filesystem Type for the node. By default, it is set to ext4 for all nodes. It is recommended to keep this default.

  7. Provide a meaningful Alias, Public Name and a Description for each node and check the Allocate box. You can also specify the Intended Role for the node. This optional setting is used to make reasonable proposals for the barclamps.

    By default Target Platform is set to SLES 12 SP2.

    Tip
    Tip: Alias Names

    Providing an alias name will change the default node names (MAC address) to the name you provided, making it easier to identify the node. Furthermore, this alias will also be used as a DNS CNAME for the node in the admin network. As a result, you can access the node via this alias when, for example, logging in via SSH.

    Bulk Editing Nodes
    Figure 6.4: Bulk Editing Nodes
  8. When you have filled in the data for all nodes, click Save. The nodes will reboot and commence the AutoYaST-based SUSE Linux Enterprise Server installation (or installation of other target platforms, if selected) via a second boot using PXE. Click Nodes › Dashboard to return to the Node Dashboard.

  9. Nodes that are being installed are listed with the status Installing (yellow/green bullet). When the installation of a node has finished, it is listed as being Ready, indicated by a green bullet. Wait until all nodes are listed as being Ready before proceeding.

    All Nodes Have Been Installed
    Figure 6.5: All Nodes Have Been Installed

6.2.3 Barclamps

The Ceph service is automatically installed on the nodes by using so-called barclamps—a set of recipes, templates, and installation instructions. A barclamp is configured via a so-called proposal. A proposal contains the configuration of the service(s) associated with the barclamp and a list of machines onto which the barclamp should be deployed.

All existing barclamps can be accessed from the Crowbar Web interface by clicking Barclamps. To create or edit barclamp proposals and deploy them, proceed as follows:

  1. Open a browser and point it to the Crowbar Web interface available on the Administration Server, for example http://192.168.124.10/. Log in as user crowbar. The password is crowbar by default, if you have not changed it.

    Click Barclamps to open the All Barclamps menu. Alternatively you may filter the list to Crowbar or SUSE Enterprise Storage barclamps by choosing the respective option from Barclamps. The Crowbar barclamps contain general recipes for setting up and configuring all nodes, while the SUSE Enterprise Storage barclamps are dedicated to Ceph service deployment and configuration.

  2. You can either Create a proposal or Edit an existing one.

    Most Ceph barclamps consist of two sections: the Attributes section lets you change the configuration, and the Node Deployment section lets you choose onto which nodes to deploy the barclamp.

  3. To edit the Attributes section, change the values via the Web form. Alternatively you can directly edit the configuration file by clicking Raw.

    Warning
    Warning: Raw Mode

    If you switch between Raw mode and Web form (Custom mode), make sure to Save your changes before switching, otherwise they will be lost.

  4. To assign nodes to a role, use the Deployment section of the barclamp. It shows the Available Nodes that you can assign to the roles belonging to the barclamp.

    One or more nodes are usually automatically pre-selected for available roles. If this pre-selection does not meet your requirements, click the Remove icon next to the role to remove the assignment. Assign a node or cluster of your choice by selecting the respective entry from the list of Available Nodes, Available Clusters, or Available Clusters with Remote Nodes. Drag it to the desired role and drop it onto the role name. Do not drop a node or cluster onto the text box—this is used to filter the list of available nodes or clusters!

  5. To save and deploy your edits, click Apply. To save your changes without deploying them, click Save. To remove the complete proposal, click Delete. A proposal that already has been deployed can only be deleted manually, see Section 6.2.3.1, “Delete a Proposal That Already Has Been Deployed” for details.

    If you deploy a proposal onto a node where a previous one is still active, the new proposal will overwrite the old one.

    Note
    Note: Wait Until a Proposal has been Deployed

    Deploying a proposal might take some time (up to several minutes). It is strongly recommended to always wait until you see the note Successfully applied the proposal before proceeding on to the next proposal.

6.2.3.1 Delete a Proposal That Already Has Been Deployed

To delete a proposal that already has been deployed, you first need to Deactivate it in the Crowbar Web interface. Deactivating a proposal removes the chef role from the nodes, so the routine that installed and set up the services is not executed anymore. After a proposal has been deactivated, you can Delete it in the Crowbar Web interface to remove the barclamp configuration data from the server.

Deactivating and deleting a barclamp that already had been deployed does not remove packages installed when the barclamp was deployed. Nor does it stop any services that were started during the barclamp deployment. To undo the deployment on the affected node, you need to stop (systemctl stop service) the respective services and disable (systemctl disable service) them. Uninstalling packages should not be necessary.

6.2.3.2 Queuing/Dequeuing Proposals

When a proposal is applied to one or more nodes that are nor yet available for deployment (for example because they are rebooting or have not been fully installed, yet), the proposal will be put in a queue. A message like

Successfully queued the proposal until the following become ready: d52-54-00-6c-25-44

will be shown when having applied the proposal. A new button Dequeue will also become available. Use it to cancel the deployment of the proposal by removing it from the queue.

6.2.4 Deploying Ceph

For Ceph at least four nodes are required. If deploying the optional Calamari server for Ceph management and monitoring, an additional node is required.

The Ceph barclamp has the following configuration options:

Disk Selection Method

Choose whether to only use the first available disk or all available disks. Available disks are all disks currently not used by the system. Note that one disk (usually /dev/sda) of every block storage node is already used for the operating system and is not available for Ceph.

Number of Replicas of an Object

For data security, stored objects are not only stored once, but redundantly. Specify the number of copies that should be stored for each object with this setting. The number includes the object itself. If you for example want the object plus two copies, specify 3.

SSL Support for RadosGW

Choose whether to encrypt public communication (HTTPS) or not (HTTP). If choosing HTTPS, you need to specify the locations for the certificate key pair files.

Calamari Credentials

Calamari is a Web front-end for managing and analyzing the Ceph cluster. Provide administrator credentials (user name, password, e-mail address) in this section. When Ceph has bee deployed you can log in to Calamari with these credentials. Deploying Calamari is optional—leave these text boxes empty when not deploying Calamari.

The Ceph Barclamp
Figure 6.6: The Ceph Barclamp

The Ceph service consists of the following different roles:

ceph-osd

The virtual block storage service. Install this role on all dedicated Ceph Storage Nodes (at least three), but not on any other node.

ceph-mon

Cluster monitor daemon for the Ceph distributed file system. ceph-mon needs to be installed on three or five dedicated nodes.

ceph-calamari

Sets up the Calamari Web interface which lets you manage the Ceph cluster. Deploying it is optional. The Web interface can be accessed via http://IP-ADDRESS/ (where IP-ADDRESS is the address of the machine where ceph-calamari is deployed on). ceph-calamari needs to be installed on a dedicated node—it is not possible to install it on a nodes running other services.

ceph-radosgw

The HTTP REST gateway for Ceph. Install it on a dedicated node.

ceph-mds

The metadata server daemon for the CephFS. Install it on a dedicated node. For more information on CephFS refer to Chapter 19, Clustered File System.

The Ceph Barclamp: Node Deployment Example
Figure 6.7: The Ceph Barclamp: Node Deployment Example

7 Upgrading from Previous Releases

This chapter introduces steps to upgrade SUSE Enterprise Storage from the previous release(s) to the current one.

7.1 General Upgrade Procedure

Before upgrading the Ceph cluster, you need to have both the underlying SUSE Linux Enterprise Server and SUSE Enterprise Storage correctly registered against SCC or SMT. You can upgrade daemons in your cluster while the cluster is online and in service. Certain types of daemons depend upon others. For example Ceph RADOS Gateways depend upon Ceph monitors and Ceph OSD daemons. We recommend upgrading in this order:

  1. Admin node (if you deployed the cluster using the admin node).

  2. Ceph monitors.

  3. Ceph OSD daemons.

  4. Ceph RADOS Gateways.

  5. iSCSI gateways (refer to https://www.suse.com/documentation/ses-3/book_storage_admin/data/ceph_upgrade_2_1to3.html#ceph_upgrade_2_1to3_iscsi_up for more details).

Tip
Tip

We recommend upgrading all the daemons of a specific type—for example all monitor daemons or all OSD daemons—one by one to ensure that they are all on the same release. We also recommend that you upgrade all the daemons in your cluster before you try to exercise new functionality in a release.

After all the daemons of a specific type are upgraded, check their status.

Ensure each monitor has rejoined the quorum after all monitors are upgraded:

ceph mon stat

Ensure each Ceph OSD daemon has rejoined the cluster after all OSDs are upgraded:

ceph osd stat

7.2 Upgrade from SUSE Enterprise Storage 2.1 to 4

Tip
Tip

SUSE Enterprise Storage 2.1 to 4 upgrade procedure includes all the steps valid for upgrading SUSE Enterprise Storage 2.1 to 3 (https://www.suse.com/documentation/ses-3/book_storage_admin/data/ceph_upgrade_2_1to3.html). Therefore the SUSE Enterprise Storage 2.1 to 4 upgrade procedure will often reference them.

Important
Important: Software Requirements

You need to have the following software installed and updated to the latest packages versions on all the Ceph nodes you want to upgrade before you can start with the upgrade procedure:

  • SUSE Linux Enterprise Server 12 SP1

  • SUSE Enterprise Storage 2.1

To upgrade the SUSE Enterprise Storage 2.1 cluster to version 4, follow these steps on each cluster node:

  1. Warning
    Warning: Do Not Run zypper dup or Reboot the Node

    After you prepare for the upgrade to SUSE Linux Enterprise Server 12 SP2 as suggested later in this step, do not run zypper dup or reboot the node as its Ceph related services may not start correctly.

    Upgrade the current SUSE Linux Enterprise Server to version 12 SP2. Refer to https://www.suse.com/documentation/sles-12/book_sle_deployment/data/cha_update_sle.html for more information on supported upgrade methods.

  2. List all the active services with zypper ls.

    zypper ls
    #| Alias                                      | Name | Enabled | Refresh | Type
    -+--------------------------------------------+------+---------+---------+------
    1| SUSE_Enterprise_Storage_3_x86_64           | ...  | Yes     | Yes     | ris
    2| SUSE_Linux_Enterprise_Server_12_SP2_x86_64 | ...  | Yes     | Yes     | ris
    [...]

    Verify that services related to SUSE Linux Enterprise Server 12 SP2 are present and enabled.

  3. Remove the current SUSE Enterprise Storage service. You can do it as follows:

    sudo zypper rs ID
  4. Activate SUSE Enterprise Storage 4 service. You can use yast2 add-on.

  5. Refresh new software repositories:

    sudo zypper ref
  6. Install the upgrade helper package:

    sudo zypper in ses-upgrade-helper
  7. Run the upgrade script:

    sudo upgrade-ses.sh

    The script does the distribution upgrade of the node. After reboot, the node comes up with SUSE Linux Enterprise Server 12 SP2 and SUSE Enterprise Storage 4 running.

  8. Check the ownership of /var/lib/ceph (https://www.suse.com/documentation/ses-3/book_storage_admin/data/ceph_upgrade_2_1to3.html#ceph_upgrade_2_1to3_varlibceph).

  9. Check the RADOS Gateway instance name (https://www.suse.com/documentation/ses-3/book_storage_admin/data/ceph_upgrade_2_1to3.html#ceph_upgrade_2_1to3_rgw).

  10. Check the RADOS Gateway log file (https://www.suse.com/documentation/ses-3/book_storage_admin/data/ceph_upgrade_2_1to3.html#ceph_upgrade_2_1to3_rgwlog).

  11. Check CRUSH Tunables (https://www.suse.com/documentation/ses-3/book_storage_admin/data/ceph_upgrade_2_1to3.html#ceph_upgrade_2_1to3_tunables).

  12. Check OSD partition GUID codes (https://www.suse.com/documentation/ses-3/book_storage_admin/data/ceph_upgrade_2_1to3.html#ceph_upgrade_2_1to3_guid).

  13. Upgrade the admin node (https://www.suse.com/documentation/ses-3/book_storage_admin/data/ceph_upgrade_2_1to3.html#ceph_upgrade_2_1to3_admin_node).

  14. Upgrade the Calamari server (https://www.suse.com/documentation/ses-3/book_storage_admin/data/ceph_upgrade_2_1to3.html#ceph_upgrade_2_1to3_calamari).

  15. Set the require_jewel_osds osdmap flag (https://www.suse.com/documentation/ses-3/book_storage_admin/data/ceph_upgrade_2_1to3.html#ceph_upgrade_2to3_jewel).

7.3 Upgrade from SUSE Enterprise Storage 3 to 4

Important
Important: Software Requirements

You need to have the following software installed and updated to the latest packages versions on all the Ceph nodes you want to upgrade before you can start with the upgrade procedure:

  • SUSE Linux Enterprise Server 12 SP1

  • SUSE Enterprise Storage 3

We recommended upgrading SUSE Linux Enterprise Server and SUSE Enterprise Storage in one step using the zypper migration command. Find more detailed information in the Service Pack Migration (https://www.suse.com/documentation/sles-12/book_sle_deployment/data/cha_update_spmigration.html) section, more specifically in the Migrating with Zypper (https://www.suse.com/documentation/sles-12/book_sle_deployment/data/sec_update_migr_zypper_onlinemigr.html) subsection and the zypper migration manual page (man zypper-migration or zypper help migration).

zypper migration upgrades both the underlying SUSE Linux Enterprise Server and the SUSE Enterprise Storage product. After it successfully finishes, reboot the node manually as soon as possible. After reboot, the node comes up with SUSE Linux Enterprise Server 12 SP2 and SUSE Enterprise Storage 4 running.

Part III Operating a Cluster

8 Introduction

In this part of the manual you will learn how to start or stop Ceph services, how to monitor a cluster's state, how to use and modify CRUSH maps, and how to manage storage pools.

9 Operating Ceph Services

Ceph related services are operated with the systemctl command. The operation takes place on the node you are currently logged in to. You need to have root privileges to be able to operate on Ceph services.

10 Determining Cluster State

Once you have a running cluster, you may use the ceph tool to monitor your cluster. Determining cluster state typically involves checking OSD status, monitor status, placement group status and metadata server status.

11 Authentication with cephx

To identify clients and protect against man-in-the-middle attacks, Ceph provides its cephx authentication system. Clients in this context are either human users—such as the admin user—or Ceph-related services/daemons, for example OSDs, monitors, or RADOS Gateways.

12 Stored Data Management

The CRUSH algorithm determines how to store and retrieve data by computing data storage locations. CRUSH empowers Ceph clients to communicate with OSDs directly rather than through a centralized server or broker. With an algorithmically determined method of storing and retrieving data, Ceph avoids a…

13 Managing Storage Pools

When you first deploy a cluster without creating a pool, Ceph uses the default pools for storing data. A pool provides you with:

14 Snapshots

A snapshot is a read-only copy of the state of an object—a pool or an image—at a particular point in time. This way you can retain a history of its state. There are two types of snapshots in Ceph—RBD snapshots and pool snapshots.

15 Erasure Coded Pools

A Ceph pool is associated to a type to sustain the loss of an OSD (that is a disk since most of the time there is one OSD per disk). The default choice when creating a pool is replicated, meaning every object is copied on multiple disks. The Erasure Code pool type can be used instead to save space.

16 Cache Tiering

A cache tier is an additional storage layer implemented between the client and the standard storage. It is designed to speed up the access to pools stored on slow hard disks and erasure coded pools.

8 Introduction

In this part of the manual you will learn how to start or stop Ceph services, how to monitor a cluster's state, how to use and modify CRUSH maps, and how to manage storage pools.

It also includes advanced topics, for example how to manage users and authentication in general, how to manage pool and RADOS device snapshots, how to set up erasure coded pools, or how to increase the cluster performance with cache tiering.

9 Operating Ceph Services

Ceph related services are operated with the systemctl command. The operation takes place on the node you are currently logged in to. You need to have root privileges to be able to operate on Ceph services.

9.1 Starting, Stopping, and Restarting Services using Targets

To facilitate starting, stopping, and restarting all the services of a particular type (for example all Ceph services, or all MONs, or all OSDs) on a node, Ceph provides the following systemd unit files:

ceph.target
ceph-osd.target
ceph-mon.target
ceph-mds.target
ceph-radosgw.target
ceph-rbd-mirror.target

To start/stop/restart all Ceph services on the node, run:

systemctl stop ceph.target
systemctl start ceph.target
systemctl restart ceph.target

To start/stop/restart all OSDs on the node, run:

systemctl stop ceph-osd.target
systemctl start ceph-osd.target
systemctl restart ceph-osd.target

Commands for the other targets are analogous.

9.2 Starting, Stopping, and Restarting Individual Services

You can operate individual services using the following parameterized systemd unit files:

ceph-osd@.service
ceph-mon@.service
ceph-mds@.service
ceph-radosgw@.service
ceph-rbd-mirror@.service

To use these commands, you first need to identify the name of the service you want to operate. See Section 9.3, “Identifying Individual Services” to learn more about services identification.

To start/stop/restart the osd.1 service, run:

systemctl stop ceph-osd@1.service
systemctl start ceph-osd@1.service
systemctl restart ceph-osd@1.service

Commands for the other service types are analogous.

9.3 Identifying Individual Services

You can find out the names/numbers of a particular type of service by running systemctl and filtering the results with the grep command. For example:

systemctl | grep -i 'ceph-osd.*service'
systemctl | grep -i 'ceph-mon.*service'
[...]

9.4 Service Status

You can query systemd for the status of services. For example:

systemctl status ceph-osd@1.service
systemctl status ceph-mon@vanguard2.service

If you do not know the exact name/number of the service, see Section 9.3, “Identifying Individual Services”.

10 Determining Cluster State

Once you have a running cluster, you may use the ceph tool to monitor your cluster. Determining cluster state typically involves checking OSD status, monitor status, placement group status and metadata server status.

Tip
Tip: Interactive Mode

To run the ceph tool in an interactive mode, type ceph at the command line with no arguments. The interactive mode is more convenient if you are going to enter more ceph commands in a row. For example:

ceph
ceph> health
ceph> status
ceph> quorum_status
ceph> mon_status

10.1 Checking Cluster Health

After you start your cluster, and before you start reading and/or writing data, check your cluster’s health first. You can check on the health of your Ceph cluster with the following:

ceph health
HEALTH_WARN 10 pgs degraded; 100 pgs stuck unclean; 1 mons down, quorum 0,2 \
node-1,node-2,node-3

If you specified non-default locations for your configuration or keyring, you may specify their locations:

ceph -c /path/to/conf -k /path/to/keyring health

Upon starting the Ceph cluster, you will likely encounter a health warning such as HEALTH_WARN XXX num placement groups stale. Wait a few moments and check it again. When your cluster is ready, ceph health should return a message such as HEALTH_OK. At that point, it is okay to begin using the cluster.

10.2 Watching a Cluster

To watch the cluster’s ongoing events, open a new terminal and enter:

ceph -w

Ceph will print each event. For example, a tiny Ceph cluster consisting of one monitor, and two OSDs may print the following:

cluster b370a29d-9287-4ca3-ab57-3d824f65e339
 health HEALTH_OK
 monmap e1: 1 mons at {ceph1=10.0.0.8:6789/0}, election epoch 2, quorum 0 ceph1
 osdmap e63: 2 osds: 2 up, 2 in
  pgmap v41338: 952 pgs, 20 pools, 17130 MB data, 2199 objects
        115 GB used, 167 GB / 297 GB avail
             952 active+clean

2014-06-02 15:45:21.655871 osd.0 [INF] 17.71 deep-scrub ok
2014-06-02 15:45:47.880608 osd.1 [INF] 1.0 scrub ok
2014-06-02 15:45:48.865375 osd.1 [INF] 1.3 scrub ok
2014-06-02 15:45:50.866479 osd.1 [INF] 1.4 scrub ok
[...]
2014-06-02 15:45:55.720929 mon.0 [INF] pgmap v41343: 952 pgs: \
 1 active+clean+scrubbing+deep, 951 active+clean; 17130 MB data, 115 GB used, \
 167 GB / 297 GB avail

The output provides the following information:

  • Cluster ID

  • Cluster health status

  • The monitor map epoch and the status of the monitor quorum

  • The OSD map epoch and the status of OSDs

  • The placement group map version

  • The number of placement groups and pools

  • The notional amount of data stored and the number of objects stored; and,

  • The total amount of data stored.

Tip
Tip: How Ceph Calculates Data Usage

The used value reflects the actual amount of raw storage used. The xxx GB / xxx GB value means the amount available (the lesser number) of the overall storage capacity of the cluster. The notional number reflects the size of the stored data before it is replicated, cloned or snapshot. Therefore, the amount of data actually stored typically exceeds the notional amount stored, because Ceph creates replicas of the data and may also use storage capacity for cloning and snapshotting.

10.3 Checking a Cluster’s Usage Stats

To check a cluster’s data usage and data distribution among pools, you can use the df option. It is similar to Linux df. Execute the following:

ceph df
GLOBAL:
    SIZE       AVAIL      RAW USED     %RAW USED
    27570M     27304M         266M          0.97
POOLS:
    NAME             ID     USED     %USED     MAX AVAIL     OBJECTS
    data             0       120         0         5064M           4
    metadata         1         0         0         5064M           0
    rbd              2         0         0         5064M           0
    hot-storage      4       134         0         4033M           2
    cold-storage     5      227k         0         5064M           1
    pool1            6         0         0         5064M           0

The GLOBAL section of the output provides an overview of the amount of storage your cluster uses for your data.

The POOLS section of the output provides a list of pools and the notional usage of each pool. The output from this section does not reflect replicas, clones or snapshots. For example, if you store an object with 1MB of data, the notional usage will be 1MB, but the actual usage may be 2MB or more depending on the number of replicas, clones and snapshots.

  • NAME: The name of the pool.

  • ID: The pool ID.

  • USED: The notional amount of data stored in kilobytes, unless the number appends M for megabytes or G for gigabytes.

  • %USED: The notional percentage of storage used per pool.

  • OBJECTS: The notional number of objects stored per pool.

Note
Note

The numbers in the POOLS section are notional. They are not inclusive of the number of replicas, snapshots or clones. As a result, the sum of the USED and %USED amounts will not add up to the RAW USED and %RAW USED amounts in the %GLOBAL section of the output.

10.4 Checking a Cluster’s Status

To check a cluster’s status, execute the following:

ceph status

or

ceph -s

In interactive mode, type status and press Enter.

ceph> status

Ceph will print the cluster status. For example, a tiny Ceph cluster consisting of one monitor and two OSDs may print the following:

cluster b370a29d-9287-4ca3-ab57-3d824f65e339
 health HEALTH_OK
 monmap e1: 1 mons at {ceph1=10.0.0.8:6789/0}, election epoch 2, quorum 0 ceph1
 osdmap e63: 2 osds: 2 up, 2 in
  pgmap v41332: 952 pgs, 20 pools, 17130 MB data, 2199 objects
        115 GB used, 167 GB / 297 GB avail
               1 active+clean+scrubbing+deep
             951 active+clean

10.5 Checking OSD Status

You can check OSDs to ensure they are up and on by executing:

ceph osd stat

or

ceph osd dump

You can also view OSDs according to their position in the CRUSH map.

ceph osd tree

Ceph will print out a CRUSH tree with a host, its OSDs, whether they are up and their weight.

# id    weight  type name       up/down reweight
-1      3       pool default
-3      3               rack mainrack
-2      3                       host osd-host
0       1                               osd.0   up      1
1       1                               osd.1   up      1
2       1                               osd.2   up      1

10.6 Checking Monitor Status

If your cluster has multiple monitors (likely), you should check the monitor quorum status after you start the cluster before reading and/or writing data. A quorum must be present when multiple monitors are running. You should also check monitor status periodically to ensure that they are running.

To display the monitor map, execute the following:

ceph mon stat

or

ceph mon dump

To check the quorum status for the monitor cluster, execute the following:

ceph quorum_status

Ceph will return the quorum status. For example, a Ceph cluster consisting of three monitors may return the following:

{ "election_epoch": 10,
  "quorum": [
        0,
        1,
        2],
  "monmap": { "epoch": 1,
      "fsid": "444b489c-4f16-4b75-83f0-cb8097468898",
      "modified": "2011-12-12 13:28:27.505520",
      "created": "2011-12-12 13:28:27.505520",
      "mons": [
            { "rank": 0,
              "name": "a",
              "addr": "127.0.0.1:6789\/0"},
            { "rank": 1,
              "name": "b",
              "addr": "127.0.0.1:6790\/0"},
            { "rank": 2,
              "name": "c",
              "addr": "127.0.0.1:6791\/0"}
           ]
    }
}

10.7 Checking Placement Group States

Placement groups map objects to OSDs. When you monitor your placement groups, you will want them to be active and clean. For a detailed discussion, refer to Monitoring OSDs and Placement Groups. (http://docs.ceph.com/docs/master/rados/operations/monitoring-osd-pg)

10.8 Using the Admin Socket

The Ceph admin socket allows you to query a daemon via a socket interface. By default, Ceph sockets reside under /var/run/ceph. To access a daemon via the admin socket, log in to the host running the daemon and use the following command:

ceph --admin-daemon /var/run/ceph/socket-name

To view the available admin socket commands, execute the following command:

ceph --admin-daemon /var/run/ceph/socket-name help

The admin socket command enables you to show and set your configuration at runtime. Refer to Viewing a Configuration at Runtime (http://docs.ceph.com/docs/master/rados/configuration/ceph-conf#ceph-runtime-config)for details.

Additionally, you can set configuration values at runtime directly (the admin socket bypasses the monitor, unlike ceph tell daemon-type.id injectargs, which relies on the monitor but does not require you to log in directly to the host in question).

11 Authentication with cephx

To identify clients and protect against man-in-the-middle attacks, Ceph provides its cephx authentication system. Clients in this context are either human users—such as the admin user—or Ceph-related services/daemons, for example OSDs, monitors, or RADOS Gateways.

Note
Note

The cephx protocol does not address data encryption in transport, such as TLS/SSL.

11.1 Authentication Architecture

cephx uses shared secret keys for authentication, meaning both the client and the monitor cluster have a copy of the client’s secret key. The authentication protocol enables both parties to prove to each other that they have a copy of the key without actually revealing it. This provides mutual authentication, which means the cluster is sure the user possesses the secret key, and the user is sure that the cluster has a copy of the secret key as well.

A key scalability feature of Ceph is to avoid a centralized interface to the Ceph object store. This means that Ceph clients can interact with OSDs directly. To protect data, Ceph provides its cephx authentication system, which authenticates Ceph clients.

Each monitor can authenticate clients and distribute keys, so there is no single point of failure or bottleneck when using cephx. The monitor returns an authentication data structure that contains a session key for use in obtaining Ceph services. This session key is itself encrypted with the client’s permanent secret key, so that only the client can request services from the Ceph monitors. The client then uses the session key to request its desired services from the monitor, and the monitor provides the client with a ticket that will authenticate the client to the OSDs that actually handle data. Ceph monitors and OSDs share a secret, so the client can use the ticket provided by the monitor with any OSD or metadata server in the cluster. cephx tickets expire, so an attacker cannot use an expired ticket or session key obtained wrongfully. This form of authentication will prevent attackers with access to the communications medium from either creating bogus messages under another client’s identity or altering another client’s legitimate messages, as long as the client secret key is not revealed before it expires.

To use cephx, an administrator must setup clients/users first. In the following diagram, the client.admin user invokes ceph auth get-or-create-key from the command line to generate a user name and secret key. Ceph’s auth subsystem generates the user name and key, stores a copy with the monitor(s) and transmits the user’s secret back to the client.admin user. This means that the client and the monitor share a secret key.

Basic cephx Authentication
Figure 11.1: Basic cephx Authentication

To authenticate with the monitor, the client passes the user name to the monitor. The monitor generates a session key and encrypts it with the secret key associated with the user name and transmits the encrypted ticket back to the client. The client then decrypts the data with the shared secret key to retrieve the session key. The session key identifies the user for the current session. The client then requests a ticket related to the user, which is signed by the session key. The monitor generates a ticket, encrypts it with the user’s secret key and transmits it back to the client. The client decrypts the ticket and uses it to sign requests to OSDs and metadata servers throughout the cluster.

cephx Authentication
Figure 11.2: cephx Authentication

The cephx protocol authenticates ongoing communications between the client machine and the Ceph servers. Each message sent between a client and a server after the initial authentication is signed using a ticket that the monitors, OSDs, and metadata servers can verify with their shared secret.

cephx Authentication - MDS and OSD
Figure 11.3: cephx Authentication - MDS and OSD
Important
Important

The protection offered by this authentication is between the Ceph client and the Ceph cluster hosts. The authentication is not extended beyond the Ceph client. If the user accesses the Ceph client from a remote host, Ceph authentication is not applied to the connection between the user’s host and the client host.

11.2 Key Management

This section describes Ceph client users and their authentication and authorization with the Ceph storage cluster. Users are either individuals or system actors such as applications, which use Ceph clients to interact with the Ceph storage cluster daemons.

When Ceph runs with authentication and authorization enabled (enabled by default), you must specify a user name and a keyring containing the secret key of the specified user (usually via the command line). If you do not specify a user name, Ceph will use client.admin as the default user name. If you do not specify a keyring, Ceph will look for a keyring via the keyring setting in the Ceph configuration file. For example, if you execute the ceph health command without specifying a user name or keyring, Ceph interprets the command like this:

ceph -n client.admin --keyring=/etc/ceph/ceph.client.admin.keyring health

Alternatively, you may use the CEPH_ARGS environment variable to avoid re-entering the user name and secret.

11.2.1 Background Information

Regardless of the type of Ceph client (for example, block device, object storage, file system, native API), Ceph stores all data as objects within pools. Ceph users need to have access to pools in order to read and write data. Additionally, Ceph users must have execute permissions to use Ceph's administrative commands. The following concepts will help you understand Ceph user management.

11.2.1.1 User

A user is either an individual or a system actor such as an application. Creating users allows you to control who (or what) can access your Ceph storage cluster, its pools, and the data within pools.

Ceph uses types of users. For the purposes of user management, the type will always be client. Ceph identifies users in period (.) delimited form, consisting of the user type and the user ID. For example, TYPE.ID, client.admin, or client.user1. The reason for user typing is that Ceph monitors, OSDs, and metadata servers also use the cephx protocol, but they are not clients. Distinguishing the user type helps to distinguish between client users and other users, streamlining access control, user monitoring, and traceability.

Note
Note

A Ceph storage cluster user is not the same as a Ceph object storage user or a Ceph file system user. The Ceph RADOS Gateway uses a Ceph storage cluster user to communicate between the gateway daemon and the storage cluster, but the gateway has its own user management functionality for end users. The Ceph file system uses POSIX semantics. The user space associated with it is not the same as a Ceph storage cluster user.

11.2.1.2 Authorization and Capabilities

Ceph uses the term 'capabilities' (caps) to describe authorizing an authenticated user to exercise the functionality of the monitors, OSDs, and metadata servers. Capabilities can also restrict access to data within a pool or a namespace within a pool. A Ceph administrative user sets a user's capabilities when creating or updating a user.

Capability syntax follows the form:

daemon-type 'allow capability' [...]

Following is a list of capabilities for each service type:

Monitor capabilities

include r, w, x and allow profile cap.

mon 'allow rwx'
mon 'allow profile osd'
OSD capabilities

include r, w, x, class-read, class-write and profile osd. Additionally, OSD capabilities also allow for pool and namespace settings.

osd 'allow capability' [pool=poolname] [namespace=namespace-name]
MDS capability

simply requires allow, or blank.

mds 'allow'

The following entries describe each capability:

allow

Precedes access settings for a daemon. Implies rw for MDS only.

r

Gives the user read access. Required with monitors to retrieve the CRUSH map.

w

Gives the user write access to objects.

x

Gives the user the capability to call class methods (both read and write) and to conduct auth operations on monitors.

class-read

Gives the user the capability to call class read methods. Subset of x.

class-write

Gives the user the capability to call class write methods. Subset of x.

*

Gives the user read, write, and execute permissions for a particular daemon/pool, and the ability to execute admin commands.

profile osd

Gives a user permissions to connect as an OSD to other OSDs or monitors. Conferred on OSDs to enable OSDs to handle replication heartbeat traffic and status reporting.

profile mds

Gives a user permissions to connect as an MDS to other MDSs or monitors.

profile bootstrap-osd

Gives a user permissions to bootstrap an OSD. Delegated to deployment tools such as ceph-disk, ceph-deploy so that they have permissions to add keys when bootstrapping an OSD.

profile bootstrap-mds

Gives a user permissions to bootstrap a metadata server. Delegated to deployment tools such as ceph-deploy so they have permissions to add keys when bootstrapping a metadata server.

11.2.1.3 Pools

A pool is a logical partition where users store data. In Ceph deployments, it is common to create a pool as a logical partition for similar types of data. For example, when deploying Ceph as a back-end for OpenStack, a typical deployment would have pools for volumes, images, backups and virtual machines, and users such as client.glance or client.cinder.

11.2.2 Managing Users

User management functionality provides Ceph cluster administrators with the ability to create, update, and delete users directly in the Ceph cluster.

When you create or delete users in the Ceph cluster, you may need to distribute keys to clients so that they can be added to keyrings. See Section 11.2.3, “Keyring Management” for details.

11.2.2.1 Listing Users

To list the users in your cluster, execute the following:

ceph auth list

Ceph will list all users in your cluster. For example, in a cluster with two nodes, ceph auth list output looks similar to this:

installed auth entries:

osd.0
        key: AQCvCbtToC6MDhAATtuT70Sl+DymPCfDSsyV4w==
        caps: [mon] allow profile osd
        caps: [osd] allow *
osd.1
        key: AQC4CbtTCFJBChAAVq5spj0ff4eHZICxIOVZeA==
        caps: [mon] allow profile osd
        caps: [osd] allow *
client.admin
        key: AQBHCbtT6APDHhAA5W00cBchwkQjh3dkKsyPjw==
        caps: [mds] allow
        caps: [mon] allow *
        caps: [osd] allow *
client.bootstrap-mds
        key: AQBICbtTOK9uGBAAdbe5zcIGHZL3T/u2g6EBww==
        caps: [mon] allow profile bootstrap-mds
client.bootstrap-osd
        key: AQBHCbtT4GxqORAADE5u7RkpCN/oo4e5W0uBtw==
        caps: [mon] allow profile bootstrap-osd
Note
Note: TYPE.ID Notation

Note that the TYPE.ID notation for users applies such that osd.0 specifies a user of type osd and its ID is 0. client.admin is a user of type client and its ID is admin. Note also that each entry has a key: value entry, and one or more caps: entries.

You may use the -o filename option with ceph auth list to save the output to a file.

11.2.2.2 Getting Information about Users

To retrieve a specific user, key, and capabilities, execute the following:

ceph auth get TYPE.ID

For example:

ceph auth get client.admin
exported keyring for client.admin
[client.admin]
	key = AQA19uZUqIwkHxAAFuUwvq0eJD4S173oFRxe0g==
	caps mds = "allow"
	caps mon = "allow *"
 caps osd = "allow *"

Developers may also execute the following:

ceph auth export TYPE.ID

The auth export command is identical to auth get, but also prints the internal authentication ID.

11.2.2.3 Adding Users

Adding a user creates a user name (TYPE.ID), a secret key, and any capabilities included in the command you use to create the user.

A user's key enables the user to authenticate with the Ceph storage cluster. The user's capabilities authorize the user to read, write, or execute on Ceph monitors (mon), Ceph OSDs (osd), or Ceph metadata servers (mds).

There are a few commands available to add a user:

ceph auth add

This command is the canonical way to add a user. It will create the user, generate a key, and add any specified capabilities.

ceph auth get-or-create

This command is often the most convenient way to create a user, because it returns a keyfile format with the user name (in brackets) and the key. If the user already exists, this command simply returns the user name and key in the keyfile format. You may use the -o filename option to save the output to a file.

ceph auth get-or-create-key

This command is a convenient way to create a user and return the user's key (only). This is useful for clients that need the key only (for example libvirt). If the user already exists, this command simply returns the key. You may use the -o filename option to save the output to a file.

When creating client users, you may create a user with no capabilities. A user with no capabilities can authenticate but nothing more. Such client cannot retrieve the cluster map from the monitor. However, you can create a user with no capabilities if you want to defer adding capabilities later using the ceph auth caps command.

A typical user has at least read capabilities on the Ceph monitor and read and write capabilities on Ceph OSDs. Additionally, a user's OSD permissions are often restricted to accessing a particular pool.

ceph auth add client.john mon 'allow r' osd \
 'allow rw pool=liverpool'
ceph auth get-or-create client.paul mon 'allow r' osd \
 'allow rw pool=liverpool'
ceph auth get-or-create client.george mon 'allow r' osd \
 'allow rw pool=liverpool' -o george.keyring
ceph auth get-or-create-key client.ringo mon 'allow r' osd \
 'allow rw pool=liverpool' -o ringo.key
Important
Important

If you provide a user with capabilities to OSDs, but you do not restrict access to particular pools, the user will have access to all pools in the cluster.

11.2.2.4 Modifying User Capabilities

The ceph auth caps command allows you to specify a user and change the user's capabilities. Setting new capabilities will overwrite current ones. To view current capabilities run ceph auth get USERTYPE.USERID. To add capabilities, you also need to specify the existing capabilities when using the following form:

ceph auth caps USERTYPE.USERID daemon 'allow [r|w|x|*|...] \
     [pool=pool-name] [namespace=namespace-name]' [daemon 'allow [r|w|x|*|...] \
     [pool=pool-name] [namespace=namespace-name]']

For example:

ceph auth get client.john
ceph auth caps client.john mon 'allow r' osd 'allow rw pool=prague'
ceph auth caps client.paul mon 'allow rw' osd 'allow rwx pool=prague'
ceph auth caps client.brian-manager mon 'allow *' osd 'allow *'

To remove a capability, you may reset the capability. If you want the user to have no access to a particular daemon that was previously set, specify an empty string:

ceph auth caps client.ringo mon ' ' osd ' '

11.2.2.5 Deleting Users

To delete a user, use ceph auth del:

ceph auth del TYPE.ID

where TYPE is one of client, osd, mon, or mds, and ID is the user name or ID of the daemon.

11.2.2.6 Printing a User's Key

To print a user’s authentication key to standard output, execute the following:

ceph auth print-key TYPE.ID

where TYPE is one of client, osd, mon, or mds, and ID is the user name or ID of the daemon.

Printing a user's key is useful when you need to populate client software with a user's key (such as libvirt), as in the following example:

mount -t ceph host:/ mount_point \
-o name=client.user,secret=`ceph auth print-key client.user`

11.2.2.7 Importing Users

To import one or more users, use ceph auth import and specify a keyring:

sudo ceph auth import -i /etc/ceph/ceph.keyring
Note
Note

The Ceph storage cluster will add new users, their keys and their capabilities and will update existing users, their keys and their capabilities.

11.2.3 Keyring Management

When you access Ceph via a Ceph client, the client will look for a local keyring. Ceph presets the keyring setting with the following four keyring names by default so you do not need to set them in your Ceph configuration file unless you want to override the defaults:

/etc/ceph/cluster.name.keyring
/etc/ceph/cluster.keyring
/etc/ceph/keyring
/etc/ceph/keyring.bin

The cluster metavariable is your Ceph cluster name as defined by the name of the Ceph configuration file. ceph.conf means that the cluster name is ceph, thus ceph.keyring. The name metavariable is the user type and user ID, for example client.admin, thus ceph.client.admin.keyring.

After you create a user (for example client.ringo), you must get the key and add it to a keyring on a Ceph client so that the user can access the Ceph storage cluster.

Section 11.2, “Key Management” details how to list, get, add, modify and delete users directly in the Ceph storage cluster. However, Ceph also provides the ceph-authtool utility to allow you to manage keyrings from a Ceph client.

11.2.3.1 Creating a Keyring

When you use the procedures in Section 11.2, “Key Management” to create users, you need to provide user keys to the Ceph client(s) so that the client can retrieve the key for the specified user and authenticate with the Ceph storage cluster. Ceph clients access keyrings to look up a user name and retrieve the user's key:

sudo ceph-authtool --create-keyring /path/to/keyring

When creating a keyring with multiple users, we recommend using the cluster name (for example cluster.keyring) for the keyring file name and saving it in the /etc/ceph directory so that the keyring configuration default setting will pick up the file name without requiring you to specify it in the local copy of your Ceph configuration file. For example, create ceph.keyring by executing the following:

sudo ceph-authtool -C /etc/ceph/ceph.keyring

When creating a keyring with a single user, we recommend using the cluster name, the user type and the user name and saving it in the /etc/ceph directory. For example, ceph.client.admin.keyring for the client.admin user.

11.2.3.2 Adding a User to a Keyring

When you add a user to the Ceph storage cluster (see Section 11.2.2.3, “Adding Users”), you can retrieve the user, key and capabilities, and save the user to a keyring.

If you only want to use one user per keyring, the ceph auth get command with the -o option will save the output in the keyring file format. For example, to create a keyring for the client.admin user, execute the following:

ceph auth get client.admin -o /etc/ceph/ceph.client.admin.keyring

When you want to import users to a keyring, you can use ceph-authtool to specify the destination keyring and the source keyring:

sudo ceph-authtool /etc/ceph/ceph.keyring \
  --import-keyring /etc/ceph/ceph.client.admin.keyring

11.2.3.3 Creating a User

Ceph provides the ceph auth add command to create a user directly in the Ceph storage cluster. However, you can also create a user, keys and capabilities directly on a Ceph client keyring. Then, you can import the user to the Ceph storage cluster:

sudo ceph-authtool -n client.ringo --cap osd 'allow rwx' \
  --cap mon 'allow rwx' /etc/ceph/ceph.keyring

You can also create a keyring and add a new user to the keyring simultaneously:

sudo ceph-authtool -C /etc/ceph/ceph.keyring -n client.ringo \
  --cap osd 'allow rwx' --cap mon 'allow rwx' --gen-key

In the previous scenarios, the new user client.ringo is only in the keyring. To add the new user to the Ceph storage cluster, you must still add the new user to the cluster:

sudo ceph auth add client.ringo -i /etc/ceph/ceph.keyring

11.2.3.4 Modifying Users

To modify the capabilities of a user record in a keyring, specify the keyring and the user followed by the capabilities:

sudo ceph-authtool /etc/ceph/ceph.keyring -n client.ringo \
  --cap osd 'allow rwx' --cap mon 'allow rwx'

To update the modified user within the Ceph cluster environment, you must import the changes from the keyring to the user entry in the Ceph cluster:

ceph auth import -i /etc/ceph/ceph.keyring

See Section 11.2.2.7, “Importing Users” for details on updating a Ceph storage cluster user from a keyring.

11.2.4 Command Line Usage

The ceph command supports the following options related to the user name and secret manipulation:

--id or --user

Ceph identifies users with a type and an ID (TYPE.ID, such as client.admin or client.user1). The id, name and -n options enable you to specify the ID portion of the user name (for example admin or user1). You can specify the user with the --id and omit the type. For example, to specify user client.foo enter the following:

ceph --id foo --keyring /path/to/keyring health
ceph --user foo --keyring /path/to/keyring health
--name or -n

Ceph identifies users with a type and an ID (TYPE.ID, such as client.admin or client.user1). The --name and -n options enable you to specify the fully qualified user name. You must specify the user type (typically client) with the user ID:

ceph --name client.foo --keyring /path/to/keyring health
ceph -n client.foo --keyring /path/to/keyring health
--keyring

The path to the keyring containing one or more user name and secret. The --secret option provides the same functionality, but it does not work with RADOS Gateway, which uses --secret for another purpose. You may retrieve a keyring with ceph auth get-or-create and store it locally. This is a preferred approach, because you can switch user names without switching the keyring path:

sudo rbd map --id foo --keyring /path/to/keyring mypool/myimage

12 Stored Data Management

The CRUSH algorithm determines how to store and retrieve data by computing data storage locations. CRUSH empowers Ceph clients to communicate with OSDs directly rather than through a centralized server or broker. With an algorithmically determined method of storing and retrieving data, Ceph avoids a single point of failure, a performance bottleneck, and a physical limit to its scalability.

CRUSH requires a map of your cluster, and uses the CRUSH map to pseudo-randomly store and retrieve data in OSDs with a uniform distribution of data across the cluster.

CRUSH maps contain a list of OSDs, a list of ‘buckets’ for aggregating the devices into physical locations, and a list of rules that tell CRUSH how it should replicate data in a Ceph cluster’s pools. By reflecting the underlying physical organization of the installation, CRUSH can model—and thereby address—potential sources of correlated device failures. Typical sources include physical proximity, a shared power source, and a shared network. By encoding this information into the cluster map, CRUSH placement policies can separate object replicas across different failure domains while still maintaining the desired distribution. For example, to address the possibility of concurrent failures, it may be desirable to ensure that data replicas are on devices using different shelves, racks, power supplies, controllers, and/or physical locations.

When you create a configuration file and deploy Ceph with ceph-deploy, Ceph generates a default CRUSH map for your configuration. The default CRUSH map is fine for your Ceph sandbox environment. However, when you deploy a large-scale data cluster, you should give significant consideration to developing a custom CRUSH map, because it will help you manage your Ceph cluster, improve performance and ensure data safety.

For example, if an OSD goes down, a CRUSH map can help you locate the physical data center, room, row and rack of the host with the failed OSD in the event you need to use on-site support or replace hardware.

Similarly, CRUSH may help you identify faults more quickly. For example, if all OSDs in a particular rack go down simultaneously, the fault may lie with a network switch or power to the rack or the network switch rather than the OSDs themselves.

A custom CRUSH map can also help you identify the physical locations where Ceph stores redundant copies of data when the placement group(s) associated with a failed host are in a degraded state.

There are three main sections to a CRUSH Map.

  • Devices consist of any object storage device, that is, the hard disk corresponding to a ceph-osd daemon.

  • Buckets consist of a hierarchical aggregation of storage locations (for example rows, racks, hosts, etc.) and their assigned weights.

  • Rule Sets consist of the manner of selecting buckets.

12.1 Devices

To map placement groups to OSDs, a CRUSH Map requires a list of OSD devices (the name of the OSD daemon). The list of devices appears first in the CRUSH Map.

#devices
device num osd.name

For example:

#devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3

As a general rule, an OSD daemon maps to a single disk.

12.2 Buckets

CRUSH maps contain a list of OSDs, which can be organized into 'buckets' for aggregating the devices into physical locations.

0

OSD

An OSD daemon (osd.1, osd.2, etc.).

1

Host

A host name containing one or more OSDs.

2

Chassis

Chassis of which the rack is composed.

3

Rack

A computer rack. The default is unknownrack.

4

Row

A row in a series of racks.

5

Pdu

Power distribution unit.

6

Pod

7

Room

A room containing racks and rows of hosts.

8

Data Center

A physical data center containing rooms.

9

Region

10

Root

Tip
Tip

You can remove these types and create your own bucket types.

Ceph’s deployment tools generate a CRUSH map that contains a bucket for each host, and a pool named 'default', which is useful for the default rbd pool. The remaining bucket types provide a means for storing information about the physical location of nodes/buckets, which makes cluster administration much easier when OSDs, hosts, or network hardware malfunction and the administrator needs access to physical hardware.

A bucket has a type, a unique name (string), a unique ID expressed as a negative integer, a weight relative to the total capacity/capability of its item(s), the bucket algorithm ( straw by default), and the hash (0 by default, reflecting CRUSH Hash rjenkins1). A bucket may have one or more items. The items may consist of other buckets or OSDs. Items may have a weight that reflects the relative weight of the item.

[bucket-type] [bucket-name] {
  id [a unique negative numeric ID]
  weight [the relative capacity/capability of the item(s)]
  alg [the bucket type: uniform | list | tree | straw ]
  hash [the hash type: 0 by default]
  item [item-name] weight [weight]
}

The following example illustrates how you can use buckets to aggregate a pool and physical locations like a data center, a room, a rack and a row.

host ceph-osd-server-1 {
        id -17
        alg straw
        hash 0
        item osd.0 weight 1.00
        item osd.1 weight 1.00
}

row rack-1-row-1 {
        id -16
        alg straw
        hash 0
        item ceph-osd-server-1 weight 2.00
}

rack rack-3 {
        id -15
        alg straw
        hash 0
        item rack-3-row-1 weight 2.00
        item rack-3-row-2 weight 2.00
        item rack-3-row-3 weight 2.00
        item rack-3-row-4 weight 2.00
        item rack-3-row-5 weight 2.00
}

rack rack-2 {
        id -14
        alg straw
        hash 0
        item rack-2-row-1 weight 2.00
        item rack-2-row-2 weight 2.00
        item rack-2-row-3 weight 2.00
        item rack-2-row-4 weight 2.00
        item rack-2-row-5 weight 2.00
}

rack rack-1 {
        id -13
        alg straw
        hash 0
        item rack-1-row-1 weight 2.00
        item rack-1-row-2 weight 2.00
        item rack-1-row-3 weight 2.00
        item rack-1-row-4 weight 2.00
        item rack-1-row-5 weight 2.00
}

room server-room-1 {
        id -12
        alg straw
        hash 0
        item rack-1 weight 10.00
        item rack-2 weight 10.00
        item rack-3 weight 10.00
}

datacenter dc-1 {
        id -11
        alg straw
        hash 0
        item server-room-1 weight 30.00
        item server-room-2 weight 30.00
}

pool data {
        id -10
        alg straw
        hash 0
        item dc-1 weight 60.00
        item dc-2 weight 60.00
}

12.3 Rule Sets

CRUSH maps support the notion of 'CRUSH rules', which are the rules that determine data placement for a pool. For large clusters, you will likely create many pools where each pool may have its own CRUSH ruleset and rules. The default CRUSH map has a rule for each pool, and one ruleset assigned to each of the default pools.

Note
Note

In most cases, you will not need to modify the default rules. When you create a new pool, its default ruleset is 0.

A rule takes the following form:

rule rulename {

        ruleset ruleset
        type type
        min_size min-size
        max_size max-size
        step step

}
ruleset

An integer. Classifies a rule as belonging to a set of rules. Activated by setting the ruleset in a pool. This option is required. Default is 0.

Important
Important

You need to increase the ruleset number from the default 0 continuously, otherwise the related monitor may crash.

type

A string. Describes a rule for either a hard disk (replicated) or a RAID. This option is required. Default is replicated.

min_size

An integer. If a placement group makes fewer replicas than this number, CRUSH will NOT select this rule. This option is required. Default is 2.

max_size

An integer. If a placement group makes more replicas than this number, CRUSH will NOT select this rule. This option is required. Default is 10.

step take bucket

Takes a bucket name, and begins iterating down the tree. This option is required.

step choose first num type bucket-type

Selects the number of buckets of the given type. Where N is the number of options available, if num > 0 && < N, choose that many buckets; if num < 0, it means N - num; and, if num == 0, choose N buckets (all available). Follows step take or step choose.

step emit

Outputs the current value and empties the stack. Typically used at the end of a rule, but may also be used to form different trees in the same rule. Follows step choose.

Important
Important

To activate one or more rules with a common ruleset number to a pool, set the ruleset number to the pool.

12.4 CRUSH Map Manipulation

This section introduces ways to basic CRUSH Map manipulation, such as editing a CRUSH Map, changing CRUSH Map parameters, and adding/moving/removing an OSD.

12.4.1 Editing a CRUSH Map

To edit an existing CRUSH map, do the following:

  1. Get a CRUSH Map. To get the CRUSH Map for your cluster, execute the following:

    ceph osd getcrushmap -o
    compiled-crushmap-filename

    Ceph will output (-o) a compiled CRUSH Map to the file name you specified. Since the CRUSH Map is in a compiled form, you must decompile it first before you can edit it.

  2. Decompile a CRUSH Map. To decompile a CRUSH Map, execute the following:

    crushtool -d compiled-crushmap-filename \
     -o decompiled-crushmap-filename

    Ceph will decompile (-d) the compiled CRUSH map and output (-o) it to the file name you specified.

  3. Edit at least one of Devices, Buckets and Rules parameters.

  4. Compile a CRUSH Map. To compile a CRUSH Map, execute the following:

    crushtool -c decompiled-crush-map-filename \
     -o compiled-crush-map-filename

    Ceph will store a compiled CRUSH map to the file name you specified.

  5. Set a CRUSH Map. To set the CRUSH Map for your cluster, execute the following:

    ceph osd setcrushmap -i compiled-crushmap-filename

    Ceph will input the compiled CRUSH Map of the file name you specified as the CRUSH Map for the cluster.

12.4.2 Add/Move an OSD

To add or move an OSD in the CRUSH map of a running cluster, execute the following:

ceph osd crush set id_or_name weight root=pool-name
bucket-type=bucket-name ...
id

An integer. The numeric ID of the OSD. This option is required.

name

A string. The full name of the OSD. This option is required.

weight

A double. The CRUSH weight for the OSD. This option is required.

pool

A key/value pair. By default, the CRUSH hierarchy contains the pool default as its root. This option is required.

bucket-type

Key/value pairs. You may specify the OSD’s location in the CRUSH hierarchy.

The following example adds osd.0 to the hierarchy, or moves the OSD from a previous location.

ceph osd crush set osd.0 1.0 root=data datacenter=dc1 room=room1 \
row=foo rack=bar host=foo-bar-1

12.4.3 Adjust an OSD’s CRUSH Weight

To adjust an OSD’s crush weight in the CRUSH map of a running cluster, execute the following:

ceph osd crush reweight name weight
name

A string. The full name of the OSD. This option is required.

weight

A double. The CRUSH weight for the OSD. This option is required.

12.4.4 Remove an OSD

To remove an OSD from the CRUSH map of a running cluster, execute the following:

ceph osd crush remove name
name

A string. The full name of the OSD. This option is required.

12.4.5 Move a Bucket

To move a bucket to a different location or position in the CRUSH map hierarchy, execute the following:

ceph osd crush move bucket-name bucket-type=bucket-name, ...
bucket-name

A string. The name of the bucket to move/reposition. This option is required.

bucket-type

Key/value pairs. You may specify the bucket’s location in the CRUSH hierarchy.

12.5 Mixed SSDs and HDDs on the Same Node

It can be desirable to configure a Ceph cluster such that each node has a mix of SSDs and HDDs, with one storage pool on the fast SSDs and one storage pool on the slower HDDs. To do this, the CRUSH Map needs to be edited.

The default CRUSH Map will have a simple hierarchy, where the default root contains hosts, and the hosts contain OSDs, for example:

cephadm > ceph osd tree
ID  WEIGHT  TYPE NAME         UP/DOWN REWEIGHT
 -1 0.18494 root default
 -2 0.05548     host node1
  0 0.01849         osd.0          up 1.00000
  3 0.01849         osd.3          up 1.00000
  6 0.01849         osd.6          up 1.00000
 -3 0.05548     host node2
  1 0.01849         osd.1          up 1.00000
  4 0.01849         osd.4          up 1.00000
  7 0.01849         osd.7          up 1.00000
 -4 0.05548     host node3
  2 0.01849         osd.2          up 1.00000
  5 0.01849         osd.5          up 1.00000
  8 0.01849         osd.8          up 1.00000

This provides no distinction between disk types. In order to split the OSDs into SSDs and HDDs, we need to create a second hierarchy in the CRUSH Map:

cephadm > ceph osd crush add-bucket ssd root

Having created the new root for SSDs, we need to add hosts to it. This means creating new host entries. But because the same host name cannot appear more than once in a CRUSH Map, this uses fake host names. These fake host names do not need to be resolvable by DNS. CRUSH does not care what the host names are, they only need to create the right hierarchies. The one thing that does need to be changed in order to support fake host names is that you must set

osd crush update on start = false

in /etc/ceph/ceph.conf. Otherwise the OSDs you move will be reset later to their original location in the default root, and the cluster will not behave as expected.

Once that setting is changed, add the new fake hosts to the SSD root:

cephadm > ceph osd crush add-bucket node1-ssd host
cephadm > ceph osd crush move node1-ssd root=ssd
cephadm > ceph osd crush add-bucket node2-ssd host
cephadm > ceph osd crush move node2-ssd root=ssd
cephadm > ceph osd crush add-bucket node3-ssd host
cephadm > ceph osd crush move node3-ssd root=ssd

Finally, for each SSD OSD, move the OSD to the SSD root. In this example, we assume that osd.0, osd.1 and osd.2 are physically hosted on SSDs:

cephadm > ceph osd crush add osd.0 1 root=ssd
cephadm > ceph osd crush set osd.0 1 root=ssd host=node1-ssd
cephadm > ceph osd crush add osd.1 1 root=ssd
cephadm > ceph osd crush set osd.1 1 root=ssd host=node2-ssd
cephadm > ceph osd crush add osd.2 1 root=ssd
cephadm > ceph osd crush set osd.2 1 root=ssd host=node3-ssd

The CRUSH hierarchy should now look like this:

cephadm > ceph osd tree
ID WEIGHT  TYPE NAME                   UP/DOWN REWEIGHT PRIMARY-AFFINITY
-5 3.00000 root ssd
-6 1.00000     host node1-ssd
 0 1.00000         osd.0                    up  1.00000          1.00000
-7 1.00000     host node2-ssd
 1 1.00000         osd.1                    up  1.00000          1.00000
-8 1.00000     host node3-ssd
 2 1.00000         osd.2                    up  1.00000          1.00000
-1 0.11096 root default
-2 0.03699     host node1
 3 0.01849         osd.3                    up  1.00000          1.00000
 6 0.01849         osd.6                    up  1.00000          1.00000
-3 0.03699     host node2
 4 0.01849         osd.4                    up  1.00000          1.00000
 7 0.01849         osd.7                    up  1.00000          1.00000
-4 0.03699     host node3
 5 0.01849         osd.5                    up  1.00000          1.00000
 8 0.01849         osd.8                    up  1.00000          1.00000

Now, create a CRUSH rule that targets the SSD root:

cephadm > ceph osd crush rule create-simple ssd_replicated_ruleset ssd host

The original default replicated_ruleset (with ID 0) will target the HDDs. The new ssd_replicated_ruleset (with ID 1) will target the SSDs.

Any existing pools will still be using the HDDs, because they are in the default hierarchy in the CRUSH map. A new pool can be created to use SSDs only:

cephadm > ceph osd pool create ssd-pool 64 64
cephadm > ceph osd pool set ssd-pool crush_ruleset 1

The ID "1" in the above command needs to match the ID of the new CRUSH role which targets the SSDs.

13 Managing Storage Pools

When you first deploy a cluster without creating a pool, Ceph uses the default pools for storing data. A pool provides you with:

  • Resilience: You can set how many OSDs are allowed to fail without losing data. For replicated pools, it is the desired number of copies/replicas of an object. A typical configuration stores an object and one additional copy (that is size=2), but you can determine the number of copies/replicas. For erasure coded pools, it is the number of coding chunks (that is m=2 in the erasure code profile).

  • Placement Groups: You can set the number of placement groups for the pool. A typical configuration uses approximately 100 placement groups per OSD to provide optimal balancing without using up too many computing resources. When setting up multiple pools, be careful to ensure you set a reasonable number of placement groups for both the pool and the cluster as a whole.

  • CRUSH Rules: When you store data in a pool, a CRUSH ruleset mapped to the pool enables CRUSH to identify a rule for the placement of the object and its replicas (or chunks for erasure coded pools) in your cluster. You can create a custom CRUSH rule for your pool.

  • Snapshots: When you create snapshots with ceph osd pool mksnap, you effectively take a snapshot of a particular pool.

  • Set Ownership: You can set a user ID as the owner of a pool.

To organize data into pools, you can list, create, and remove pools. You can also view the usage statistics for each pool.

13.1 Operating Pools

This section introduces practical information to perform basic tasks with pools. You can find out how to list, create, and delete pools, as well as show pool statistics or manage snapshots of a pool.

13.1.1 List Pools

To list your cluster’s pools, execute:

ceph osd lspools
0 rbd, 1 photo_collection, 2 foo_pool,

13.1.2 Create a Pool

To create a replicated pool, execute:

ceph osd pool create pool_name pg_num pgp_num pgp_type crush_ruleset_name, expected_num_objects

To create an erasure pool, execute:

ceph osd pool create pool_name pg_num pgp_num pgp_type erasure_code_profile \
 crush_ruleset_name, expected_num_objects
pool_name

The name of the pool. It must be unique. This option is required.

pg_num

The total number of placement groups for the pool. This option is required. Default value is 8.

pgp_num

The total number of placement groups for placement purposes. This should be equal to the total number of placement groups, except for placement group splitting scenarios. This option is required. Default value is 8.

pgp_type

The pool type which may either be replicated to recover from lost OSDs by keeping multiple copies of the objects or erasure to get a kind of generalized RAID5 capability. The replicated pools require more raw storage but implement all Ceph operations. The erasure pools require less raw storage but only implement a subset of the available operations. Default is 'replicated'.

crush_ruleset_name

The name of the crush ruleset for this pool. If the specified ruleset does not exist, the creation of replicated pool will fail with -ENOENT. But replicated pool will create a new erasure ruleset with specified name. The default value is 'erasure-code' for erasure pool. Picks up Ceph configuration variable osd_pool_default_crush_replicated_ruleset for replicated pool.

erasure_code_profile=profile

For erasure pools only. Use the erasure code profile. It must be an existing profile as defined by osd erasure-code-profile set.

When you create a pool, set the number of placement groups to a reasonable value (for example 100). Consider the total number of placement groups per OSD too. Placement groups are computationally expensive, so performance will degrade when you have many pools with many placement groups (for example 50 pools with 100 placement groups each). The point of diminishing returns depends upon the power of the OSD host.

See Placement Groups (http://docs.ceph.com/docs/master/rados/operations/placement-groups/) for details on calculating an appropriate number of placement groups for your pool.

expected_num_objects

The expected number of objects for this pool. By setting this value, the PG folder splitting happens at the pool creation time. This avoids the latency impact with a runtime folder splitting.

13.1.3 Set Pool Quotas

You can set pool quotas for the maximum number of bytes and/or the maximum number of objects per pool.

ceph osd pool set-quota pool-name max_objects obj-count max_bytes bytes

For example:

ceph osd pool set-quota data max_objects 10000

To remove a quota, set its value to 0.

13.1.4 Delete a Pool

To delete a pool, execute:

ceph osd pool delete pool-name pool-name --yes-i-really-really-mean-it

If you created your own rulesets and rules for a pool you created, you should consider removing them when you no longer need your pool. If you created users with permissions strictly for a pool that no longer exists, you should consider deleting those users too.

13.1.5 Rename a Pool

To rename a pool, execute:

ceph osd pool rename current-pool-name new-pool-name

If you rename a pool and you have per-pool capabilities for an authenticated user, you must update the user’s capabilities with the new pool name.

13.1.6 Show Pool Statistics

To show a pool’s usage statistics, execute:

rados df
pool name  category  KB  objects   lones  degraded  unfound  rd  rd KB  wr  wr KB
cold-storage    -   228   1         0      0          0       0   0      1   228
data            -    1    4         0      0          0       0   0      4    4
hot-storage     -    1    2         0      0          0       15  10     5   231
metadata        -    0    0         0      0          0       0   0      0    0
pool1           -    0    0         0      0          0       0   0      0    0
rbd             -    0    0         0      0          0       0   0      0    0
total used          266268          7
total avail       27966296
total space       28232564

13.1.7 Set Pool Values

To set a value to a pool, execute:

ceph osd pool set pool-name key value

You may set values for the following keys:

size

Sets the number of replicas for objects in the pool. See Section 13.1.9, “Set the Number of Object Replicas” for further details. Replicated pools only.

min_size

Sets the minimum number of replicas required for I/O. See Section 13.1.9, “Set the Number of Object Replicas” for further details. Replicated pools only.

crash_replay_interval

The number of seconds to allow clients to replay acknowledged, but uncommitted requests.

pg_num

The number of placement groups for the pool.

pgp_num

The effective number of placement groups to use when calculating data placement.

crush_ruleset

The ruleset to use for mapping object placement in the cluster.

hashpspool

Set (1) or unset (0) the HASHPSPOOL flag on a given pool. Enabling this flag changes the algorithm to better distribute PGs to OSDs. After enabling this flag on a pool whose HASHPSPOOL flag was set to 0, the cluster starts backfilling to have a correct placement of all PGs again. Be aware that this can create quite substantial I/O load on a cluster, so good planning must be done on a highly loaded production clusters.

nodelete

Prevents the pool from being removed.

nopgchange

Prevents the pool's pg_num and pgp_num from being changed.

nosizechange

Prevents the pool's size from being changed.

write_fadvise_dontneed

Set/Unset the WRITE_FADVISE_DONTNEED flag on a given pool.

noscrub,nodeep-scrub

Disables (deep)-scrubbing of the data for the specific pool to resolve temporary high I/O load.

hit_set_type

Enables hit set tracking for cache pools. See Bloom Filter (http://en.wikipedia.org/wiki/Bloom_filter) for additional information. This option can have the following values: bloom, explicit_hash, explicit_object. Default is bloom.

hit_set_count

The number of hit sets to store for cache pools. The higher the number, the more RAM consumed by the ceph-osd daemon. Default is 0.

hit_set_period

The duration of a hit set period in seconds for cache pools. The higher the number, the more RAM consumed by the ceph-osd daemon.

hit_set_fpp

The false positive probability for the bloom hit set type. See Bloom Filter (http://en.wikipedia.org/wiki/Bloom_filter) for additional information. Valid range is 0.0 - 1.0 Default is 0.05

cache_target_dirty_ratio

The percentage of the cache pool containing modified (dirty) objects before the cache tiering agent will flush them to the backing storage pool. Default is .4

cache_target_dirty_high_ratio

The percentage of the cache pool containing modified (dirty) objects before the cache tiering agent will flush them to the backing storage pool with a higher speed. Default is .6.

cache_target_full_ratio

The percentage of the cache pool containing unmodified (clean) objects before the cache tiering agent will evict them from the cache pool. Default is .8

target_max_bytes

Ceph will begin flushing or evicting objects when the max_bytes threshold is triggered.

target_max_objects

Ceph will begin flushing or evicting objects when the max_objects threshold is triggered.

hit_set_grade_decay_rate

Temperature decay rate between two successive hit_sets. Default is 20.

hit_set_grade_search_last_n

Count at most N appearances in hit_sets for temperature calculation. Default is 1.

cache_min_flush_age

The time (in seconds) before the cache tiering agent will flush an object from the cache pool to the storage pool.

cache_min_evict_age

The time (in seconds) before the cache tiering agent will evict an object from the cache pool.

fast_read

If this flag is enabled on erasure coding pools, then the read request issues sub-reads to all shards, and waits until it receives enough shards to decode to serve the client. In the case of jerasure and isa erasure plug-ins, when the first K replies return, then the client’s request is served immediately using the data decoded from these replies. This helps to gain some resources for better performance. Currently, this flag is only supported for erasure coding pools. Default is 0.

scrub_min_interval

The minimum interval in seconds for pool scrubbing when the cluster load is low. The default 0 means that the osd_scrub_min_interval value from the Ceph configuration file is used.

scrub_max_interval

The maximum interval in seconds for pool scrubbing, regardless of the cluster load. The default 0 means that the osd_scrub_max_interval value from the Ceph configuration file is used.

deep_scrub_interval

The interval in seconds for the pool deep scrubbing. The default 0 means that the osd_deep_scrub value from the Ceph configuration file is used.

13.1.8 Get Pool Values

To get a value from a pool, execute:

ceph osd pool get pool-name key

You can get values for keys listed in Section 13.1.7, “Set Pool Values” plus the following keys:

pg_num

The number of placement groups for the pool.

pgp_num

The effective number of placement groups to use when calculating data placement. Valid range is equal to or less than pg_num.

13.1.9 Set the Number of Object Replicas

To set the number of object replicas on a replicated pool, execute the following:

ceph osd pool set poolname size num-replicas
Important
Important

The num-replicas includes the object itself. If you for example want the object and two copies of the object for a total of three instances of the object, specify 3.

For example:

ceph osd pool set data size 3

You may execute this command for each pool.

Note
Note

An object might accept I/Os in degraded mode with fewer than pool size replicas. To set a minimum number of required replicas for I/O, you should use the min_size setting. For example:

ceph osd pool set data min_size 2

This ensures that no object in the data pool will receive I/O with fewer than min_size replicas.

13.1.10 Get the Number of Object Replicas

To get the number of object replicas, execute the following:

ceph osd dump | grep 'replicated size'

Ceph will list the pools, with the replicated size attribute highlighted. By default, Ceph creates two replicas of an object (a total of three copies, or a size of 3).

14 Snapshots

A snapshot is a read-only copy of the state of an object—a pool or an image—at a particular point in time. This way you can retain a history of its state. There are two types of snapshots in Ceph—RBD snapshots and pool snapshots.

14.1 RBD Snapshots

RBD snapshot is a snapshot of a RADOS block device image. With snapshots you retain a history of the image’s state. Ceph also supports snapshot layering, which allows you to clone VM images quickly and easily. Ceph supports block device snapshots using the rbd command and many higher level interfaces, including QEMU, libvirt, OpenStack and CloudStack.

Note
Note

Stop input/output operations before snapshotting an image. If the image contains a file system, the file system must be in a consistent state before snapshotting.

14.1.1 Cephx Notes

When cephx is enabled (see http://ceph.com/docs/master/rados/configuration/auth-config-ref/ for more information), you must specify a user name or ID and a path to the keyring containing the corresponding key for the user. See User Management (http://ceph.com/docs/master/rados/operations/user-management/) for more details. You may also add the CEPH_ARGS environment variable to avoid re-entry of the following parameters.

rbd --id user-ID --keyring=/path/to/secret commands
rbd --name username --keyring=/path/to/secret commands

For example:

rbd --id admin --keyring=/etc/ceph/ceph.keyring commands
rbd --name client.admin --keyring=/etc/ceph/ceph.keyring commands
Tip
Tip

Add the user and secret to the CEPH_ARGS environment variable so that you do not need to enter them each time.

14.1.2 Snapshot Basics

The following procedures demonstrate how to create, list, and remove snapshots using the rbd command on the command line.

14.1.2.1 Create Snapshot

To create a snapshot with rbd, specify the snap create option, the pool name and the image name.

rbd --pool pool-name snap create --snap snap-name image-name
rbd snap create pool-name/image-name@snap-name

For example:

rbd --pool rbd snap create --snap snapshot1 image1
rbd snap create rbd/image1@snapshot1

14.1.2.2 List Snapshots

To list snapshots of an image, specify the pool name and the image name.

rbd --pool pool-name snap ls image-name
rbd snap ls pool-name/image-name

For example:

rbd --pool rbd snap ls image1
rbd snap ls rbd/image1

14.1.2.3 Rollback Snapshot

To rollback to a snapshot with rbd, specify the snap rollback option, the pool name, the image name and the snap name.

rbd --pool pool-name snap rollback --snap snap-name image-name
rbd snap rollback pool-name/image-name@snap-name

For example:

rbd --pool pool1 snap rollback --snap snapshot1 image1
rbd snap rollback pool1/image1@snapshot1
Note
Note

Rolling back an image to a snapshot means overwriting the current version of the image with data from a snapshot. The time it takes to execute a rollback increases with the size of the image. It is faster to clone from a snapshot than to rollback an image to a snapshot, and it is the preferred method of returning to a pre-existing state.

14.1.2.4 Delete a Snapshot

To delete a snapshot with rbd, specify the snap rm option, the pool name, the image name and the user name.

rbd --pool pool-name snap rm --snap snap-name image-name
rbd snap rm pool-name/image-name@snap-name

For example:

rbd --pool pool1 snap rm --snap snapshot1 image1
rbd snap rm pool1/imag1@snapshot1
Note
Note

Ceph OSDs delete data asynchronously, so deleting a snapshot does not free up the disk space immediately.

14.1.2.5 Purge Snapshots

To delete all snapshots for an image with rbd, specify the snap purge option and the image name.

rbd --pool pool-name snap purge image-name
rbd snap purge pool-name/image-name

For example:

rbd --pool pool1 snap purge image1
rbd snap purge pool1/image1

14.1.3 Layering

Ceph supports the ability to create many copy-on-write (COW) clones of a block device snapshot. Snapshot layering enables Ceph block device clients to create images very quickly. For example, you might create a block device image with a Linux VM written to it; then, snapshot the image, protect the snapshot, and create as many copy-on-write clones as you like. A snapshot is read-only, so cloning a snapshot simplifies semantics—making it possible to create clones rapidly.

Note
Note

The terms “parent” and “child” mentioned in the command line examples below mean a Ceph block device snapshot (parent), and the corresponding image cloned from the snapshot (child).

Each cloned image (child) stores a reference to its parent image, which enables the cloned image to open the parent snapshot and read it.

A COW clone of a snapshot behaves exactly like any other Ceph block device image. You can read to, write from, clone, and resize cloned images. There are no special restrictions with cloned images. However, the copy-on-write clone of a snapshot refers to the snapshot, so you must protect the snapshot before you clone it.

Note
Note

Ceph only supports cloning for format 2 images (that is created with rbd create --image-format 2).

14.1.3.1 Getting Started with Layering

Ceph block device layering is a simple process. You must have an image. You must create a snapshot of the image. You must protect the snapshot. Once you have performed these steps, you can begin cloning the snapshot.

The cloned image has a reference to the parent snapshot, and includes the pool ID, image ID and snapshot ID. The inclusion of the pool ID means that you may clone snapshots from one pool to images in another pool.

  • Image Template: A common use case for block device layering is to create a master image and a snapshot that serves as a template for clones. For example, a user may create an image for a Linux distribution (for example SUSE Linux Enterprise Server), and create a snapshot for it. Periodically, the user may update the image and create a new snapshot (for example zypper ref && zypper patch followed by rbd snap create). As the image matures, the user can clone any one of the snapshots.

  • Extended Template: A more advanced use case includes extending a template image that provides more information than a base image. For example, a user may clone an image (a VM template) and install other software (for example a database, a content management system, an analytics system, etc.) and then snapshot the extended image, which itself may be updated same as the base image.

  • Template Pool: One way to use block device layering is to create a pool that contains master images that act as templates, and snapshots of those templates. You may then extend read-only privileges to users so that they may clone the snapshots without the ability to write or execute within the pool.

  • Image Migration/Recovery: One way to use block device layering is to migrate or recover data from one pool into another pool.

14.1.3.2 Protecting a Snapshot

Clones access the parent snapshots. All clones would break if a user inadvertently deleted the parent snapshot. To prevent data loss, you need to protect the snapshot before you can clone it.

rbd --pool pool-name snap protect \
 --image image-name --snap snapshot-name
rbd snap protect pool-name/image-name@snapshot-name

For example:

rbd --pool pool1 snap protect --image image1 --snap snapshot1
rbd snap protect pool1/image1@snapshot1
Note
Note

You cannot delete a protected snapshot.

14.1.3.3 Cloning a Snapshot

To clone a snapshot, you need to specify the parent pool, image and snapshot, the child pool and image name. You must protect the snapshot before you can clone it.

rbd --pool pool-name --image parent-image \
 --snap snap-name --dest-pool pool-name \
 --dest child-image
rbd clone
pool-name/parent-image@snap-name \
 pool-name/child-image-name

For example:

rbd clone pool1/image1@snapshot1 pool1/image2
Note
Note

You may clone a snapshot from one pool to an image in another pool. For example, you may maintain read-only images and snapshots as templates in one pool, and writable clones in another pool.

14.1.3.4 Unprotecting a Snapshot

Before you can delete a snapshot, you must unprotect it first. Additionally, you may not delete snapshots that have references from clones. You must flatten each clone of a snapshot, before you can delete the snapshot.

rbd --pool pool-name snap unprotect --image image-name \
 --snap snapshot-name
rbd snap unprotect pool-name/image-name@snapshot-name

For example:

rbd --pool pool1 snap unprotect --image image1 --snap snapshot1
rbd snap unprotect pool1/image1@snapshot1

14.1.3.5 Listing Children of a Snapshot

To list the children of a snapshot, execute the following:

rbd --pool pool-name children --image image-name --snap snap-name
rbd children pool-name/image-name@snapshot-name

For example:

rbd --pool pool1 children --image image1 --snap snapshot1
rbd children pool1/image1@snapshot1

14.1.3.6 Flattening a Cloned Image

Cloned images retain a reference to the parent snapshot. When you remove the reference from the child clone to the parent snapshot, you effectively “flatten” the image by copying the information from the snapshot to the clone. The time it takes to flatten a clone increases with the size of the snapshot. To delete a snapshot, you must flatten the child images first.

rbd --pool pool-name flatten --image image-name
rbd flatten pool-name/image-name

For example:

rbd --pool pool1 flatten --image image1
rbd flatten pool1/image1
Note
Note

Since a flattened image contains all the information from the snapshot, a flattened image will take up more storage space than a layered clone.

14.2 Pool Snapshots

Pool snapshots are snapshots of the state of the whole Ceph pool. With pool snapshots, you can retain the history of the pool's state. Depending on the pool's size, creating pool snapshots may require a lot of storage space. Always check the related storage for enough disk space before creating a snapshot of a pool.

14.2.1 Make a Snapshot of a Pool

To make a snapshot of a pool, execute:

ceph osd pool mksnap pool-name snap-name

For example:

ceph osd pool mksnap pool1 snapshot1
created pool pool1 snap snapshot1

14.2.2 Remove a Snapshot of a Pool

To remove a snapshot of a pool, execute:

ceph osd pool rmsnap pool-name snap-name

15 Erasure Coded Pools

A Ceph pool is associated to a type to sustain the loss of an OSD (that is a disk since most of the time there is one OSD per disk). The default choice when creating a pool is replicated, meaning every object is copied on multiple disks. The Erasure Code pool type can be used instead to save space.

For background information on Erasure Code, see https://en.wikipedia.org/wiki/Erasure_code.

Note
Note

You cannot access erasure coded pools with the rbd interface unless you have a cache tier configured. Refer to Section 15.3, “Erasure Coded Pool And Cache Tiering” for more details.

15.1 Creating a Sample Erasure Coded Pool

The simplest erasure coded pool is equivalent to RAID5 and requires at least three hosts:

> ceph osd pool create ecpool 12 12 erasure
pool 'ecpool' created
> echo ABCDEFGHI | rados --pool ecpool put NYAN -
> rados --pool ecpool get NYAN -
ABCDEFGHI

The 12 in the pool create command stands for the number of placement groups.

15.2 Erasure Code Profiles

Some terminology hints:

chunk

when the encoding function is called, it returns chunks of the same size: data chunks which can be concatenated to reconstruct the original object and coding chunks which can be used to rebuild a lost chunk.

k

the number of data chunks, that is the number of chunks into which the original object is divided. For example if k = 2 a 10KB object will be divided into k objects of 5KB each.

m

the number of coding chunks, that is the number of additional chunks computed by the encoding functions. If there are 2 coding chunks, it means 2 OSDs can be out without losing data.

The default erasure code profile sustains the loss of a single OSD. It is equivalent to a replicated pool of size two but requires 1.5TB instead of 2TB to store 1TB of data. The default profile can be displayed with:

> ceph osd erasure-code-profile get default
directory=.libs
k=2
m=1
plugin=jerasure
ruleset-failure-domain=host
technique=reed_sol_van

Choosing the right profile is important because it cannot be modified after the pool is created: a new pool with a different profile needs to be created and all objects from the previous pool moved to the new.

The most important parameters of the profile are k, m and ruleset-failure-domain because they define the storage overhead and the data durability. For example, if the desired architecture must sustain the loss of two racks with a storage overhead of 40% overhead, the following profile can be defined:

> ceph osd erasure-code-profile set myprofile \
   k=3 \
   m=2 \
   ruleset-failure-domain=rack
> ceph osd pool create ecpool 12 12 erasure myprofile
> echo ABCDEFGHI | rados --pool ecpool put NYAN -
> rados --pool ecpool get NYAN -
ABCDEFGHI

The NYAN object will be divided in three (k=3) and two additional chunks will be created (m=2). The value of m defines how many OSDs can be lost simultaneously without losing any data. The ruleset-failure-domain=rack will create a CRUSH ruleset that ensures no two chunks are stored in the same rack.

For more information about the erasure code profiles, see http://docs.ceph.com/docs/master/rados/operations/erasure-code-profile.

15.3 Erasure Coded Pool And Cache Tiering

Erasure coded pools require more resources than replicated pools and lack some functionalities such as partial writes. To overcome these limitations, it is recommended to set a cache tier before the erasure coded pool.

For example, if the hot-storage pool is made of fast storage:

> ceph osd tier add ecpool hot-storage
> ceph osd tier cache-mode hot-storage writeback
> ceph osd tier set-overlay ecpool hot-storage

This will place the hot-storage pool as tier of ecpool in write-back mode so that every write and read to the ecpool is actually using the hot-storage and benefits from its flexibility and speed.

It is not possible to create an RBD image on an erasure coded pool because it requires partial writes. It is however possible to create an RBD image on an erasure coded pool when a replicated pool tier set a cache tier:

> rbd --pool ecpool create --size 10 myvolume

For more information about cache tiering, see Chapter 16, Cache Tiering.

16 Cache Tiering

A cache tier is an additional storage layer implemented between the client and the standard storage. It is designed to speed up the access to pools stored on slow hard disks and erasure coded pools.

Typically cache tiering involves creating a pool of relatively fast/expensive storage devices (for example SSD drives) configured to act as a cache tier, and a backing pool of slower and cheaper devices configured to act as a storage tier.

16.1 Tiered Storage Terminology

Cache tiering recognizes two types of pools: a cache pool and a storage pool.

Tip
Tip

For general information on pools, see Chapter 13, Managing Storage Pools.

storage pool

Either a standard replicated pool that stores several copies of an object in the Ceph storage cluster, or an erasure coded pool (see Chapter 15, Erasure Coded Pools).

The storage pool is sometimes referred to as a 'backing' or 'cold' storage.

cache pool

A standard replicated pool stored on a relatively small but fast storage device with their own ruleset in a crush map.

The cache pool is also referred to as a 'hot' storage.

16.2 Points to Consider

Cache tiering may degrade the cluster performance for specific workloads. The following points show some of its aspects you need to consider:

  • Workload dependent: Whether a cache will improve performance is dependent on the workload. Because there is a cost associated with moving objects into or out of the cache, it can be more effective when most of the requests touch a small number of objects. The cache pool should be large enough to capture the working set for your workload to avoid thrashing.

  • Difficult to benchmark: Most performance benchmarks may show low performance with cache tiering. The reason is that they request a big set of objects, and it takes a long time for the cache to 'warm up'.

  • Possibly low performance: For workloads that are not suitable for cache tiering, performance is often slower than a normal replicated pool without cache tiering enabled.

  • librados object enumeration: If your application is using librados directly and relies on object enumeration, cache tiering may not work as expected. (This is not a problem for RADOS Gateway, RBD, or CephFS.)

16.3 When to Use Cache Tiering

Consider using cache tiering in the following cases:

  • You need to access erasure coded pools via RADOS block device (RBD).

  • You need to access erasure coded pools via iSCSI as it inherits the limitations of RBD. For more information on iSCSI, refer to Chapter 18, Ceph iSCSI Gateway.

  • You have a limited number of high performance storage and a large collection of low performance storage, and need to access the stored data faster.

16.4 Cache Modes

The cache tiering agent handles the migration of data between the cache tier and the backing storage tier. Administrators have the ability to configure how this migration takes place. There are two main scenarios:

write-back mode

When administrators configure tiers with write-back mode, Ceph clients write data to the cache tier and receive an ACK from the cache tier. In time, the data written to the cache tier migrates to the storage tier and gets flushed from the cache tier. Conceptually, the cache tier is overlaid in front of the backing storage tier. When a Ceph client needs data that resides in the storage tier, the cache tiering agent migrates the data to the cache tier on read, then it is sent to the Ceph client. Thereafter, the Ceph client can perform I/O using the cache tier, until the data becomes inactive. This is ideal for mutable data such as photo or video editing, transactional data, etc.

read-only mode

When administrators configure tiers with read-only mode, Ceph clients write data to the backing tier. On read, Ceph copies the requested objects from the backing tier to the cache tier. Stale objects get removed from the cache tier based on the defined policy. This approach is ideal for immutable data such as presenting pictures or videos on a social network, DNA data, X-ray imaging, etc., because reading data from a cache pool that might contain out-of-date data provides weak consistency. Do not use read-only mode for mutable data.

16.5 Setting Up an Example Tiered Storage

This section illustrates how to set up a fast SSD cache tier (hot-storage) in front of a standard hard disk (cold-storage).

Tip
Tip

The following example is for illustration purposes only and includes a setup with one root and one rule for the SSD part residing on a single Ceph node.

In the production environment, the cluster setup typically includes more root and rule entries for the hot storage, and also mixed nodes with both SSDs and SATA disks.

  1. Prepare a host machine with fast drives, such as SSDs. This cluster node will act as a fast cache tier.

  2. Turn the machine into a Ceph node. Install the software and configure the host machine as described in Section 5.3, “Preparing Each Ceph Node”. Let us assume that its name is node-4.

  3. You need to create 4 OSDs nodes. For this purpose run ceph-deploy from the admin server (refer to Section 5.5, “Running ceph-deploy). Remember to replace node-4 with the actual node name and device with the actual device name:

    cephadm > for d in a b c d; do
      ceph-deploy osd create node-4:device${d}
    done

    This may result in an entry like this in the CRUSH map:

    [...]
    host node-4 {
            id -5  # do not change unnecessarily
            # weight 0.012
            alg straw
            hash 0  # rjenkins1
            item osd.6 weight 0.003
            item osd.7 weight 0.003
            item osd.8 weight 0.003
            item osd.9 weight 0.003
    }
    [...]
  4. Edit the CRUSH map for the hot-storage pool mapped to the OSDs backed by the fast SSD drives. Define a second hierarchy with a root node for the SSDs (as root ssd). Additionally, change the weight and a CRUSH rule for the SSDs. For more information on CRUSH map, see http://docs.ceph.com/docs/master/rados/operations/crush-map/.

    Edit the CRUSH map directly with command line tools such as getcrushmap and crushtool:

    1. Retrieve the current map and save it as c.map:

      sudo ceph osd getcrushmap -o c.map
    2. Decompile c.map and save it as c.txt:

      cephadm > crushtool -d c.map -o c.txt
    3. Edit c.txt:

      [...]
      host node-4 {
              id -5  # do not change unnecessarily
              # weight 4.000
              alg straw
              hash 0  # rjenkins1
              item osd.6 weight 1.000
              item osd.7 weight 1.000
              item osd.8 weight 1.000
              item osd.9 weight 1.000
      }
      root ssd {    # newly added root for the SSD hot-storage
              id -6
              alg straw
              hash 0
              item node-4 weight 4.00
      }
      rule ssd {
              ruleset 4
              type replicated
              min_size 0
              max_size 4
              step take ssd
              step chooseleaf firstn 0 type host
              step emit
      }
      [...]
    4. Compile the edited c.txt file and save it as ssd.map:

      cephadm > crushtool -c c.txt -o ssd.map
    5. Finally install ssd.map as the new CRUSH map:

      sudo ceph osd setcrushmap -i ssd.map
  5. Create the hot-storage pool to be used for cache tiering. Use the new 'ssd' rule for it:

    sudo ceph osd pool create hot-storage 100 100 replicated ssd
  6. Create the cold-storage pool using the default 'replicated_ruleset' rule:

    sudo ceph osd pool create cold-storage 100 100 replicated replicated_ruleset
  7. Then setting up a cache tier involves associating a backing storage pool with a cache pool, in this case cold-storage (= storage pool) with hot-storage (= cache pool):

    sudo ceph osd tier add cold-storage hot-storage
  8. To set the cache mode to writeback, execute the following:

    sudo ceph osd tier cache-mode hot-storage writeback

    For more information about cache modes, see Section 16.4, “Cache Modes”.

    Writeback cache tiers overlay the backing storage tier, so they require one additional step: you must direct all client traffic from the storage pool to the cache pool. To direct client traffic directly to the cache pool, execute the following for example:

    sudo ceph osd tier set-overlay cold-storage hot-storage

16.5.1 Configuring a Cache Tier

There are several options you can use to configure cache tiers. Use the following syntax:

sudo ceph osd pool set cachepool key value

16.5.1.1 Target Size and Type

Ceph's production cache tiers use a Bloom Filter for the hit_set_type:

sudo ceph osd pool set cachepool hit_set_type bloom

The hit_set_count and hit_set_period define how much time each HitSet should cover, and how many such HitSets to store.

sudo ceph osd pool set cachepool hit_set_count 12
sudo ceph osd pool set cachepool hit_set_period 14400
sudo ceph osd pool set cachepool target_max_bytes 1000000000000
Note
Note

A larger hit_set_count results in more RAM consumed by the ceph-osd process.

The min_read_recency_for_promote defines how many HitSets to check for the existence of an object when handling a read operation. The checking result is used to decide whether to promote the object asynchronously. Its value should be between 0 and hit_set_count. If set to 0, the object is always promoted. If set to 1, the current HitSet is checked. And if this object is in the current HitSet, it is promoted, otherwise not. For the other values, the exact number of archive HitSets are checked. The object is promoted if the object is found in any of the most recent min_read_recency_for_promote HitSets.

You can set a similar parameter min_write_recency_for_promote for the write operation:

sudo ceph osd pool set cachepool min_read_recency_for_promote 2
sudo ceph osd pool set cachepool min_write_recency_for_promote 2
Note
Note

The longer the period and the higher the min_read_recency_for_promote and min_write_recency_for_promote values, the more RAM the ceph-osd daemon consumes. In particular, when the agent is active to flush or evict cache objects, all hit_set_count HitSets are loaded into RAM.

16.5.1.2 Cache Sizing

The cache tiering agent performs two main functions:

Flushing

The agent identifies modified (or dirty) objects and forwards them to the storage pool for long-term storage.

Evicting

The agent identifies objects that have not been modified (or clean) and evicts the least recently used among them from the cache.

16.5.1.2.1 Absolute Sizing

The cache tiering agent can flush or evict objects based upon the total number of bytes or the total number of objects. To specify a maximum number of bytes, execute the following:

sudo ceph osd pool set cachepool target_max_bytes num_of_bytes

To specify the maximum number of objects, execute the following:

sudo ceph osd pool set cachepool target_max_objects num_of_bytes
Note
Note

Ceph is not able to determine the size of a cache pool automatically, so the configuration on the absolute size is required here, otherwise the flush/evict will not work. If you specify both limits, the cache tiering agent will begin flushing or evicting when either threshold is triggered.

Note
Note

All client requests will be blocked only when target_max_bytes or target_max_objects reached.

16.5.1.2.2 Relative Sizing

The cache tiering agent can flush or evict objects relative to the size of the cache pool (specified by target_max_bytes / target_max_objects in Section 16.5.1.2.1, “Absolute Sizing”). When the cache pool consists of a certain percentage of modified (or dirty) objects, the cache tiering agent will flush them to the storage pool. To set the cache_target_dirty_ratio, execute the following:

sudo ceph osd pool set cachepool cache_target_dirty_ratio 0.0...1.0

For example, setting the value to 0.4 will begin flushing modified (dirty) objects when they reach 40% of the cache pool's capacity:

sudo ceph osd pool set hot-storage cache_target_dirty_ratio 0.4

When the dirty objects reach a certain percentage of its capacity, flush dirty objects with a higher speed. Use cache_target_dirty_high_ratio:

sudo ceph osd pool set cachepool cache_target_dirty_high_ratio 0.0..1.0

When the cache pool reaches a certain percentage of its capacity, the cache tiering agent will evict objects to maintain free capacity. To set the cache_target_full_ratio, execute the following:

sudo ceph osd pool set cachepool cache_target_full_ratio 0.0..1.0

16.5.1.3 Cache Age

You can specify the minimum age of an object before the cache tiering agent flushes a recently modified (or dirty) object to the backing storage pool:

sudo ceph osd pool set cachepool cache_min_flush_age num_of_seconds

You can specify the minimum age of an object before it will be evicted from the cache tier:

sudo ceph osd pool set cachepool cache_min_evict_age num_of_seconds

Part IV Accessing Cluster Data

17 Ceph RADOS Gateway

Ceph RADOS Gateway is an object storage interface built on top of librgw to provide applications with a RESTful gateway to Ceph Storage Clusters. Ceph Object Storage supports two interfaces:

18 Ceph iSCSI Gateway

iSCSI is a storage area network (SAN) protocol that allows clients (called initiators) to send SCSI commands to SCSI storage devices (targets) on remote servers. SUSE Enterprise Storage includes a facility that opens Ceph storage management to heterogeneous clients, such as Microsoft Windows* and VM…

19 Clustered File System

The Ceph file system (CephFS) is a POSIX-compliant file system that uses a Ceph storage cluster to store its data. CephFS uses the same cluster system as Ceph block devices, Ceph object storage with its S3 and Swift APIs, or native bindings (librados).

20 NFS-Ganesha: Export Ceph Data via NFS

NFS-Ganesha is an NFS server (refer to Sharing File Systems with NFS (https://www.suse.com/documentation/sles-12/book_sle_admin/data/cha_nfs.html) ) that runs in a user address space instead of as part of the operating system kernel. With NFS-Ganesha you can plug in your own storage mechanism—such a…

17 Ceph RADOS Gateway

Ceph RADOS Gateway is an object storage interface built on top of librgw to provide applications with a RESTful gateway to Ceph Storage Clusters. Ceph Object Storage supports two interfaces:

  • S3-compatible: Provides object storage functionality with an interface that is compatible with a large subset of the Amazon S3 RESTful API.

  • Swift-compatible: Provides object storage functionality with an interface that is compatible with a large subset of the OpenStack Swift API.

Ceph Object Storage uses the Ceph RADOS Gateway daemon (radosgw), which uses an embedded HTTP server (CivetWeb) for interacting with a Ceph Storage Cluster. Since it provides interfaces compatible with OpenStack Swift and Amazon S3, the Ceph RADOS Gateway has its own user management. Ceph RADOS Gateway can store data in the same Ceph Storage Cluster used to store data from Ceph File System clients or Ceph Block Device clients. The S3 and Swift APIs share a common name space, so you may write data with one API and retrieve it with the other.

This section helps you install and manage the Ceph RADOS Gateway (RADOS Gateway). You can either choose to use the ceph-deploy tool, or do the installation and management manually.

Important
Important

Before installing RADOS Gateway, you need to have the Ceph cluster installed first (see Chapter 3, Introduction for more information).

17.1 Managing RADOS Gateway with ceph-deploy

This section describes how to install and configure RADOS Gateway with ceph-deploy.

17.1.1 Installation

The ceph-deploy script includes the rgw component that helps you manage the RADOS Gateway creation and operation.

Important
Important: Install Ceph

Before running ceph-deploy rgw as suggested in the following step, make sure that Ceph together with the object gateway package are correctly installed on the node where you want to setup RADOS Gateway:

ceph-deploy install --rgw short_rgw_hostname

Prepare and activate the nodes in one step. You can specify several pairs of short_hostname:gateway_name to install RADOS Gateway on a required number of nodes.

ceph-deploy --overwrite-conf rgw create \
 short_hostname:gateway_name ...

For example:

ceph-deploy --overwrite-conf rgw create ceph-node1:rgw.gateway1

You now have a working RADOS Gateway on the specified nodes, and you need to give access to a client. For more information, see Section 17.4, “Managing RADOS Gateway Access”.

17.1.2 Listing RADOS Gateway Installations

To list all RADOS Gateway instances within the Ceph cluster, run:

ceph-deploy rgw list

17.1.3 Removing RADOS Gateway from a Node

To remove a RADOS Gateway installation from the node where it was previously installed, run:

ceph-deploy --overwrite-conf rgw delete  \
  short_hostname:gatewayname ...

For example:

ceph-deploy --overwrite-conf rgw delete ceph-node1:rgw.gateway1
Tip
Tip

You need a copy of the local ceph.conf file, in your current working directory. If you do not have a copy of it, copy it from your cluster.

17.2 Managing RADOS Gateway Manually

This section describes how to install and configure RADOS Gateway manually.

17.2.1 Installation

  1. Install RADOS Gateway. The following command installs all required components:

    sudo zypper ref && sudo zypper in ceph-radosgw
  2. If the Apache server from the previous RADOS Gateway instance is running, stop it and disable the relevant service:

    sudo systemctl stop disable apache2.service
  3. Edit /etc/ceph/ceph.conf and add the following lines:

    [client.rgw.gateway]
     rgw frontends = "civetweb port=80"
    Tip
    Tip

    If you want to configure RADOS Gateway/CivetWeb for use with SSL encryption, modify the line accordingly:

    rgw frontends = civetweb port=7480s ssl_certificate=path_to_certificate.pem
  4. Restart the RADOS Gateway service. See Section 17.3, “Operating the RADOS Gateway Service” for more information.

17.2.2 Configuring RADOS Gateway

Several steps are required to configure a RADOS Gateway.

17.2.2.1 Basic Configuration

Configuring a Ceph RADOS Gateway requires a running Ceph Storage Cluster. The Ceph RADOS Gateway is a client of the Ceph Storage Cluster. As a Ceph Storage Cluster client, it requires:

  • A host name for the gateway instance, for example gateway.

  • A storage cluster user name with appropriate permissions and a keyring.

  • Pools to store its data.

  • A data directory for the gateway instance.

  • An instance entry in the Ceph Configuration file.

Each instance must have a user name and key to communicate with a Ceph storage cluster. In the following steps, we use a monitor node to create a bootstrap keyring, then create the RADOS Gateway instance user keyring based on the bootstrap one. Then, we create a client user name and key. Next, we add the key to the Ceph Storage Cluster. Finally, we distribute the keyring to the node containing the gateway instance.

  1. Create a keyring for the gateway:

    sudo ceph-authtool --create-keyring /etc/ceph/ceph.client.rgw.keyring
    sudo chmod +r /etc/ceph/ceph.client.rgw.keyring
  2. Generate a Ceph RADOS Gateway user name and key for each instance. As an example, we will use the name gateway after client.radosgw:

    sudo ceph-authtool /etc/ceph/ceph.client.rgw.keyring \
      -n client.rgw.gateway --gen-key
  3. Add capabilities to the key:

    sudo ceph-authtool -n client.rgw.gateway --cap osd 'allow rwx' \
      --cap mon 'allow rwx' /etc/ceph/ceph.client.rgw.keyring
  4. Once you have created a keyring and key to enable the Ceph Object Gateway with access to the Ceph Storage Cluster, add the key to your Ceph Storage Cluster. For example:

    sudo ceph -k /etc/ceph/ceph.client.admin.keyring auth add client.rgw.gateway \
      -i /etc/ceph/ceph.client.rgw.keyring
  5. Distribute the keyring to the node with the gateway instance:

    sudo scp /etc/ceph/ceph.client.rgw.keyring  ceph@hostname:/home/ceph
    ssh hostname
    sudo mv ceph.client.rgw.keyring /etc/ceph/ceph.client.rgw.keyring
Tip
Tip: Use Bootstrap Keyring

An alternative way is to create the RADOS Gateway bootstrap keyring, and then create the RADOS Gateway keyring from it:

  1. Create a RADOS Gateway bootstrap keyring on one of the monitor nodes:

    sudo ceph \
     auth get-or-create client.bootstrap-rgw mon 'allow profile bootstrap-rgw' \
     --connect-timeout=25 \
     --cluster=ceph \
     --name mon. \
     --keyring=/var/lib/ceph/mon/ceph-node_host/keyring \
     -o /var/lib/ceph/bootstrap-rgw/keyring
  2. Create the /var/lib/ceph/radosgw/ceph-rgw_name directory for storing the bootstrap keyring:

    sudo mkdir \
    /var/lib/ceph/radosgw/ceph-rgw_name
  3. Create a RADOS Gateway keyring from the newly created bootstrap keyring:

    sudo ceph \
     auth get-or-create client.rgw.rgw_name osd 'allow rwx' mon 'allow rw' \
     --connect-timeout=25 \
     --cluster=ceph \
     --name client.bootstrap-rgw \
     --keyring=/var/lib/ceph/bootstrap-rgw/keyring \
     -o /var/lib/ceph/radosgw/ceph-rgw_name/keyring
  4. Copy the RADOS Gateway keyring to the RADOS Gateway host:

    sudo scp \
    /var/lib/ceph/radosgw/ceph-rgw_name/keyring \
    rgw_host:/var/lib/ceph/radosgw/ceph-rgw_name/keyring

17.2.2.2 Create Pools (Optional)

Ceph RADOS Gateways require Ceph Storage Cluster pools to store specific gateway data. If the user you created has proper permissions, the gateway will create the pools automatically. However, ensure that you have set an appropriate default number of placement groups per pool in the Ceph configuration file.

When configuring a gateway with the default region and zone, the naming convention for pools typically uses 'default' for region and zone naming, but you can use any naming convention you prefer:

.rgw.root
default.rgw.control
default.rgw.data.root
default.rgw.gc
default.rgw.log
default.rgw.users.uid
default.rgw.users.email
default.rgw.users.keys
default.rgw.meta
default.rgw.users.swift

To create the pools manually, see Section 13.1.2, “Create a Pool”.

17.2.2.3 Adding Gateway Configuration to Ceph

Add the Ceph RADOS Gateway configuration to the Ceph Configuration file. The Ceph RADOS Gateway configuration requires you to identify the Ceph RADOS Gateway instance. Then, specify the host name where you installed the Ceph RADOS Gateway daemon, a keyring (for use with cephx), and optionally a log file. For example:

[client.rgw.instance-name]
host = hostname
keyring = /etc/ceph/ceph.client.rgw.keyring
Tip
Tip: RADOS Gateway Log File

To override the default RADOS Gateway log file, include the following:

log file = /var/log/radosgw/client.rgw.instance-name.log

The [client.rgw.*] portion of the gateway instance identifies this portion of the Ceph configuration file as configuring a Ceph Storage Cluster client where the client type is a Ceph RADOS Gateway (radosgw). The instance name follows. For example:

[client.rgw.gateway]
host = ceph-gateway
keyring = /etc/ceph/ceph.client.rgw.keyring
Note
Note

The host must be your machine host name, excluding the domain name.

Then turn off print continue. If you have it set to true, you may encounter problems with PUT operations:

rgw print continue = false

To use a Ceph RADOS Gateway with subdomain S3 calls (for example http://bucketname.hostname), you must add the Ceph RADOS Gateway DNS name under the [client.rgw.gateway] section of the Ceph configuration file:

[client.rgw.gateway]
...
rgw dns name = hostname

You should also consider installing a DNS server such as Dnsmasq on your client machine(s) when using the http://bucketname.hostname syntax. The dnsmasq.conf file should include the following settings:

address=/hostname/host-ip-address
listen-address=client-loopback-ip

Then, add the client-loopback-ip IP address as the first DNS server on the client machine(s).

17.2.2.4 Redeploy Ceph Configuration

Use ceph-deploy to push a new copy of the configuration to the hosts in your cluster:

ceph-deploy config push host-name [host-name]...

17.2.2.5 Create Data Directory

Deployment scripts may not create the default Ceph RADOS Gateway data directory. Create data directories for each instance of a radosgw daemon if not already done. The host variables in the Ceph configuration file determine which host runs each instance of a radosgw daemon. The typical form specifies the radosgw daemon, the cluster name and the daemon ID.

sudo mkdir -p /var/lib/ceph/radosgw/cluster-id

Using the exemplary ceph.conf settings above, you would execute the following:

sudo mkdir -p /var/lib/ceph/radosgw/ceph-radosgw.gateway

17.2.2.6 Restart Services and Start the Gateway

To ensure that all components have reloaded their configurations, we recommend restarting your Ceph Storage Cluster service. Then, start up the radosgw service. For more information, see Chapter 8, Introduction and Section 17.3, “Operating the RADOS Gateway Service”.

After the service is up and running, you can make an anonymous GET request to see if the gateway returns a response. A simple HTTP request to the domain name should return the following:

<ListAllMyBucketsResult>
      <Owner>
              <ID>anonymous</ID>
              <DisplayName/>
      </Owner>
      <Buckets/>
</ListAllMyBucketsResult>

17.3 Operating the RADOS Gateway Service

RADOS Gateway service is operated with the systemctl command. You need to have root privileges to operate the RADOS Gateway service. Note that gateway_host is the host name of the server whose RADOS Gateway instance you need to operate.

The following subcommands are supported for the RADOS Gateway service:

systemctl status ceph-radosgw@rgw.gateway_host

Prints the status information of the service.

systemctl start ceph-radosgw@rgw.gateway_host

Starts the service if it is not already running.

systemctl restart ceph-radosgw@rgw.gateway_host

Restarts the service.

systemctl stop ceph-radosgw@rgw.gateway_host

Stops the running service.

systemctl enable ceph-radosgw@rgw.gateway_host

Enables the service so that it is automatically started on system start-up.

systemctl disable ceph-radosgw@rgw.gateway_host

Disables the service so that it is not automatically started on system start-up.

17.4 Managing RADOS Gateway Access

You can communicate with RADOS Gateway using either S3- or Swift-compatible interface. Both interfaces require you to create a specific user, and install the relevant client software to communicate with the gateway using the user's secret key.

For an introduction and a few practical examples on RADOS Gateway access, see Section 27.6, “RADOS Gateway Client Usage”.

17.4.1 Managing S3 Access

S3 interface is compatible with a large subset of the Amazon S3 RESTful API.

Tip
Tip

S3cmd is a command line S3 client. You can find it in the OpenSUSE Build Service (https://build.opensuse.org/package/show/Cloud:Tools/s3cmd). The repository contains versions for both SUSE Linux Enterprise and openSUSE based distributions.

17.4.1.1 Adding Users

See Section 31.1, “Adding S3 Users”.

17.4.1.2 Removing Users

See Section 31.2, “Removing S3 Users”.

17.4.1.4 Setting Quotas

See Section 31.3, “User Quota Management”.

17.4.2 Managing Swift Access

Swift interface is compatible with a large subset of the OpenStack Swift API.

17.4.2.1 Adding Users

See Section 31.4, “Adding Swift Users”.

17.4.2.2 Removing Users

See Section 31.5, “Removing Swift Users”.

17.5 Multi-site Object Storage Gateways

You can configure each RADOS Gateway to participate in a federated architecture, working in an active zone configuration while allowing for writes to non-master zones.

17.5.1 Terminology

A description of terms specific to a federated architecture follows:

Zone

A logical grouping of one or more RADOS Gateway instances. There must be one zone designated as the master zone in a zonegroup, which handles all bucket and user creation.

Zonegroup

A zonegroup consists of multiple zones. There should be a master zonegroup that will handle changes to the system configuration.

Zonegroup map

A configuration structure that holds the map of the entire system, for example which zonegroup is the master, relationships between different zonegroups, and certain configuration options such as storage policies.

Realm

A container for zonegroups. This allows for separation of zonegroups between clusters. It is possible to create multiple realms, making it easier to run completely different configurations in the same cluster.

Period

A period holds the configuration structure for the current state of the realm. Every period contains a unique ID and an epoch. Every realm has an associated current period, holding the current state of configuration of the zonegroups and storage policies. Any configuration change for a non-master zone will increment the period's epoch. Changing the master zone to a different zone will trigger the following changes:

  • A new period is generated with a new period ID and epoch of 1.

  • Realm's current period is updated to point to the newly generated period ID.

  • Realm's epoch is incremented.

17.5.2 Example Cluster Setup

In this example, we will focus on creating a single zone group with three separate zones, which actively synchronize their data. Two zones belong to the same cluster, while the third belongs to a different one. There is no synchronization agent involved in mirroring data changes between the RADOS Gateways. This allows for a much simpler configuration scheme and active-active configurations. Note that metadata operations—such as creating a new user—still need to go through the master zone. However, data operations—such as creation of buckets and objects—can be handled by any of the zones.

17.5.3 System Keys

While configuring zones, RADOS Gateway expects creation of an S3-compatible system user together with their access and secret keys. This allows another RADOS Gateway instance to pull the configuration remotely with the access and secret keys. For more information on creating S3 users, see Section 31.1, “Adding S3 Users”.

Tip
Tip

It is useful to generate the access and secret keys before the zone creation itself because it makes scripting and use of configuration management tools easier later on.

For the purpose of this example, let us assume that the access and secret keys are set in the environment variables:

# SYSTEM_ACCESS_KEY=1555b35654ad1656d805
# SYSTEM_SECRET_KEY=h7GhxuBLTrlhVUyxSPUKUV8r/2EI4ngqJxD7iBdBYLhwluN30JaT3Q==

Generally, access keys consist of 20 alphanumeric characters, while secret keys consist of 40 alphanumeric characters (they can contain +/= characters as well). You can generate these keys in the command line:

# SYSTEM_ACCESS_KEY=$(cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 20 | head -n 1)
# SYSTEM_SECRET_KEY=$(cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 40 | head -n 1)

17.5.4 Naming Conventions

This example describes the process of setting up a master zone. We will assume a zonegroup called us spanning the United States, which will be our master zonegroup. This will contain two zones written in a zonegroup-zone format. This is our convention only and you can choose a format that you prefer. In summary:

  • Master zonegroup: United States us

  • Master zone: United States, East Region 1: us-east-1

  • Secondary zone: United States, East Region 2: us-east-2

  • Secondary zone: United States, West Region: us-west

This will be a part of a larger realm named gold. The us-east-1 and us-east-2 zones are part of the same Ceph cluster, us-east-1 being the primary one. us-west is in a different Ceph cluster.

17.5.5 Default Pools

When configured with the appropriate permissions, RADOS Gateway creates default pools on its own. The pg_num and pgp_num values are taken from the ceph.conf configuration file. Pools related to a zone by default follow the convention of zone-name.pool-name. For example for the us-east-1 zone, it will be the following pools:

.rgw.root
us-east-1.rgw.control
us-east-1.rgw.data.root
us-east-1.rgw.gc
us-east-1.rgw.log
us-east-1.rgw.intent-log
us-east-1.rgw.usage
us-east-1.rgw.users.keys
us-east-1.rgw.users.email
us-east-1.rgw.users.swift
us-east-1.rgw.users.uid
us-east-1.rgw.buckets.index
us-east-1.rgw.buckets.data
us-east-1.rgw.meta

These pools can be created in other zones as well, by replacing us-east-1 with the appropriate zone name.

17.5.6 Creating a Realm

Configure a realm called gold and make it the default realm:

cephadm > radosgw-admin realm create --rgw-realm=gold --default
{
  "id": "4a367026-bd8f-40ee-b486-8212482ddcd7",
  "name": "gold",
  "current_period": "09559832-67a4-4101-8b3f-10dfcd6b2707",
  "epoch": 1
}

Note that every realm has an ID, which allows for flexibility such as renaming the realm later if needed. The current_period changes whenever we change anything in the master zone. The epoch is incremented when there is a change in the master zone's configuration which results in a change of the current period.

17.5.7 Deleting the Default Zonegroup

The default installation of RADOS Gateway creates the default zonegroup called default. Because we no longer need the default zonegroup, remove it.

cephadm > radosgw-admin zonegroup delete --rgw-zonegroup=default

17.5.8 Creating a Master Zonegroup

Create a master zonegroup called us. The zonegroup will manage the zonegroup map and propagate changes to the rest of the system. By marking the zonegroup as default, you allow explicitly mentioning the rgw-zonegroup switch for later commands.

cephadm > radosgw-admin zonegroup create --rgw-zonegroup=us \
--endpoints=http://rgw1:80 --master --default
{
  "id": "d4018b8d-8c0d-4072-8919-608726fa369e",
  "name": "us",
  "api_name": "us",
  "is_master": "true",
  "endpoints": [
      "http:\/\/rgw1:80"
  ],
  "hostnames": [],
  "hostnames_s3website": [],
  "master_zone": "",
  "zones": [],
  "placement_targets": [],
  "default_placement": "",
  "realm_id": "4a367026-bd8f-40ee-b486-8212482ddcd7"
}

Alternatively, you can mark a zonegroup as default with the following command:

cephadm > radosgw-admin zonegroup default --rgw-zonegroup=us

17.5.9 Creating a Master Zone

Now create a default zone and add it to the default zonegroup. Note that you will use this zone for metadata operations such as user creation:

cephadm > radosgw-admin zone create --rgw-zonegroup=us --rgw-zone=us-east-1 \
--endpoints=http://rgw1:80 --access-key=$SYSTEM_ACCESS_KEY --secret=$SYSTEM_SECRET_KEY
{
  "id": "83859a9a-9901-4f00-aa6d-285c777e10f0",
  "name": "us-east-1",
  "domain_root": "us-east-1/gc.rgw.data.root",
  "control_pool": "us-east-1/gc.rgw.control",
  "gc_pool": "us-east-1/gc.rgw.gc",
  "log_pool": "us-east-1/gc.rgw.log",
  "intent_log_pool": "us-east-1/gc.rgw.intent-log",
  "usage_log_pool": "us-east-1/gc.rgw.usage",
  "user_keys_pool": "us-east-1/gc.rgw.users.keys",
  "user_email_pool": "us-east-1/gc.rgw.users.email",
  "user_swift_pool": "us-east-1/gc.rgw.users.swift",
  "user_uid_pool": "us-east-1/gc.rgw.users.uid",
  "system_key": {
      "access_key": "1555b35654ad1656d804",
      "secret_key": "h7GhxuBLTrlhVUyxSPUKUV8r\/2EI4ngqJxD7iBdBYLhwluN30JaT3Q=="
  },
  "placement_pools": [
      {
          "key": "default-placement",
          "val": {
              "index_pool": "us-east-1/gc.rgw.buckets.index",
              "data_pool": "us-east-1/gc.rgw.buckets.data",
              "data_extra_pool": "us-east-1/gc.rgw.buckets.non-ec",
              "index_type": 0
          }
      }
  ],
  "metadata_heap": "us-east-1/gc.rgw.meta",
  "realm_id": "4a367026-bd8f-40ee-b486-8212482ddcd7"
}

Note that the --rgw-zonegroup and --default switches add the zone to a zonegroup and make it the default zone. Alternatively, the same can also be done with the following commands:

cephadm > radosgw-admin zone default --rgw-zone=us-east-1
cephadm > radosgw-admin zonegroup add --rgw-zonegroup=us --rgw-zone=us-east-1

17.5.9.1 Creating System Users

To access zone pools, you need to create a system user. Note that you will need these keys when configuring the secondary zone as well.

cephadm > radosgw-admin user create --uid=zone.user \
--display-name="Zone User" --access-key=$SYSTEM_ACCESS_KEY \
--secret=$SYSTEM_SECRET_KEY --system

17.5.9.2 Update the Period

Because you changed the master zone configuration, you need to commit the changes for them to take effect in the realm configuration structure. Initially, the period looks like this:

cephadm > radosgw-admin period get
{
  "id": "09559832-67a4-4101-8b3f-10dfcd6b2707", "epoch": 1, "predecessor_uuid": "", "sync_status": [], "period_map":
  {
    "id": "09559832-67a4-4101-8b3f-10dfcd6b2707", "zonegroups": [], "short_zone_ids": []
  }, "master_zonegroup": "", "master_zone": "", "period_config":
  {
     "bucket_quota": {
     "enabled": false, "max_size_kb": -1, "max_objects": -1
     }, "user_quota": {
       "enabled": false, "max_size_kb": -1, "max_objects": -1
     }
  }, "realm_id": "4a367026-bd8f-40ee-b486-8212482ddcd7", "realm_name": "gold", "realm_epoch": 1
}

Update the period and commit the changes:

cephadm > radosgw-admin period update --commit
{
  "id": "b5e4d3ec-2a62-4746-b479-4b2bc14b27d1",
  "epoch": 1,
  "predecessor_uuid": "09559832-67a4-4101-8b3f-10dfcd6b2707",
  "sync_status": [ "[...]"
  ],
  "period_map": {
      "id": "b5e4d3ec-2a62-4746-b479-4b2bc14b27d1",
      "zonegroups": [
          {
              "id": "d4018b8d-8c0d-4072-8919-608726fa369e",
              "name": "us",
              "api_name": "us",
              "is_master": "true",
              "endpoints": [
                  "http:\/\/rgw1:80"
              ],
              "hostnames": [],
              "hostnames_s3website": [],
              "master_zone": "83859a9a-9901-4f00-aa6d-285c777e10f0",
              "zones": [
                  {
                      "id": "83859a9a-9901-4f00-aa6d-285c777e10f0",
                      "name": "us-east-1",
                      "endpoints": [
                          "http:\/\/rgw1:80"
                      ],
                      "log_meta": "true",
                      "log_data": "false",
                      "bucket_index_max_shards": 0,
                      "read_only": "false"
                  }
              ],
              "placement_targets": [
                  {
                      "name": "default-placement",
                      "tags": []
                  }
              ],
              "default_placement": "default-placement",
              "realm_id": "4a367026-bd8f-40ee-b486-8212482ddcd7"
          }
      ],
      "short_zone_ids": [
          {
              "key": "83859a9a-9901-4f00-aa6d-285c777e10f0",
              "val": 630926044
          }
      ]
  },
  "master_zonegroup": "d4018b8d-8c0d-4072-8919-608726fa369e",
  "master_zone": "83859a9a-9901-4f00-aa6d-285c777e10f0",
  "period_config": {
      "bucket_quota": {
          "enabled": false,
          "max_size_kb": -1,
          "max_objects": -1
      },
      "user_quota": {
          "enabled": false,
          "max_size_kb": -1,
          "max_objects": -1
      }
  },
  "realm_id": "4a367026-bd8f-40ee-b486-8212482ddcd7",
  "realm_name": "gold",
  "realm_epoch": 2
}

17.5.9.3 Start the RADOS Gateway

You need to mention the RADOS Gateway zone and port options in the configuration file before starting the RADOS Gateway. For more information on RADOS Gateway and its configuration, see Chapter 17, Ceph RADOS Gateway. The configuration section of RADOS Gateway should look similar to this:

[client.rgw.us-east-1]
rgw_frontends="civetweb port=80"
rgw_zone=us-east-1

Start the RADOS Gateway:

sudo systemctl start ceph-radosgw@rgw.us-east-1

17.5.10 Creating a Secondary Zone

In the same cluster, create and configure the secondary zone named us-east-2. You can execute all the following commands in the node hosting the master zone itself.

To create the secondary zone, use the same command as when you created the primary zone, except dropping the master flag:

cephadm > radosgw-admin zone create --rgw-zonegroup=us --endpoints=http://rgw2:80 \
--rgw-zone=us-east-2 --access-key=$SYSTEM_ACCESS_KEY --secret=$SYSTEM_SECRET_KEY
{
  "id": "950c1a43-6836-41a2-a161-64777e07e8b8",
  "name": "us-east-2",
  "domain_root": "us-east-2.rgw.data.root",
  "control_pool": "us-east-2.rgw.control",
  "gc_pool": "us-east-2.rgw.gc",
  "log_pool": "us-east-2.rgw.log",
  "intent_log_pool": "us-east-2.rgw.intent-log",
  "usage_log_pool": "us-east-2.rgw.usage",
  "user_keys_pool": "us-east-2.rgw.users.keys",
  "user_email_pool": "us-east-2.rgw.users.email",
  "user_swift_pool": "us-east-2.rgw.users.swift",
  "user_uid_pool": "us-east-2.rgw.users.uid",
  "system_key": {
      "access_key": "1555b35654ad1656d804",
      "secret_key": "h7GhxuBLTrlhVUyxSPUKUV8r\/2EI4ngqJxD7iBdBYLhwluN30JaT3Q=="
  },
  "placement_pools": [
      {
          "key": "default-placement",
          "val": {
              "index_pool": "us-east-2.rgw.buckets.index",
              "data_pool": "us-east-2.rgw.buckets.data",
              "data_extra_pool": "us-east-2.rgw.buckets.non-ec",
              "index_type": 0
          }
      }
  ],
  "metadata_heap": "us-east-2.rgw.meta",
  "realm_id": "815d74c2-80d6-4e63-8cfc-232037f7ff5c"
}

17.5.10.1 Update the Period

Inform all the gateways of the new change in the system map by doing a period update and committing the changes:

cephadm > radosgw-admin period update --commit
{
  "id": "b5e4d3ec-2a62-4746-b479-4b2bc14b27d1",
  "epoch": 2,
  "predecessor_uuid": "09559832-67a4-4101-8b3f-10dfcd6b2707",
  "sync_status": [ "[...]"
  ],
  "period_map": {
      "id": "b5e4d3ec-2a62-4746-b479-4b2bc14b27d1",
      "zonegroups": [
          {
              "id": "d4018b8d-8c0d-4072-8919-608726fa369e",
              "name": "us",
              "api_name": "us",
              "is_master": "true",
              "endpoints": [
                  "http:\/\/rgw1:80"
              ],
              "hostnames": [],
              "hostnames_s3website": [],
              "master_zone": "83859a9a-9901-4f00-aa6d-285c777e10f0",
              "zones": [
                  {
                      "id": "83859a9a-9901-4f00-aa6d-285c777e10f0",
                      "name": "us-east-1",
                      "endpoints": [
                          "http:\/\/rgw1:80"
                      ],
                      "log_meta": "true",
                      "log_data": "false",
                      "bucket_index_max_shards": 0,
                      "read_only": "false"
                  },
                  {
                      "id": "950c1a43-6836-41a2-a161-64777e07e8b8",
                      "name": "us-east-2",
                      "endpoints": [
                          "http:\/\/rgw2:80"
                      ],
                      "log_meta": "false",
                      "log_data": "true",
                      "bucket_index_max_shards": 0,
                      "read_only": "false"
                  }

              ],
              "placement_targets": [
                  {
                      "name": "default-placement",
                      "tags": []
                  }
              ],
              "default_placement": "default-placement",
              "realm_id": "4a367026-bd8f-40ee-b486-8212482ddcd7"
          }
      ],
      "short_zone_ids": [
          {
              "key": "83859a9a-9901-4f00-aa6d-285c777e10f0",
              "val": 630926044
          },
          {
              "key": "950c1a43-6836-41a2-a161-64777e07e8b8",
              "val": 4276257543
          }

      ]
  },
  "master_zonegroup": "d4018b8d-8c0d-4072-8919-608726fa369e",
  "master_zone": "83859a9a-9901-4f00-aa6d-285c777e10f0",
  "period_config": {
      "bucket_quota": {
          "enabled": false,
          "max_size_kb": -1,
          "max_objects": -1
      },
      "user_quota": {
          "enabled": false,
          "max_size_kb": -1,
          "max_objects": -1
      }
  },
  "realm_id": "4a367026-bd8f-40ee-b486-8212482ddcd7",
  "realm_name": "gold",
  "realm_epoch": 2
}

17.5.10.2 Start the RADOS Gateway

Adjust the configuration of the RADOS Gateway for the secondary zone, and start it:

[client.rgw.us-east-2]
rgw_frontends="civetweb port=80"
rgw_zone=us-east-2
cephadm > sudo systemctl start ceph-radosgw@rgw.us-east-2

17.5.11 Adding RADOS Gateway to the Second Cluster

The second Ceph cluster belongs to the same zonegroup as the initial one, but may be geographically located elsewhere.

17.5.11.1 Default Realm and Zonegroup

Since you already created the realm for the first gateway, pull the realm here and make it the default here:

cephadm > radosgw-admin realm pull --url=http://rgw1:80 \
--access-key=$SYSTEM_ACCESS_KEY --secret=$SYSTEM_SECRET_KEY
{
  "id": "4a367026-bd8f-40ee-b486-8212482ddcd7",
  "name": "gold",
  "current_period": "b5e4d3ec-2a62-4746-b479-4b2bc14b27d1",
  "epoch": 2
}
cephadm > radosgw-admin realm default --rgw-realm=gold

Get the configuration from the master zone by pulling the period:

cephadm > radosgw-admin period pull --url=http://rgw1:80 \
--access-key=$SYSTEM_ACCESS_KEY --secret=$SYSTEM_SECRET_KEY

Set the default zonegroup to the already created us zonegroup:

cephadm > radosgw-admin zonegroup default --rgw-zonegroup=us

17.5.11.2 Secondary Zone Configuration

Create a new zone named us-west with the same system keys:

cephadm > radosgw-admin zone create --rgw-zonegroup=us --rgw-zone=us-west \
--access-key=$SYSTEM_ACCESS_KEY --secret=$SYSTEM_SECRET_KEY \
--endpoints=http://rgw3:80 --default
{
  "id": "950c1a43-6836-41a2-a161-64777e07e8b8",
  "name": "us-west",
  "domain_root": "us-west.rgw.data.root",
  "control_pool": "us-west.rgw.control",
  "gc_pool": "us-west.rgw.gc",
  "log_pool": "us-west.rgw.log",
  "intent_log_pool": "us-west.rgw.intent-log",
  "usage_log_pool": "us-west.rgw.usage",
  "user_keys_pool": "us-west.rgw.users.keys",
  "user_email_pool": "us-west.rgw.users.email",
  "user_swift_pool": "us-west.rgw.users.swift",
  "user_uid_pool": "us-west.rgw.users.uid",
  "system_key": {
      "access_key": "1555b35654ad1656d804",
      "secret_key": "h7GhxuBLTrlhVUyxSPUKUV8r\/2EI4ngqJxD7iBdBYLhwluN30JaT3Q=="
  },
  "placement_pools": [
      {
          "key": "default-placement",
          "val": {
              "index_pool": "us-west.rgw.buckets.index",
              "data_pool": "us-west.rgw.buckets.data",
              "data_extra_pool": "us-west.rgw.buckets.non-ec",
              "index_type": 0
          }
      }
  ],
  "metadata_heap": "us-west.rgw.meta",
  "realm_id": "4a367026-bd8f-40ee-b486-8212482ddcd7"
}

17.5.11.3 Update the Period

To propagate the zonegroup map changes, we update and commit the period:

cephadm > radosgw-admin period update --commit --rgw-zone=us-west
{
  "id": "b5e4d3ec-2a62-4746-b479-4b2bc14b27d1",
  "epoch": 3,
  "predecessor_uuid": "09559832-67a4-4101-8b3f-10dfcd6b2707",
  "sync_status": [
      "", # truncated
  ],
  "period_map": {
      "id": "b5e4d3ec-2a62-4746-b479-4b2bc14b27d1",
      "zonegroups": [
          {
              "id": "d4018b8d-8c0d-4072-8919-608726fa369e",
              "name": "us",
              "api_name": "us",
              "is_master": "true",
              "endpoints": [
                  "http:\/\/rgw1:80"
              ],
              "hostnames": [],
              "hostnames_s3website": [],
              "master_zone": "83859a9a-9901-4f00-aa6d-285c777e10f0",
              "zones": [
                  {
                      "id": "83859a9a-9901-4f00-aa6d-285c777e10f0",
                      "name": "us-east-1",
                      "endpoints": [
                          "http:\/\/rgw1:80"
                      ],
                      "log_meta": "true",
                      "log_data": "true",
                      "bucket_index_max_shards": 0,
                      "read_only": "false"
                  },
                                  {
                      "id": "950c1a43-6836-41a2-a161-64777e07e8b8",
                      "name": "us-east-2",
                      "endpoints": [
                          "http:\/\/rgw2:80"
                      ],
                      "log_meta": "false",
                      "log_data": "true",
                      "bucket_index_max_shards": 0,
                      "read_only": "false"
                  },
                  {
                      "id": "d9522067-cb7b-4129-8751-591e45815b16",
                      "name": "us-west",
                      "endpoints": [
                          "http:\/\/rgw3:80"
                      ],
                      "log_meta": "false",
                      "log_data": "true",
                      "bucket_index_max_shards": 0,
                      "read_only": "false"
                  }
              ],
              "placement_targets": [
                  {
                      "name": "default-placement",
                      "tags": []
                  }
              ],
              "default_placement": "default-placement",
              "realm_id": "4a367026-bd8f-40ee-b486-8212482ddcd7"
          }
      ],
      "short_zone_ids": [
          {
              "key": "83859a9a-9901-4f00-aa6d-285c777e10f0",
              "val": 630926044
          },
          {
              "key": "950c1a43-6836-41a2-a161-64777e07e8b8",
              "val": 4276257543
          },
          {
              "key": "d9522067-cb7b-4129-8751-591e45815b16",
              "val": 329470157
          }
      ]
  },
  "master_zonegroup": "d4018b8d-8c0d-4072-8919-608726fa369e",
  "master_zone": "83859a9a-9901-4f00-aa6d-285c777e10f0",
  "period_config": {
      "bucket_quota": {
          "enabled": false,
          "max_size_kb": -1,
          "max_objects": -1
      },
      "user_quota": {
          "enabled": false,
          "max_size_kb": -1,
          "max_objects": -1
      }
  },
  "realm_id": "4a367026-bd8f-40ee-b486-8212482ddcd7",
  "realm_name": "gold",
  "realm_epoch": 2
}

Note that the period epoch number has incremented, indicating a change in the configuration.

17.5.11.4 Start the RADOS Gateway

This is similar to starting the RADOS Gateway in the first zone. The only difference is that the RADOS Gateway zone configuration should reflect the us-west zone name:

[client.rgw.us-west]
rgw_frontends="civetweb port=80"
rgw_zone=us-west

Start the second RADOS Gateway:

sudo systemctl start ceph-radosgw@rgw.us-west

18 Ceph iSCSI Gateway

iSCSI is a storage area network (SAN) protocol that allows clients (called initiators) to send SCSI commands to SCSI storage devices (targets) on remote servers. SUSE Enterprise Storage includes a facility that opens Ceph storage management to heterogeneous clients, such as Microsoft Windows* and VMware* vSphere, through the iSCSI protocol. Multipath iSCSI access enables availability and scalability for these clients, and the standardized iSCSI protocol also provides an additional layer of security isolation between clients and the SUSE Enterprise Storage cluster. The configuration facility is named lrbd. Using lrbd, Ceph storage administrators can define thin-provisioned, replicated, highly-available volumes supporting read-only snapshots, read-write clones, and automatic resizing with Ceph RADOS Block Device (RBD). Administrators can then export volumes either via a single lrbd gateway host, or via multiple gateway hosts supporting multipath failover. Linux, Microsoft Windows, and VMware hosts can connect to volumes using the iSCSI protocol, which makes them available like any other SCSI block device. This means SUSE Enterprise Storage customers can effectively run a complete block-storage infrastructure subsystem on Ceph that provides all features and benefits of a conventional SAN enabling future growth.

This chapter introduces detailed information to set up a Ceph cluster infrastructure together with an iSCSI gateway so that the client hosts can use remotely stored data as local storage devices using the iSCSI protocol.

18.1 iSCSI Block Storage

iSCSI is an implementation of the Small Computer System Interface (SCSI) command set using the Internet Protocol (IP), specified in RFC 3720. iSCSI is implemented as a service where a client (the initiator) talks to a server (the target) via a session on TCP port 3260. An iSCSI target's IP address and port are called an iSCSI portal, where a target can be exposed through one or more portals. The combination of a target and one or more portals is called the target portal group (TPG).

The underlying data link layer protocol for iSCSI is commonly Ethernet. More specifically, modern iSCSI infrastructures use 10 Gigabit Ethernet or faster networks for optimal throughput. 10 Gigabit Ethernet connectivity between the iSCSI gateway and the back-end Ceph cluster is strongly recommended.

18.1.1 The Linux Kernel iSCSI Target

The Linux kernel iSCSI target was originally named LIO for linux-iscsi.org, the project's original domain and Web site. For some time, no fewer than 4 competing iSCSI target implementations were available for the Linux platform, but LIO ultimately prevailed as the single iSCSI reference target. The mainline kernel code for LIO uses the simple, but somewhat ambiguous name "target", distinguishing between "target core" and a variety of front-end and back-end target modules.

The most commonly used front-end module is arguably iSCSI. However, LIO also supports Fibre Channel (FC), Fibre Channel over Ethernet (FCoE) and several other front-end protocols. At this time, only the iSCSI protocol is supported by SUSE Enterprise Storage.

The most frequently used target back-end module is one that is capable of simply re-exporting any available block device on the target host. This module is named iblock. However, LIO also has an RBD-specific back-end module supporting parallelized multipath I/O access to RBD images.

18.1.2 iSCSI Initiators

This section introduces a brief information on iSCSI initiators used on Linux, Microsoft Windows, and VMware platforms.

18.1.2.1 Linux

The standard initiator for the Linux platform is open-iscsi. open-iscsi launches a daemon, iscsid, which the user can then use to discover iSCSI targets on any given portal, log in to targets, and map iSCSI volumes. iscsid communicates with the SCSI mid layer to create in-kernel block devices that the kernel can then treat like any other SCSI block device on the system. The open-iscsi initiator can be deploying in conjunction with the Device Mapper Multipath (dm-multipath) facility to provide a highly available iSCSI block device.

18.1.2.2 Microsoft Windows and Hyper-V

The default iSCSI initiator for the Microsoft Windows operating system is the Microsoft iSCSI initiator. The iSCSI service can be configured via a graphical user interface (GUI), and supports multipath I/O for high availability.

18.1.2.3 VMware

The default iSCSI initiator for VMware vSphere and ESX is the VMware ESX software iSCSI initiator, vmkiscsi. When enabled, it can be configured either from the vSphere client, or using the vmkiscsi-tool command. You can then format storage volumes connected through the vSphere iSCSI storage adapter with VMFS, and use them like any other VM storage device. The VMware initiator also supports multipath I/O for high availability.

18.2 General Information about lrdb

lrbd combines the benefits of RADOS Block Devices with the ubiquitous versatility of iSCSI. By employing lrbd on an iSCSI target host (known as the lrbd gateway), any application that needs to make use of block storage can benefit from Ceph, even if it does not speak any Ceph client protocol. Instead, users can use iSCSI or any other target front-end protocol to connect to an LIO target, which translates all target I/O to RBD storage operations.

Ceph Cluster with a Single iSCSI Gateway
Figure 18.1: Ceph Cluster with a Single iSCSI Gateway

lrbd is inherently highly-available and supports multipath operations. Thus, downstream initiator hosts can use multiple iSCSI gateways for both high availability and scalability. When communicating with an iSCSI configuration with more than one gateway, initiators may load-balance iSCSI requests across multiple gateways. In the event of a gateway failing, being temporarily unreachable, or being disabled for maintenance, I/O will transparently continue via another gateway.

Ceph Cluster with Multiple iSCSI Gateways
Figure 18.2: Ceph Cluster with Multiple iSCSI Gateways

18.3 Deployment Considerations

A minimum configuration of SUSE Enterprise Storage with lrbd consists of the following components:

  • A Ceph storage cluster. The Ceph cluster consists of a minimum of four physical servers hosting at least eight object storage daemons (OSDs) each. In such a configuration, three OSD nodes also double as a monitor (MON) host.

  • An iSCSI target server running the LIO iSCSI target, configured via lrbd.

  • An iSCSI initiator host, running open-iscsi (Linux), the Microsoft iSCSI Initiator (Microsoft Windows), or any other compatible iSCSI initiator implementation.

A recommended production configuration of SUSE Enterprise Storage with lrbd consists of:

  • A Ceph storage cluster. A production Ceph cluster consists of any number of (typically more than 10) OSD nodes, each typically running 10-12 object storage daemons (OSDs), with no fewer than three dedicated MON hosts.

  • Several iSCSI target servers running the LIO iSCSI target, configured via lrbd. For iSCSI fail-over and load-balancing, these servers must run a kernel supporting the target_core_rbd module. Updates packages are available from the SUSE Linux Enterprise Server maintenance channel.

  • Any number of iSCSI initiator hosts, running open-iscsi (Linux), the Microsoft iSCSI Initiator (Microsoft Windows), or any other compatible iSCSI initiator implementation.

18.4 Installation and Configuration

This section describes steps to install and configure iSCSI gateway on top of SUSE Enterprise Storage.

18.4.1 Install SUSE Enterprise Storage and Deploy a Ceph Cluster

Before you start installing and configuring an iSCSI gateway, you need to install SUSE Enterprise Storage and deploy a Ceph cluster as described in Chapter 3, Introduction.

18.4.2 Installing the ceph_iscsi Pattern

On your designated iSCSI target server nodes, install the ceph_iscsi pattern. Doing so will automatically install lrbd, the necessary Ceph binaries and libraries, and the targetcli command line tool:

sudo zypper in -t pattern ceph_iscsi

Repeat this step on any node that you want to act as a fail-over or load-balancing target server node.

18.4.3 Create RBD Images

RBD images are created in the Ceph store and subsequently exported to iSCSI. We recommend that you use a dedicated RADOS pool for this purpose. You can create a volume from any host that is able to connect to your storage cluster using the Ceph rbd command line utility. This requires the client to have at least a minimal ceph.conf configuration file, and appropriate CephX authentication credentials.

To create a new volume for subsequent export via iSCSI, use the rbd create command, specifying the volume size in megabytes. For example, in order to create a 100GB volume named testvol in the pool named iscsi, run:

rbd --pool iscsi create --size=102400 testvol

The above command creates an RBD volume in the default format 2.

Note
Note

Since SUSE Enterprise Storage 3, the default volume format is 2, and format 1 is deprecated. However, you can still create the deprecated format 1 volumes with the --image-format 1 option.

18.4.4 Export RBD Images via iSCSI

To export RBD images via iSCSI, use the lrbd utility. lrbd allows you to create, review, and modify the iSCSI target configuration, which uses a JSON format.

In order to edit the configuration, use lrbd -e or lrbd --edit. This command will invoke the default editor, as defined by the EDITOR environment variable. You may override this behavior by setting the -E option in addition to -e.

Below is an example configuration for

  • two iSCSI gateway hosts named iscsi1.example.com and iscsi2.example.com,

  • defining a single iSCSI target with an iSCSI Qualified Name (IQN) of iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol,

  • with a single iSCSI Logical Unit (LU),

  • backed by an RBD image named testvol in the RADOS pool rbd,

  • and exporting the target via two portals named "east" and "west":

{
    "auth": [
        {
            "target": "iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol",
            "authentication": "none"
        }
    ],
    "targets": [
        {
            "target": "iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol",
            "hosts": [
                {
                    "host": "iscsi1.example.com",
                    "portal": "east"
                },
                {
                    "host": "iscsi2.example.com",
                    "portal": "west"
                }
            ]
        }
    ],
    "portals": [
        {
            "name": "east",
            "addresses": [
                "192.168.124.104"
            ]
        },
        {
            "name": "west",
            "addresses": [
                "192.168.124.105"
            ]
        }
    ],
    "pools": [
        {
            "pool": "rbd",
            "gateways": [
                {
                    "target": "iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol",
                    "tpg": [
                        {
                            "image": "testvol"
                        }
                    ]
                }
            ]
        }
    ]
    }

Note that whenever you refer to a host name in the configuration, this host name must match the iSCSI gateway's uname -n command output.

The edited JSON is stored in the extended attributes (xattrs) of a single RADOS object per pool. This object is available to the gateway hosts where the JSON is edited, and all gateway hosts connected to the same Ceph cluster. No configuration information is stored locally on the lrbd gateway.

To activate the configuration, store it in the Ceph cluster, and do one of the following things (as root):

  • Run the lrbd command (without additional options) from the command line,

or

  • Restart the lrbd service with service lrbd restart.

The lrbd "service" does not operate any background daemon. Instead, it simply invokes the lrbd command. This type of service is known as a "one-shot" service.

You should also enable lrbd to auto-configure on system start-up. To do so, run the systemctl enable lrbd command.

The configuration above reflects a simple, one-gateway setup. lrbd configuration can be much more complex and powerful. The lrbd RPM package comes with an extensive set of configuration examples, which you may refer to by checking the contents of the /usr/share/doc/packages/lrbd/samples directory after installation. The samples are also available from https://github.com/SUSE/lrbd/tree/master/samples.

18.4.5 Optional Settings

The following settings may be useful for some environments. For images, there are uuid, lun, retries, sleep, and retry_errors attributes. The first two—uuid and lun—allow hardcoding of the 'uuid' or 'lun' for a specific image. You can specify either of them for an image. The retries, sleep and retry_errors affect the attempts to map an rbd image.

"pools": [
    {
        "pool": "rbd",
        "gateways": [
        {
        "host": "igw1",
        "tpg": [
                    {
                        "image": "archive",
                        "uuid": "12345678-abcd-9012-efab-345678901234",
                        "lun": "2",
                        "retries": "3",
                        "sleep": "4",
                        "retry_errors": [ 95 ],
                        [...]
                    }
                ]
            }
        ]
    }
]

18.4.6 Advanced Settings

lrdb can be configured with advanced parameters which are subsequently passed on to the LIO I/O target. The parameters are divided up into iSCSI and backing store components, which can then be specified in the "targets" and "tpg" sections, respectively, of the lrbd configuration.

Warning
Warning

Changing these parameters from the default setting is not recommended.

"targets": [
    {
        [...]
        "tpg_default_cmdsn_depth": "64",
        "tpg_default_erl": "0",
        "tpg_login_timeout": "10",
        "tpg_netif_timeout": "2",
        "tpg_prod_mode_write_protect": "0",
    }
]

Description of the options follows:

tpg_default_cmdsn_depth

Default CmdSN (Command Sequence Number) depth. Limits the amount of requests that an iSCSI initiator can have outstanding at any moment.

tpg_default_erl

Default error recovery level.

tpg_login_timeout

Login timeout value in seconds.

tpg_netif_timeout

NIC failure timeout in seconds.

tpg_prod_mode_write_protect

If set to 1, prevent writes to LUNs.

"pools": [
    {
        "pool": "rbd",
        "gateways": [
        {
        "host": "igw1",
        "tpg": [
                    {
                        "image": "archive",
                        "backstore_block_size": "512",
                        "backstore_emulate_3pc": "1",
                        "backstore_emulate_caw": "1",
                        "backstore_emulate_dpo": "0",
                        "backstore_emulate_fua_read": "0",
                        "backstore_emulate_fua_write": "1",
                        "backstore_emulate_model_alias": "0",
                        "backstore_emulate_rest_reord": "0",
                        "backstore_emulate_tas": "1",
                        "backstore_emulate_tpu": "0",
                        "backstore_emulate_tpws": "0",
                        "backstore_emulate_ua_intlck_ctrl": "0",
                        "backstore_emulate_write_cache": "0",
                        "backstore_enforce_pr_isids": "1",
                        "backstore_fabric_max_sectors": "8192",
                        "backstore_hw_block_size": "512",
                        "backstore_hw_max_sectors": "8192",
                        "backstore_hw_pi_prot_type": "0",
                        "backstore_hw_queue_depth": "128",
                        "backstore_is_nonrot": "1",
                        "backstore_max_unmap_block_desc_count": "1",
                        "backstore_max_unmap_lba_count": "8192",
                        "backstore_max_write_same_len": "65535",
                        "backstore_optimal_sectors": "8192",
                        "backstore_pi_prot_format": "0",
                        "backstore_pi_prot_type": "0",
                        "backstore_queue_depth": "128",
                        "backstore_unmap_granularity": "8192",
                        "backstore_unmap_granularity_alignment": "4194304"
                    }
                ]
            }
        ]
    }
]

Description of the options follows:

backstore_block_size

Block size of the underlying device.

backstore_emulate_3pc

If set to 1, enable Third Party Copy.

backstore_emulate_caw

If set to 1, enable Compare and Write.

backstore_emulate_dpo

If set to 1, turn on Disable Page Out.

backstore_emulate_fua_read

If set to 1, enable Force Unit Access read.

backstore_emulate_fua_write

If set to 1, enable Force Unit Access write.

backstore_emulate_model_alias

If set to 1, use the back-end device name for the model alias.

backstore_emulate_rest_reord

If set to 0, the Queue Algorithm Modifier is Restricted Reordering.

backstore_emulate_tas

If set to 1, enable Task Aborted Status.

backstore_emulate_tpu

If set to 1, enable Thin Provisioning Unmap.

backstore_emulate_tpws

If set to 1, enable Thin Provisioning Write Same.

backstore_emulate_ua_intlck_ctrl

If set to 1, enable Unit Attention Interlock.

backstore_emulate_write_cache

If set to 1, turn on Write Cache Enable.

backstore_enforce_pr_isids

If set to 1, enforce persistent reservation ISIDs.

backstore_fabric_max_sectors

Maximum number of sectors the fabric can transfer at once.

backstore_hw_block_size

Hardware block size in bytes.

backstore_hw_max_sectors

Maximum number of sectors the hardware can transfer at once.

backstore_hw_pi_prot_type

If non-zero, DIF protection is enabled on the underlying hardware.

backstore_hw_queue_depth

Hardware queue depth.

backstore_is_nonrot

If set to 1, the backstore is a non rotational device.

backstore_max_unmap_block_desc_count

Maximum number of block descriptors for UNMAP.

"backstore_max_unmap_lba_count":

Maximum number of LBA for UNMAP.

backstore_max_write_same_len

Maximum length for WRITE_SAME.

backstore_optimal_sectors

Optimal request size in sectors.

backstore_pi_prot_format

DIF protection format.

backstore_pi_prot_type

DIF protection type.

backstore_queue_depth

Queue depth.

backstore_unmap_granularity

UNMAP granularity.

backstore_unmap_granularity_alignment

UNMAP granularity alignment.

For targets, the tpg attributes allow tuning of kernel parameters. Use with caution.

"targets": [
{
    "host": "igw1",
    "target": "iqn.2003-01.org.linux-iscsi.generic.x86:sn.abcdefghijk",
    "tpg_login_timeout": "10",
    "tpg_default_cmdsn_depth": "64",
    "tpg_default_erl": "0",
    "tpg_login_timeout": "10",
    "tpg_netif_timeout": "2",
    "tpg_prod_mode_write_protect": "0",
    "tpg_t10_pi": "0"
}
Tip
Tip

If a site needs statically assigned LUNs, then assign numbers to each LUN.

18.5 Connecting to lrbd-managed Targets

This chapter describes how to connect to lrdb-managed targets from clients running Linux, Microsoft Windows, or VMware.

18.5.1 Linux (open-iscsi)

Connecting to lrbd-backed iSCSI targets with open-iscsi is a two-step process. First the initiator must discover the iSCSI targets available on the gateway host, then it must log in and map the available Logical Units (LUs).

Both steps require that the open-iscsi daemon is running. The way you start the open-iscsi daemon is dependent on your Linux distribution:

  • On SUSE Linux Enterprise Server (SLES); and Red Hat Enterprise Linux (RHEL) hosts, run systemctl start iscsid (or service iscsid start if systemctl is not available).

  • On Debian and Ubuntu hosts, run systemctl start open-iscsi (or service open-iscsi start).

If your initiator host runs SUSE Linux Enterprise Server, refer to https://www.suse.com/documentation/sles-12/stor_admin/data/sec_iscsi_initiator.html or https://www.suse.com/documentation/sles11/stor_admin/data/sec_inst_system_iscsi_initiator.html for details on how to connect to an iSCSI target.

For any other Linux distribution supporting open-iscsi, proceed to discover targets on your lrbd gateway (this example uses iscsi1.example.com as the portal address; for multipath access repeat these steps with iscsi2.example.com):

iscsiadm -m discovery -t sendtargets -p iscsi1.example.com
192.168.124.104:3260,1 iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol

Then, log in to the portal. If the login completes successfully, any RBD-backed logical units on the portal will immediately become available on the system SCSI bus:

iscsiadm -m node -p iscsi1.example.com --login
Logging in to [iface: default, target: iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol, portal: 192.168.124.104,3260] (multiple)
Login to [iface: default, target: iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol, portal: 192.168.124.104,3260] successful.

Repeat this process for other portal IP addresses or hosts.

If your system has the lsscsi utility installed, you use it to enumerate available SCSI devices on your system:

lsscsi
[8:0:0:0]    disk    SUSE     RBD              4.0   /dev/sde
[9:0:0:0]    disk    SUSE     RBD              4.0   /dev/sdf

In a multipath configuration (where two connected iSCSI devices represent one and the same LU), you can also examine the multipath device state with the multipath utility:

multipath -ll
360014050cf9dcfcb2603933ac3298dca dm-9 SUSE,RBD
size=49G features='0' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=1 status=active
| `- 8:0:0:0 sde 8:64 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
`- 9:0:0:0 sdf 8:80 active ready running

You can now use this multipath device as you would any block device. For example, you can use the device as a Physical Volume for Linux Logical Volume Management (LVM), or you can simply create a file system on it. The example below demonstrates how to create an XFS file system on the newly connected multipath iSCSI volume:

mkfs -t xfs /dev/mapper/360014050cf9dcfcb2603933ac3298dca
log stripe unit (4194304 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/mapper/360014050cf9dcfcb2603933ac3298dca isize=256    agcount=17, agsize=799744 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=0        finobt=0
data     =                       bsize=4096   blocks=12800000, imaxpct=25
         =                       sunit=1024   swidth=1024 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
log      =internal log           bsize=4096   blocks=6256, version=2
         =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Note that XFS being a non-clustered file system, you may only ever mount it on a single iSCSI initiator node at any given time.

If at any time you want to discontinue using the iSCSI LUs associated with a particular target, run the following command:

iscsiadm -m node -p iscsi1.example.com --logout
Logging out of session [sid: 18, iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol, portal: 192.168.124.104,3260]
Logout of [sid: 18, target: iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol, portal: 192.168.124.104,3260] successful.

As with discovery and login, you must repeat the logout steps for all portal IP addresses or host names.

18.5.1.1 Multipath Configuration

The multipath configuration is maintained on the clients or initiators and is independent of any lrbd configuration. Select a strategy prior to using block storage. After editing the /etc/multipath.conf, restart multipathd with

sudo systemctl restart multipathd

For an active-passive configuration with friendly names, add

defaults {
  user_friendly_names yes
}

to your /etc/multipath.conf. After connecting to your targets successfully, run

multipath -ll
mpathd (36001405dbb561b2b5e439f0aed2f8e1e) dm-0 SUSE,RBD
size=2.0G features='0' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=1 status=active
| `- 2:0:0:3 sdl 8:176 active ready running
|-+- policy='service-time 0' prio=1 status=enabled
| `- 3:0:0:3 sdj 8:144 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 4:0:0:3 sdk 8:160 active ready running

Note the status of each link. For an active-active configuration, add

defaults {
  user_friendly_names yes
}

devices {
  device {
    vendor "(LIO-ORG|SUSE)"
    product "RBD"
    path_grouping_policy "multibus"
    path_checker "tur"
    features "0"
    hardware_handler "1 alua"
    prio "alua"
    failback "immediate"
    rr_weight "uniform"
    no_path_retry 12
    rr_min_io 100
  }
}

to your /etc/multipath.conf. Restart multipathd and run

multipath -ll
mpathd (36001405dbb561b2b5e439f0aed2f8e1e) dm-3 SUSE,RBD
size=2.0G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 4:0:0:3 sdj 8:144 active ready running
  |- 3:0:0:3 sdk 8:160 active ready running
  `- 2:0:0:3 sdl 8:176 active ready running

18.5.2 Microsoft Windows (Microsoft iSCSI initiator)

To connect to a SUSE Enterprise Storage iSCSI target from a Windows 2012 server, follow these steps:

  1. Open Windows Server Manager. From the Dashboard, select Tools › iSCSI Initiator. The iSCSI Initiator Properties dialog appears. Select the Discovery tab:

    iSCSI Initiator Properties
    Figure 18.3: iSCSI Initiator Properties
  2. In the Discover Target Portal dialog, enter the target's host name or IP address in the Target field and click OK:

    Discover Target Portal
    Figure 18.4: Discover Target Portal
  3. Repeat this process for all other gateway host names or IP addresses. When completed, review the Target Portals list:

    Target Portals
    Figure 18.5: Target Portals
  4. Next, switch to the Targets tab and review your discovered target(s).

    Targets
    Figure 18.6: Targets
  5. Click Connect in the Targets tab. The Connect To Target dialog appears. Select the Enable Multi-path check box to enable multipath I/O (MPIO), then click OK:

  6. When the Connect to Target dialog closes, select Properties to review the target's properties:

    iSCSI Target Properties
    Figure 18.7: iSCSI Target Properties
  7. Select Devices, and click MPIO to review the multipath I/O configuration:

    Device Details
    Figure 18.8: Device Details

    The default Load Balance policy is Round Robin With Subset. If you prefer a pure fail-over configuration, change it to Fail Over Only.

This concludes the iSCSI initiator configuration. The iSCSI volumes are now available like any other SCSI devices, and may be initialized for use as volumes and drives. Click OK to close the iSCSI Initiator Properties dialog, and proceed with the File and Storage Services role from the Server Manager dashboard.

Observe the newly connected volume. It identifies as SUSE RBD SCSI Multi-Path Drive on the iSCSI bus, and is initially marked with an Offline status and a partition table type of Unknown. If the new volume does not appear immediately, select Rescan Storage from the Tasks drop-down box to rescan the iSCSI bus.

  1. Right-click on the iSCSI volume and select New Volume from the context menu. The New Volume Wizard appears. Click Next, highlight the newly connected iSCSI volume and click Next to begin.

    New Volume Wizard
    Figure 18.9: New Volume Wizard
  2. Initially, the device is empty and does not contain a partition table. When prompted, confirm the dialog indicating that the volume will be initialized with a GPT partition table:

    Offline Disk Prompt
    Figure 18.10: Offline Disk Prompt
  3. Select the volume size. Typically, you would use the device's full capacity. Then assign a drive letter or folder name where the newly created volume will become available. Then select a file system to create on the new volume, and finally confirm your selections with Create to finish creating the volume:

    Confirm Volume Selections
    Figure 18.11: Confirm Volume Selections

    When the process finishes, review the results, then Close to conclude the drive initialization. Once initialization completes, the volume (and its NTFS file system) becomes available like a newly initialized local drive.

18.5.3 VMware

  1. To connect to lrbd managed iSCSI volumes you need a configured iSCSI software adapter. If no such adapter is available in your vSphere configuration, create one by selecting Configuration › Storage Adapters › Add › iSCSI Software initiator.

  2. When available, select the adapter's properties by right-clicking the adapter and selecting Properties from the context menu:

    iSCSI Initiator Properties
    Figure 18.12: iSCSI Initiator Properties
  3. In the iSCSI Software Initiator dialog, click the Configure button. Then go to the Dynamic Discovery tab and select Add.

  4. Enter the IP address or host name of your lrbd iSCSI gateway. If you run multiple iSCSI gateways in a failover configuration, repeat this step for as many gateways as you operate.

    Add Target Server
    Figure 18.13: Add Target Server

    When you have entered all iSCSI gateways, click OK in the dialog to initiate a rescan of the iSCSI adapter.

  5. When the rescan completes, the new iSCSI device appears below the Storage Adapters list in the Details pane. For multipath devices, you can now right-click on the adapter and select Manage Paths from the context menu:

    Manage Multipath Devices
    Figure 18.14: Manage Multipath Devices

    You should now see all paths with a green light under Status. One of your paths should be marked Active (I/O) and all others simply Active:

    Paths Listing for Multipath
    Figure 18.15: Paths Listing for Multipath
  6. You can now switch from Storage Adapters to the item labeled Storage. Select Add Storage... in the top-right corner of the pane to bring up the Add Storage dialog. Then, select Disk/LUN and click Next. The newly added iSCSI device appears in the Select Disk/LUN list. Select it, then click Next to proceed:

    Add Storage Dialog
    Figure 18.16: Add Storage Dialog

    Click Next to accept the default disk layout.

  7. In the Properties pane, assign a name to the new datastore, and click Next. Accept the default setting to use the volume's entire space for the datastore, or select Custom Space Setting for a smaller datastore:

    Custom Space Setting
    Figure 18.17: Custom Space Setting

    Click Finish to complete the datastore creation.

    The new datastore now appears in the datastore list and you can select it to retrieve details. You are now able to use the lrbd-backed iSCSI volume like any other vSphere datastore.

    iSCSI Datastore Overview
    Figure 18.18: iSCSI Datastore Overview

18.6 Conclusion

lrbd is a key component of SUSE Enterprise Storage that enables access to distributed, highly available block storage from any server or client capable of speaking the iSCSI protocol. By using lrbd on one or more iSCSI gateway hosts, Ceph RBD images become available as Logical Units (LUs) associated with iSCSI targets, which can be accessed in an optionally load-balanced, highly available fashion.

Since all of lrbd's configuration is stored in the Ceph RADOS object store, lrbd gateway hosts are inherently without persistent state and thus can be replaced, augmented, or reduced at will. As a result, SUSE Enterprise Storage enables SUSE customers to run a truly distributed, highly-available, resilient, and self-healing enterprise storage technology on commodity hardware and an entirely open source platform.

19 Clustered File System

The Ceph file system (CephFS) is a POSIX-compliant file system that uses a Ceph storage cluster to store its data. CephFS uses the same cluster system as Ceph block devices, Ceph object storage with its S3 and Swift APIs, or native bindings (librados).

To use CephFS, you need to have a running Ceph storage cluster, and at least one running Ceph metadata server.

Warning
Warning

CephFS file layout changes can be performed as documented in http://docs.ceph.com/docs/jewel/cephfs/file-layouts/. However, a data pool must not be added to an existing CephFS file system (via ceph mds add_data_pool) while the file system is mounted by any clients.

19.1 Ceph Metadata Server

Ceph metadata server (MDS) stores metadata for the CephFS. Ceph block devices and Ceph object storage do not use MDS. MDSs make it possible for POSIX file system users to execute basic commands—such as ls or find—without placing an enormous burden on the Ceph storage cluster.

19.1.1 Adding a Metadata Server

After you deploy OSDs and monitors, you can deploy metadata servers. Although MDS service can share a node with an OSD and/or monitor service, you are encouraged to deploy it on a separate cluster node for performance reasons.

cephadm > ceph-deploy install mds-host-name
cephadm > ceph-deploy mds create host-name:daemon-name

You may optionally specify a daemon instance name if you need to run multiple daemons on a single server.

After you deploy your MDS, allow the Ceph OSD/MDS service in the firewall setting of the server where MDS is deployed. Start yast, navigate to Security and Users › Firewall › Allowed Services and in the Service to Allow drop–down menu select Ceph OSD/MDS. If the Ceph MDS node is not allowed full traffic, mounting of a file system fails, even though other operations may work properly.

19.1.2 Configuring a Metadata Server

You can fine-tune the MDS behavior by inserting relevant options in the ceph.conf configuration file. For detailed list of MDS related configuration options, see http://docs.ceph.com/docs/master/cephfs/mds-config-ref/.

For detailed list of MDS journaler configuration options, see http://docs.ceph.com/docs/master/cephfs/journaler/.

19.2 CephFS

When you have a healthy Ceph storage cluster with at least one Ceph metadata server, you may create and mount your Ceph file system. Ensure that your client has network connectivity and a proper authentication keyring.

19.2.1 Creating CephFS

A CephFS requires at least two RADOS pools: one for data and one for metadata. When configuring these pools, you might consider:

  • Using a higher replication level for the metadata pool, as any data loss in this pool can render the whole file system inaccessible.

  • Using lower-latency storage such as SSDs for the metadata pool, as this will improve the observed latency of file system operations on clients.

For more information on managing pools, see Chapter 13, Managing Storage Pools.

To create the two required pools—for example 'cephfs_data' and 'cephfs_metadata'—with default settings for use with CephFS, run the following commands:

ceph osd pool create cephfs_data pg_num
ceph osd pool create cephfs_metadata pg_num

When the pools are created, you may enable the file system with the ceph fs new command:

ceph fs new fs_name metadata_pool_name data_pool_name

For example:

ceph fs new cephfs cephfs_metadata cephfs_data

You can check that the file system was created by listing all available CephFS's:

ceph fs ls
 name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ]

When the file system has been created, your MDS will be able to enter an active state. For example, in a single MDS system:

ceph mds stat
 e5: 1/1/1 up

19.2.2 Mounting CephFS

Once the file system is created and the MDS is active, you are ready to mount the file system from a client host.

19.2.2.1 Client Preparation

If the client host is running SUSE Linux Enterprise 12 SP2, you can skip this section as the system is ready to mount CephFS 'out-of-the-box'.

If the client host is running SUSE Linux Enterprise 12 SP1, you need to apply all the latest patches before mounting CephFS.

In either case (the client is SUSE Linux Enterprise 12 SP2 or SUSE Linux Enterprise 12 SP1), neither the SUSE Enterprise Storage product itself nor the ceph-common package need be installed on the client host, and it is always recommended to run fully updated systems.

19.2.2.2 Create a Secret File

The Ceph cluster runs with authentication turned on by default. You should create a file that stores your secret key (not the keyring itself). To obtain the secret key for a particular user and then create the file, do the following:

Procedure 19.1: Creating a Secret Key
  1. View the key for the particular user in a keyring file:

    cat /etc/ceph/ceph.client.admin.keyring
  2. Copy the key of the user who will be using the mounted Ceph FS file system. Usually the key looks similar like the following:

    [client.admin]
       key = AQCj2YpRiAe6CxAA7/ETt7Hcl9IyxyYciVs47w==
  3. Create a file with the user name as a filename part, e.g. /etc/ceph/admin.secret for the user admin

  4. Paste the key value to the file created in the previous step.

  5. Set proper access rights to the file. The user should be the only one who can read the file, others may not have any access rights.

19.2.2.3 Mount CephFS with the Kernel Driver

You can mount CephFS, normally with the mount command. You need to specify the monitor host name or IP address.

Tip
Tip: Specify Multiple Monitors

It is a good idea to specify multiple monitors separated by commas on the mount command line in case one monitor happens to be down at the time of mount. Each monitor address takes the form host[:port]. If the port is not specified, it defaults to 6789.

Create the mount point on the local host:

sudo mkdir /mnt/cephfs

Mount the CephFS:

sudo mount -t ceph ceph_mon1:6789:/ /mnt/cephfs

A subdirectory subdir may be specified if a subset of the file system is to be mounted:

sudo mount -t ceph ceph_mon1:6789:/subdir /mnt/cephfs

You can specify more than one monitor host in the mount command:

sudo mount -t ceph ceph_mon1,ceph_mon2,ceph_mon3:6789:/ /mnt/cephfs
Tip
Tip: CephFS and cephx Authentication

To mount CephFS with cephx authentication enabled, you need to specify a user name and a secret:

sudo mount -t ceph ceph_mon1:6789:/ /mnt/cephfs \
 -o name=admin,secret=AQATSKdNGBnwLhAAnNDKnH65FmVKpXZJVasUeQ==

As the previous command remains in the shell history, a more secure approach is to read the secret from a file:

sudo mount -t ceph ceph_mon1:6789:/ /mnt/cephfs \
 -o name=admin,secretfile=/etc/ceph/admin.secret
Important
Important: Read Access to the Root Directory

If clients with path restriction are used, the mds capabilities need to include read access to the root directory. For example a keyring may look as follows:

client.bar
 key: supersecretkey
 caps: [mds] allow rw path=/barjail, allow r path=/
 caps: [mon] allow r
 caps: [osd] allow rwx

The allow r path=/ part means that path-restricted clients are able to see the root volume, but cannot write to it. This may be an issue for use cases where complete isolation is a requirement.

19.2.3 Unmounting CephFS

To unmount the CephFS, use the umount command:

sudo umount /mnt/cephfs

19.2.4 CephFS in /etc/fstab

To mount CephFS automatically on the client start-up, insert the corresponding line in its file systems table /etc/fstab:

mon1:6790,mon2:/subdir /mnt/cephfs ceph name=admin,secretfile=/etc/ceph/secret.key,noatime 0 2

19.3 Managing Failover

If an MDS daemon stops communicating with the monitor, the monitor will wait mds_beacon_grace seconds (default 15 seconds) before marking the daemon as laggy. You can configure one or more 'standby' daemons that can will take over during the MDS daemon failover.

19.3.1 Configuring Standby Daemons

There are several configuration settings that control how a daemon will behave while in standby. You can specify them in the ceph.conf on the host where the MDS daemon runs. The daemon loads these settings when it starts, and sends them to the monitor.

By default, if none of these settings are used, all MDS daemons which do not hold a rank will be used as 'standbys' for any rank.

The settings which associate a standby daemon with a particular name or rank do not guarantee that the daemon will only be used for that rank. They mean that when several standbys are available, the associated standby daemon will be used. If a rank is failed, and a standby is available, it will be used even if it is associated with a different rank or named daemon.

mds_standby_replay

If set to true, then the standby daemon will continuously read the metadata journal an up rank. This will give it a warm metadata cache, and speed up the process of failing over if the daemon serving the rank fails.

An up rank may only have one standby replay daemon assigned to it. If two daemons are both set to be standby replay then one of them will arbitrarily win, and the other will become a normal non-replay standby.

Once a daemon has entered the standby replay state, it will only be used as a standby for the rank that it is following. If another rank fails, this standby replay daemon will not be used as a replacement, even if no other standbys are available.

mds_standby_for_name

Set this to make the standby daemon only take over a failed rank if the last daemon to hold it matches this name.

mds_standby_for_rank

Set this to make the standby daemon only take over the specified rank. If another rank fails, this daemon will not be used to replace it.

Use in conjunction withmds_standby_for_fscid to be specific about which file system's rank you are targeting in case of multiple file systems.

mds_standby_for_fscid

If mds_standby_for_rank is set, this is simply a qualifier to say which file system's rank is referred to.

If mds_standby_for_rank is not set, then setting FSCID will cause this daemon to target any rank in the specified FSCID. Use this if you have a daemon that you want to use for any rank, but only within a particular file system.

mon_force_standby_active

This setting is used on monitor hosts. It defaults to true.

If it is false, then daemons configured with standby_replay=true will only become active if the rank/name that they have been configured to follow fails. On the other hand, if this setting is true, then a daemon configured with standby_replay=true may be assigned some other rank.

19.3.2 Examples

Several example ceph.conf configurations follow. You can either copy a ceph.conf with the configuration of all daemons to all your servers, or you can have a different file on each server that contains just that server's daemons configuration.

19.3.2.1 Simple Pair

Two MDS daemons 'a' and 'b' acting as a pair. Whichever one is not currently assigned a rank will be the standby replay follower of the other.

[mds.a]
mds standby replay = true
mds standby for rank = 0

[mds.b]
mds standby replay = true
mds standby for rank = 0

20 NFS-Ganesha: Export Ceph Data via NFS

Warning
Warning: Technology Preview

As of SUSE Enterprise Storage 4, NFS-Ganesha is considered a technology preview and is not supported.

NFS-Ganesha is an NFS server (refer to Sharing File Systems with NFS (https://www.suse.com/documentation/sles-12/book_sle_admin/data/cha_nfs.html) ) that runs in a user address space instead of as part of the operating system kernel. With NFS-Ganesha you can plug in your own storage mechanism—such as Ceph—and access it from any NFS client.

20.1 Installation

Although it is possible to install and run the NFS-Ganesha server on an already existing Ceph node, we recommend running it on a dedicated host with an access to the Ceph cluster. The client hosts are typically not part of the cluster, but they need to have network access to the NFS-Ganesha server.

To run the NFS-Ganesha server, you need to install the nfs-ganesha package:

sudo zypper in nfs-ganesha

If you plan to export CephFS via NFS, install the nfs-ganesha-ceph package:

sudo zypper in nfs-ganesha-ceph

If you plan to export RADOS Gateway buckets via NFS, install the nfs-ganesha-rgw package:

sudo zypper in nfs-ganesha-rgw

20.2 Configuration

This section includes information to help you configure the NFS-Ganesha server to export the cluster data accessible via RADOS Gateway and CephFS.

20.2.1 NFS-Ganesha Common Configuration

NFS-Ganesha configuration is controlled by /etc/ganesha/ganesha.conf. Edit it and change as required:

EXPORT
{
  Export_Id = 1; # Each export needs to have a unique 'Export_Id' (mandatory)
  Path = "/"; # Export path in the related CephFS pool (mandatory)
  Pseudo = "/"; # Target NFS export path (mandatory for NFSv4)
  Access_Type = RW; # 'RO' for read-only access, default is 'None'
  Squash = No_Root_Squash; # NFS squash option
  [...]
  FSAL { # Exporting 'File System Abstraction Layer'
    Name = CEPH; # Ganesha backend, 'CEPH' for CephFS or 'RGW' for RADOS Gateway
  }
}

20.2.2 NFS Access to CephFS Data

Tip
Tip

Before you start configuring NFS-Ganesha for CephFS, an instance of CephFS needs to be already configured and running. Refer to Chapter 19, Clustered File System for more information on CephFS.

To be able to export data stored on the active CephFS, edit /etc/ganesha/ganesha.conf and check that it includes the following snippets (configured to your needs):

EXPORT
{
[...]
  FSAL {
    Name = CEPH;
  }
}

20.2.3 NFS Access to RADOS Gateway Buckets

Tip
Tip

Before you start configuring NFS-Ganesha for RADOS Gateway, an instance of RADOS Gateway needs to be already configured and running. Refer to Chapter 17, Ceph RADOS Gateway for more information on RADOS Gateway.

To export data stored in RADOS Gateway buckets via NFS, do the following:

  1. Add a keyring entry to the [global] section of /etc/ceph/ceph.conf. NFS-Ganesha needs it to communicate with librados:

    keyring = /etc/ceph/ceph.client.admin.keyring
  2. NFS-Ganesha is controlled by /etc/ganesha/ganesha.conf by default. The package nfs-ganesha-rgw installs a new file /etc/ganesha/rgw.conf. You can either merge its content to /etc/ganesha/ganesha.conf, or make a backup of both files and create the following symbolic link:

    sudo cp /etc/ganesha/ganesha.conf /etc/ganesha/ganesha.conf.orig
    sudo cp /etc/ganesha/rgw.conf /etc/ganesha/rgw.conf.orig
    sudo ln -sf /etc/ganesha/rgw.conf /etc/ganesha/ganesha.conf
  3. Edit /etc/ganesha/ganesha.conf and check that it includes the following snippets (configured to your needs):

    EXPORT
    {
    [...]
      FSAL {
        Name = RGW;
        User_Id = "s3_user_id";
        Access_Key_Id = "s3_access_key";
        Secret_Access_Key = "s3_secret_key";
      }
      RGW {
       ceph_conf = "/etc/ceph/ceph.conf";
      }
    }
Important
Important: Limitations for Access to RADOS Gateway Buckets

This technology preview feature has several limitations in usage. Refer to the relevant Release Notes (https://www.suse.com/releasenotes/x86_64/SUSE-Enterprise-Storage/4/#fate-321078) section to get more detailed information.

20.3 Starting NFS-Ganesha Related Services

Enable and start the RPC service required by NFS-Ganesha:

sudo systemctl enable rpcbind rpc-statd
sudo systemctl start rpcbind rpc-statd

Enable and start the NFS-Ganesha service:

sudo systemctl enable nfs-ganesha
sudo systemctl start nfs-ganesha
Tip
Tip: Starting the NFS-Ganesha Service Manually

Instead of using systemd targets, you can start NFS-Ganesha manually. The following command starts NFS-Ganesha with the debugging information enabled:

sudo /usr/bin/ganesha.nfsd -L /var/log/ganesha.log \
 -f /etc/ganesha/ganesha.conf -N NIV_DEBUG

20.4 Verifying the Exported NFS Share

After you configure and start NFS-Ganesha, you can verify whether the NFS shares are exported on the NFS-Ganesha server node:

sudo showmount -e
/ (everything)

20.5 Mounting the Exported NFS Share

To mount the exported NFS share (as configured in Section 20.2, “Configuration”) on a client host, run:

sudo mount -t nfs -o rw,noatime,sync \
 nfs_ganesha_server_hostname:/ /path/to/local/mountpoint

Part V Managing Cluster with GUI Tools

21 openATTIC

openATTIC is a central storage management system which supports Ceph storage cluster. With openATTIC you can control everything from a central management interface. It is no longer necessary to be familiar with the inner workings of the Ceph storage tools. Cluster management tasks can be carried out…

22 Calamari

Calamari is a management and monitoring system for Ceph storage cluster. It provides a Web user interface that makes Ceph cluster monitoring very simple and handy.

21 openATTIC

openATTIC is a central storage management system which supports Ceph storage cluster. With openATTIC you can control everything from a central management interface. It is no longer necessary to be familiar with the inner workings of the Ceph storage tools. Cluster management tasks can be carried out by either using openATTIC's intuitive Web interface, or via its REST API.

21.1 Installing openATTIC

The procedures of openATTIC installation described further are applied only if you deployed your Ceph cluster by using ceph-deploy or Crowbar. If you used DeepSea to deploy your Ceph cluster, openATTIC is automatically installed and started on the Salt master node, for a procedure of removing it, refer Section 21.1.3, “Removing openATTIC from the Salt master Node”. In case you need to install openATTIC on a different node via DeepSea, refer to the Section 4.3.2.1, “Using Customized Configuration Files”.

21.1.1 Installing Required Packages

While you can install and run openATTIC on any existing Ceph cluster node, we recommend to install it on the admin node. openATTIC is included in the SUSE Enterprise Storage extension. To install the required packages, run

sudo zypper in openattic

The openATTIC GUI requires access to a running Ceph cluster. The access is granted by a keyring. You can use the admin keyring (and possibly adjust its permissions), or create a new keyring intended for openATTIC.

sudo scp admin_node:/etc/ceph/ceph.client.admin.keyring /etc/ceph
sudo chgrp openattic /etc/ceph/ceph.client.admin.keyring
sudo chmod g+r /etc/ceph/ceph.client.admin.keyring

or you can create a new keyring dedicated for openATTIC by using ceph-authtool:

sudo ceph-authtool -C path_to_keyring
Tip
Tip

openATTIC will work correctly only if it is the only Web-based application on the specific host. Do not share the host with another Web application such as Calamari.

21.1.2 openATTIC Initial Setup

After the packages are installed, run the actual openATTIC setup:

sudo oaconfig install

oaconfig install will start a number of services, initialize the openATTIC database, and scan the system for pools and volumes to include.

By default, oaconfig creates an administrative user account openattic, with the same password as the user name. As a security precaution, we strongly recommend to change this password immediately:

sudo oaconfig changepassword openattic
Changing password for user 'openattic'
Password: <enter password>
Password (again): <re-enter password>
Password changed successfully for user 'openattic'

Now your openATTIC storage system can be managed by the Web user interface.

21.1.3 Removing openATTIC from the Salt master Node

If you deployed Ceph by using DeepSea, you have a running instance of openATTIC on the Salt-master node. In case you want to remove openATTIC from the Salt master node and block further installation on the Salt master node, perform the following steps:

  1. Remove the running instance of openATTIC:

    root@master > zypper remove openattic*
  2. On the Salt master create a file: /srv/salt/ceph/stage/openattic/disable.sls

  3. To the file /srv/salt/ceph/stage/openattic/disable.sls add the following line:

    disable openattic: Salt master hostname
  4. To the /srv/pillar/ceph/stack/ceph/cluster.yml file add the following:

    stage_openattic: disable
  5. Run stages 2, 3 and 4 again as described in Section 4.2, “Deploying with DeepSea and Salt”.

  6. To verify that openATTIC has been removed and will not be installed again, run:

    salt '*' saltutil.pillar_refresh
    salt-run state.orch ceph.stage.openattic

21.2 openATTIC Web User Interface

openATTIC can be managed using a Web user interface. Open a Web browser and navigate to http://www.example.org/openattic. To log in, use the default user name openattic and the corresponding password.

openATTIC Login Screen
Figure 21.1: openATTIC Login Screen

The openATTIC user interface is graphically divided into a top menu pane and a content pane.

The right part of the top pane includes a link to the current user settings, and a Logout link. The rest of the top pane includes the main openATTIC menu.

The content pane changes depending on which item menu is activated. By default, a Dashboard is displayed showing general Ceph cluster statistics.

openATTIC Dashboard
Figure 21.2: openATTIC Dashboard

21.3 Dashboard

Dashboard shows the overall statistics of the running Ceph cluster. By default it shows the following widgets: Ceph Status, Utilization, OSD Status, and Throughput.

The Ceph Status widget tells whether the cluster is operating correctly. In case a problem is detected, you can view the detailed error message by clicking the subtitle inside the widget.

Ceph Status
Figure 21.3: Ceph Status

The OSD Status widget shows the total number of OSD nodes and the number of online OSD node in the cluster in time.

OSDs Status
Figure 21.4: OSDs Status

The Utilization widget shows the storage usage in time. You can activate or deactivate the following charts:

  • Bytes total - shows the total storage size.

  • Bytes available - shows the remaining available space.

  • Bytes used - shows the occupied space.

Cluster Utilization
Figure 21.5: Cluster Utilization

The Throughput widget shows the read/write per second statistics in time.

Throughput
Figure 21.6: Throughput
Tip
Tip: More Details on Mouse Over

If you move the mouse pointer over any of the displayed charts, it will show you more details related to the pointed date and time in a pop-up window.

21.4 Ceph Related Tasks

openATTIC's main menu lists Ceph related tasks. Currently, the following tasks are relevant: OSDs, RBDs, Pools, Nodes and CRUSH Map.

21.4.1 Common Web UI Features

In openATTIC you often work with lists—for example lists of pools, OSD nodes, or RBD devices. The following common widgets help you manage or adjust these list:

Click to refresh the list of items.

Click to display or hide individual table columns.

Click and select how many rows to display on a single page.

Click inside and filter the rows by the typing the string to search for.

Use to change the currently displayed page if the list spans across multiple pages.

21.4.2 Listing OSD Nodes

To list all available OSD nodes, click OSDs from the main menu.

The list shows each OSD's name, host name, status, and weight.

List of OSD nodes
Figure 21.7: List of OSD nodes

21.4.3 Managing RADOS Block Devices (RBDs)

To list all available RADOS block devices, click RBDs from the main menu.

The list shows each device's name, the related pool name, size of the device, and how many percents are already occupied.

List of RBDs
Figure 21.8: List of RBDs

To view more detailed information about a device, activate its check box in the very left column:

RBD Details
Figure 21.9: RBD Details

21.4.3.1 Deleting RBDs

To delete a device or a group of devices, activate their check boxes in the very left column and click Delete in the top left of the RBDs table:

Deleting RBD
Figure 21.10: Deleting RBD

21.4.3.2 Adding RBDs

To add a new device, click Add in the top left of the RBDs table and do the following on the Create RBD screen:

Adding a New RBD
Figure 21.11: Adding a New RBD
  1. Enter the name of the new device. Refer to Section 2.5, “Naming Limitations” for naming limitations.

  2. Select the cluster that will store the new pool.

  3. Select the pool from which the new RBD device will be created.

  4. Specify the size of the new device. If you click the use max link above, the maximum pool size is populated.

  5. To fine tune the device parameters, click Expert settings and activate or deactivate displayed options.

  6. Confirm with Create.

21.4.4 Managing Pools

Tip
Tip: More Information on Pools

For more general information about Ceph pools, refer to Chapter 13, Managing Storage Pools. For information specific to erasure coded pools, refer to Chapter 15, Erasure Coded Pools.

To list all available pools, click Pools from the main menu.

The list shows each pool's name, ID, the percentage of used space, the number of placement groups, replica size, type ('replicated' or 'erasure'), erasure code profile, and the crush ruleset.

List of Pools
Figure 21.12: List of Pools

To view more detailed information about a pool, activate its check box in the very left column:

Pool Details
Figure 21.13: Pool Details

21.4.4.1 Deleting Pools

To delete a pool or a group of pools, activate their check boxes in the very left column and click Delete in the top left of the pools table:

Deleting Pools
Figure 21.14: Deleting Pools

21.4.4.2 Adding Pools

To add a new pool, click Add in the top left of the pools table and do the following on the Create Ceph pool screen:

Adding a New Pool
Figure 21.15: Adding a New Pool
  1. Enter the name of the new pool. Refer to Section 2.5, “Naming Limitations” for naming limitations.

  2. Select the cluster that will store the new pool.

  3. Select the pool type. Pools can be either replicated or erasure coded.

  4. Specify the number of the pool's placement groups.

  5. For a replicated pool, specify the replica size.

  6. Confirm with Create.

21.4.5 Listing Nodes

Important
Important: Salt Only Deployment

The Nodes tab is only available when the cluster is deployed via Salt. Refer to Chapter 4, Deploying with DeepSea and Salt for more information on Salt.

Click Nodes from the main menu to view the list of nodes available on the cluster.

List of Nodes
Figure 21.16: List of Nodes

Each node is represented by its host name, public IP address, cluster ID it belongs to, node role (for example 'admin', 'storage', or 'master'), and key acceptance status.

21.4.6 Viewing the Cluster CRUSH Map

Click CRUSH Map from the main menu to view cluster CRUSH Map.

CRUSH Map
Figure 21.17: CRUSH Map

In the Physical setup pane, you can see the structure of the cluster as described by the CRUSH Map.

In the Replication rules pane, you can view individual rulesets after selecting one of them from the Content from the drop-down box.

Replication rules
Figure 21.18: Replication rules

22 Calamari

Calamari is a management and monitoring system for Ceph storage cluster. It provides a Web user interface that makes Ceph cluster monitoring very simple and handy.

The Calamari installation procedure differs according to the used deployment procedure. If you deployed your Ceph by using ceph-deploy, refer to Section 22.1, “Installing Calamari with ceph-deploy. If you deployed your cluster by using Crowbar, refer to Section 22.2, “Installing Calamari Using Crowbar”.

22.1 Installing Calamari with ceph-deploy

To install Calamari, do the following:

  1. Install the client part of Calamari:

    sudo zypper in romana
  2. Initialize Calamari installation. You will be asked for superuser user name and password. These will be needed when logging in to the Web interface after the setup is complete.

    sudo calamari-ctl initialize
    [INFO] Loading configuration..
    [INFO] Starting/enabling salt...
    [INFO] Starting/enabling postgres...
    [INFO] Initializing database...
    [INFO] Initializing web interface...
    [INFO] You will now be prompted for login details for the administrative user
    account.  This is the account you will use to log into the web interface once
    setup is complete.
    Username (leave blank to use 'root'):
    Email address:
    Password:
    Password (again):
    Superuser created successfully.
    [INFO] Starting/enabling services...
    [INFO] Restarting services...
    [INFO] Complete.
  3. Check the firewall status

    sudo /sbin/SuSEfirewall2 status

    and if it is off, check its configuration and turn it on with

    sudo /sbin/SuSEfirewall2 on

    You can find detailed information in Section 37.2, “Firewall Settings for Ceph”.

  4. Tip
    Tip

    In order for Calamari to work correctly, the admin keyring needs to be installed on each monitor node:

    cephadm > ceph-deploy admin mon1 mon2 mon3

    where mon1, mon2, or mon3 are the host names of all the monitors.

    Now open your Web browser and point it to the host name/IP address of the server where you installed Calamari. Log in with the credentials you entered when installing the Calamari client. A welcome screen appears, instructing you to enter the ceph-deploy calamari connect command. Switch to the terminal on the Calamari host and enter the following command. Note that the --master option specifies the host name of the Calamari server to which all the cluster nodes connect to, and that you need to list all the relevant OSD and monitor nodes as well:

    cephadm > ceph-deploy calamari connect --master master_host osd1 osd2 ... mon1 mon2 ...

    After the command is successfully finished, reload the Web browser. Now you can monitor your Ceph cluster, OSDs, pools, etc.

    Tip
    Tip: Empty Usage Graphs

    If, after having installed Calamari initially, the usage graphs are empty/blank, it is possible that the diamond metrics collector was not automatically installed. To fix this, run salt '*' state.highstate on the Calamari host.

    Important
    Important

    The Calamari dashboard screen shows the current status of the cluster. This updates regularly, so any change to the cluster state—for example if a node goes offline—should be reflected in Calamari within a few seconds. The Health panel includes a timer to indicate how long it has been since Calamari last saw heartbeat information from the cluster. Normally, this will not be more than one minute old, but in certain failure cases, for example when a network outage occurs or if the cluster loses quorum (that is if more than half of the monitor nodes are down), Calamari will no longer be able to determine cluster state. In this case, the Health panel will indicate that the last update was more than one minute ago. After too long time with no updates, Calamari displays a warning at the top of the screen "Cluster Updates Are Stale. The Cluster is not updating Calamari." If this occurs, the other status information Calamari presents will not be correct so you should investigate further to check the status of your storage nodes and network.

    Note
    Note: Salt Installed by Default with Calamari

    Even though you deployed your Ceph cluster by using ceph-deploy, salt is installed along with Calamari. The salt command is thus installed even though you did not install salt manually.

    Tip
    Tip

    They may be leftovers of the previous Calamari setup on the system. If after logging in to the Calamari application some nodes are already joined or registered, run the following on the Calamari host to trigger a re-run of salt on all Ceph nodes, which should clear up any odd state or missing bits and pieces.

    salt '*' state.highstate

    We also recommend to remove files from the previous Calamari setup, such as state files, configuration files, or PostgreSQL database files. At minimum, remove the files in the following directories:

    • /etc/calamari/

    • /etc/salt/

    • /etc/graphite/

    • /var/*/salt/

    • /var/lib/graphite/

    • /var/lib/pgsql/

22.2 Installing Calamari Using Crowbar

Note
Note: Conflicts in Combination with Deployment using Crowbar

If you used Crowbar to install SUSE Enterprise Storage, install Calamari on a different server than Crowbar as Crowbar uses the same port as Calamari (port 80).

Use the Crowbar UI to deploy Calamari as described in Section 6.2.4, “Deploying Ceph”.

Part VI Integration with Virtualization Tools

23 Using libvirt with Ceph

The libvirt library creates a virtual machine abstraction layer between hypervisor interfaces and the software applications that use them. With libvirt, developers and system administrators can focus on a common management framework, common API, and common shell interface (virsh) to many different h…

24 Ceph as a Back-end for QEMU KVM Instance

The most frequent Ceph use case involves providing block device images to virtual machines. For example, a user may create a 'golden' image with an OS and any relevant software in an ideal configuration. Then, the user takes a snapshot of the image. Finally, the user clones the snapshot (usually man…

23 Using libvirt with Ceph

The libvirt library creates a virtual machine abstraction layer between hypervisor interfaces and the software applications that use them. With libvirt, developers and system administrators can focus on a common management framework, common API, and common shell interface (virsh) to many different hypervisors, including QEMU/KVM, Xen, LXC, or VirtualBox.

Ceph block devices support QEMU/KVM. You can use Ceph block devices with software that interfaces with libvirt. The cloud solution uses libvirt to interact with QEMU/KVM, and QEMU/KVM interacts with Ceph block devices via librbd.

To create VMs that use Ceph block devices, use the procedures in the following sections. In the examples, we have used libvirt-pool for the pool name, client.libvirt for the user name, and new-libvirt-image for the image name. You may use any value you like, but ensure you replace those values when executing commands in the subsequent procedures.

23.1 Configuring Ceph

To configure Ceph for use with libvirt, perform the following steps:

  1. Create a pool. The following example uses the pool name libvirt-pool with 128 placement groups.

    ceph osd pool create libvirt-pool 128 128

    Verify that the pool exists.

    ceph osd lspools
  2. Create a Ceph User. The following example uses the Ceph user name client.libvirt and references libvirt-pool.

    ceph auth get-or-create client.libvirt mon 'allow r' osd \
     'allow class-read object_prefix rbd_children, allow rwx pool=libvirt-pool'

    Verify the name exists.

    ceph auth list
    Note
    Note

    libvirt will access Ceph using the ID libvirt, not the Ceph name client.libvirt. See http://ceph.com/docs/master/rados/operations/user-management#user for a detailed explanation of the difference between ID and name.

  3. Use QEMU to create an image in your RBD pool. The following example uses the image name new-libvirt-image and references libvirt-pool.

    qemu-img create -f rbd rbd:libvirt-pool/new-libvirt-image 2G

    Verify the image exists.

    rbd -p libvirt-pool ls

23.2 Preparing the VM Manager

You may use libvirt without a VM manager, but you may find it simpler to create your first domain with virt-manager.

  1. Install a virtual machine manager.

    sudo zypper in virt-manager
  2. Prepare/download an OS image of the system you want to run virtualized.

  3. Launch the virtual machine manager.

    virt-manager

23.3 Creating a VM

To create a VM with virt-manager, perform the following steps:

  1. Choose the connection from the list, right-click it, and select New.

  2. Import existing disk image by providing the path to the existing storage. Specify OS type, memory settings, and Name the virtual machine, for example libvirt-virtual-machine.

  3. Finish the configuration and start the VM.

  4. Verify that the newly created domain exists with sudo virsh list. If needed, specify the connection string, such as

    virsh -c qemu+ssh://root@vm_host_hostname/system list
    Id    Name                           State
    -----------------------------------------------
    [...]
     9     libvirt-virtual-machine       running
  5. Log in to the VM and stop it before configuring it for use with Ceph.

23.4 Configuring the VM

When configuring the VM for use with Ceph, it is important to use virsh where appropriate. Additionally, virsh commands often require root privileges (sudo) and will not return appropriate results or notify you that root privileges are required. For a reference of virsh commands, refer to Virsh Command Reference (http://www.libvirt.org/virshcmdref.html).

  1. Open the configuration file with virsh edit vm-domain-name.

    sudo virsh edit libvirt-virtual-machine
  2. Under <devices> there should be a <disk> entry.

    <devices>
        <emulator>/usr/bin/qemu-system-x86_64</emulator>
        <disk type='file' device='disk'>
          <driver name='qemu' type='raw'/>
          <source file='/path/to/image/recent-linux.img'/>
          <target dev='vda' bus='virtio'/>
          <address type='drive' controller='0' bus='0' unit='0'/>
        </disk>

    Replace /path/to/image/recent-linux.img with the path to the OS image.

    Important
    Important

    Use sudo virsh edit instead of a text editor. If you edit the configuration file under /etc/libvirt/qemu with a text editor, libvirt may not recognize the change. If there is a discrepancy between the contents of the XML file under /etc/libvirt/qemu and the result of sudo virsh dumpxml vm-domain-name, then your VM may not work properly.

  3. Add the Ceph RBD image you previously created as a <disk> entry.

    <disk type='network' device='disk'>
            <source protocol='rbd' name='libvirt-pool/new-libvirt-image'>
                    <host name='monitor-host' port='6789'/>
            </source>
            <target dev='vda' bus='virtio'/>
    </disk>

    Replace monitor-host with the name of your host, and replace the pool and/or image name as necessary. You may add multiple <host> entries for your Ceph monitors. The dev attribute is the logical device name that will appear under the /dev directory of your VM. The optional bus attribute indicates the type of disk device to emulate. The valid settings are driver specific (for example ide, scsi, virtio, xen, usb or sata). See Disks (http://www.libvirt.org/formatdomain.html#elementsDisks) for details of the <disk> element, and its child elements and attributes.

  4. Save the file.

  5. If your Ceph cluster has authentication enabled (it does by default), you must generate a secret.

    cat > secret.xml <<EOF
    <secret ephemeral='no' private='no'>
            <usage type='ceph'>
                    <name>client.libvirt secret</name>
            </usage>
    </secret>
    EOF
  6. Define the secret.

    sudo virsh secret-define --file secret.xml
    <uuid of secret is output here>
  7. Get the client.libvirt key and save the key string to a file.

    ceph auth get-key client.libvirt | sudo tee client.libvirt.key
  8. Set the UUID of the secret.

    sudo virsh secret-set-value --secret uuid of secret \
    --base64 $(cat client.libvirt.key) && rm client.libvirt.key secret.xml

    You must also set the secret manually by adding the following &auth> entry to the &disk> element you entered earlier (replacing the uuid value with the result from the command line example above).

    sudo virsh edit libvirt-virtual-machine

    Then, add &auth>&/auth> element to the domain configuration file:

    ...
    &/source>
    &auth username='libvirt'>
            &secret type='ceph' uuid='9ec59067-fdbc-a6c0-03ff-df165c0587b8'/>
    &/auth>
    &target ...
    Note
    Note

    The exemplary ID is libvirt, not the Ceph name client.libvirt as generated at step 2 of Section 23.1, “Configuring Ceph”. Ensure you use the ID component of the Ceph name you generated. If for some reason you need to regenerate the secret, you will need to execute sudo virsh secret-undefine uuid before executing sudo virsh secret-set-value again.

23.5 Summary

Once you have configured the VM for use with Ceph, you can start the VM. To verify that the VM and Ceph are communicating, you may perform the following procedures.

  1. Check to see if Ceph is running:

    ceph health
  2. Check to see if the VM is running:

    sudo virsh list
  3. Check to see if the VM is communicating with Ceph. Replace vm-domain-name with the name of your VM domain:

    sudo virsh qemu-monitor-command --hmp vm-domain-name 'info block'
  4. Check to see if the device from &target dev='hdb' bus='ide'/> appears under /dev or under /proc/partitions:

    ls /dev
    cat /proc/partitions

24 Ceph as a Back-end for QEMU KVM Instance

The most frequent Ceph use case involves providing block device images to virtual machines. For example, a user may create a 'golden' image with an OS and any relevant software in an ideal configuration. Then, the user takes a snapshot of the image. Finally, the user clones the snapshot (usually many times, see Chapter 14, Snapshots for details). The ability to make copy-on-write clones of a snapshot means that Ceph can provision block device images to virtual machines quickly, because the client does not need to download an entire image each time it spins up a new virtual machine.

Ceph block devices can integrate with the QEMU virtual machines. For more information on QEMU KVM, see https://www.suse.com/documentation/sles-12/book_virt/data/part_virt_qemu.html.

24.1 Installation

In order to use Ceph block devices, QEMU needs to have the appropriate driver installed. Check whether the qemu-block-rbd package is installed, and install it if needed:

sudo zypper install qemu-block-rbd

24.2 Usage

The QEMU command line expects you to specify the pool name and image name. You may also specify a snapshot name.

qemu-img command options \
rbd:pool-name/image-name@snapshot-name:option1=value1:option2=value2...

For example, specifying the id and conf options might look like the following:

qemu-img command options \
rbd:pool_name/image_name:id=glance:conf=/etc/ceph/ceph.conf

24.3 Creating Images with QEMU

You can create a block device image from QEMU. You must specify rbd, the pool name, and the name of the image you want to create. You must also specify the size of the image.

qemu-img create -f raw rbd:pool-name/image-name size

For example:

qemu-img create -f raw rbd:pool1/image1 10G
Formatting 'rbd:pool1/image1', fmt=raw size=10737418240 nocow=off cluster_size=0
Important
Important

The raw data format is really the only sensible format option to use with RBD. Technically, you could use other QEMU-supported formats such as qcow2, but doing so would add additional overhead, and would also render the volume unsafe for virtual machine live migration when caching is enabled.

24.4 Resizing Images with QEMU

You can resize a block device image from QEMU. You must specify rbd, the pool name, and the name of the image you want to resize. You must also specify the size of the image.

qemu-img resize rbd:pool-name/image-name size

For example:

qemu-img resize rbd:pool1/image1 9G
Image resized.

24.5 Retrieving Image Info with QEMU

You can retrieve block device image information from QEMU. You must specify rbd, the pool name, and the name of the image.

qemu-img info rbd:pool-name/image-name

For example:

qemu-img info rbd:pool1/image1
image: rbd:pool1/image1
file format: raw
virtual size: 9.0G (9663676416 bytes)
disk size: unavailable
cluster_size: 4194304

24.6 Running QEMU with RBD

QEMU can access an image as a virtual block device directly via librbd. This avoids an additional context switch, and can take advantage of RBD caching.

You can use qemu-img to convert existing virtual machine images to Ceph block device images. For example, if you have a qcow2 image, you could run:

qemu-img convert -f qcow2 -O raw sles12.qcow2 rbd:pool1/sles12

To run a virtual machine booting from that image, you could run:

qemu -m 1024 -drive format=raw,file=rbd:pool1/sles12

RBD caching (http://ceph.com/docs/master/rbd/rbd-config-ref/#cache-settings) can significantly improve performance. QEMU’s cache options control librbd caching:

qemu -m 1024 -drive format=rbd,file=rbd:pool1/sles12,cache=writeback

24.7 Enabling Discard/TRIM

Ceph block devices support the discard operation. This means that a guest can send TRIM requests to let a Ceph block device reclaim unused space. This can be enabled in the guest by mounting XFS with the discard option.

For this to be available to the guest, it must be explicitly enabled for the block device. To do this, you must specify a discard_granularity associated with the drive:

qemu -m 1024 -drive format=raw,file=rbd:pool1/sles12,id=drive1,if=none \
-device driver=ide-hd,drive=drive1,discard_granularity=512
Note
Note

The above example uses the IDE driver. The virtio driver does not support discard.

If using libvirt, edit your libvirt domain’s configuration file using virsh edit to include the xmlns:qemu value. Then, add a qemu:commandline block as a child of that domain. The following example shows how to set two devices with qemu id= to different discard_granularity values.

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
 <qemu:commandline>
  <qemu:arg value='-set'/>
  <qemu:arg value='block.scsi0-0-0.discard_granularity=4096'/>
  <qemu:arg value='-set'/>
  <qemu:arg value='block.scsi0-0-1.discard_granularity=65536'/>
 </qemu:commandline>
</domain>

24.8 QEMU Cache Options

QEMU’s cache options correspond to the following Ceph RBD Cache settings.

Writeback:

rbd_cache = true

Writethrough:

rbd_cache = true
rbd_cache_max_dirty = 0

None:

rbd_cache = false

QEMU’s cache settings override Ceph’s default settings (settings that are not explicitly set in the Ceph configuration file). If you explicitly set RBD Cache (http://ceph.com/docs/master/rbd/rbd-config-ref/#cache-settings) settings in your Ceph configuration file, your Ceph settings override the QEMU cache settings. If you set cache settings on the QEMU command line, the QEMU command line settings override the Ceph configuration file settings.

Part VII Best Practices

25 Introduction

This chapter introduces a list of selected topics which you may encounter when managing the Ceph environment. To every topic there is a recommended solution that helps you understand or fix the existing problem. The topics are sorted into relevant categories.

26 Hardware Recommendations
27 Cluster Administration

The chapter describes some useful operations that can be performed after the cluster is completely deployed and running, like adding nodes, disks.

28 Monitoring

29 Disk Management

30 Recovery

31 Accountancy

32 Tune-ups

33 Integration
34 Cluster Maintenance and Troubleshooting

35 Performance Diagnosis

36 Server Maintenance

37 Networking

25 Introduction

This chapter introduces a list of selected topics which you may encounter when managing the Ceph environment. To every topic there is a recommended solution that helps you understand or fix the existing problem. The topics are sorted into relevant categories.

25.1 Reporting Software Problems

If you come across a problem when running SUSE Enterprise Storage related to some of its components, such as Ceph, RADOS Gateway, or Calamari, report the problem to SUSE Technical Support. The recommended way is with the supportconfig utility.

Tip
Tip

Because supportconfig is modular software, make sure that the supportutils-plugin-ses package is installed.

rpm -q supportutils-plugin-ses

If it is missing on the Ceph server, install it with

zypper ref && zypper in supportutils-plugin-ses

Although you can use supportconfig on the command line, we recommend using the related YaST module. Find more information about supportconfig in https://www.suse.com/documentation/sles-12/singlehtml/book_sle_admin/book_sle_admin.html#sec.admsupport.supportconfig.

26 Hardware Recommendations

26.1 Can I Reduce Data Replication

Ceph stores data within pools. Pools are logical groups for storing objects. Data objects within a pool are replicated so that they can be recovered when OSDs fail. New pools are created with the default of three replicas. This number includes the 'original' data object itself. Three replicas then mean the data object and two its copies for a total of three instances.

You can manually change the number of pool replicas (see Section 13.1.9, “Set the Number of Object Replicas”). Setting a pool to two replicas means that there is only one copy of the data object besides the object itself, so if you lose one object instance, you need to trust that the other copy has not been corrupted for example since the last scrubbing (http://ceph.com/docs/master/rados/configuration/osd-config-ref/#scrubbing) during recovery.

Setting a pool to one replica means that there is exactly one instance of the data object in the pool. If the OSD fails, you lose the data. A possible usage for a pool with one replica is storing temporary data for a short time.

Setting more than three replicas for a pool means only a small increase in reliability, but may be suitable in rare cases. Remember that the more replicas, tho more disk space needed for storing the object copies. If you need the ultimate data security, we recommend using erasure coded pools. For more information, see Chapter 15, Erasure Coded Pools.

Warning
Warning

We strongly encourage you to either leave the number of replicas for a pool at the default value of 3, or use higher value if suitable. Setting the number of replicas to a smaller number is dangerous and may cause the loss of data stored in the cluster.

26.2 Can I Reduce Redundancy Similar to RAID 6 Arrays?

When creating a new pool, Ceph uses the replica type by default, which replicates objects across multiple disks to be able to recover from an OSD failure. While this type of pool is safe, it uses a lot of disk space to store objects.

To reduce the disk space needed, Ceph implements erasure coded pools. This method adds extra chunks of data to detect errors in a data stream. Erasure coded pools exhibit similar performance, reliability, and storage saved as RAID 6 arrays.

As erasure coding is a complex topic, you need to study it properly to be able to deploy it for optimum performance. For more information, see Chapter 15, Erasure Coded Pools.

26.3 What is the Minimum Disk Size for an OSD node?

There are two types of disk space needed to run on OSD: the space for the disk journal, and the space for the stored data. The minimum (and default) value for the journal is 6GB. The minimum space for data is 5GB as partitions smaller than 5GB are automatically assigned the weight of 0.

So although the minimum disk space for an OSD is 11GB, we do not recommend a disk smaller than 20GB, even for testing purposes.

26.4 How Much RAM Do I Need in a Storage Server?

The recommended minimum is 2GB per OSD. Note that during recovery, 1 or even 2GB of RAM per Terabyte of OSD disk space is optimal.

26.5 OSD and Monitor Sharing One Server

Although it is technically possible to run OSDs and monitor nodes on the same server in test environments, we strongly recommend having a separate server for each monitor node in production. The main reason is performance—the more OSDs the cluster has, the more I/O operations the monitor nodes need to perform. And when one server is shared between a monitor node and OSD(s), the OSD I/O operations are a limiting factor for the monitor node.

Another aspect is whether to share disks between an OSD, a monitor node, and the operating system on the server. The answer is simple: if possible, dedicate a separate disk to OSD, and a separate server to a monitor node.

Although Ceph supports directory-based OSDs, an OSD should always have a dedicated disk other than the operating system one.

Tip
Tip

If it is really necessary to run OSD and monitor node on the same server, run the monitor on a separate disk by mounting the disk to the /var/lib/ceph/mon directory for slightly better performance.

26.6 How Many Disks Can I Have in a Server

You can have as many disks in one server as it allows. There are a few things to consider when planning the number of disks per server:

  • Network bandwidth. The more disks you have in a server, the more data must be transferred via the network card(s) for the disk write operations.

  • Memory. For optimum performance, reserve at least 2GB of RAM per Terabyte of disk space installed.

  • Fault tolerance. If the complete server fails, the more disks it has, the more OSDs the cluster temporarily loses. Moreover, to keep the replication rules running, you need to copy all the data from the failed server between the other nodes in the cluster.

26.7 How Many OSDs Can Share a Single SSD Journal

Solid-state drives (SSD) have no moving parts. This reduces random access time and read latency while accelerating data throughput. Because their price per 1MB is significantly higher than the price of spinning hard disks, SSDs are only suitable for smaller storage.

OSDs may see a significant performance improvement by storing their journal on an SSD and the object data on a separate hard disk. The osd journal configuration setting defaults to /var/lib/ceph/osd/cluster-id/journal. You can mount this path to an SSD or to an SSD partition so that it is not merely a file on the same disk as the object data.

Tip
Tip: Sharing an SSD for Multiple Journals

As journal data occupies relatively small space, you can mount several journal directories to a single SSD disk. Keep in mind that with each shared journal, the performance of the SSD disk degrades. We do not recommend sharing more than 6 journals on the same SSD disk.

27 Cluster Administration

The chapter describes some useful operations that can be performed after the cluster is completely deployed and running, like adding nodes, disks.

27.1 Using ceph-deploy on an Already Setup Server

ceph-deploy is a command line utility to easily deploy a Ceph cluster (see Chapter 5, Deploying with ceph-deploy). After the cluster is deployed, you can use ceph-deploy to administer the clusters' nodes. You can add OSD nodes, monitor nodes, gather authentication keys, or purge a running cluster. ceph-deploy has the following general syntax:

ceph-deploy subcommands options

A list of selected ceph-deploy subcommands with short descriptions follow.

Tip
Tip

Administer Ceph nodes with ceph-deploy from the admin node. Before administering them, always create a new temporary directory and cd into it. Then choose one monitor node and gather the authentication keys with the gatherkeys subcommand from it, and copy the /etc/ceph/ceph.conf file from the monitor node into the current local directory.

cephadm >  mkdir ceph_tmp
cephadm >  cd ceph_tmp
cephadm >  ceph-deploy gatherkeys ceph_mon_host
cephadm >  scp ceph_mon_host:/etc/ceph/ceph.conf .
gatherkeys

Gather authentication keys for provisioning new nodes. It takes host names as arguments. It checks for and fetches client.admin keyring, monitor keyring and bootstrap-mds/bootstrap-osd keyring from monitor host. These authentication keys are used when new monitors/OSDs/MDS are added to the cluster.

Usage:

ceph-deploy gatherkeys hostname

hostname is the host name of the monitor from where keys are to be pulled.

mon add

Adds a monitor to an existing cluster. It first detects the platform and distribution for the target host, and checks if the host name is compatible for deployment. It then uses the monitor keyring, ensures configuration for new monitor host and adds the monitor to the cluster. If the section for the monitor exists, it can define the monitor address by the mon addr option, otherwise it will fall back by resolving the host name to an IP. If --address is used, it will override all other options. After adding the monitor to the cluster, it gives it some time to start. It then looks for any monitor errors, and checks monitor status. Monitor errors arise if the monitor is not added in the mon initial members option, if it does not exist in monmap, or if neither public_addr nor public_network keys were defined for monitors. Under such conditions, monitors may not be able to form quorum. Monitor status tells if the monitor is up and running normally. The status is checked by running ceph daemon mon.hostname mon_status on remote end which provides the output and returns a Boolean status of what is going on. False means a monitor that is not fine even if it is up and running, while True means the monitor is up and running correctly.

Usage:

ceph-deploy mon add host
ceph-deploy mon add host --address IP

host is the host name and IP is the IP address of the desired monitor node.

osd prepare

Prepares a directory, disk or drive for a Ceph OSD. It first checks against multiple OSDs getting created and warns about the possibility of more than the recommended which would cause issues with max allowed PIDs in a system. It then reads the bootstrap-osd key for the cluster or writes the bootstrap key if not found. It then uses ceph-disk utility’s prepare subcommand to prepare the disk and journal and deploy the OSD on the desired host. It gives some time to the OSD to settle and checks for any possible errors and if found, reports them to the user.

Usage:

ceph-deploy osd prepare host:disk[journal] ...
osd activate

Activates the OSD prepared using the prepare subcommand. It actually uses ceph-disk utility’s activate subcommand to activate the OSD with the appropriate initialization type based on the distribution. When activated, it gives some time to the OSD to start and checks for any possible errors and if found, reports to the user. It checks the status of the prepared OSD, checks the OSD tree and makes sure the OSDs are up and in.

Tip
Tip

osd activate is usually not needed as udev rules explicitly trigger "activate" after a disk is prepared after osd prepare.

Usage:

ceph-deploy osd activate host:disk[journal] ...
Tip
Tip

You can use ceph-deploy osd create to join prepare and activate functionality into one command.

rgw prepare/activate/create

Find more information in Section 27.5, “Usage of ceph-deploy rgw.

purge, purgedata, forgetkeys

You can use the subcommands to completely purge the Ceph cluster (or some of its nodes) as if Ceph was never installed on the cluster servers. They are typically used when Ceph installation fails and you want to start with a clean environment. Or, you can purge one or more nodes because you want to remove them from the cluster as their life-cycle ends.

For more information on purging the cluster or its nodes, see Section 5.4, “Cleaning Previous Ceph Environment”.

Tip
Tip

If you do not intend to purge the whole cluster, do not use the forgetkeys subcommand, as the keys will remain in place for the remaining cluster infrastructure.

27.2 Adding OSDs with ceph-disk

ceph-disk is a utility that can prepare and activate a disk, partition or directory as a Ceph OSD. It automates the multiple steps involved in manual creation and start of an OSD into two steps of preparing and activating the OSD by using the subcommands prepare and activate.

prepare

Prepares a directory, disk or drive for a Ceph OSD. It creates a GPT partition, marks the partition with Ceph type uuid, creates a file system, marks the file system as ready for Ceph consumption, uses entire partition and adds a new partition to the journal disk.

activate

Activates the Ceph OSD. It mounts the volume in a temporary location, allocates an OSD ID (if needed), remounts in the correct location /var/lib/ceph/osd/cluster-id and starts ceph-osd.

The following example shows steps for adding an OSD with ceph-osd.

  1. Make sure a new disk is physically present on the node where you want to add the OSD. In our example, it is node1 belonging to cluster ceph.

  2. ssh to node1.

  3. Generate a unique identification for the new OSD:

    uuidgen
    c70c032a-6e88-4962-8376-4aa119cb52ee
  4. Prepare the disk:

    sudo ceph-disk prepare --cluster ceph \
    --cluster-uuid c70c032a-6e88-4962-8376-4aa119cb52ee --fs-type xfs /dev/hdd1
  5. Activate the OSD:

    sudo ceph-disk activate /dev/hdd1

27.3 Adding OSDs with ceph-deploy

ceph-deploy is a command line utility to simplify the installation and configuration of a Ceph cluster. It can be used to add or remove OSDs as well. To add a new OSD to a node node2 with ceph-deploy, follow these steps:

Tip
Tip

ceph-deploy is usually run from the administration node, from which you installed the cluster.

  1. List available disks on a node:

    ceph-deploy disk list node2
    [...]
    [node2][DEBUG ] /dev/sr0 other, unknown
    [node2][DEBUG ] /dev/vda :
    [node2][DEBUG ]  /dev/vda1 swap, swap
    [node2][DEBUG ]  /dev/vda2 other, btrfs, mounted on /
    [node2][DEBUG ] /dev/vdb :
    [node2][DEBUG ]  /dev/vdb1 ceph data, active, cluster ceph, osd.1, journal /dev/vdb2
    [node2][DEBUG ]  /dev/vdb2 ceph journal, for /dev/vdb1
    [node2][DEBUG ] /dev/vdc other, unknown

    /dev/vdc seems to be unused, so let us focus on adding it as an OSD.

  2. Zap the disk. Zapping deletes the disk's partition table.

    ceph-deploy disk zap node2:vdc
    Warning
    Warning

    Zapping deletes all data from the disk

  3. Prepare the OSD. The prepare command expects you to specify the disk for data, and optionally the disk for its journal. We recommend storing the journal on a separate drive to maximize throughput.

    ceph-deploy osd prepare node2:vdc:/dev/ssd

27.4 Adding and Removing Monitors

With ceph-deploy, adding and removing monitors is a simple task. Also, take into account the following restrictions/recommendation.

Important
Important
  • ceph-deploy restricts you to only install one monitor per host.

  • We do not recommend mixing monitors and OSDs on the same host.

  • For high availability, you should run a production Ceph cluster with at least three monitors.

27.4.1 Adding a Monitor

After you create a cluster and install Ceph packages to the monitor host(s) (see Chapter 5, Deploying with ceph-deploy for more information), you may deploy the monitors to the monitor hosts. You may specify more monitor host names in the same command.

ceph-deploy mon create host-name
Note
Note

When adding a monitor on a host that was not in hosts initially defined with the ceph-deploy new command, a public network statement needs to be added to the ceph.conf file.

27.4.2 Removing a Monitor

If you have a monitor in your cluster that you want to remove, you may use the destroy option. You may specify more monitor host names in the same command.

ceph-deploy mon destroy host-name
Note
Note

Ensure that if you remove a monitor, the remaining monitors will be able to establish a consensus. If that is not possible, consider adding a monitor before removing the monitor you want to take offline.

27.5 Usage of ceph-deploy rgw

The ceph-deploy script includes the rgw component that helps you manage RADOS Gateway instances. Its general form follows this pattern:

ceph-deploy rgw subcommand rgw-host:rgw-instance:fqdn:port:redirect
subcommand

One of list, prepare, activate, create (= prepare + activate), or delete.

rgw-host

Host name where you want to operate the RADOS Gateway.

rgw-instance

Ceph instance name. Default is 'rgw-host'.

fqdn

Virtual host to listen to. Default is 'None'.

port

Port to listen to. Default is 80.

redirect

The URL redirect. Default is '^/(.*)'.

For example:

ceph-deploy rgw prepare example_host2:gateway1

or

ceph-deploy activate example_host1:gateway1:virtual_srv2:81
Tip
Tip: Specifying Multiple RADOS Gateway Instances

You can specify more rgw_hostname:rgw_instance pairs on the same command line if you separate them with a comma:

ceph-deploy rgw create hostname1:rgw,hostname2:rgw,hostname3:rgw

For a practical example of setting RADOS Gateway with ceph-deploy, see Section 17.2.2, “Configuring RADOS Gateway.

27.6 RADOS Gateway Client Usage

To use RADOS Gateway REST interfaces, you need to create a user for the S3 interface, then a subuser for the Swift interface. Find more information on creating RADOS Gateway users in Section 31.4, “Adding Swift Users” and Section 31.1, “Adding S3 Users”.

27.6.1 S3 Interface Access

To access the S3 interface, you need to write a Python script. The script will connect to RADOS Gateway, create a new bucket, and list all buckets. The values for aws_access_key_id and aws_secret_access_key are taken from the values of access_key and secret_key returned by the radosgw_admin command from Section 31.1, “Adding S3 Users”.

  1. Install the python-boto package:

    sudo zypper in python-boto
  2. Create a new Python script called s3test.py with the following content:

    import boto
    import boto.s3.connection
    access_key = '11BS02LGFB6AL6H1ADMW'
    secret_key = 'vzCEkuryfn060dfee4fgQPqFrncKEIkh3ZcdOANY'
    conn = boto.connect_s3(
    aws_access_key_id = access_key,
    aws_secret_access_key = secret_key,
    host = '{hostname}',
    is_secure=False,
    calling_format = boto.s3.connection.OrdinaryCallingFormat(),
    )
    bucket = conn.create_bucket('my-new-bucket')
    for bucket in conn.get_all_buckets():
    print "{name}\t{created}".format(
    name = bucket.name,
    created = bucket.creation_date,
    )

    Replace {hostname} with the host name of the host where you configured RADOS Gateway service, for example gateway_host.

  3. Run the script:

    python s3test.py

    The script outputs something like the following:

    my-new-bucket 2015-07-22T15:37:42.000Z

27.6.2 Swift Interface Access

To access RADOS Gateway via Swift interface, you need the swift command line client. Its manual page man 1 swift tells you more about its command line options.

To install swift, run the following:

sudo zypper in python-swiftclient

The swift access uses the following syntax:

swift -A http://IP_ADDRESS/auth/1.0 \
-U example_user:swift -K 'swift_secret_key' list

Replace IP_ADDRESS with the IP address of the gateway server, and swift_secret_key with its value from the output of the radosgw-admin key create command executed for the swift user in Section 31.4, “Adding Swift Users”.

For example:

swift -A http://gateway.example.com/auth/1.0 -U example_user:swift \
-K 'r5wWIxjOCeEO7DixD1FjTLmNYIViaC6JVhi3013h' list

The output is:

my-new-bucket

27.7 Automated Installation via Salt

The installation can be automated by using the Salt reactor. For virtual environments or consistent hardware environments, this configuration will allow the creation of a Ceph cluster with the specified behavior.

Warning
Warning

Salt cannot perform dependency checks based on reactor events. Putting your Salt master into a death spiral is a real risk.

The automated installation requires the following:

  • A properly created /srv/pillar/ceph/proposals/policy.cfg.

  • Prepared custom configuration, placed to the /srv/pillar/ceph/stack directory.

  • The example reactor file /usr/share/doc/packages/deepsea/reactor.conf must be copied to /etc/salt/master.d/reactor.conf.

The default reactor configuration will only run Stages 0 and 1. This allows testing of the reactor without waiting for subsequent stages to complete.

When the first salt-minion starts, Stage 0 will begin. A lock prevents multiple instances. When all minions complete Stage 0, Stage 1 will begin.

If the operation is performed properly, change the last line in the /etc/salt/master.d/reactor.conf:

- /srv/salt/ceph/reactor/discovery.sls

to

- /srv/salt/ceph/reactor/all_stages.sls

27.8 Restarting Ceph services using DeepSea

When you install updates, specifically ceph-<mon,osd etc> you need to restart the services to make use of the recently installed version. To do so, run:

salt-run state.orch ceph.restart

The script iterates over all roles you have configured in the following order: MON, OSD, MDS, RGW, IGW. To keep the downtime low and to find potential issues as early as possible, nodes are restarted sequentially. For example, only one monitoring node is restarted at a time. The command also waits for the cluster to recover if the cluster is in a degraded unhealthy state.

Note
Note: Watching the Restarting

The process of restarting the cluster may take some time. You can watch the events by using the Salt event bus by running:

salt-run state.event pretty=True

27.9 Node Management

After you set up a complete cluster you may need to perform additional changes to the cluster like adding or removing monitoring nodes or adding/removing Ceph  OSD nodes. Adding and removing of cluster nodes can be done without shutting down the whole cluster, but it might increase replication traffic.

Note
Note: Limitations

The procedures described in sections: Section 27.9.1, “Adding Ceph OSD Nodes” and Section 27.9.2, “Removing Ceph OSD Nodes” can be performed only with the default CRUSH map. The default CRUSH map must have been created by using Ceph-deploy or DeepSea.

27.9.1 Adding Ceph OSD Nodes

The procedure below describes adding of a Ceph OSD node to your cluster.

Procedure 27.1: Adding a Ceph OSD Node
  1. List all Ceph OSD nodes and then choose a proper name for the new node/s

    ceph osd tree
  2. Inspect your CRUSH map to find out the bucket type, for a procedure refer to Section 12.4, “CRUSH Map Manipulation”. Typically the bucket type is host.

  3. Create a record for the new node in your CRUSH map.

    ceph osd crush add-bucket {bucket name} {bucket type}

    for example:

    ceph osd crush add-bucket ses4-4 host
  4. Add all OSD that the new node should use. For a procedure refer to Section 27.3, “Adding OSDs with ceph-deploy.

27.9.2 Removing Ceph OSD Nodes

To remove a Ceph OSD node follow the procedure:

Procedure 27.2: Removing a Ceph OSD Node
  1. Remove all OSD on the node you want to delete as described in Section 29.2, “Deleting disks”.

  2. Verify that all OSDs have been removed:

    ceph osd tree

    The OSD to be removed must not have any OSD.

  3. Remove the node from the cluster:

    ceph osd crush remove {bucket name}

27.9.3 Removing and Reinstalling Salt Cluster Nodes

You may want remove a role from your minion, to do so use the Stage 5 command:

root # salt-run state.orch ceph.stage.5

When a role is removed from a minion, the objective is to undo all changes related to that role. For most of the roles, the task is simple, but there may be problems with package dependencies. If a package is uninstalled, its dependencies are not.

Removed OSDs appear as blank drives. The related tasks overwrite the beginning of the file systems and remove backup partitions in addition to wiping the partition tables.

Note
Note: Preserving Partitions Created by Other Methods

Disk drives previously configured by other methods, such as ceph-deploy, may still contain partitions. DeepSea will not automatically destroy these. Currently, the administrator must reclaim these drives.

28 Monitoring

28.1 Usage Graphs on Calamari

Calamari—a Ceph's Web front-end for managing and monitoring the cluster—includes several graphs on the cluster's usage.

At the bottom of the Dashboard—the home page of Calamari—there are two usage related boxes. While IOPS shows the cluster's overall number of input/output operations per second, the Usage graph shows the number of the cluster's total/used disk space.

IOPS and Usage Graphs on Calamari Dashboard
Figure 28.1: IOPS and Usage Graphs on Calamari Dashboard

You can find more detailed and interactive graphs after clicking the Charts menu item. It shows the cluster's overall input/output operations per second and free disk space by default. Select Pool IOPS from the top drop-down box to detail the view by existing pools.

Pool IOPS Detailed View
Figure 28.2: Pool IOPS Detailed View

By moving the slider in the Time Axis pane, you can change the displayed time interval in the graph. By moving the mouse over the graph, the time/read/write information changes accordingly. By clicking and dragging the mouse horizontally across the graph, the specified time interval gets zoomed. You can see more help by moving the mouse over the little question mark in the top right corner of the graph.

If you select the host name of a specific Ceph server, Calamari displays detailed information about CPU, average load, and memory related to the specified host.

Ceph Host Average Load
Figure 28.3: Ceph Host Average Load

28.2 Checking for Full OSDs

Ceph prevents you from writing to a full OSD so that you do not lose data. In an operational cluster, you should receive a warning when your cluster is getting near its full ratio. The mon osd full ratio defaults to 0.95, or 95% of capacity before it stops clients from writing data. The mon osd nearfull ratio defaults to 0.85, or 85% of capacity, when it generates a health warning.

Full OSD nodes will be reported by ceph health:

ceph health
  HEALTH_WARN 1 nearfull osds
  osd.2 is near full at 85%

or

ceph health
  HEALTH_ERR 1 nearfull osds, 1 full osds
  osd.2 is near full at 85%
  osd.3 is full at 97%

The best way to deal with a full cluster is to add new OSD nodes allowing the cluster to redistribute data to the newly available storage.

If you cannot start an OSD because it is full, you may delete some data by deleting some placement group directories in the full OSD.

Tip
Tip: Preventing Full OSDs

After an OSD becomes full—is uses 100% of its disk space—it will normally crash quickly without warning. Following are a few tips to remember when administering OSD nodes.

  • Each OSD's disk space (usually mounted under /var/lib/ceph/osd/osd-{1,2..}) needs to be placed on a dedicated underlying disk or partition.

  • Check the Ceph configuration files and make sure that Ceph does not store its log file to the disks/partitions dedicated for use by OSDs.

  • Make sure that no other process writes to the disks/partitions dedicated for use by OSDs.

28.3 Checking if OSD Daemons are Running on a Node

To check the status of OSD services on a specific node, log in to the node, and run the following:

sudo systemctl status ceph-osd*
 ceph-osd@0.service - Ceph object storage daemon
    Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled)
    Active: active (running) since Fri 2015-02-20 11:13:18 CET; 2 days ago
  Main PID: 1822 (ceph-osd)
    CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@0.service
            └─1822 /usr/bin/ceph-osd -f --cluster ceph --id 0

For more information, see Chapter 9, Operating Ceph Services.

28.4 Checking if Monitor Daemons are Running on a Node

To check the status of monitor services on a specific node, log in to the node, and run the following:

sudo systemctl status ceph-mon*
 ceph-mon@doc-ceph1.service - Ceph cluster monitor daemon
    Loaded: loaded (/usr/lib/systemd/system/ceph-mon@.service; enabled)
    Active: active (running) since Wed 2015-02-18 16:57:17 CET; 4 days ago
  Main PID: 1203 (ceph-mon)
    CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@doc-ceph1.service
            └─1203 /usr/bin/ceph-mon -f --cluster ceph --id doc-ceph1

For more information, see Chapter 9, Operating Ceph Services.

28.5 What Happens When a Disk Fails?

When a disk with a stored cluster data has a hardware problem and fails to operate, here is what happens:

  • The related OSD crashed and is automatically removed from the cluster.

  • The failed disk's data is replicated to another OSD in the cluster from other copies of the same data stored in other OSDs.

  • Then you should remove the disk from the cluster CRUSH Map, and physically from the host hardware.

28.6 What Happens When a Journal Disk Fails?

Ceph OSDs use journaling file systems (see http://en.wikipedia.org/wiki/Journaling_file_system for more information) to store data. When a disk dedicated to a journal fails, the related OSD(s) fail as well (see Section 28.5, “What Happens When a Disk Fails?”).

Warning
Warning: Hosting Multiple Journals on One Disk

For performance boost, you can use a fast disk (such as SSD) to store journal partitions for several OSDs. We do not recommend to host journals for more than 4 OSDs on one disk, because in case of the journals' disk failure, you risk losing stored data for all the related OSDs' disks.

29 Disk Management

29.1 Adding Disks

Important
Important
  • This can be done on a live cluster without downtime.

  • This will cause increased replication traffic between servers.

  • Doing this operation repeatedly before the last operation has completed replication can save the cluster overall rebuild time.

To add a disk (/dev/sdd in our example) to a Ceph cluster, follow these steps:

  1. Create a partition sdd1 on the disk:

    sudo parted /dev/sdd1 mkpart primary 0.0 -1s
  2. Format the partition with XFS file system:

    sudo mkfs.xfs -f /dev/sdd1
  3. Find out the UUID (Universally Unique Identifier) of the disk:

    ls -l /dev/disk/by-uuid | grep sdd1
     [...] 04bb24f1-d631-47ff-a2ee-22d94ad4f80c -> ../../sdd1
  4. Add the corresponding line to /etc/fstab for the example disk osd.12:

    [...]
     UUID=04bb24f1-d631-47ff-a2ee-22d94ad4f80c /mnt/osd.12 xfs \
     defaults,errors=remount-ro 0 1
     [...]
  5. Mount the disk:

    sudo mount /mnt/osd.12
  6. Add the new disk to /etc/ceph/ceph.conf and copy the updated configuration file to all other nodes in the cluster.

  7. Create the OSD:

    ceph osd create 04bb24f1-d631-47ff-a2ee-22d94ad4f80c
  8. Make sure that the new OSD is accepted into the cluster:

    sudo mkdir /srv/ceph/04bb24f1-d631-47ff-a2ee-22d94ad4f80c
     ceph-osd -i 12 --mkfs --mkkey
     ceph auth add osd.12 osd 'allow *' mon 'allow rwx' -i /etc/ceph/keyring.osd.12
  9. Start the newly added OSD:

    sudo systemctl start ceph-osd@12.service
  10. Add it to the cluster and allow replication based on CRUSH Map:

    ceph osd crush set 12 osd.12 1.0 \
     pool=pool_name rack=rack_name host=host_name-osd
  11. Check that the new OSD is in the right place within the cluster:

    ceph osd tree
Tip
Tip

The process of preparing/adding a disk can be simplified with the ceph-disk command. See http://ceph.com/docs/master/man/8/ceph-disk/ for more information on ceph-disk.

29.2 Deleting disks

Important
Important
  • This can be done on a live cluster without downtime.

  • This will cause increased replication traffic between servers.

  • Be sure not to remove too many disks from your cluster to be able to keep the replication rules. See Section 12.3, “Rule Sets” for more information.

To delete a disk (for example osd.12) from a Ceph cluster, follow these steps:

  1. Make sure you have the right disk:

    ceph osd tree
  2. If the disk is a member of a pool and/or active:

    1. Drain the OSD by setting its weight to zero:

      ceph osd crush reweight osd.12 0

      Then wait for all the placement groups to be moved away to other OSDs with ceph -w. Optionally, you can check if the OSD is emptying with df -h.

    2. Mark the disk out:

      ceph osd out 12
    3. Stop the related OSD service:

      sudo systemctl stop ceph-osd@12.service
  3. Remove the disk from CRUSH Map:

    ceph osd crush remove osd.12
  4. Remove authentication information for the disk:

    ceph auth del osd.12
  5. Remove the disk from the cluster:

    ceph osd rm 12
  6. Wipe the disk to remove all the data:

    sudo sgdisk --zap-all -- disk_device_name
    sudo sgdisk --clear --mbrtogpt -- disk_device_name

29.3 How to Use Existing Partitions for OSDs Including OSD Journals

Important
Important

This section describes an advanced topic that only storage experts and developers should examine. It is mostly needed when using non-standard OSD journal sizes. If the OSD partition's size is less than 10GB, its initial weight is rounded to 0 and because no data are therefore placed on it, you should increase its weight. We take no responsibility for overfilled journals.

If you need to use existing disk partitions as an OSD node, the OSD journal and data partitions need to be in a GPT partition table.

You need to set the correct partition types to the OSD partitions so that udev recognizes them correctly and sets their ownership to ceph:ceph.

For example, to set the partition type for the journal partition /dev/vdb1 and data partition /dev/vdb2, run the following:

sudo sgdisk --typecode=1:45b0969e-9b03-4f30-b4c6-b4b80ceff106 /dev/vdb
sudo sgdisk --typecode=2:4fbd7e29-9d25-41b8-afd0-062c0ceff05d /dev/vdb
Tip
Tip

The Ceph partition table types are listed in /usr/lib/udev/rules.d/95-ceph-osd.rules:

cat /usr/lib/udev/rules.d/95-ceph-osd.rules
# OSD_UUID
ACTION=="add", SUBSYSTEM=="block", \
  ENV{DEVTYPE}=="partition", \
  ENV{ID_PART_ENTRY_TYPE}=="4fbd7e29-9d25-41b8-afd0-062c0ceff05d", \
  OWNER:="ceph", GROUP:="ceph", MODE:="660", \
  RUN+="/usr/sbin/ceph-disk --log-stdout -v trigger /dev/$name"
ACTION=="change", SUBSYSTEM=="block", \
  ENV{ID_PART_ENTRY_TYPE}=="4fbd7e29-9d25-41b8-afd0-062c0ceff05d", \
  OWNER="ceph", GROUP="ceph", MODE="660"

# JOURNAL_UUID
ACTION=="add", SUBSYSTEM=="block", \
  ENV{DEVTYPE}=="partition", \
  ENV{ID_PART_ENTRY_TYPE}=="45b0969e-9b03-4f30-b4c6-b4b80ceff106", \
  OWNER:="ceph", GROUP:="ceph", MODE:="660", \
  RUN+="/usr/sbin/ceph-disk --log-stdout -v trigger /dev/$name"
ACTION=="change", SUBSYSTEM=="block", \
  ENV{ID_PART_ENTRY_TYPE}=="45b0969e-9b03-4f30-b4c6-b4b80ceff106", \
  OWNER="ceph", GROUP="ceph", MODE="660"
[...]

30 Recovery

30.1 'Too Many PGs per OSD' Status Message

If you receive a Too Many PGs per OSD message after running ceph status, it means that the mon_pg_warn_max_per_osd value (300 by default) was exceeded. This value is compared to the number of PGs per OSD ratio. This means that the cluster setup is not optimal.

As the number of PGs cannot be reduced after the pool is created, the only solution is to add OSDs to the cluster so that the ratio of PGs per OSD becomes lower.

30.2 Calamari Has a Stale Cluster

The Calamari back-end supports operating multiple clusters, while its front-end does not yet. This means that if you point Calamari at one cluster, then destroy that cluster and create a new one, and then point the same Calamari instance at the new cluster, it will still remember the old cluster and possibly/probably try to display the old cluster state by default.

To make Calamari 'forget' the old cluster, run:

sudo systemctl stop cthulhu.service
sudo calamari-ctl clear --yes-i-am-sure
sudo calamari-ctl initialize

This will make Calamari forget all the old clusters it knows about. It will, however, not clear out the salt minion keys from the master. This is fine if you are reusing the same nodes for the new cluster.

30.3 'nn pg stuck inactive' Status Message

If you receive a stuck inactive status message after running ceph status, it means that Ceph does not know where to replicate the stored data to fulfill the replication rules. It can happen shortly after the initial Ceph setup and fix itself automatically. In other cases, this may require a manual interaction, such as bringing up a broken OSD, or adding a new OSD to the cluster. In very rare cases, reducing the replication level may help.

If the placement groups are stuck perpetually, you need to check the output of ceph osd tree. The output should look tree-structured, similar to the example in Section 30.5, “OSD is Down”.

If the output of ceph osd tree is rather flat as in the following example

ceph osd tree
ID WEIGHT TYPE NAME    UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1      0 root default
 0      0 osd.0             up  1.00000          1.00000
 1      0 osd.1             up  1.00000          1.00000
 2      0 osd.2             up  1.00000          1.00000

you should check that the related CRUSH map has a tree structure. If it is also flat, or with no hosts as in the above example, it may mean that host name resolution is not working correctly across the cluster.

30.4 OSD Weight is 0

When OSD starts, it is assigned a weight. The higher the weight, the bigger the chance that the cluster writes data to the OSD. The weight is either specified in a cluster CRUSH Map, or calculated by the OSDs' start-up script.

In some cases, the calculated value for OSDs' weight may be rounded down to zero. It means that the OSD is not scheduled to store data, and no data is written to it. The reason is usually that the disk is too small (smaller than 15GB) and should be replaced with a bigger one.

30.5 OSD is Down

OSD daemon is either running, or stopped/down. There are 3 general reasons why an OSD is down:

  • Hard disk failure.

  • The OSD crashed.

  • The server crashed.

You can see the detailed status of OSDs by running

ceph osd tree
# id  weight  type name up/down reweight
 -1    0.02998  root default
 -2    0.009995   host doc-ceph1
 0     0.009995      osd.0 up  1
 -3    0.009995   host doc-ceph2
 1     0.009995      osd.1 up  1
 -4    0.009995   host doc-ceph3
 2     0.009995      osd.2 down  1

The example listing shows that the osd.2 is down. Then you may check if the disk where the OSD is located is mounted:

lsblk -f
 [...]
 vdb
 ├─vdb1               /var/lib/ceph/osd/ceph-2
 └─vdb2

You can track the reason why the OSD is down by inspecting its log file /var/log/ceph/ceph-osd.2.log. After you find and fix the reason why the OSD is not running, start it with

sudo systemctl start ceph-osd@2.service

Do not forget to replace 2 with the actual number of your stopped OSD.

30.6 Fixing Clock Skew Warnings

The time information in all cluster nodes must be synchronized. If a node's time is not fully synchronized, you may get clock skew warnings when checking the state of the cluster.

Time synchronization is managed with NTP (see http://en.wikipedia.org/wiki/Network_Time_Protocol). Set each node to synchronize its time with one or more NTP servers, preferably to the same group of NTP servers. If the time skew still occurs on a node, follow these steps to fix it:

systemctl stop ntpd.service
systemctl stop ceph-mon.target
systemctl start ntpd.service
systemctl start ceph-mon.target

You can then query the NTP peers and check the time offset with sudo ntpq -p.

The Ceph monitors need to have their clocks synchronized to within 0.05 seconds of each other. In a typical ntpd configuration with remote NTP servers, it may be impossible for ntpd to reliably maintain this degree of accuracy. In such cases, the Ceph developers recommend running an NTP server in the local network.

31 Accountancy

31.1 Adding S3 Users

S3 (Simple Storage Service) is an online file storage Web service, offered by Amazon. You can use the S3 interface to interact with the Ceph RADOS Gateway, besides the Swift interface. You need to create a user to interact with the gateway.

To create a user for the S3 interface, follow these steps:

sudo radosgw-admin user create --uid=username \
 --display-name="display-name" --email=email

For example:

sudo radosgw-admin user create \
   --uid=example_user \
   --display-name="Example User" \
   --email=penguin@example.com

The command also creates the user's access and secret key. Check its output for access_key and secret_key keywords and their values:

[...]
 "keys": [
       { "user": "example_user",
         "access_key": "11BS02LGFB6AL6H1ADMW",
         "secret_key": "vzCEkuryfn060dfee4fgQPqFrncKEIkh3ZcdOANY"}],
 [...]

31.2 Removing S3 Users

To remove a user previously created to interact with the S3 interface, use the following command:

sudo radosgw-admin user rm --uid=example_user

For more information on the command's options, see Section 31.5, “Removing Swift Users”.

31.3 User Quota Management

The Ceph RADOS Gateway enables you to set quotas on users and buckets owned by users. Quotas include the maximum number of objects in a bucket and the maximum storage size in megabytes.

Before you enable a user quota, you first need to set its parameters:

radosgw-admin quota set --quota-scope=user --uid=example_user \
 --max-objects=1024 --max-size=1024
--max-objects

Specifies the maximum number of objects. A negative value disables the check.

--max-size

Specifies the maximum number of bytes. A negative value disables the check.

--quota-scope

Sets the scope for the quota. The options are bucket and user. Bucket quotas apply to buckets a user owns. User quotas apply to a user.

Once you set a user quota, you may enable it:

radosgw-admin quota enable --quota-scope=user --uid=example_user

To disable a quota:

radosgw-admin quota disable --quota-scope=user --uid=example_user

To list quota settings:

radosgw-admin user info --uid=example_user

To update quota statistics:

radosgw-admin user stats --uid=example_user --sync-stats

31.4 Adding Swift Users

Swift is a standard for stored data access compatible with OpenStack. It is used to interact with the Ceph RADOS Gateway. You need to create a Swift user, access key and secret to enable end users to interact with the gateway. There are two types of users: a user and subuser. While users are used when interacting with the S3 interface, subusers are users of the Swift interface. Each subuser is associated to a user.

  1. To create a Swift user—which is a subuser in our terminology—you need to create the associated user first.

    sudo radosgw-admin user create --uid=username \
     --display-name="display-name" --email=email

    For example:

    sudo radosgw-admin user create \
       --uid=example_user \
       --display-name="Example User" \
       --email=penguin@example.com
  2. To create a subuser (Swift interface) for the user, you must specify the user ID (--uid=username), a subuser ID, and the access level for the subuser.

    radosgw-admin subuser create --uid=uid \
     --subuser=uid \
     --access=[ read | write | readwrite | full ]

    For example:

    radosgw-admin subuser create --uid=example_user \
     --subuser=example_user:swift --access=full
  3. Generate a secret key for the user.

    sudo radosgw-admin key create \
       --gen-secret \
       --subuser=example_user:swift \
       --key-type=swift
  4. Both commands will output JSON-formatted data showing the user state. Notice the following lines, and remember the secret_key value:

    "swift_keys": [
       { "user": "example_user:swift",
         "secret_key": "r5wWIxjOCeEO7DixD1FjTLmNYIViaC6JVhi3013h"}],

For more information on using Swift client, see Section 17.4, “Managing RADOS Gateway Access”.

31.5 Removing Swift Users

When you remove a user, the user and subuser are removed from the system. However, you may remove only the subuser if you want. To remove a user (and subuser), specify user rm and the user ID.

radosgw-admin user rm --uid=example_user

To remove the subuser only, specify subuser rm and the subuser ID.

radosgw-admin subuser rm --uid=example_user:swift

You can make use of the following options:

--purge-data

Purges all data associated to the user ID.

--purge-keys

Purges all keys associated to the user ID.

Tip
Tip: Removing a Subuser

When you remove a subuser, you are removing access to the Swift interface. The user will remain in the system. To remove the subuser, specify subuser rm and the subuser ID.

sudo radosgw-admin subuser rm --uid=example_user:swift

You can make use of the following option:

--purge-keys

Purges all keys associated to the user ID.

31.6 Changing S3 and Swift User Access and Secret Keys

The access_key and secret_key parameters identify the RADOS Gateway user when accessing the gateway. Changing the existing user keys is the same as creating new ones, as the old keys get overwritten.

For S3 users, run the following:

radosgw-admin key create --uid=example_user --key-type=s3 --gen-access-key --gen-secret

For Swift users, run the following:

radosgw-admin key create --subuser=example_user:swift --key-type=swift --gen-secret
--key-type=type

Specifies the type of key. Either swift or s3.

--gen-access-key

Generates a random access key (for S3 user by default).

--gen-secret

Generates a random secret key.

--secret=key

Specifies a secret key, for example manually generated.

32 Tune-ups

32.1 How Does the Number of Placement Groups Affect the Cluster Performance?

Placement groups (PGs) are internal data structures for storing data in a pool across OSDs. The way Ceph stores data into PGs is defined in a CRUSH Map, and you can override the default by editing it. When creating a new pool, you need to specify the initial number of PGs for the pool.

When your cluster is becoming 70% to 80% full, it is time to add more OSDs to it. When you increase the number of OSDs, you may consider increasing the number of PGs as well.

Warning
Warning

Changing the number of PGs causes a lot of data transfer within the cluster.

To calculate the optimal value for your newly-resized cluster is a complex task.

A high number of PGs creates small chunks of data. This speeds up recovery after an OSD failure, but puts a lot of load on the monitor nodes as they are responsible for calculating the data location.

On the other hand, a low number of PGs takes more time and data transfer to recover from an OSD failure, but does not impose that much load on monitor nodes as they need to calculate locations for less (but larger) data chunks.

Find more information on the optimal number of PGs for your cluster using the online calculator (http://ceph.com/pgcalc/).

32.2 Can I Use SSDs and Hard Disks on the Same Cluster?

Solid-state drives (SSD) are generally faster than hard disks. If you mix the two types of disks for the same write operation, the data writing to the SSD disk will be slowed down by the hard disk performance. Thus, you should never mix SSDs and hard disks for data writing following the same rule (see Section 12.3, “Rule Sets” for more information on rules for storing data).

There are generally 2 cases where using SSD and hard disk on the same cluster makes sense:

  1. Use each disk type for writing data following different rules. Then you need to have a separate rule for the SSD disk, and another rule for the hard disk.

  2. Use each disk type for a specific purpose. For example the SSD disk for journal, and the hard disk for storing data.

32.3 What are the Trade-offs of Using a Journal on SSD?

Using SSDs for OSD journal(s) is better for performance as the journal is usually the bottleneck of hard disk-only OSDs. SSDs are often used to share journals of several OSDs.

Following is a list of potential disadvantages of using SSDs for OSD journal:

  • SSD disks are more expensive than hard disks. But as one OSD journal requires up to 6GB of disk space only, the price may not be so crucial.

  • SSD disk consumes storage slots which can be otherwise used by a large hard disk to extend the cluster capacity.

  • SSD disks have reduced write cycles compared to hard disks, but modern technologies are beginning to eliminate the problem.

  • If you share more journals on the same SSD disk, you risk losing all the related OSDs after the SSD disk fails. This will require a lot of data to be moved to rebalance the cluster.

  • Hotplugging disks becomes more complex as the data mapping is not 1:1 the failed OSD and the journal disk.

33 Integration

33.1 Storing KVM Disks in Ceph Cluster

You can create a disk image for KVM-driven virtual machine, store it in a Ceph pool, optionally convert the content of an existing image to it, and then run the virtual machine with qemu-kvm making use of the disk image stored in the cluster. For more detailed information, see Chapter 24, Ceph as a Back-end for QEMU KVM Instance.

33.2 Storing libvirt Disks in Ceph Cluster

Similar to KVM (see Section 33.1, “Storing KVM Disks in Ceph Cluster”), you can use Ceph to store virtual machines driven by libvirt. The advantage is that you can run any libvirt-supported virtualization solution, such as KVM, Xen, or LXC. For more information, see Chapter 23, Using libvirt with Ceph.

33.3 Storing Xen Disks in Ceph Cluster

One way to use Ceph for storing Xen disks is to make use of libvirt as described in Chapter 23, Using libvirt with Ceph.

Another option is to make Xen talk to the rbd block device driver directly:

  1. If you have no disk image prepared for Xen, create a new one:

    rbd create myimage --size 8000 --pool mypool
  2. List images in the pool mypool and check if your new image is there:

    rbd list mypool
  3. Create a new block device by mapping the myimage image to the rbd kernel module:

    sudo rbd map --pool mypool myimage
    Tip
    Tip: User Name and Authentication

    To specify a user name, use --id user-name. Moreover, if you use cephx authentication, you must also specify a secret. It may come from a keyring or a file containing the secret:

    sudo rbd map --pool rbd myimage --id admin --keyring
     /path/to/keyring

    or

    sudo rbd map --pool rbd myimage --id admin --keyfile /path/to/file
  4. List all mapped devices:

    rbd showmapped
     id pool   image   snap device
     0  mypool myimage -    /dev/rbd0
  5. Now you can configure Xen to use this device as a disk for running a virtual machine. You can for example add the following line to the xl-style domain configuration file:

    disk = [ '/dev/rbd0,,sda', '/dev/cdrom,,sdc,cdrom' ]

33.4 Mounting and Unmounting an RBD Image

Images stored inside a Ceph cluster pool can be mapped to a block device. You can then format such device, mount it to be able to exchange files, and unmount it when done.

  1. Make sure your Ceph cluster includes a pool with the disk image you want to mount. Assume the pool is called mypool and the image is myimage.

    rbd list mypool
  2. Map the image to a new block device.

    sudo rbd map --pool mypool myimage
    Tip
    Tip: User Name and Authentication

    To specify a user name, use --id user-name. Moreover, if you use cephx authentication, you must also specify a secret. It may come from a keyring or a file containing the secret:

    sudo rbd map --pool rbd myimage --id admin --keyring
     /path/to/keyring

    or

    sudo rbd map --pool rbd myimage --id admin --keyfile /path/to/file
  3. List all mapped devices:

    rbd showmapped
     id pool   image   snap device
     0  mypool myimage -    /dev/rbd0

    The device we want to work on is /dev/rbd0.

  4. Make an XFS file system on the /dev/rbd0 device.

    sudo mkfs.xfs /dev/rbd0
     log stripe unit (4194304 bytes) is too large (maximum is 256KiB)
     log stripe unit adjusted to 32KiB
     meta-data=/dev/rbd0              isize=256    agcount=9, agsize=261120 blks
              =                       sectsz=512   attr=2, projid32bit=1
              =                       crc=0        finobt=0
     data     =                       bsize=4096   blocks=2097152, imaxpct=25
              =                       sunit=1024   swidth=1024 blks
     naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
     log      =internal log           bsize=4096   blocks=2560, version=2
              =                       sectsz=512   sunit=8 blks, lazy-count=1
     realtime =none                   extsz=4096   blocks=0, rtextents=0
  5. Mount the device and check it is correctly mounted. Replace /mnt with your mount point.

    sudo mount /dev/rbd0 /mnt
     mount | grep rbd0
     /dev/rbd0 on /mnt type xfs (rw,relatime,attr2,inode64,sunit=8192,...

    Now you can move data from/to the device as if it was a local directory.

    Tip
    Tip: Increasing the Size of RBD Device

    If you find that the size of the RBD device is no longer enough, you can easily increase it.

    1. Increase the size of the RBD image, for example up to 10GB.

      rbd resize --size 10000  mypool/myimage
       Resizing image: 100% complete...done.
    2. Grow the file system to fill up the new size of the device.

      sudo xfs_growfs /mnt
       [...]
       data blocks changed from 2097152 to 2560000
  6. After you finish accessing the device, you can unmount it.

    sudo unmount /mnt

34 Cluster Maintenance and Troubleshooting

34.1 Creating and Deleting Pools from Calamari

Apart from using the command line to create or delete pools (see Section 34.9, “Adding a Pool” and Section 34.10, “Deleting a Pool”), you can do the same from within Calamari in a more comfortable user interface.

To create a new pool using Calamari, follow these steps:

  1. Log in to a running instance of Calamari.

  2. Go to Manage › Pools. You can see a list of the cluster's existing pools.

  3. Click in the right top.

  4. Enter a name for the new pool, and either change the number of replicas, number of placement groups, and the CRUSH ruleset, or leave them at default values.

  5. Click to confirm, then Cancel the warning dialog.

  6. Now you can see the new pool in the list of all existing pools. You can verify the existence of the new pool on the command line with

    ceph osd lspools

To delete an existing pool using Calamari, follow these steps:

  1. Log in to a running instance of Calamari.

  2. Go to Manage › Pools. You can see a list of the cluster's existing pools.

  3. From the list of pools, choose the one to delete and click the related

  4. Confirm the deletion and Cancel the warning dialog.

  5. You can verify the deletion of the pool on the command line with

    ceph osd lspools

34.2 Managing Keyring Files

When Ceph runs with authentication and authorization enabled (enabled by default), you must specify a user name and a keyring containing the secret key of the specified user. If you do not specify a user name, Ceph will use client.admin as the default user name. If you do not specify a keyring, Ceph will look for a keyring via the keyring setting in the Ceph configuration. For example, if you execute the ceph health command without specifying a user or keyring:

ceph health

Ceph interprets the command like this:

ceph -n client.admin --keyring=/etc/ceph/ceph.client.admin.keyring health

ceph-authtool is a utility to create, view, and modify a Ceph keyring file. A keyring file stores one or more Ceph authentication keys and possibly an associated capability specification. Each key is associated with an entity name, of the form {client,mon,mds,osd}.name.

To create a new keyring file in the current directory containing a key for client.example1:

ceph-authtool -C -n client.example1 --gen-key keyring

To add a new key for client.example2, omit the -C option:

ceph-authtool -n client.example2 --gen-key keyring

The keyring now has two entries:

ceph-authtool -l keyring
 [client.example1]
     key = AQCQ04NV8NE3JBAAHurrwc2BTVkMGybL1DYtng==
 [client.example2]
     key = AQBv2INVWMqFIBAAf/4/H3zxzAsPBTH4jsN80w==

For more information on ceph-authtool, see its manual page man 8 ceph-authtool.

34.3 Creating Client Keys

User management functionality provides Ceph cluster administrators with the ability to create, update and delete users directly in the cluster environment.

Tip
Tip

When you create or delete users in the Ceph cluster, you may need to distribute keys to clients so that they can be added to keyrings.

Adding a user creates a user name (TYPE.ID), a secret key and possibly capabilities included in the command you use to create the user. A user’s key enables the user to authenticate with the cluster. The user’s capabilities authorize the user to read, write, or execute on monitors, OSDs, or metadata servers.

Authentication key creation usually follows cluster user creation. There are several ways to add a user. The most convenient seems to be using

ceph auth get-or-create

It returns a keyfile format with the user name [in brackets] and the key. If the user already exists, this command simply returns the user name and key in the keyfile format. You may use the-o filename option to save the output to a file.

ceph auth get-or-create client.example1
 [client.example1]
    key = AQDs+odVODCGGxAAvmSnsNx3XYHJ7Ri6sZFfhw==

You can verify that the client key was added to the cluster keyring:

ceph auth list
    [...]
 client.example1
    key: AQDs+odVODCGGxAAvmSnsNx3XYHJ7Ri6sZFfhw==

When creating client users, you may create a user with no capabilities. A user with no capabilities is useless beyond mere authentication, because the client cannot retrieve the cluster map from the monitor. However, you can create a user with no capabilities if you want to defer adding capabilities later using the ceph auth caps command.

Tip
Tip

After you add a key to the cluster keyring, go to the relevant client(s) and copy the keyring from the cluster host to the client(s).

Find more details in the related upstream documentation, see User Management (http://ceph.com/docs/master/rados/operations/user-management/).

34.4 Revoking Client Keys

If you need to remove an already generated client key from the keyring file, use the ceph auth del command. To remove the key for user client.example1 that we added in Section 34.3, “Creating Client Keys”:

ceph auth del client.example1

and check the deletion with ceph auth list.

Tip
Tip

After you add a key to the cluster keyring, go to the relevant client(s) and copy the keyring from the cluster host to the client(s).

34.5 Checking for Unbalanced Data Writing

When data is written to OSDs evenly, the cluster is considered balanced. Each OSD within a cluster is assigned its weight. The weight is a relative number and tells Ceph how much of the data should be written to the related OSD. The higher the weight, the more data will be written. If an OSD has zero weight, no data will be written to it. If the weight of an OSD is relatively high compared to other OSDs, a large portion of the data will be written there, which makes the cluster unbalanced.

Unbalanced clusters have poor performance, and in the case that an OSD with a high weight suddenly crashes, a lot of data needs to be moved to other OSDs, which slows down the cluster as well.

To avoid this, you should regularly check OSDs for the amount of data writing. If the amount is between 30% and 50% of the capacity of a group of OSDs specified by a given rule set, you need to reweight the OSDs. Check for individual disks and find out which of them fill up faster than the others (or are generally slower), and lower their weight. The same is valid for OSDs where not enough data is written—you can increase their weight to have Ceph write more data to them. In the following example, you will find out the weight of an OSD with ID 13, and reweight it from 3 to 3.05:

$ ceph osd tree | grep osd.13
 13  3                   osd.13  up  1

 $ ceph osd crush reweight osd.13 3.05
 reweighted item id 13 name 'osd.13' to 3.05 in crush map

 $ ceph osd tree | grep osd.13
 13  3.05                osd.13  up  1

Tip
Tip: OSD Reweight by Utilization

The ceph osd reweight-by-utilization threshold command automates the process of reducing the weight of OSDs which are heavily overused. By default it will adjust the weights downward on OSDs which reached 120% of the average usage, but if you include threshold it will use that percentage instead.

34.6 Time Synchronization of Nodes

Ceph requires precise time synchronization between particular nodes. You should set up a node with your own NTP server. Even though you can point all ntpd instances to a remote public time server, we do not recommend it with Ceph. With such a configuration, each node in the cluster has its own NTP daemon that communicate continually over the Internet with a set of three or four time servers, all of which are quite some hops away. This solution introduces a large degree of latency variability that makes it difficult or impossible to keep the clock drift under 0.05 seconds (which is what the Ceph monitors require).

Thus use a single machine as the NTP server for the whole cluster. Your NTP server ntpd instance may then point to the remote (public) NTP server or it can have its own time source. The ntpd instances on all nodes are then pointed to this local server. Such a solution has several advantages like—eliminating unnecessary network traffic and clock skews, decreasing load on the public NTP servers. For details how to set up the NTP server refer to SUSE Linux Enterprise Server Administration Guide (https://www.suse.com/documentation/sled11/book_sle_admin/data/cha_netz_xntp.html).

Then to change the time on your cluster, do the following:

Important
Important: Setting Time

You may face a situation when you need to set the time back, e.g. if the time changes from the summer to the standard time. We do not recommend to move the time backward for a longer period than the cluster is down. Moving the time forward does not cause any trouble.

Procedure 34.1: Time Synchronization on the Cluster
  1. Stop all clients accessing the Ceph cluster, especially those using iSCSI.

  2. Shut down your Ceph cluster. On each node run:

    rcceph stop
    Note
    Note

    If you use Ceph and SUSE OpenStack Cloud, stop also the SUSE OpenStack Cloud.

  3. Verify that your NTP server is set up correctly—all ntpd daemons get their time from a source or sources in the local network.

  4. Set the correct time on your NTP server.

  5. Verify that NTP is running and working properly, on all nodes run:

    status ntpd.service

    or

    ntpq -p
  6. Start all monitoring nodes and verify that there is no clock skew:

    systemctl start target
  7. Start all OSD nodes.

  8. Start other Ceph services.

  9. Start the SUSE OpenStack Cloud if you have it.

34.7 Upgrading Software

Both SUSE Linux Enterprise Server and SUSE Enterprise Storage products are provided with regular package updates. To apply new updates to the whole cluster, you need to run

sudo zypper dup

on all cluster nodes. Remember to upgrade all the monitor nodes first, and then all the OSD nodes one by one.

34.8 Increasing the Number of Placement Groups

When creating a new pool, you specify the number of placement groups for the pool (see Section 13.1.2, “Create a Pool”). After adding more OSDs to the cluster, you usually need to increase the number of placement groups as well for performance and data durability reasons. For each placement group, OSD and monitor nodes need memory, network and CPU at all times and even more during recovery. From which follows that minimizing the number of placement groups saves significant amounts of resources.

Warning
Warning: Too High Value of pg_num

When changing the pg_num value for a pool, it may happen that the new number of placement groups exceeds the allowed limit. For example

ceph osd pool set rbd pg_num 4096
 Error E2BIG: specified pg_num 3500 is too large (creating 4096 new PGs \
 on ~64 OSDs exceeds per-OSD max of 32)

The limit prevents extreme placement group splitting, and is derived from the mon_osd_max_split_count value.

To determine the right new number of placement groups for a resized cluster is a complex task. One approach is to continuously grow the number of placement groups up to the state when the cluster performance is optimal. To determine the new incremented number of placement groups, you need to get the value of the mon_osd_max_split_count parameter, and add it to the current number of placement groups. To give you a basic idea, take a look at the following script:

max_inc=`ceph daemon mon.a config get mon_osd_max_split_count 2>&1 \
  | tr -d '\n ' | sed 's/.*"\([[:digit:]]\+\)".*/\1/'`
 pg_num=`ceph osd pool get rbd pg_num | cut -f2 -d: | tr -d ' '`
 echo "current pg_num value: $pg_num, max increment: $max_inc"
 next_pg_num="$(($pg_num+$max_inc))"
 echo "allowed increment of pg_num: $next_pg_num"

After finding out the next number of placement groups, increase it with

ceph osd pool set pool_name pg_num next_pg_num

34.9 Adding a Pool

After you first deploy a cluster, Ceph uses the default pools to store data. You can later create a new pool with

ceph osd pool create

For more information on cluster pool creation, see Section 13.1.2, “Create a Pool”.

34.10 Deleting a Pool

By deleting a pool, you permanently destroy all data stored in that pool. You can delete a previously created pool with

ceph osd pool delete

For more information on cluster pool deletion, see Section 13.1.4, “Delete a Pool”.

34.11 Troubleshooting

This section describes several issues that you may face when you operate a Ceph cluster.

34.11.1 Sending Large Objects with rados Fails with Full OSD

rados is a command line utility to manage RADOS object storage. For more information, see man 8 rados.

If you send a large object to a Ceph cluster with the rados utility, such as

rados -p mypool put myobject /file/to/send

it can fill up all the related OSD space and cause serious trouble to the cluster performance. RADOS has a 'striper' API that enables applications to stripe large objects over multiple OSDs. If you turn the striping feature on with the --striper option, you can prevent the OSD from filling up.

rados --striper -p mypool put myobject /file/to/send

34.11.2 Corrupted XFS File System

In rare circumstances like kernel bug or broken/misconfigured hardware, the underlying file system (XFS) in which an OSD stores its data might be damaged and unmountable.

If you are sure there is no problem with your hardware and the system is configured properly, raise a bug against the XFS subsystem of the SUSE Linux Enterprise Server kernel and mark the particular OSD as down:

ceph osd down OSD identification
Warning
Warning: Do Not Format or Otherwise Modify the Damaged Device

Even though using xfs_repair to fix the problem in the file system may seem reasonable, do not use it as the command modifies the file system. The OSD may start but its functioning may be influenced.

Now zap the underlying disk and recreate the OSD by running:

ceph-disk prepare --zap $OSD_DISK_DEVICE $OSD_JOURNAL_DEVICE"

for example:

ceph-disk prepare --zap /dev/sdb /dev/sdd2

35 Performance Diagnosis

35.1 Finding Slow OSDs

When tuning the cluster performance, it is very important to identify slow storage/OSDs within the cluster. The reason is that if the data is written to the slow(est) disk, the complete write operation slows down as it always waits until it is finished on all the related disks.

It is not trivial to locate the storage bottleneck. You need to examine each and every OSD to find out the ones slowing down the write process. To do a benchmark on a single OSD, run:

ceph tell osd_id bench

For example:

cephadm > ceph tell osd.0 bench
 { "bytes_written": 1073741824,
   "blocksize": 4194304,
   "bytes_per_sec": "19377779.000000"}

Then you need to run this command on each OSD and compare the bytes_per_sec value to get the slow(est) OSDs.

35.2 Is My Network Causing Issues?

There are more reasons why the cluster performance may become weak. One of them can be network problems. In such case, you may notice the cluster reaching quorum, OSD and monitor nodes going offline, data transfers taking a long time, or a lot of reconnect attempts.

To check whether cluster performance is degraded by network problems, inspect the Ceph log files under the /var/log/ceph directory.

To fix network issues on the cluster, focus on the following points:

  • Basic network diagnostics. Try to ping between cluster nodes and pay attention to data loss and response times.

  • Network performance benchmark. Use tools such as Netperf to measure the performance of your network.

  • Check firewall settings on cluster nodes. Make sure they do not block ports/protocols required by Ceph operation. See Section 37.2, “Firewall Settings for Ceph” for more information on firewall settings.

  • Check the networking hardware, such as network cards, cables, or switches, for proper operation.

Tip
Tip: Separate Network

To ensure fast and safe network communication between cluster nodes, set up a separate network used exclusively by the cluster OSD and monitor nodes.

36 Server Maintenance

36.1 Adding a Server to a Cluster

Tip
Tip

When adding an OSD to an existing cluster, be aware that the cluster will be rebalancing for some time afterward. To minimize the rebalancing periods, it is best to add all the OSDs you intend to add at the same time.

If you are adding an OSD to a cluster, follow Section 27.3, “Adding OSDs with ceph-deploy.

If you are adding a monitor to a cluster, follow Section 27.4, “Adding and Removing Monitors”.

Important
Important

After adding a monitor, make sure that /etc/ceph/ceph.conf files on each server point to the new monitor as well so that it works after the next reboot.

Tip
Tip

Adding an OSD and monitor on the same server is recommended only for small size clusters. Although the monitor can share disk with the operating system (preferably an SSD disk for performance reasons), it should never share disk with an OSD.

36.2 Removing a Server from a Cluster

When removing an OSD from an existing cluster, make sure there are enough OSDs left in the cluster so that the replication rules can be followed. Also be aware that the cluster will be rebalancing for some time after removing the OSD.

If you are removing an OSD from a cluster, follow Section 29.2, “Deleting disks”.

If you are removing a monitor from a cluster, follow Section 27.4.2, “Removing a Monitor”.

36.3 Increasing File Descriptors

For OSD daemons, the read/write operations are critical to keep the Ceph cluster balanced. They often need to have many files open for reading and writing at the same time. On the OS level, the maximum number of simultaneously open files is called 'maximum number of file descriptors'.

To prevent OSDs from running out of file descriptors, you can override the OS default value and specify the number in /etc/ceph/ceph.conf, for example:

max_open_files = 131072

After you change max_open_files, you need to restart the OSD service on the relevant Ceph node.

37 Networking

37.1 Setting NTP to a Ceph Cluster

In a cluster environment, it is necessary to keep all cluster nodes' time synchronized. NTP—Network Time Protocol—is a network service commonly used for this purpose. NTP is well integrated in SUSE products, including SUSE Enterprise Storage. There are two ways to configure NTP—either using YaST, or setting it up manually. Find both methods described—and more information on NTP in general—in SUSE Linux Enterprise Server Administration Guide (https://www.suse.com/documentation/sles-12/book_sle_admin/data/cha_netz_xntp.html).

37.2 Firewall Settings for Ceph

We recommend protecting the network cluster communication with SUSE Firewall. You can edit its configuration by selecting YaST › Security and Users › Firewall › Allowed Services.

For Calamari, enable the "HTTP Server", "Carbon" and "SaltStack" services (ports 80, 2003, 2004, 4505 and 4506).

For Ceph monitor nodes, enable the "Ceph MON" service (port 6789).

For Ceph OSD (or MDS) nodes, enable the "Ceph OSD/MDS" service (ports 6800-7300).

37.3 Adding a Private Network to a Running Cluster

If you do not specify a cluster network during Ceph deployment, it assumes a single public network environment. While Ceph operates fine with a public network, its performance and security improves when you set a second private cluster network.

A general recommendation for a Ceph cluster is to have two networks: a public (front-side) and cluster (back-side) one. To support two networks, each Ceph node needs to have at least two network cards.

You need to apply the following changes to each Ceph node. It is comfortable for a small cluster, but can be very time demanding if you have a cluster consisting of hundreds or thousands of nodes.

  1. Stop Ceph related services on each cluster node.

    Replace 10.0.0.0/24 with the IP address and netmask of the cluster network. You can specify more comma-delimited subnets. If you need to specifically assign static IP addresses or override cluster network settings, you can do so with the optional cluster addr.

  2. Check that the private cluster network works as expected on the OS level.

  3. Start Ceph related services on each cluster node.

    sudo rcceph start

Glossary

General

Admin node

The node from which you run the ceph-deploy utility to deploy Ceph on OSD nodes.

Bucket

A point which aggregates other nodes into a hierarchy of physical locations.

CRUSH, CRUSH Map

An algorithm that determines how to store and retrieve data by computing data storage locations. CRUSH requires a map of the cluster to pseudo-randomly store and retrieve data in OSDs with a uniform distribution of data across the cluster.

Monitor node, MON

A cluster node that maintains maps of cluster state, including the monitor map, or the OSD map.

Node

Any single machine or server in a Ceph cluster.

OSD node

A cluster node that stores data, handles data replication, recovery, backfilling, rebalancing, and provides some monitoring information to Ceph monitors by checking other Ceph OSD daemons.

Pool

Logical partitions for storing objects such as disk images.

Rule Set

Rules to determine data placement for a pool.

Ceph Specific Terms

Calamari

A management and monitoring system for Ceph storage cluster. It provides a Web user interface that makes Ceph cluster monitoring simple.

Ceph Storage Cluster

The core set of storage software which stores the user’s data. Such a set consists of Ceph monitors and OSDs.

AKA Ceph Object Store.

RADOS Gateway

The S3/Swift gateway component for Ceph Object Store.

A Example Procedure of Manual Ceph Installation

The following procedure shows the commands that you need to install Ceph storage cluster manually.

  1. Generate the key secrets for the Ceph services you plan to run. You can use the following command to generate it:

    python -c "import os ; import struct ; import time; import base64 ; \
     key = os.urandom(16) ; header = struct.pack('<hiih',1,int(time.time()),0,len(key)) ; \
     print base64.b64encode(header + key)"
  2. Add the keys to the related keyrings. First for client.admin, then for monitors, and then other related services, such as OSD, RADOS Gateway, or MDS:

    ceph-authtool -n client.admin \
     --create-keyring /etc/ceph/ceph.client.admin.keyring \
     --cap mds 'allow *' --cap mon 'allow *' --cap osd 'allow *'
    ceph-authtool -n mon. \
     --create-keyring /var/lib/ceph/bootstrap-mon/ceph-osceph-03.keyring \
     --set-uid=0 --cap mon 'allow *'
    ceph-authtool -n client.bootstrap-osd \
     --create-keyring /var/lib/ceph/bootstrap-osd/ceph.keyring \
     --cap mon 'allow profile bootstrap-osd'
    ceph-authtool -n client.bootstrap-rgw \
     --create-keyring /var/lib/ceph/bootstrap-rgw/ceph.keyring \
     --cap mon 'allow profile bootstrap-rgw'
    ceph-authtool -n client.bootstrap-mds \
     --create-keyring /var/lib/ceph/bootstrap-mds/ceph.keyring \
     --cap mon 'allow profile bootstrap-mds'
  3. Create a monmap—a database of all monitors in a cluster:

    monmaptool --create --fsid eaac9695-4265-4ca8-ac2a-f3a479c559b1 \
     /tmp/tmpuuhxm3/monmap
    monmaptool --add osceph-02 192.168.43.60 /tmp/tmpuuhxm3/monmap
    monmaptool --add osceph-03 192.168.43.96 /tmp/tmpuuhxm3/monmap
    monmaptool --add osceph-04 192.168.43.80 /tmp/tmpuuhxm3/monmap
  4. Create a new keyring and import keys from the admin and monitors' keyrings there. Then use them to start the monitors:

    ceph-authtool --create-keyring /tmp/tmpuuhxm3/keyring \
     --import-keyring /var/lib/ceph/bootstrap-mon/ceph-osceph-03.keyring
    ceph-authtool /tmp/tmpuuhxm3/keyring \
     --import-keyring /etc/ceph/ceph.client.admin.keyring
    sudo -u ceph ceph-mon --mkfs -i osceph-03 \
     --monmap /tmp/tmpuuhxm3/monmap --keyring /tmp/tmpuuhxm3/keyring
    systemctl restart ceph-mon@osceph-03
  5. Check the monitors state in systemd:

    systemctl show --property ActiveState ceph-mon@osceph-03
  6. Check if Ceph is running and reports the monitor status:

    ceph --cluster=ceph \
     --admin-daemon /var/run/ceph/ceph-mon.osceph-03.asok mon_status
  7. Check the specific services' status using the existing keys:

    ceph --connect-timeout 5 --keyring /etc/ceph/ceph.client.admin.keyring \
     --name client.admin -f json-pretty status
    [...]
    ceph --connect-timeout 5 \
     --keyring /var/lib/ceph/bootstrap-mon/ceph-osceph-03.keyring \
     --name mon. -f json-pretty status
  8. Import keyring from existing Ceph services and check the status:

    ceph auth import -i /var/lib/ceph/bootstrap-osd/ceph.keyring
    ceph auth import -i /var/lib/ceph/bootstrap-rgw/ceph.keyring
    ceph auth import -i /var/lib/ceph/bootstrap-mds/ceph.keyring
    ceph --cluster=ceph \
     --admin-daemon /var/run/ceph/ceph-mon.osceph-03.asok mon_status
    ceph --connect-timeout 5 --keyring /etc/ceph/ceph.client.admin.keyring \
     --name client.admin -f json-pretty status
  9. Prepare disks/partitions for OSDs, using the XFS file system:

    ceph-disk -v prepare --fs-type xfs --data-dev --cluster ceph \
     --cluster-uuid eaac9695-4265-4ca8-ac2a-f3a479c559b1 /dev/vdb
    ceph-disk -v prepare --fs-type xfs --data-dev --cluster ceph \
     --cluster-uuid eaac9695-4265-4ca8-ac2a-f3a479c559b1 /dev/vdc
    [...]
  10. Activate the partitions:

    ceph-disk -v activate --mark-init systemd --mount /dev/vdb1
    ceph-disk -v activate --mark-init systemd --mount /dev/vdc1
  11. For SUSE Enterprise Storage version 2.1 and earlier, create the default pools:

    ceph --connect-timeout 5 --keyring /etc/ceph/ceph.client.admin.keyring \
     --name client.admin osd pool create .users.swift 16 16
    ceph --connect-timeout 5 --keyring /etc/ceph/ceph.client.admin.keyring \
     --name client.admin osd pool create .intent-log 16 16
    ceph --connect-timeout 5 --keyring /etc/ceph/ceph.client.admin.keyring \
     --name client.admin osd pool create .rgw.gc 16 16
    ceph --connect-timeout 5 --keyring /etc/ceph/ceph.client.admin.keyring \
     --name client.admin osd pool create .users.uid 16 16
    ceph --connect-timeout 5 --keyring /etc/ceph/ceph.client.admin.keyring \
     --name client.admin osd pool create .rgw.control 16 16
    ceph --connect-timeout 5 --keyring /etc/ceph/ceph.client.admin.keyring \
     --name client.admin osd pool create .users 16 16
    ceph --connect-timeout 5 --keyring /etc/ceph/ceph.client.admin.keyring \
     --name client.admin osd pool create .usage 16 16
    ceph --connect-timeout 5 --keyring /etc/ceph/ceph.client.admin.keyring \
     --name client.admin osd pool create .log 16 16
    ceph --connect-timeout 5 --keyring /etc/ceph/ceph.client.admin.keyring \
     --name client.admin osd pool create .rgw 16 16
  12. Create the RADOS Gateway instance key from the bootstrap key:

    ceph --connect-timeout 5 --cluster ceph --name client.bootstrap-rgw \
     --keyring /var/lib/ceph/bootstrap-rgw/ceph.keyring auth get-or-create \
     client.rgw.0dc1e13033d2467eace46270f0048b39 osd 'allow rwx' mon 'allow rw' \
     -o /var/lib/ceph/radosgw/ceph-rgw.rgw_name/keyring
  13. Enable and start RADOS Gateway:

    systemctl enable ceph-radosgw@rgw.rgw_name
    systemctl start ceph-radosgw@rgw.rgw_name
  14. Optionally, create the MDS instance key from the bootstrap key, then enable and start it:

    ceph --connect-timeout 5 --cluster ceph --name client.bootstrap-mds \
     --keyring /var/lib/ceph/bootstrap-mds/ceph.keyring auth get-or-create \
     mds.mds.rgw_name osd 'allow rwx' mds allow mon \
     'allow profile mds' \
     -o /var/lib/ceph/mds/ceph-mds.rgw_name/keyring
    systemctl enable ceph-mds@mds.rgw_name
    systemctl start ceph-mds@mds.rgw_name

B Documentation Updates

This chapter lists content changes for this document since the initial release of SUSE Enterprise Storage 1.

The document was updated on the following dates:

B.1 February, 2017 (Release of SUSE Enterprise Storage 4 Maintenance Update 1)

General Updates

Bugfixes

B.3 June, 2016 (Release of SUSE Enterprise Storage 3)

Bugfixes

B.4 January, 2016 (Release of SUSE Enterprise Storage 2.1)

General Updates

Chapter 3, Introduction

Chapter 8, Introduction

Chapter 24, Ceph as a Back-end for QEMU KVM Instance

Bugfixes

B.5 October, 2015 (Release of SUSE Enterprise Storage 2)

Chapter 3, Introduction

Chapter 17, Ceph RADOS Gateway

Chapter 25, Introduction

Print this page