SUSE Enterprise Storage 1.0

Deployment and Administration Guide

Author:
Publication date: February 26, 2015

Copyright © 2015 SUSE LLC

Copyright © 2010-2014, Inktank Storage, Inc. and contributors.

The text of and illustrations in this document are licensed by Inktank Storage under a Creative Commons Attribution-Share Alike 4.0 International ("CC-BY-SA"). An explanation of CC-BY-SA is available at http://creativecommons.org/licenses/by-sa/4.0/legalcode. In accordance with CC-BY-SA, if you distribute this document or an adaptation of it, you must provide the URL for the original version.

This document is an adaption of original works found at http://ceph.com/docs/master/ (2015-01-30).

Red Hat, Red Hat Enterprise Linux, the Shadowman logo, JBoss, MetaMatrix, Fedora, the Infinity Logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries. Linux® is the registered trademark of Linus Torvalds in the United States and other countries. Java® is a registered trademark of Oracle and/or its affiliates. XFS® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries. MySQL® is a registered trademark of MySQL AB in the United States, the European Union and other countries. All other trademarks are the property of their respective owners.

For SUSE or Novell trademarks, see the Novell Trademark and Service Mark list http://www.novell.com/company/legal/trademarks/tmlist.html. Linux* is a registered trademark of Linus Torvalds. All other third party trademarks are the property of their respective owners. A trademark symbol (®, ™ etc.) denotes a SUSE or Novell trademark; an asterisk (*) denotes a third party trademark.

All information found in this book has been compiled with utmost attention to detail. However, this does not guarantee complete accuracy. Neither SUSE LLC, the authors, nor the translators shall be held liable for possible errors or the consequences thereof.

About This Guide

SUSE Enterprise Storage is an extension to SUSE Linux Enterprise. It combines the capabilities from the Ceph (http://ceph.com/ storage project with the enterprise engineering and support of SUSE. SUSE Enterprise Storage provides IT organizations with the ability to deploy a distributed storage architecture that can support a number of use cases using commodity hardware platforms.

This guide helps you understand the concept of the SUSE Enterprise Storage with the main focus on managing and administrating the Ceph infrastructure. It also demonstrates how to use Ceph with other related solutions, such as OpenStack or KVM.

Many chapters in this manual contain links to additional documentation resources. These include additional documentation that is available on the system as well as documentation available on the Internet.

For an overview of the documentation available for your product and the latest documentation updates, refer to http://www.suse.com/documentation.

1 Available Documentation

The following manuals are available for this product:

Deployment and Administration Guide

Guides you through management of projects and users, images, flavors, quotas, and networks. Also describes how to migrate instances.

To complete these tasks, either use the graphical Web interface (based on OpenStack Dashboard, codename Horizon) or the OpenStack command line clients.

HTML versions of the product manuals can be found in the installed system under /usr/share/doc/manual. Additionally, you can access the product-specific manuals as well as upstream documentation from the Help links in the graphical Web interfaces. Find the latest documentation updates at http://www.suse.com/documentation where you can download the manuals for your product in multiple formats.

2 Feedback

Several feedback channels are available:

Bugs and Enhancement Requests

For services and support options available for your product, refer to http://www.suse.com/support/.

To report bugs for a product component, log in to the Novell Customer Center from http://www.suse.com/support/ and select My Support › Service Request.

User Comments

We want to hear your comments about and suggestions for this manual and the other documentation included with this product. Use the User Comments feature at the bottom of each page in the online documentation or go to http://www.suse.com/documentation/feedback.html and enter your comments there.

Mail

For feedback on the documentation of this product, you can also send a mail to doc-team@suse.de. Make sure to include the document title, the product version, and the publication date of the documentation. To report errors or suggest enhancements, provide a concise description of the problem and refer to the respective section number and page (or URL).

3 Documentation Conventions

The following typographical conventions are used in this manual:

  • /etc/passwd: directory names and file names

  • placeholder: replace placeholder with the actual value

  • PATH: the environment variable PATH

  • ls, --help: commands, options, and parameters

  • user: users or groups

  • Alt, AltF1: a key to press or a key combination; keys are shown in uppercase as on a keyboard

  • File, File › Save As: menu items, buttons

  • Dancing Penguins (Chapter Penguins, ↑Another Manual): This is a reference to a chapter in another manual.

4 About the Making of This Manual

This book is written in Novdoc, a subset of DocBook (see http://www.docbook.org). The XML source files were validated by xmllint, processed by xsltproc, and converted into XSL-FO using a customized version of Norman Walsh's stylesheets. The final PDF can be formatted through FOP from Apache or through XEP from RenderX. The authoring and publishing tools used to produce this manual are available in the package daps. The DocBook Authoring and Publishing Suite (DAPS) is developed as open source software. For more information, see http://daps.sf.net/.

1 About SUSE Enterprise Storage

1.1 Introduction

Ceph is a distributed storage designed for scalability, reliability and performance. As opposed to conventional systems which have allocation tables to store and fetch data, ceph uses a pseudo-random data distribution function to store data, which reduces the number of lookups required in storage. Data is stored on intelligent object storage devices (OSDs), which automates data management tasks such as data distribution, data replication, failure detection and recovery. Ceph is both self-healing and self-managing which results in reduction of administrative and budget overhead.

The ceph system has the following features:

Controlled, Scalable, Decentralized Placement of replicated Data using CRUSH

The ceph system uses a unique map called CRUSH (Controlled Replication Under Scalable Hashing) to assign data to OSDs in an efficient manner. Data assignment offsets are generated as opposed to looked in tables. This does away with disk lookup which come with conventional allocation table based systems, reducing the communication between the storage and the client. The client armed with the CRUSH map and the metadata such as object name and byte offset knows where it can find the data or which OSD it has to place the data.

CRUSH maintains a hierarchy of devices and the replica placement policy. As new devices are added, data from existing nodes is moved to the new device to improve distribution with respect to workload and resilience. As a part of the replica placement policy, it can add weights to the devices so, some devices are more favored as opposed to others. This could be used to give more weights to Solid State Devices (SSDs) and lower weights to conventional rotational Hard Drives to get overall better performance.

CRUSH is designed to optimally distribute data to utilize available devices efficiently. CRUSH supports different ways of data distribution such as the following:

  • n-way replication (mirroring)

  • RAID parity schemes

  • Erasure Coding

  • Hybrid approaches such as RAID-10

Reliable Autonomic Distributed Object Storage (RADOS)

The intelligence in the OSDs allows tasks such as data replication and migration for self-management and self-healing automatically. By default, data written to ceph storage is replicated within the OSDs. The level and type of replication is configurable. In case of failures, the CRUSH map is updated and data is written to new (replicated) OSDs.

The intelligence of OSDs enables to handle data replication, data migration, failure detection and recovery. These tasks are automatically and autonomously managed. This also allows the creation of various pools for different sort of I/O.

Replicated Monitor Servers

The monitor servers keep track of all the devices in the system. They manage the CRUSH map which is used to determine where the data needs to be placed. In case of failures of any of the OSDs, the CRUSH map is re-generated and re-distributed to the rest of the system. At a given time, it is recommended that a system contains multiple monitor servers to add redundancy and improve resilience.

2 Installation of Basic Ceph Cluster

This chapter outlines procedures to install and deploy the Ceph cluster. The main focus is on deploying Ceph either with the ceph-deploy command line utility, or the Crowbar framework.

2.1 Deploying with ceph-deploy

ceph-deploy is a command line utility to ease the way you deploy Ceph cluster in your small scale setups.

2.1.1 Ceph Layout

Even if Ceph can run with a minimum setup of two nodes (mainly for testing purposes), we recommend a setup of at least four nodes: three monitor and four OSD nodes.

Tip
Tip

Although Ceph nodes can be virtual machines, real hardware is strongly recommended for a production environment.

2.1.2 Preparing Each Ceph Node

Before deploying the Ceph cluster, apply the following steps for each Ceph node as root:

  1. Install SUSE Linux Enterprise 12 and add the SUSE Enterprise Storage extension. It provides a software repository with the software needed to run Ceph.

  2. Install NTP. We recommend installing the NTP time synchronization service on Ceph nodes to prevent issues arising from clock drift:

    sudo zypper in ntp yast2-ntp-client

    To configure NTP, go to YaST › Network Services › NTP Configuration. Make sure to enable the NTP service (systemctl enable ntpd.service && systemctl start ntpd.service).

  3. Install SSH server. Ceph uses SSH to log in to all cluster nodes. Make sure SSH is installed (zypper in openssh) and enabled (systemctl enable sshd.service && systemctl start sshd.service).

  4. Add a ceph user. The admin node will log in to Ceph nodes as this particular user ceph.

    useradd -m ceph
  5. The admin node needs to have a passwordless SSH access to all Ceph nodes. When ceph-deploy logs in to a Ceph node as a ceph user, this user must have passwordless sudo privileges.

    Edit the /etc/sudoers file (with visudo) and add the following line to add the sudo command for the ceph user:

    ceph ALL = (root) NOPASSWD:ALL
  6. On the admin node, become the ceph user, and enable passwordless ssh access to all other Ceph nodes:

    su - ceph
    ssh-keygen

    You will be asked several questions. Leave the values at their defaults, and the passphrase empty.

    Copy the key to each Ceph node:

    ssh-copy-id ceph@node1
    ssh-copy-id ceph@node2
    ssh-copy-id ceph@node3

2.1.3 Cleaning Previous Ceph Environment

If at any point during the Ceph deployment you run into trouble and need to start over, or you want to make sure that any previous Ceph configuration is removed, execute the following commands as ceph user to purge the previous Ceph configuration:

ceph-deploy purge node1 node2 node3
ceph-deploy purgedata node1 node2 node3
ceph-deploy forgetkeys

As root, run the following:

zypper rm $(zypper --disable-system-resolvables -s 0 packages \
 -r SUSE-Storage | grep -v '^-' | tail -n +4 | cut -d'|' -f3 | sort -u)

2.1.4 Running ceph-deploy

After you prepared each Ceph node as described in Section 2.1.2, “Preparing Each Ceph Node”, you are ready to deploy Ceph with ceph-deploy. Note that the ceph-deploy utility is run from the admin node.

Tip
Tip: Non-default Cluster Name

If you need to install the cluster with ceph-deploy using other than the default cluster name, you have to initially specify it with --cluster, and then specify it in each ceph-deploy command related to that cluster:

ceph-deploy --cluster my_cluster new [...]
ceph-deploy --ceph-conf my_cluster.conf mon create-initial
ceph-deploy --ceph-conf my_cluster.conf osd prepare [...]
ceph-deploy --ceph-conf my_cluster.conf osd activate [...]

Please note that using other than default cluster name is not supported by SUSE.

  1. Install ceph-deploy:

    sudo zypper in ceph-deploy
  2. Disable IPv6. Open /etc/sysctl.conf, edit the following lines, and reboot the admin node:

    net.ipv6.conf.all.disable_ipv6 = 1
    net.ipv6.conf.default.disable_ipv6 = 1
    net.ipv6.conf.lo.disable_ipv6 = 1
  3. Because it is not recommended to run ceph-deploy as root, become the ceph user:

    su - ceph
  4. Run ceph-deploy to install Ceph on each node:

    ceph-deploy install node1 node2 node3
    Tip
    Tip

    ceph-deploy creates important files in the directory where you run it from. It is best to run ceph-deploy in an empty directory.

  5. Set up the monitor nodes. Create keys and local configuration. The keys are used to authenticate and protect the communication between Ceph nodes.

    ceph-deploy new node1 node2 node3

    During this step, ceph-deploy creates local configuration files. It is recommended to inspect the configuration files in the current directory.

    Tip
    Tip: Monitor Nodes on Different Subnets

    If the monitor nodes are not in the same subnet, you need to modify the ceph.conf in the current directory. For example, if the nodes have IP addresses

    10.121.9.186
    10.121.10.186
    10.121.11.186

    add the following line to the global section of ceph.conf:

    public network = 10.121.0.0/16

    Since you are likely to experience problems with IPv6 networking, please consider modifying the IPv6 mon_host settings, as in the following example:

    mon_host = [2620:...10:121:9:186,2620:...10:121:10:186,2620:...10:121:11:186]

    into its IPv4 equivalent:

    mon_host = 10.121.9.186, 10.121.10.186, 10.121.11.186
  6. Create the initial monitor service on already created monitor nodes:

    ceph-deploy mon create-initial
  7. To test the status of the cluster, run

    ceph -k ceph.client.admin.keyring health
    Tip
    Tip

    If you specify which node is the admin node, then you need not specify the keyring file each time you run the ceph command:

    ceph-deploy admin node
  8. Check the firewall status

    sudo /sbin/SuSEfirewall2 status

    and if it is on, either turn it off with

    sudo /sbin/SuSEfirewall2 off

    or, if you want to keep it on, enable the appropriate set of ports. Ceph needs the following incoming TCP ports: 6789, 6800-7100, 8140.

  9. Create OSD daemons. Although you can use a directory as a storage, we recommend to create a separate disk dedicated to a Ceph node. To find out the name of the disk device, run

    cat /proc/partitions
    major minor  #blocks  name
    
     254        0   12582912 vda
     254        1    1532928 vda1
     254        2   11048960 vda2
      11        0    2831360 sr0
     254       16    4194304 vdb

    In our case the vdb disk has no partitions, so it is most likely our newly created disk.

    Now set up the disk for Ceph:

    ceph-deploy osd prepare node:vdb
    Tip
    Tip

    If there are already one or more partitions on the disk you want to use, you need to run

    ceph-deploy disk zap node:vdb

    before running the ceph-deploy osd prepare command.

    Note
    Note: Default Filesystem for OSDs

    The default and recommended filesystem for OSDs is xfs. Additionally, only btrfs is supported. Use the --fs-type option to specify a filesystem other than the default:

    ceph-deploy osd prepare --fs-type btrfs node:vdb

    Finally activate the OSD:

    ceph-deploy osd activate node:vdb1
    Tip
    Tip

    To join the functionality of ceph-deploy osd prepare and ceph-deploy osd activate, use ceph-deploy osd create.

2.1.5 Install Calamari

Calamari is a management and monitoring system for Ceph storage cluster. It provides a Web user interface that makes Ceph cluster monitoring very simple and handy.

To install Calamari, run the following commands as root:

  1. Install the client part of Calamari:

    # zypper in calamari-clients
  2. Initialize Calamari installation. You will be asked for superuser username and password. These will be needed when logging in to the Web interface after the setup is complete.

    # calamari-ctl initialize
    [INFO] Loading configuration..
    [INFO] Starting/enabling salt...
    [INFO] Starting/enabling postgres...
    [INFO] Initializing database...
    [INFO] Initializing web interface...
    [INFO] You will now be prompted for login details for the administrative user
    account.  This is the account you will use to log into the web interface once
    setup is complete.
    Username (leave blank to use 'root'):  
    Email address: 
    Password: 
    Password (again): 
    Superuser created successfully.
    [INFO] Starting/enabling services...
    [INFO] Restarting services...
    [INFO] Complete.
  3. Check the firewall status

    sudo /sbin/SuSEfirewall2 status

    and if it is on, either turn it off with

    sudo /sbin/SuSEfirewall2 off

    or, if you want to keep it on, enable the appropriate set of ports. Calamari needs the following incoming TCP ports: 80, 4505, 4506, 2003, 2004.

  4. Now open your Web browser and point it to the hostname/IP address of the server where you installed Calamari. Log in with the credentials you entered when installing the Calamari client. A welcome screen appears, instructing you to enter the ceph-deploy calamari connect command. Switch to the terminal on the Calamari host and enter the following command:

    ceph-deploy calamari --master master_host connect node1 node2 ...

    After the command is successfully finished, reload the Web browser. Now you can monitor your Ceph cluster, OSDs, pools, etc.

    Tip
    Tip

    They may be leftovers of the previous Calamari setup on the system. If, after logging in the Calamari application, some nodes are already joined or registered, run the following on the Calamari host to trigger a re-run of salt on all Ceph nodes, which should clear up any odd state or missing bits and pieces.

    salt '*' state.highstate

3 Operating a Ceph Cluster

This section introduces information on basic cluster services operating. You will learn how to start or stop Ceph services, how to monitor a cluster's state, how to use and modify CRUSH maps, and how to manage storage pools.

3.1 Operating Ceph Services

Ceph related services are operated with the systemctl command. The operation takes place on the node you are currently logged in. You need to have root privileges to be able to operate on Ceph services.

The following subcommands are supported for all Ceph services:

systemctl status target

Prints the status information of the service(s) specified by target

systemctl start target

Starts the service(s) specified by target if they are not already running.

systemctl stop target

Stops the running service(s) specified by target.

systemctl enable target

Enables the service(s) specified by target so that they are automatically started on system start-up.

systemctl disable target

Disables the service(s) specified by target so that they are not automatically started on system start-up. You have to start them manually with systemctl start.

The following subcommand is supported for OSD related targets:

systemctl mask target

Prevents the service from being started manually or automatically. systemctl is creating a symlink from the specific target file located in /etc/systemd/system/ to /dev/null. Targets in /etc/systemd override those provided by packages in /usr/lib/systemd. systemd recognizes the symlink and will not start the service.

systemctl unmask target

Opposite to systemctl mask. Removes the symlink from the specific target file located in /etc/systemd/system/ to /dev/null, so that targets located in /usr/lib/systemd are valid again.

You can either operate on all Ceph services at once, or all Ceph services of a certain type (such as OSD or monitor), or a specific instance of service identified by its instance name.

3.1.1 Operating on All Ceph Services

To operate on all Ceph services at once, run:

systemctl subcommand ceph*

For example:

systemctl status ceph*
ceph-osd@0.service - Ceph object storage daemon
    Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled)
    Active: active (running) since Fri 2015-01-23 09:09:12 EST; 4min 52s ago
  Main PID: 2788 (ceph-osd)
    CGroup:
/system.slice/system-ceph\x2dosd.slice/ceph-osd@0.service
            └─2788 /usr/bin/ceph-osd -f \-\-cluster ceph \-\-id 0
Jan 23 09:09:12 ceph-node2 ceph-osd-prestart.sh[2737]:
create-or-move updated item name 'osd.0' weight 0.01 at location
{host=ceph-node2,root=default} to crush map
 Jan 23 09:09:12 ceph-node2 ceph-osd[2788]: starting osd.0 at
:/0 osd_data /var/lib/ceph/osd/ceph-0 /var/lib/ceph/osd/ceph-0/journal
 Jan 23 09:09:12 ceph-node2 ceph-osd[2788]:
HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
 Jan 23 09:09:12 ceph-node2 ceph-osd[2788]:
HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device

ceph-radosgw@ceph4.service - Ceph rados gateway
    Loaded: loaded
(/usr/lib/systemd/system/ceph-radosgw@.service; enabled)
    Active: active (running) since Fri 2015-01-23 09:11:29
EST; 2min 35s ago
  Main PID: 3741 (radosgw)
    CGroup:
/system.slice/system-ceph\x2dradosgw.slice/ceph-radosgw@ceph4.service
            └─3741 /usr/bin/radosgw -f \-\-conf
/etc/ceph/ceph.conf \-\-name client.radosgw.ceph4
 Jan 23 09:11:31 ceph-node2 radosgw[3741]: 2015-01-23
09:11:31.206129 7f60e1df9700 -1 failed to list objects pool_iterate returned r=-2
Tip
Tip

systemctl Ceph commands cannot operate on services not enabled, not masked, and those which have never been started. To include all present services on a node, use the rcceph wrapper. For example:

rcceph status

3.1.2 Operating on All Ceph Services of a Specific Type

It is possible to address services of a specific type only — such as OSDs, monitors, or RADOS Gateway:

systemctl subcommand ceph-service_type*

A few practical examples follow:

Print the status of all monitor services on a node:

systemctl status ceph-mon*

Start all OSD services on a node:

systemctl start ceph-osd*

Stop all RADOS Gateway services on a node:

systemctl stop ceph-radosgw*

3.1.3 Operating on a Specific Ceph Service

If you want to operate on a single service on a node, you need to know its instance name. It is available in the output of systemctl status ceph*.

systemctl subcommand ceph-service_type@instance_name.service

Print the status information of the OSD service with instance name 0:

systemctl status ceph-osd@0.service

3.2 Monitoring a Cluster

Once you have a running cluster, you may use the ceph tool to monitor your cluster. Monitoring a cluster typically involves checking OSD status, monitor status, placement group status and metadata server status.

Tip
Tip: Interactive Mode

To run the ceph tool in an interactive mode, type ceph at the command line with no arguments. The interactive mode is more convenient if you are going to enter more ceph commands in a row. For example:

ceph
ceph> health
ceph> status
ceph> quorum_status
ceph> mon_status

3.2.1 Checking Cluster Health

After you start your cluster, and before you start reading and/or writing data, check your cluster’s health first. You can check on the health of your Ceph cluster with the following:

ceph health
HEALTH_WARN 10 pgs degraded; 100 pgs stuck unclean; 1 mons down, quorum 0,2 \
node-1,node-2,node-3

If you specified non-default locations for your configuration or keyring, you may specify their locations:

ceph -c /path/to/conf -k /path/to/keyring health

Upon starting the Ceph cluster, you will likely encounter a health warning such as HEALTH_WARN XXX num placement groups stale. Wait a few moments and check it again. When your cluster is ready, ceph health should return a message such as HEALTH_OK. At that point, it is okay to begin using the cluster.

3.2.2 Watching a Cluster

To watch the cluster’s ongoing events, open a new terminal and enter:

ceph -w

Ceph will print each event. For example, a tiny Ceph cluster consisting of one monitor, and two OSDs may print the following:

cluster b370a29d-9287-4ca3-ab57-3d824f65e339
 health HEALTH_OK
 monmap e1: 1 mons at {ceph1=10.0.0.8:6789/0}, election epoch 2, quorum 0 ceph1
 osdmap e63: 2 osds: 2 up, 2 in
  pgmap v41338: 952 pgs, 20 pools, 17130 MB data, 2199 objects
        115 GB used, 167 GB / 297 GB avail
             952 active+clean

2014-06-02 15:45:21.655871 osd.0 [INF] 17.71 deep-scrub ok
2014-06-02 15:45:47.880608 osd.1 [INF] 1.0 scrub ok
2014-06-02 15:45:48.865375 osd.1 [INF] 1.3 scrub ok
2014-06-02 15:45:50.866479 osd.1 [INF] 1.4 scrub ok
[...]
2014-06-02 15:45:55.720929 mon.0 [INF] pgmap v41343: 952 pgs: \
 1 active+clean+scrubbing+deep, 951 active+clean; 17130 MB data, 115 GB used, \
 167 GB / 297 GB avail

The output provides the following information:

  • Cluster ID

  • Cluster health status

  • The monitor map epoch and the status of the monitor quorum

  • The OSD map epoch and the status of OSDs

  • The placement group map version

  • The number of placement groups and pools

  • The notional amount of data stored and the number of objects stored; and,

  • The total amount of data stored.

Tip
Tip: How Ceph Calculates Data Usage

The used value reflects the actual amount of raw storage used. The xxx GB / xxx GB value means the amount available (the lesser number) of the overall storage capacity of the cluster. The notional number reflects the size of the stored data before it is replicated, cloned or snapshotted. Therefore, the amount of data actually stored typically exceeds the notional amount stored, because Ceph creates replicas of the data and may also use storage capacity for cloning and snapshotting.

3.2.3 Checking a Cluster’s Usage Stats

To check a cluster’s data usage and data distribution among pools, you can use the df option. It is similar to Linux df. Execute the following:

ceph df
GLOBAL:
    SIZE       AVAIL      RAW USED     %RAW USED 
    27570M     27304M         266M          0.97 
POOLS:
    NAME             ID     USED     %USED     MAX AVAIL     OBJECTS 
    data             0       120         0         5064M           4 
    metadata         1         0         0         5064M           0 
    rbd              2         0         0         5064M           0 
    hot-storage      4       134         0         4033M           2 
    cold-storage     5      227k         0         5064M           1 
    pool1            6         0         0         5064M           0

The GLOBAL section of the output provides an overview of the amount of storage your cluster uses for your data.

  • SIZE: The overall storage capacity of the cluster.

  • AVAIL: The amount of free space available in the cluster.

  • RAW USED: The amount of raw storage used.

  • % RAW USED: The percentage of raw storage used. Use this number in conjunction with the full ratio and near full ratio to ensure that you are not reaching your cluster’s capacity. See Storage Capacity for additional details.

The POOLS section of the output provides a list of pools and the notional usage of each pool. The output from this section does not reflect replicas, clones or snapshots. For example, if you store an object with 1MB of data, the notional usage will be 1MB, but the actual usage may be 2MB or more depending on the number of replicas, clones and snapshots.

  • NAME: The name of the pool.

  • ID: The pool ID.

  • USED: The notional amount of data stored in kilobytes, unless the number appends M for megabytes or G for gigabytes.

  • %USED: The notional percentage of storage used per pool.

  • OBJECTS: The notional number of objects stored per pool.

Note
Note

The numbers in the POOLS section are notional. They are not inclusive of the number of replicas, snapshots or clones. As a result, the sum of the USED and %USED amounts will not add up to the RAW USED and %RAW USED amounts in the %GLOBAL section of the output.

3.2.4 Checking a Cluster’s Status

To check a cluster’s status, execute the following:

ceph status

or

ceph -s

In interactive mode, type status and press Enter.

ceph> status

Ceph will print the cluster status. For example, a tiny Ceph cluster consisting of one monitor, and two OSDs may print the following:

cluster b370a29d-9287-4ca3-ab57-3d824f65e339
 health HEALTH_OK
 monmap e1: 1 mons at {ceph1=10.0.0.8:6789/0}, election epoch 2, quorum 0 ceph1
 osdmap e63: 2 osds: 2 up, 2 in
  pgmap v41332: 952 pgs, 20 pools, 17130 MB data, 2199 objects
        115 GB used, 167 GB / 297 GB avail
               1 active+clean+scrubbing+deep
             951 active+clean

3.2.5 Checking OSD Status

You can check OSDs to ensure they are up and in by executing:

ceph osd stat

or

ceph osd dump

You can also check view OSDs according to their position in the CRUSH map.

ceph osd tree

Ceph will print out a CRUSH tree with a host, its OSDs, whether they are up and their weight.

# id    weight  type name       up/down reweight
-1      3       pool default
-3      3               rack mainrack
-2      3                       host osd-host
0       1                               osd.0   up      1
1       1                               osd.1   up      1
2       1                               osd.2   up      1

3.2.6 Checking Monitor Status

If your cluster has multiple monitors (likely), you should check the monitor quorum status after you start the cluster before reading and/or writing data. A quorum must be present when multiple monitors are running. You should also check monitor status periodically to ensure that they are running.

To display the monitor map, execute the following:

ceph mon stat

or

ceph mon dump

To check the quorum status for the monitor cluster, execute the following:

ceph quorum_status

Ceph will return the quorum status. For example, a Ceph cluster consisting of three monitors may return the following:

{ "election_epoch": 10,
  "quorum": [
        0,
        1,
        2],
  "monmap": { "epoch": 1,
      "fsid": "444b489c-4f16-4b75-83f0-cb8097468898",
      "modified": "2011-12-12 13:28:27.505520",
      "created": "2011-12-12 13:28:27.505520",
      "mons": [
            { "rank": 0,
              "name": "a",
              "addr": "127.0.0.1:6789\/0"},
            { "rank": 1,
              "name": "b",
              "addr": "127.0.0.1:6790\/0"},
            { "rank": 2,
              "name": "c",
              "addr": "127.0.0.1:6791\/0"}
           ]
    }
}

3.2.7 Checking MDS Status

Metadata servers provide metadata services for Ceph FS. Metadata servers have two sets of states: up | down and active | inactive. To ensure your metadata servers are up and active, execute the following:

ceph mds stat

To display details of the metadata cluster, execute the following:

ceph mds dump

3.2.8 Checking Placement Group States

Placement groups map objects to OSDs. When you monitor your placement groups, you will want them to be active and clean. For a detailed discussion, refer to Monitoring OSDs and Placement Groups.

3.2.9 Using the Admin Socket

The Ceph admin socket allows you to query a daemon via a socket interface. By default, Ceph sockets reside under /var/run/ceph. To access a daemon via the admin socket, login to the host running the daemon and use the following command:

ceph --admin-daemon /var/run/ceph/socket-name

To view the available admin socket commands, execute the following command:

ceph --admin-daemon /var/run/ceph/socket-name help

The admin socket command enables you to show and set your configuration at runtime. See Viewing a Configuration at Runtime for details.

Additionally, you can set configuration values at runtime directly (i.e., the admin socket bypasses the monitor, unlike ceph tell daemon-type.id injectargs, which relies on the monitor but does not require you to login directly to the host in question ).

3.3 Managing Storage Pools

When you first deploy a cluster without creating a pool, Ceph uses the default pools for storing data. A pool provides you with:

  • Resilience: You can set how many OSDs are allowed to fail without losing data. For replicated pools, it is the desired number of copies/replicas of an object. A typical configuration stores an object and one additional copy (i.e., size = 2), but you can determine the number of copies/replicas. For erasure coded pools, it is the number of coding chunks (i.e. m=2 in the erasure code profile).

  • Placement Groups: You can set the number of placement groups for the pool. A typical configuration uses approximately 100 placement groups per OSD to provide optimal balancing without using up too many computing resources. When setting up multiple pools, be careful to ensure you set a reasonable number of placement groups for both the pool and the cluster as a whole.

  • CRUSH Rules: When you store data in a pool, a CRUSH ruleset mapped to the pool enables CRUSH to identify a rule for the placement of the object and its replicas (or chunks for erasure coded pools) in your cluster. You can create a custom CRUSH rule for your pool.

  • Snapshots: When you create snapshots with ceph osd pool mksnap, you effectively take a snapshot of a particular pool.

  • Set Ownership: You can set a user ID as the owner of a pool.

To organize data into pools, you can list, create, and remove pools. You can also view the utilization statistics for each pool.

3.3.1 Operating Pools

This section introduces practical information to perform basic tasks with pools. You can find out how to list, create, delete pools, as well as show pool statistics or manage snapshots of a pool.

3.3.1.1 List Pools

To list your cluster’s pools, execute:

ceph osd lspools
0 data,1 metadata,2 rbd,

The command outputs information about data, metadata, and rados block devices (rbd).

3.3.1.2 Create a Pool

To create a replicated pool, execute:

ceph osd pool create pool-name pg-num pgp-num pool-type crush-ruleset-name

To create an erasure pool, execute:

ceph osd pool create pool-name pg-num pgp-num pool-type erasure-code-profile \
crush-ruleset-name
pool-name

The name of the pool. It must be unique. This option is required.

pg_num

The total number of placement groups for the pool. This option is required. Default value is 8.

pgp-num

The total number of placement groups for placement purposes. This should be equal to the total number of placement groups, except for placement group splitting scenarios. This option is required. Default value is 8.

pgp-type

The pool type which may either be replicated to recover from lost OSDs by keeping multiple copies of the objects or erasure to get a kind of generalized RAID5 capability. The replicated pools require more raw storage but implement all Ceph operations. The erasure pools require less raw storage but only implement a subset of the available operations. Default is 'replicated'.

crush-ruleset-name

The name of the crush ruleset for this pool. If the specified ruleset does not exist, the creation of replicated pool will fail with -ENOENT. But replicated pool will create a new erasure ruleset with specified name. The default value is 'erasure-code' for erasure pool. Picks up Ceph configuration variable osd_pool_default_crush_replicated_ruleset for replicated pool.

erasure-code-profile=profile

For erasure pools only. Use the erasure code profile. It must be an existing profile as defined by osd erasure-code-profile set.

When you create a pool, set the number of placement groups to a reasonable value (e.g., 100). Consider the total number of placement groups per OSD too. Placement groups are computationally expensive, so performance will degrade when you have many pools with many placement groups (e.g., 50 pools with 100 placement groups each). The point of diminishing returns depends upon the power of the OSD host.

See Placement Groups (http://docs.ceph.com/docs/master/rados/operations/placement-groups/) for details on calculating an appropriate number of placement groups for your pool.

3.3.1.3 Set Pool Quotas

You can set pool quotas for the maximum number of bytes and/or the maximum number of objects per pool.

ceph osd pool set-quota pool-name max_objects obj-count max_bytes bytes

For example:

ceph osd pool set-quota data max_objects 10000

To remove a quota, set its value to 0.

3.3.1.4 Delete a Pool

To delete a pool, execute:

ceph osd pool delete pool-name pool-name --yes-i-really-really-mean-it

If you created your own rulesets and rules for a pool you created, you should consider removing them when you no longer need your pool. If you created users with permissions strictly for a pool that no longer exists, you should consider deleting those users too.

3.3.1.5 Rename a Pool

To rename a pool, execute:

ceph osd pool rename current-pool-name new-pool-name

If you rename a pool and you have per-pool capabilities for an authenticated user, you must update the user’s capabilities (i.e., caps) with the new pool name.

3.3.1.6 Show Pool Statistics

To show a pool’s utilization statistics, execute:

rados df
pool name  category  KB  objects   lones  degraded  unfound  rd  rd KB  wr  wr KB
cold-storage    -   228   1         0      0          0       0   0      1   228
data            -    1    4         0      0          0       0   0      4    4
hot-storage     -    1    2         0      0          0       15  10     5   231
metadata        -    0    0         0      0          0       0   0      0    0
pool1           -    0    0         0      0          0       0   0      0    0
rbd             -    0    0         0      0          0       0   0      0    0
total used          266268          7
total avail       27966296
total space       28232564

3.3.1.7 Make a Snapshot of a Pool

To make a snapshot of a pool, execute:

ceph osd pool mksnap pool-name snap-name

For example:

ceph osd pool mksnap pool1 snapshot1
created pool pool1 snap snapshot1

3.3.1.8 Remove a Snapshot of a Pool

To remove a snapshot of a pool, execute:

ceph osd pool rmsnap pool-name snap-name

3.3.1.9 Set Pool Values

To set a value to a pool, execute:

ceph osd pool set pool-name key value

You may set values for the following keys:

size

Sets the number of replicas for objects in the pool. See Section 3.3.1.11, “Set the Number of Object Replicas” for further details. Replicated pools only.

min_size

Sets the minimum number of replicas required for I/O. See Section 3.3.1.11, “Set the Number of Object Replicas” for further details. Replicated pools only.

crash_replay_interval

The number of seconds to allow clients to replay acknowledged, but uncommitted requests.

pgp_num

The effective number of placement groups to use when calculating data placement.

crush_ruleset

The ruleset to use for mapping object placement in the cluster.

hashpspool

Set/Unset HASHPSPOOL flag on a given pool. 1 sets flag, 0 unsets flag.

hit_set_type

Enables hit set tracking for cache pools. See Bloom Filter (http://en.wikipedia.org/wiki/Bloom_filter) for additional information. This option can have the following values: bloom, explicit_hash, explicit_object. Default is bloom.

hit_set_count

The number of hit sets to store for cache pools. The higher the number, the more RAM consumed by the ceph-osd daemon. Default is 1.

hit_set_period

The duration of a hit set period in seconds for cache pools. The higher the number, the more RAM consumed by the ceph-osd daemon.

hit_set_fpp

The false positive probability for the bloom hit set type. See Bloom Filter (http://en.wikipedia.org/wiki/Bloom_filter) for additional information. Valid range is 0.0 - 1.0 Default is 0.05

cache_target_dirty_ratio

The percentage of the cache pool containing modified (dirty) objects before the cache tiering agent will flush them to the backing storage pool. Default is .4

cache_target_full_ratio

The percentage of the cache pool containing unmodified (clean) objects before the cache tiering agent will evict them from the cache pool. Default is .8

target_max_bytes

Ceph will begin flushing or evicting objects when the max_bytes threshold is triggered.

target_max_objects

Ceph will begin flushing or evicting objects when the max_objects threshold is triggered.

cache_min_flush_age

The time (in seconds) before the cache tiering agent will flush an object from the cache pool to the storage pool.

cache_min_evict_age

The time (in seconds) before the cache tiering agent will evict an object from the cache pool.

3.3.1.10 Get Pool Values

To get a value from a pool, execute:

ceph osd pool get pool-name key

You can get values for keys listed in Section 3.3.1.9, “Set Pool Values” plus the following keys:

pg_num

The number of placement groups for the pool.

pgp_num

The effective number of placement groups to use when calculating data placement. Valid range is equal to or less than pg_num.

3.3.1.11 Set the Number of Object Replicas

To set the number of object replicas on a replicated pool, execute the following:

ceph osd pool set poolname size num-replicas
Important
Important

The num-replicas includes the object itself. If you for example want the object and two copies of the object for a total of three instances of the object, specify 3.

For example:

ceph osd pool set data size 3

You may execute this command for each pool.

Note
Note

An object might accept I/Os in degraded mode with fewer than pool size replicas. To set a minimum number of required replicas for I/O, you should use the min_size setting. For example:

ceph osd pool set data min_size 2

This ensures that no object in the data pool will receive I/O with fewer than min_size replicas.

3.3.1.12 Get the Number of Object Replicas

To get the number of object replicas, execute the following:

ceph osd dump | grep 'replicated size'

Ceph will list the pools, with the replicated size attribute highlighted. By default, Ceph creates two replicas of an object (a total of three copies, or a size of 3).

4 Stored Data Management

The CRUSH algorithm determines how to store and retrieve data by computing data storage locations. CRUSH empowers Ceph clients to communicate with OSDs directly rather than through a centralized server or broker. With an algorithmically determined method of storing and retrieving data, Ceph avoids a single point of failure, a performance bottleneck, and a physical limit to its scalability.

CRUSH requires a map of your cluster, and uses the CRUSH map to pseudo-randomly store and retrieve data in OSDs with a uniform distribution of data across the cluster.

CRUSH maps contain a list of OSDs, a list of ‘buckets’ for aggregating the devices into physical locations, and a list of rules that tell CRUSH how it should replicate data in a Ceph cluster’s pools. By reflecting the underlying physical organization of the installation, CRUSH can model—and thereby address—potential sources of correlated device failures. Typical sources include physical proximity, a shared power source, and a shared network. By encoding this information into the cluster map, CRUSH placement policies can separate object replicas across different failure domains while still maintaining the desired distribution. For example, to address the possibility of concurrent failures, it may be desirable to ensure that data replicas are on devices using different shelves, racks, power supplies, controllers, and/or physical locations.

When you create a configuration file and deploy Ceph with ceph-deploy, Ceph generates a default CRUSH map for your configuration. The default CRUSH map is fine for your Ceph sandbox environment. However, when you deploy a large-scale data cluster, you should give significant consideration to developing a custom CRUSH map, because it will help you manage your Ceph cluster, improve performance and ensure data safety.

For example, if an OSD goes down, a CRUSH map can help you can locate the physical data center, room, row and rack of the host with the failed OSD in the event you need to use on-site support or replace hardware.

Similarly, CRUSH may help you identify faults more quickly. For example, if all OSDs in a particular rack go down simultaneously, the fault may lie with a network switch or power to the rack or the network switch rather than the OSDs themselves.

A custom CRUSH map can also help you identify the physical locations where Ceph stores redundant copies of data when the placement group(s) associated with a failed host are in a degraded state.

There are three main sections to a CRUSH Map.

  • Devices consist of any object storage device–i.e., the hard disk corresponding to a ceph-osd daemon.

  • Buckets consist of a hierarchical aggregation of storage locations (e.g., rows, racks, hosts, etc.) and their assigned weights.

  • Rules Sets consist of the manner of selecting buckets.

4.1 Devices

To map placement groups to OSDs, a CRUSH Map requires a list of OSD devices (i.e., the name of the OSD daemon). The list of devices appears first in the CRUSH Map.

#devices
device num osd.name

For example:

#devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3

As a general rule, an OSD daemon maps to a single disk or to a RAID.

4.2 Buckets

CRUSH maps support the notion of ‘buckets’, which may be thought of as nodes that aggregate other buckets into a hierarchy of physical locations, where OSD devices are the leaves of the hierarchy. The following table lists the default types.

0

OSD

An OSD daemon (e.g., osd.1, osd.2, etc).

1

Host

A host name containing one or more OSDs.

2

Rack

A computer rack. The default is unknownrack.

3

Row

A row in a series of racks.

4

Room

A room containing racks and rows of hosts.

5

Data Center

A physical data center containing rooms.

6

Pool

A data storage pool for storing objects.

Tip
Tip

You can remove these types and create your own bucket types.

Ceph’s deployment tools generate a CRUSH map that contains a bucket for each host, and a pool named 'default', which is useful for the default data, metadata and rbd pools. The remaining bucket types provide a means for storing information about the physical location of nodes/buckets, which makes cluster administration much easier when OSDs, hosts, or network hardware malfunction and the administrator needs access to physical hardware.

A bucket has a type, a unique name (string), a unique ID expressed as a negative integer, a weight relative to the total capacity/capability of its item(s), the bucket algorithm ( straw by default), and the hash (0 by default, reflecting CRUSH Hash rjenkins1). A bucket may have one or more items. The items may consist of other buckets or OSDs. Items may have a weight that reflects the relative weight of the item.

[bucket-type] [bucket-name] {
  id [a unique negative numeric ID]
  weight [the relative capacity/capability of the item(s)]
  alg [the bucket type: uniform | list | tree | straw ]
  hash [the hash type: 0 by default]
  item [item-name] weight [weight]
}

The following example illustrates how you can use buckets to aggregate a pool and physical locations like a datacenter, a room, a rack and a row.

host ceph-osd-server-1 {
        id -17
        alg straw
        hash 0
        item osd.0 weight 1.00
        item osd.1 weight 1.00
}

row rack-1-row-1 {
        id -16
        alg straw
        hash 0
        item ceph-osd-server-1 2.00
}

rack rack-3 {
        id -15
        alg straw
        hash 0
        item rack-3-row-1 weight 2.00
        item rack-3-row-2 weight 2.00
        item rack-3-row-3 weight 2.00
        item rack-3-row-4 weight 2.00
        item rack-3-row-5 weight 2.00
}

rack rack-2 {
        id -14
        alg straw
        hash 0
        item rack-2-row-1 weight 2.00
        item rack-2-row-2 weight 2.00
        item rack-2-row-3 weight 2.00
        item rack-2-row-4 weight 2.00
        item rack-2-row-5 weight 2.00
}

rack rack-1 {
        id -13
        alg straw
        hash 0
        item rack-1-row-1 weight 2.00
        item rack-1-row-2 weight 2.00
        item rack-1-row-3 weight 2.00
        item rack-1-row-4 weight 2.00
        item rack-1-row-5 weight 2.00
}

room server-room-1 {
        id -12
        alg straw
        hash 0
        item rack-1 weight 10.00
        item rack-2 weight 10.00
        item rack-3 weight 10.00
}

datacenter dc-1 {
        id -11
        alg straw
        hash 0
        item server-room-1 weight 30.00
        item server-room-2 weight 30.00
}

pool data {
        id -10
        alg straw
        hash 0
        item dc-1 weight 60.00
        item dc-2 weight 60.00
}

4.3 Rules Sets

CRUSH maps support the notion of 'CRUSH rules', which are the rules that determine data placement for a pool. For large clusters, you will likely create many pools where each pool may have its own CRUSH ruleset and rules. The default CRUSH map has a rule for each pool, and one ruleset assigned to each of the default pools, which include:

  • data

  • metadata

  • rbd

Note
Note

In most cases, you will not need to modify the default rules. When you create a new pool, its default ruleset is 0.

A rule takes the following form:

rule rulename {

        ruleset ruleset
        type type
        min_size min-size
        max_size max-size
        step step

}
ruleset

An integer. Classifies a rule as belonging to a set of rules. Activated by setting the ruleset in a pool. This option is required. Default is 0.

type

A string. Describes a rule for either a hard disk (replicated) or a RAID. This option is required. Default is replicated.

min_size

An integer. If a placement group makes fewer replicas than this number, CRUSH will NOT select this rule. This option is required. Default is 1.

max_size

An integer. If a placement group makes more replicas than this number, CRUSH will NOT select this rule. This option is required. Default is 10.

step take bucket

Takes a bucket name, and begins iterating down the tree. This option is required.

step choose firstn num type bucket-type

Selects the number of buckets of the given type. Where N is the number of options available, if num > 0 && < N, choose that many buckets; if num < 0, it means N - num; and, if num == 0, choose N buckets (all available). Follows step take or step choose.

step emit

Outputs the current value and empties the stack. Typically used at the end of a rule, but may also be used to from different trees in the same rule. Follows step choose.

Important
Important

To activate one or more rules with a common ruleset number to a pool, set the ruleset number to the pool.

4.4 CRUSH Map Manipulation

This section introduces ways to basic CRUSH Map manipulation, such as editing a CRUSH Map, changing CRUSH Map parameters, and adding/moving/removing an OSD.

4.4.1 Editing a CRUSH Map

To edit an existing CRUSH map, do the following:

  1. Get a CRUSH Map. To get the CRUSH Map for your cluster, execute the following:

    ceph osd getcrushmap -o
    compiled-crushmap-filename

    Ceph will output (-o) a compiled CRUSH Map to the filename you specified. Since the CRUSH Map is in a compiled form, you must decompile it first before you can edit it.

  2. Decompile a CRUSH Map. To decompile a CRUSH Map, execute the following:

    crushtool -d compiled-crushmap-filename \
     -o decompiled-crushmap-filename

    Ceph will decompile (-d) the compiled CRUSH map and output (-o) it to the filename you specified.

  3. Edit at least one of Devices, Buckets and Rules parameters.

  4. Compile a CRUSH Map. To compile a CRUSH Map, execute the following:

    crushtool -c decompiled-crush-map-filename \
     -o compiled-crush-map-filename

    Ceph will store a compiled CRUSH map to the filename you specified.

  5. Set a CRUSH Map. To set the CRUSH Map for your cluster, execute the following:

    ceph osd setcrushmap -i compiled-crushmap-filename

    Ceph will input the compiled CRUSH Map of the filename you specified as the CRUSH Map for the cluster.

4.4.2 Add/Move an OSD

To add or move an OSD in the CRUSH map of a running cluster, execute the following:

ceph osd crush set id name weight pool=pool-name
bucket-type=bucket-name ...
id

An integer. The numeric ID of the OSD. This option is required.

name

A string. The full name of the OSD. This option is required.

weight

A double. The CRUSH weight for the OSD. This option is required.

pool

A key/value pair. By default, the CRUSH hierarchy contains the pool default as its root. This option is required.

bucket-type

Key/value pairs. You may specify the OSD’s location in the CRUSH hierarchy.

The following example adds osd.0 to the hierarchy, or moves the OSD from a previous location.

ceph osd crush set 0 osd.0 1.0 pool=data datacenter=dc1 room=room1 \
row=foo rack=bar host=foo-bar-1

4.4.3 Adjust an OSD’s CRUSH Weight

To adjust an OSD’s crush weight in the CRUSH map of a running cluster, execute the following:

ceph osd crush reweight name weight
name

A string. The full name of the OSD. This option is required.

weight

A double. The CRUSH weight for the OSD. This option is required.

4.4.4 Remove an OSD

To remove an OSD from the CRUSH map of a running cluster, execute the following:

ceph osd crush remove name
name

A string. The full name of the OSD. This option is required.

4.4.5 Move a Bucket

To move a bucket to a different location or position in the CRUSH map hierarchy, execute the following:

ceph osd crush move bucket-name bucket-type=bucket-name, ...
bucket-name

A string. The name of the bucket to move/reposition. This option is required.

bucket-type

Key/value pairs. You may specify the bucket’s location in the CRUSH hierarchy.

5 Cache Tiering

In the context of Ceph tiered storage is implemented as a cache tier.

A cache tier provides Ceph Clients with better I/O performance for a subset of the data stored in a backing storage tier. Cache tiering involves creating a pool of relatively fast/expensive storage devices (e.g., solid state drives) configured to act as a cache tier, and a backing pool of either erasure-coded or relatively slower/cheaper devices configured to act as an economical storage tier. The Ceph objecter handles where to place the objects and the tiering agent determines when to flush objects from the cache to the backing storage tier. So the cache tier and the backing storage tier are completely transparent to Ceph clients.

5.1 Cache Modes

The cache tiering agent handles the migration of data between the cache tier and the backing storage tier. Administrators have the ability to configure how this migration takes place. There are two main scenarios:

Write-back Mode

When administrators configure tiers with write-back mode, Ceph clients write data to the cache tier and receive an ACK from the cache tier. In time, the data written to the cache tier migrates to the storage tier and gets flushed from the cache tier. Conceptually, the cache tier is overlaid in front of the backing storage tier. When a Ceph client needs data that resides in the storage tier, the cache tiering agent migrates the data to the cache tier on read, then it is sent to the Ceph client. Thereafter, the Ceph client can perform I/O using the cache tier, until the data becomes inactive. This is ideal for mutable data such as photo or video editing, transactional data, etc.

Read-only Mode

When administrators configure tiers with read-only mode, Ceph clients write data to the backing tier. On read, Ceph copies the requested objects from the backing tier to the cache tier. Stale objects get removed from the cache tier based on the defined policy. This approach is ideal for immutable data such as presenting pictures or videos on a social network, DNA data, X-Ray imaging, etc., because reading data from a cache pool that might contain out-of-date data provides weak consistency. Do not use read-only mode for mutable data.

5.2 Ceph Pools in the Context of Tiered Storage

For general information on pools, see Section 3.3, “Managing Storage Pools”.

5.2.1 The Backing Storage Pool

Either a standard storage that stores several copies of an object in the Ceph Storage Cluster or storage with erasure coding which is a way to store data much more efficiently with a small performance trade-off. For erasure coding, see Chapter 6, Erasure Code.

In the context of tiered storage this backing storage pool will also be referred as cold-storage.

5.2.2 The Cache Pool

For the cache pool you configure standard storage, but you use high performance drives in dedicated servers with their own ruleset. See http://docs.ceph.com/docs/master/rados/operations/crush-map/#placing-different-pools-on-different-osds.

In the context of tiered storage this cache pool will also be referred as hot-storage.

5.3 Setting Up Tiered Storage

For hot-storage (= cache pool) create OSDs with fast disks. Then associate a ceph storage pool with them. Here is a procedure how to do this.

  1. Prepare a host machine with fast drives (SSDs).

  2. Turn the machine into a ceph node. Install the software and configure the host machine as described in Section 2.1.2, “Preparing Each Ceph Node”. Let's assume that its name is node-4.

  3. To create four OSDs, call ceph-deploy commands on the administration node as follows, replace node-4 the actual node and device with the actual device name:

    for d in a b c d; do
      ceph-deploy osd prepare node-4:device${d}
      ceph-deploy osd activate node-4:device${d}1
    done

    Or call just ceph-deploy osd create in one go.

    This may result in an entry like this in the CRUSH map:

    host node-4 {
            id -5           # do not change unnecessarily
            # weight 0.000
            alg straw
            hash 0  # rjenkins1
            item osd.6 weight 0.000
            item osd.7 weight 0.000
            item osd.8 weight 0.000
            item osd.9 weight 0.000
    }
  4. Edit the CRUSH map the hot-storage pool to the OSDs backed by the fast solid-state disks (SSDs). Define a second hierarchy with a root node for the SSDs (as root ssd). Additionally, change the weight and a CRUSH rule for the SSDs. For more information on CRUSH map, see http://docs.ceph.com/docs/master/rados/operations/crush-map/.

    Edit the CRUSH map directly with command line tools such as getcrushmap and crushtool:

    1. Retrieve the current map and save it as c.map:

      ceph osd getcrushmap -o c.map
    2. Decompile c.map and save it as c.txt:

      crushtool -d c.map -o c.txt
    3. Edit c.txt:

      host node-4 {
              id -5           # do not change unnecessarily
              # weight 4.000
              alg straw
              hash 0  # rjenkins1
              item osd.6 weight 1.000
              item osd.7 weight 1.000
              item osd.8 weight 1.000
              item osd.9 weight 1.000
      }
      root ssd {
              id -6
              alg straw
              hash 0
              item node-4 weight 4.00
      }
      rule ssd {
              ruleset 4
              type replicated
              min_size 0
              max_size 4
              step take ssd
              step chooseleaf firstn 0 type host
              step emit
      }
      
    4. Compile the edited c.txt file and save it as ssd.map:

      crushtool -c c.txt -o ssd.map
    5. Finally install ssd.map as the new CRUSH map:

      ceph osd setcrushmap -i ssd.map
  5. Create the hot-storage pool (to be used for cache tiering) and the cold-storage backing pool:

    ceph osd pool create hot-storage 100 100 replicated replicated_ruleset
    ceph osd pool create cold-storage 100 100 replicated replicated_ruleset
  6. Set the above defined hot-storage pool to use the SSD rule:

    ceph osd pool set hot-storage crush_ruleset 4

    For now, the cold-storage pool will stick with default replicated_ruleset.

  7. Then setting up a cache tier involves associating a backing storage pool with a cache pool, in this case cold-storage (= storage pool) with hot-storage (= cache pool):

    ceph osd tier add cold-storage hot-storage
  8. To set the cache mode to writeback, execute the following:

    ceph osd tier cache-mode hot-storage writeback

    For more information about cache modes, see Section 5.1, “Cache Modes”

    Writeback cache tiers overlay the backing storage tier, so they require one additional step: you must direct all client traffic from the storage pool to the cache pool. To direct client traffic directly to the cache pool, execute the following for example:

    ceph osd tier set-overlay cold-storage hot-storage

For detailed information about configuring a cache tier, see http://docs.ceph.com/docs/master/rados/operations/cache-tiering/#configuring-a-cache-tier.

5.4 Erasure Coded Pool And Cache Tiering

Erasure coded pools require more resources than replicated pools and lack some functionalities such as partial writes. To overcome these limitations, it is recommended to set a cache tier before the erasure coded pool.

For instance, if the pool hot-storage is made of fast storage:

> ceph osd tier add ecpool hot-storage
> ceph osd tier cache-mode hot-storage writeback
> ceph osd tier set-overlay ecpool hot-storage

This will place the hot-storage pool as tier of ecpool in write-back mode so that every write and read to the ecpool are actually using the hot-storage and benefit from its flexibility and speed.

It is not possible to create an RBD image on an erasure coded pool because it requires partial writes. It is however possible to create an RBD image on an erasure coded pools when a replicated pool tier set a cache tier:

> rbd --pool ecpool create --size 10 myvolume

For more information about cache tiering, see Chapter 5, Cache Tiering.

6 Erasure Code

A Ceph pool is associated to a type to sustain the loss of an OSD (i.e. a disk since most of the time there is one OSD per disk). The default choice when creating a pool is replicated, meaning every object is copied on multiple disks. The Erasure Code pool type can be used instead to save space.

For background information on Erasure Code, see https://en.wikipedia.org/wiki/Erasure_code.

Note
Note

You cannot access erasure coded pools with the rbd interface unless you have a cache tier set up. Please refer to Section 5.4, “Erasure Coded Pool And Cache Tiering” for more details.

6.1 Creating a Sample Erasure Coded Pool

The simplest erasure coded pool is equivalent to RAID5 and requires at least three hosts:

> ceph osd pool create ecpool 12 12 erasure
pool 'ecpool' created
> echo ABCDEFGHI | rados --pool ecpool put NYAN -
> rados --pool ecpool get NYAN -
ABCDEFGHI

The 12 in the pool create command stands for the number of placement groups.

6.2 Erasure Code Profiles

Some terminology hints:

chunk

when the encoding function is called, it returns chunks of the same size. Data chunks which can be concatenated to reconstruct the original object and coding chunks which can be used to rebuild a lost chunk.

k

the number of data chunks, i.e. the number of chunks in which the original object is divided. For instance if k = 2 a 10KB object will be divided into k objects of 5KB each.

m

the number of coding chunks, i.e. the number of additional chunks computed by the encoding functions. If there are 2 coding chunks, it means 2 OSDs can be out without losing data.

The default erasure code profile sustains the loss of a single OSD. It is equivalent to a replicated pool of size two but requires 1.5TB instead of 2TB to store 1TB of data. The default profile can be displayed with:

> ceph osd erasure-code-profile get default
directory=.libs
k=2
m=1
plugin=jerasure
ruleset-failure-domain=host
technique=reed_sol_van

Choosing the right profile is important because it cannot be modified after the pool is created: a new pool with a different profile needs to be created and all objects from the previous pool moved to the new.

The most important parameters of the profile are k, m and ruleset-failure-domain because they define the storage overhead and the data durability. For instance, if the desired architecture must sustain the loss of two racks with a storage overhead of 40% overhead, the following profile can be defined:

> ceph osd erasure-code-profile set myprofile \
   k=3 \
   m=2 \
   ruleset-failure-domain=rack
> ceph osd pool create ecpool 12 12 erasure myprofile
> echo ABCDEFGHI | rados --pool ecpool put NYAN -
> rados --pool ecpool get NYAN -
ABCDEFGHI

The NYAN object will be divided in three (k=3) and two additional chunks will be created (m=2). The value of m defines how many OSD can be lost simultaneously without losing any data. The ruleset-failure-domain=rack will create a CRUSH ruleset that ensures no two chunks are stored in the same rack.

For more information about the erasure code profiles, see http://docs.ceph.com/docs/master/rados/operations/erasure-code-profile.

7 Ceph as a Backend for QEMU KVM Instance

The most frequent Ceph use case involves providing block device images to virtual machines. For example, a user may create a 'golden' image with an OS and any relevant software in an ideal configuration. Then, the user takes a snapshot of the image. Finally, the user clones the snapshot (usually many times, see Chapter 8, Snapshots for details). The ability to make copy-on-write clones of a snapshot means that Ceph can provision block device images to virtual machines quickly, because the client does not have to download an entire image each time it spins up a new virtual machine.

Ceph block devices can integrate with the QEMU virtual machines. For more information on QEMU KVM, see https://www.suse.com/documentation/sles-12/book_virt/data/part_virt_qemu.html.

7.1 Usage

The QEMU command line expects you to specify the pool name and image name. You may also specify a snapshot name.

qemu-img command options \
rbd:pool-name/image-name@snapshot-name:option1=value1:option2=value2...

For example, specifying the id and conf options might look like the following:

qemu-img command options \
rbd:pool_name/image_name:id=glance:conf=/etc/ceph/ceph.conf

7.2 Creating Images with QEMU

You can create a block device image from QEMU. You must specify rbd, the pool name, and the name of the image you wish to create. You must also specify the size of the image.

qemu-img create -f raw rbd:pool-name/image-name size

For example:

qemu-img create -f raw rbd:pool1/image1 10G
Formatting 'rbd:pool1/image1', fmt=raw size=10737418240 nocow=off cluster_size=0
Important
Important

The raw data format is really the only sensible format option to use with RBD. Technically, you could use other QEMU-supported formats (such as qcow2 or vmdk), but doing so would add additional overhead, and would also render the volume unsafe for virtual machine live migration when caching is enabled.

7.3 Resizing Images with QEMU

You can resize a block device image from QEMU. You must specify rbd, the pool name, and the name of the image you wish to resize. You must also specify the size of the image.

qemu-img resize rbd:pool-name/image-name size

For example:

qemu-img resize rbd:pool1/image1 9G
Image resized.

7.4 Retrieving Image Info with QEMU

You can retrieve block device image information from QEMU. You must specify rbd, the pool name, and the name of the image.

qemu-img info rbd:pool-name/image-name

For example:

qemu-img info rbd:pool1/image1
image: rbd:pool1/image1
file format: raw
virtual size: 9.0G (9663676416 bytes)
disk size: unavailable
cluster_size: 4194304

7.5 Running QEMU with RBD

QEMU can access an image as a virtual block device directly via librbd. This avoids an additional context switch, and can take advantage of RBD caching.

You can use qemu-img to convert existing virtual machine images to Ceph block device images. For example, if you have a qcow2 image, you could run:

qemu-img convert -f qcow2 -O raw sles12.qcow2 rbd:pool1/sles12

To run a virtual machine booting from that image, you could run:

qemu -m 1024 -drive format=raw,file=rbd:pool1/sles12

RBD caching (http://ceph.com/docs/master/rbd/rbd-config-ref/#cache-settings) can significantly improve performance. QEMU’s cache options control librbd caching:

qemu -m 1024 -drive format=rbd,file=rbd:pool1/sles12,cache=writeback

7.6 Enabling Discard/TRIM

Ceph block devices support the discard operation. This means that a guest can send TRIM requests to let a Ceph block device reclaim unused space. This can be enabled in the guest by mounting ext4 or XFS with the discard option.

For this to be available to the guest, it must be explicitly enabled for the block device. To do this, you must specify a discard_granularity associated with the drive:

qemu -m 1024 -drive format=raw,file=rbd:pool1/sles12,id=drive1,if=none \
-device driver=ide-hd,drive=drive1,discard_granularity=512
Note
Note

The above example uses the IDE driver. The virtio driver does not support discard.

If using libvirt, edit your libvirt domain’s configuration file using virsh edit to include the xmlns:qemu value. Then, add a qemu:commandline block as a child of that domain. The following example shows how to set two devices with qemu id= to different discard_granularity values.

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
 <qemu:commandline>
  <qemu:arg value='-set'/>
  <qemu:arg value='block.scsi0-0-0.discard_granularity=4096'/>
  <qemu:arg value='-set'/>
  <qemu:arg value='block.scsi0-0-1.discard_granularity=65536'/>
 </qemu:commandline>
</domain>

7.7 QEMU Cache Options

QEMU’s cache options correspond to the following Ceph RBD Cache settings.

Writeback:

rbd_cache = true

Writethrough:

rbd_cache = true
rbd_cache_max_dirty = 0

None:

rbd_cache = false

QEMU’s cache settings override Ceph’s default settings (i.e., settings that are not explicitly set in the Ceph configuration file). If you explicitly set RBD Cache (http://ceph.com/docs/master/rbd/rbd-config-ref/#cache-settings) settings in your Ceph configuration file, your Ceph settings override the QEMU cache settings. If you set cache settings on the QEMU command line, the QEMU command line settings override the Ceph configuration file settings.

8 Snapshots

A snapshot is a read-only copy of the state of an image at a particular point in time. One of the advanced features of Ceph block devices is that you can create snapshots of the images to retain a history of an image’s state. Ceph also supports snapshot layering, which allows you to clone images (e.g., a VM image) quickly and easily. Ceph supports block device snapshots using the rbd command and many higher level interfaces, including QEMU, libvirt, OpenStack and CloudStack.

Note
Note

Stop input/output operations before snapshotting an image. If the image contains a filesystem, the filesystem must be in a consistent state before snapshotting.

8.1 Cephx Notes

When cephx is enabled (see http://ceph.com/docs/master/rados/configuration/auth-config-ref/ for more information), you must specify a user name or ID and a path to the keyring containing the corresponding key for the user. See User Management (http://ceph.com/docs/master/rados/operations/user-management/) for more details. You may also add the CEPH_ARGS environment variable to avoid re-entry of the following parameters.

rbd --id user-ID --keyring=/path/to/secret commands
rbd --name username --keyring=/path/to/secret commands

For example:

rbd --id admin --keyring=/etc/ceph/ceph.keyring commands
rbd --name client.admin --keyring=/etc/ceph/ceph.keyring commands
Tip
Tip

Add the user and secret to the CEPH_ARGS environment variable so that you do not need to enter them each time.

8.2 Snapshot Basics

The following procedures demonstrate how to create, list, and remove snapshots using the rbd command on the command line.

8.2.1 Create Snapshot

To create a snapshot with rbd, specify the snap create option, the pool name and the image name.

rbd --pool pool-name snap create --snap snap-name image-name
rbd snap create pool-name/image-name@snap-name

For example:

rbd --pool rbd snap create --snap snapshot1 image1
rbd snap create rbd/image1@snapshot1

8.2.2 List Snapshots

To list snapshots of an image, specify the pool name and the image name.

rbd --pool pool-name snap ls image-name
rbd snap ls pool-name/image-name

For example:

rbd --pool rbd snap ls image1
rbd snap ls rbd/image1

8.2.3 Rollback Snapshot

To rollback to a snapshot with rbd, specify the snap rollback option, the pool name, the image name and the snap name.

rbd --pool pool-name snap rollback --snap snap-name image-name
rbd snap rollback pool-name/image-name@snap-name

For example:

rbd --pool pool1 snap rollback --snap snapshot1 image1
rbd snap rollback pool1/image1@snapshot1
Note
Note

Rolling back an image to a snapshot means overwriting the current version of the image with data from a snapshot. The time it takes to execute a rollback increases with the size of the image. It is faster to clone from a snapshot than to rollback an image to a snapshot, and it is the preferred method of returning to a pre-existing state.

8.2.4 Delete a Snapshot

To delete a snapshot with rbd, specify the snap rm option, the pool name, the image name and the username.

rbd --pool pool-name snap rm --snap snap-name image-name
rbd snap rm pool-name/image-name@snap-name

For example:

rbd --pool pool1 snap rm --snap snapshot1 image1
rbd snap rm pool1/imag1@snapshot1
Note
Note

Ceph OSDs delete data asynchronously, so deleting a snapshot does not free up the disk space immediately.

8.2.5 Purge Snapshots

To delete all snapshots for an image with rbd, specify the snap purge option and the image name.

rbd --pool pool-name snap purge image-name
rbd snap purge pool-name/image-name

For example:

rbd --pool pool1 snap purge image1
rbd snap purge pool1/image1

8.3 Layering

Ceph supports the ability to create many copy-on-write (COW) clones of a block device snapshot. Snapshot layering enables Ceph block device clients to create images very quickly. For example, you might create a block device image with a Linux VM written to it; then, snapshot the image, protect the snapshot, and create as many copy-on-write clones as you like. A snapshot is read-only, so cloning a snapshot simplifies semantics–making it possible to create clones rapidly.

Note
Note

The terms “parent” and “child” mentioned in the command line examples below mean a Ceph block device snapshot (parent), and the corresponding image cloned from the snapshot (child).

Each cloned image (child) stores a reference to its parent image, which enables the cloned image to open the parent snapshot and read it.

A COW clone of a snapshot behaves exactly like any other Ceph block device image. You can read to, write from, clone, and resize cloned images. There are no special restrictions with cloned images. However, the copy-on-write clone of a snapshot refers to the snapshot, so you must protect the snapshot before you clone it. The following diagram depicts the process.

Note
Note

Ceph only supports cloning for format 2 images (i.e., created with rbd create --image-format 2), and is not yet supported by the kernel rbd module. So you must use QEMU/KVM or librbd directly to access clones in the current release.

8.3.1 Getting Started with Layering

Ceph block device layering is a simple process. You must have an image. You must create a snapshot of the image. You must protect the snapshot. Once you have performed these steps, you can begin cloning the snapshot.

The cloned image has a reference to the parent snapshot, and includes the pool ID, image ID and snapshot ID. The inclusion of the pool ID means that you may clone snapshots from one pool to images in another pool.

  • Image Template: A common use case for block device layering is to create a a master image and a snapshot that serves as a template for clones. For example, a user may create an image for a Linux distribution (e.g., SUSE Linux Enterprise Server), and create a snapshot for it. Periodically, the user may update the image and create a new snapshot (e.g., zypper ref && zypper patch followed by rbd snap create). As the image matures, the user can clone any one of the snapshots.

  • Extended Template: A more advanced use case includes extending a template image that provides more information than a base image. For example, a user may clone an image (e.g., a VM template) and install other software (e.g., a database, a content management system, an analytics system, etc.) and then snapshot the extended image, which itself may be updated just like the base image.

  • Template Pool: One way to use block device layering is to create a pool that contains master images that act as templates, and snapshots of those templates. You may then extend read-only privileges to users so that they may clone the snapshots without the ability to write or execute within the pool.

  • Image Migration/Recovery: One way to use block device layering is to migrate or recover data from one pool into another pool.

8.3.2 Protecting a Snapshot

Clones access the parent snapshots. All clones would break if a user inadvertently deleted the parent snapshot. To prevent data loss, you need to protect the snapshot before you can clone it.

rbd --pool pool-name snap protect \
 --image image-name --snap snapshot-name
rbd snap protect pool-name/image-name@snapshot-name

For example:

rbd --pool pool1 snap protect --image image1 --snap snapshot1
rbd snap protect pool1/image1@snapshot1
Note
Note

You cannot delete a protected snapshot.

8.3.3 Cloning a Snapshot

To clone a snapshot, you need to specify the parent pool, image and snapshot, the child pool and image name. You must protect the snapshot before you can clone it.

rbd --pool pool-name --image parent-image \
 --snap snap-name --dest-pool pool-name \
 --dest child-image
rbd clone
pool-name/parent-image@snap-name \
 pool-name/child-image-name

For example:

rbd clone pool1/image1@snapshot1 pool1/image2
Note
Note

You may clone a snapshot from one pool to an image in another pool. For example, you may maintain read-only images and snapshots as templates in one pool, and writeable clones in another pool.

8.3.4 Unprotecting a Snapshot

Before you can delete a snapshot, you must unprotect it first. Additionally, you may not delete snapshots that have references from clones. You must flatten each clone of a snapshot, before you can delete the snapshot.

rbd --pool pool-name snap unprotect --image image-name \
 --snap snapshot-name
rbd snap unprotect pool-name/image-name@snapshot-name

For example:

rbd --pool pool1 snap unprotect --image image1 --snap snapshot1
rbd snap unprotect pool1/image1@snapshot1

8.3.5 Listing Children of a Snapshot

To list the children of a snapshot, execute the following:

rbd --pool pool-name children --image image-name --snap snap-name
rbd children pool-name/image-name@snapshot-name

For example:

rbd --pool pool1 children --image image1 --snap snapshot1
rbd children pool1/image1@snapshot1

8.3.6 Flattening a Cloned Image

Cloned images retain a reference to the parent snapshot. When you remove the reference from the child clone to the parent snapshot, you effectively “flatten” the image by copying the information from the snapshot to the clone. The time it takes to flatten a clone increases with the size of the snapshot. To delete a snapshot, you must flatten the child images first.

rbd --pool pool-name flatten --image image-name
rbd flatten pool-name/image-name

For example:

rbd --pool pool1 flatten --image image1
rbd flatten pool1/image1
Note
Note

Since a flattened image contains all the information from the snapshot, a flattened image will take up more storage space than a layered clone.

9 Ceph Object Gateway

Ceph Object Gateway is an object storage interface built on top of librgw to provide applications with a RESTful gateway to Ceph Storage Clusters. Ceph Object Storage supports two interfaces:

  • S3-compatible: Provides object storage functionality with an interface that is compatible with a large subset of the Amazon S3 RESTful API.

  • Swift-compatible: Provides object storage functionality with an interface that is compatible with a large subset of the OpenStack Swift API.

Ceph Object Storage uses the Ceph Object Gateway daemon (radosgw), which is a FastCGI module for interacting with a Ceph Storage Cluster. Since it provides interfaces compatible with OpenStack Swift and Amazon S3, the Ceph Object Gateway has its own user management. Ceph Object Gateway can store data in the same Ceph Storage Cluster used to store data from Ceph Filesystem clients or Ceph Block Device clients. The S3 and Swift APIs share a common namespace, so you may write data with one API and retrieve it with the other.

9.1 Manual Installation and Service Activation

The Ceph Object Gateway daemon runs on Apache and FastCGI. An installed and configured Ceph cluster is a prerequisite.

To run a Ceph Object Storage service, install Apache and FastCGI. Then, you must install the Ceph Object Gateway daemon. The Ceph Object Gateway supports 100-continue, but you must install Ceph builds of Apache and FastCGI for 100-continue support. If you plan to run a Ceph Object Storage service with a federated architecture (multiple regions and zones), you must also install the synchronization agent.

  1. Install Apache, FastCGI, and RADOS Gateway. The following command installs all these components:

    sudo zypper ref && sudo zypper in ceph-radosgw
  2. Enable the URL rewrite modules for Apache and FastCGI:

    sudo a2enmod rewrite fastcgi

    In fastcgi.conf, comment the lines Order allow,deny and Deny from all and add the line Require all granted. This is needed because the order directive only applies for Apache version 2.4 or greater.

  3. Create a configuration file for a new Apache virtual host. Then add a line for the ServerName directive and provide the fully qualified domain name of the host where you will install the RADOS Gateway.

    cd /etc/apache2/vhosts.d
    sudo cp vhost.template radosgw.conf
    sudo vim radosgw.conf
  4. Restart Apache so that the foregoing changes take effect.

    sudo systemctl restart apache2.service

If you enable SSL, set the port to your SSL port (usually 443) and in your rgw.conf file include the following:

SSLEngine On
SSLCertificateFile /etc/apache2/ssl.crt/server.crt
SSLCertificateKeyFile /etc/apache2/ssl.key/server.key
SetEnv SERVER_PORT_SECURE 443

For background information about enabling SSL, see the SLES Admin Guide: https://www.suse.com/documentation/sles-12/book_sle_admin/data/sec_apache2_ssl.html.

9.2 Configuring the RADOS Gateway

Several steps are required to configure a RADOS Gateway.

9.2.1 Basic Configuration

Configuring a Ceph Object Gateway requires a running Ceph Storage Cluster, and an Apache web server with the FastCGI module. The Ceph Object Gateway is a client of the Ceph Storage Cluster. As a Ceph Storage Cluster client, it requires:

  • A host name for the gateway instance, for example gateway.

  • A storage cluster user name with appropriate permissions in a keyring.

  • Pools to store its data.

  • A data directory for the gateway instance.

  • An instance entry in the Ceph Configuration file.

  • A configuration file for the web server to interact with FastCGI.

Each instance must have a user name and key to communicate with a Ceph Storage Cluster. In the following steps, we use an admin node to create a keyring. Then, we create a client user name and key. Next, we add the key to the Ceph Storage Cluster. Finally, we distribute the keyring to the node containing the gateway instance.

  1. Create a keyring for the gateway:

    sudo ceph-authtool --create-keyring /etc/ceph/ceph.client.radosgw.keyring
    sudo chmod +r /etc/ceph/ceph.client.radosgw.keyring
  2. Generate a Ceph Object Gateway user name and key for each instance. As an example, we will use the name gateway after client.radosgw:

    sudo ceph-authtool /etc/ceph/ceph.client.radosgw.keyring \
      -n client.radosgw.gateway --gen-key
  3. Add capabilities to the key:

    sudo ceph-authtool -n client.radosgw.gateway --cap osd 'allow rwx' \
      --cap mon 'allow rwx' /etc/ceph/ceph.client.radosgw.keyring
  4. Once you have created a keyring and key to enable the Ceph Object Gateway with access to the Ceph Storage Cluster, add the key to your Ceph Storage Cluster. For example:

    sudo ceph -k /etc/ceph/ceph.client.admin.keyring auth add client.radosgw.gateway \
      -i /etc/ceph/ceph.client.radosgw.keyring
  5. Distribute the keyring to the node with the gateway instance:

    sudo scp /etc/ceph/ceph.client.radosgw.keyring  ceph@hostname:/home/ceph
    ssh hostname
    sudo mv ceph.client.radosgw.keyring /etc/ceph/ceph.client.radosgw.keyring

9.2.2 Create Pools

Ceph Object Gateways require Ceph Storage Cluster pools to store specific gateway data. If the user you created has proper permissions, the gateway will create the pools automatically. However, ensure that you have set an appropriate default number of placement groups per pool in the Ceph configuration file.

When configuring a gateway with the default region and zone, the naming convention for pools typically omits region and zone naming, but you can use any naming convention you prefer. For example:

.rgw
.rgw.root
.rgw.control
.rgw.gc
.rgw.buckets
.rgw.buckets.index
.log
.intent-log
.usage
.users
.users.email
.users.swift
.users.uid

9.2.3 Adding Gateway Configuration to Ceph

Add the Ceph Object Gateway configuration to the Ceph Configuration file. The Ceph Object Gateway configuration requires you to identify the Ceph Object Gateway instance. Then, specify the host name where you installed the Ceph Object Gateway daemon, a keyring (for use with cephx), the socket path for FastCGI and a log file. For example:

[client.radosgw.instance-name]
host = hostname
keyring = /etc/ceph/ceph.client.radosgw.keyring
rgw socket path = /var/run/ceph/ceph.radosgw.instance-name.fastcgi.sock
log file = /var/log/radosgw/client.radosgw.instance-name.log

The [client.radosgw.*] portion of the gateway instance identifies this portion of the Ceph configuration file as configuring a Ceph Storage Cluster client where the client type is a Ceph Object Gateway (i.e., radosgw). The instance name follows. For example:

[client.radosgw.gateway]
host = ceph-gateway
keyring = /etc/ceph/ceph.client.radosgw.keyring
rgw socket path = /var/run/ceph/ceph.radosgw.gateway.fastcgi.sock
log file = /var/log/radosgw/client.radosgw.gateway.log
Note
Note

The host must be your machine hostname, not the FQDN. The name you use for the FastCGI socket is not the same as the one used for the object gateway, which is ceph-client.radosgw.instance-name.asok by default. Use the same name in your S3 FastCGI file, too.

Then turn off print continue. If you have it set to true, you may encounter problems with PUT operations:

rgw print continue = false

To use a Ceph Object Gateway with subdomain S3 calls (e.g., http://bucketname.hostname), you must add the Ceph Object Gateway DNS name under the [client.radosgw.gateway] section of the Ceph configuration file:

[client.radosgw.gateway]
...
rgw dns name = hostname

You should also consider installing a DNS server such as Dnsmasq on your client machine(s) when using the http://bucketname.hostname syntax. The dnsmasq.conf file should include the following settings:

address=/hostname/host-ip-address
listen-address=client-loopback-ip

Then, add the client-loopback-ip IP address as the first DNS nameserver on the client machine(s).

9.2.4 Redeploy Ceph Configuration

Use ceph-deploy to push a new copy of the configuration to the hosts in your cluster:

ceph-deploy config push host-name [host-name]...

9.2.5 Add a Ceph Object Gateway Script

Add a s3gw.fcgi file (use the same name referenced in the first line of rgw.conf). Save the file to the /srv/www/fastcgi directory. Assuming a cluster named ceph (default), and the user created in previous steps, the contents of the file should include:

#!/bin/sh
exec /usr/bin/radosgw -c /etc/ceph/ceph.conf -n client.radosgw.gateway

Apply execute permissions to s3gw.fcgi:

sudo chmod +x s3gw.fcgi

Also change the ownership to the apache user.

sudo chown wwwrun:www s3gw.fcgi

9.2.6 Create Data Directory

Deployment scripts may not create the default Ceph Object Gateway data directory. Create data directories for each instance of a radosgw daemon if not already done. The host variables in the Ceph configuration file determine which host runs each instance of a radosgw daemon. The typical form specifies the radosgw daemon, the cluster name and the daemon ID.

sudo mkdir -p /var/lib/ceph/radosgw/cluster-id

Using the exemplary ceph.conf settings above, you would execute the following:

sudo mkdir -p /var/lib/ceph/radosgw/ceph-radosgw.gateway

9.2.7 Create a Gateway Configuration

On the host where you installed the Ceph Object Gateway, create an rgw.conf file. Place the file in the /etc/apache2/conf.d directory. You should configure Apache to allow encoded slashes, provide paths for log files and to turn off server signatures. Here is an exemplary embodiment of a gateway configuration:

FastCgiExternalServer /srv/www/fcgi-bin/s3gw.fcgi -socket /var/run/ceph/ceph.radosgw.gateway.fastcgi.sock
<VirtualHost *:80>
ServerName fqdn
ServerAdmin email.address
DocumentRoot /srv/www/fcgi-bin
RewriteEngine On
RewriteRule  ^/(.*) /s3gw.fcgi?%{QUERY_STRING} [E=HTTP_AUTHORIZATION:%{HTTP:Authorization},L]
<IfModule mod_fastcgi.c>
<Directory /srv/www/fcgi-bin>
Options +ExecCGI
AllowOverride All
SetHandler fastcgi-script
Require all granted
AuthBasicAuthoritative Off 
</Directory>
</IfModule>
AllowEncodedSlashes On
ErrorLog /var/log/httpd/error.log
CustomLog /var/log/httpd/access.log combined
ServerSignature Off
</VirtualHost>

Replace the fqdn entry with the fully-qualified domain name of the server. Replace the email.address entry with the email address for the server administrator. Add a ServerAlias if you wish to use S3-style subdomains (of course you do). Save the configuration to a file (for example, rgw.conf).

Finally, if you enabled SSL, make sure that you set the port to your SSL port (usually 443) and your configuration file includes the following:

SSLEngine on
SSLCertificateFile /etc/apache2/ssl/apache.crt
SSLCertificateKeyFile /etc/apache2/ssl/apache.key
SetEnv SERVER_PORT_SECURE 443

Enable the fastcgi, rewrite, and ssl modules:

sudo a2enmod fastcgi
sudo a2enmod rewrite
sudo a2enmod ssl

Change the ownership for /var/log/apache2 and /var/run/ceph to ensure that Apache has permissions to create a socket or log file:

sudo chown wwwrun:www /path/to/file

9.2.8 Restart Services and Start the Gateway

To ensure that all components have reloaded their configurations, we recommend restarting your Ceph Storage Cluster and Apache services. Then, start up the radosgw service.

For the Ceph Storage Cluster, see Chapter 3, Operating a Ceph Cluster.

sudo systemctl restart apache2
sudo systemctl restart ceph-radosgw

Once the service is up and running, you can make an anonymous GET request to see if the gateway returns a response. A simple HTTP request to the domain name should return the following:

<ListAllMyBucketsResult>
      <Owner>
              <ID>anonymous</ID>
              <DisplayName/>
      </Owner>
      <Buckets/>
</ListAllMyBucketsResult>

9.3 Managing RADOS Gateway with ceph-deploy

The ceph-deploy script includes the rwg component that helps you manage the RADOS Gateway activation and operation.

Before running ceph-deploy rgw, you need to have the Ceph cluster installed (see Chapter 2, Installation of Basic Ceph Cluster for more information).

  1. Prepare node(s). You can specify several pairs of short_hostname:gateway_name to install RADOS Gateway on a required number of nodes.

    ceph-deploy --overwrite-conf rgw prepare \
     short_hostname:gateway_name ...

    For example:

    ceph-deploy --overwrite-conf rgw prepare ceph-node1:gateway1
    [ceph_deploy.cli][INFO  ] Invoked (1.5.19): /usr/bin/ceph-deploy \
      --overwrite-conf rgw prepare ceph-node1:gateway1
    [ceph-node1][INFO  ] Running command: sudo ceph -f json auth list
    [ceph-node1][INFO  ] Running command: sudo ceph -f json osd lspools
    [ceph-node1][INFO  ] Running command: sudo ceph osd pool create .users.email 3
    [ceph-node1][INFO  ] Running command: sudo ceph osd pool create .intent-log 7
    [ceph-node1][INFO  ] Running command: sudo ceph osd pool create .usage 8
    [ceph-node1][INFO  ] Running command: sudo ceph osd pool create .users.uid 9
    [ceph-node1][INFO  ] Running command: sudo ceph osd pool create .rgw.control 10
    [ceph-node1][INFO  ] Running command: sudo ceph osd pool create .users 11
    [ceph-node1][INFO  ] Running command: sudo ceph osd pool create .log 12
    [ceph-node1][INFO  ] Running command: sudo ceph osd pool create .rgw.gc 13
    [ceph-node1][INFO  ] Running command: sudo ceph osd pool create .users.swift 14
    [ceph-node1][INFO  ] Running command: sudo ceph osd pool create .rgw 15
    [ceph-node1][INFO  ] Running command: sudo zypper --non-interactive \
      --gpg-auto-import-keys refresh
    [ceph-node1][INFO  ] Running command: sudo zypper --non-interactive --quiet \
      install ceph-radosgw
    [...]
    [ceph_deploy.rgw][INFO  ] Writing:/etc/apache2/conf.d/rgw_80.conf
    [ceph_deploy.rgw][INFO  ] Writing:/srv/www/radosgw/s3gw_80.fcgi
  2. Activate node(s). Specify the same pairs of short_hostname:gateway_name you did in when preparing the nodes.

    ceph-deploy --overwrite-conf rgw activate \
     short_hostname:gateway_name ...

    For example:

    ceph-deploy --overwrite-conf rgw activate ceph-node1:gateway1
    [ceph_deploy.cli][INFO  ] Invoked (1.5.19): /usr/bin/ceph-deploy \
      --overwrite-conf rgw activate ceph-node1:gateway1
    [ceph-node1][INFO  ] Running command: sudo systemctl start apache2
    [ceph-node1][INFO  ] Running command: sudo systemctl enable apache2
    [ceph-node1][INFO  ] Running command: sudo systemctl start ceph-radosgw@gateway1
    [ceph-node1][INFO  ] Running command: sudo systemctl status ceph-radosgw@gateway1 \
      --output json
    [ceph-node1][INFO  ] Running command: sudo systemctl enable ceph-radosgw@gateway1
    Tip
    Tip

    To combine the prepare and activate steps into one, you can use the following command:

    ceph-deploy --overwrite-conf rgw create \
     short_hostname:gateway_name ...

You now have a working RADOS Gateway on the specified nodes. Now you need to give access to a client.

  1. Create a new user.

    sudo radosgw-admin user create \
      --uid=example_user \
      --display-name="Example User" \
      --email=penguin@example.com
  2. Generate a secret key for the user.

    sudo radosgw-admin key create \
      --gen-secret \
      --subuser=example_user:swift \
      --key-type=swift
  3. Both commands will output JSON-formatted data showing the user state. Notice the following lines, and remember the secret_key value:

    "swift_keys": [
      { "user": "example_user:swift",
        "secret_key": "r5wWIxjOCeEO7DixD1FjTLmNYIViaC6JVhi3013h"}],
  4. On a client node (any host, typically outside the Ceph cluster), install the swift client that understand the RADOS Gateway API.

    sudo zypper install python-swiftclient
  5. After the swift client is installed, you can run

    swift -v -A http://radosgw_host/auth/v1.0 \
      -U username:swift \
      --key='secret_key' \
      stat

    For example:

    swift -v -A http://ceph_node1.example.com/auth/v1.0 \
      -U example_user:swift \
      --key='r5wWIxjOCeEO7DixD1FjTLmNYIViaC6JVhi3013h' \
      stat

9.3.1 Removing RADOS Gateway from a Node

To remove a RADOS Gateway installation from the node where it was previously installed, run:

ceph-deploy --overwrite-conf rgw delete  \
  short_hostname:gatewayname ...

For example:

ceph-deploy --overwrite-conf rgw delete ceph-node1:gateway1
[ceph_deploy.cli][INFO] Invoked (1.5.19): /usr/bin/ceph-deploy \
  --overwrite-conf rgw delete ceph-node1:gateway1
[...]
[ceph-node1][INFO] Running command: sudo systemctl stop ceph-radosgw@gateway1
[ceph-node1][INFO] Running command: sudo systemctl disable ceph-radosgw@gateway1
[ceph-node1][INFO] Running command: sudo systemctl stop apache2
[ceph-node1][INFO] Running command: sudo systemctl disable apache2
[ceph-node1][INFO] Running command: sudo ceph -f json auth list
[ceph-node1][INFO] Running command: sudo ceph auth del client.radosgw.gateway1
Tip
Tip

You need a copy of the local ceph.conf file, in your current working directory. If you do not have a copy of it, copy it from your cluster.

9.3.2 Listing RADOS Gateway Installations

To list RADOS Gateway installations with the Ceph cluster, run:

ceph-deploy rgw list

10 Best Practice

This chapter introduces a list of selected topics which you can encounter when managing Ceph environment. To every topic there is a recommended solution that helps you understand or fix the existing problem. The topics are sorted into relevant categories.

10.1 Monitoring

10.1.1 Checking if OSD Daemons are Running on a Node

To check the status of OSD services on a specific node, log in to the node, and run the following:

sudo systemctl status ceph-osd*
ceph-osd@0.service - Ceph object storage daemon
   Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled)
   Active: active (running) since Fri 2015-02-20 11:13:18 CET; 2 days ago
 Main PID: 1822 (ceph-osd)
   CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@0.service
           └─1822 /usr/bin/ceph-osd -f --cluster ceph --id 0

For more information, see Section 3.1, “Operating Ceph Services”.

10.1.2 Checking if Monitor Daemons are Running on a Node

To check the status of monitor services on a specific node, log in to the node, and run the following:

sudo systemctl status ceph-mon*
ceph-mon@doc-ceph1.service - Ceph cluster monitor daemon
   Loaded: loaded (/usr/lib/systemd/system/ceph-mon@.service; enabled)
   Active: active (running) since Wed 2015-02-18 16:57:17 CET; 4 days ago
 Main PID: 1203 (ceph-mon)
   CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@doc-ceph1.service
           └─1203 /usr/bin/ceph-mon -f --cluster ceph --id doc-ceph1

For more information, see Section 3.1, “Operating Ceph Services”.

10.2 Disk Management

10.2.1 Adding Disks

Important
Important
  • This can be done on a live cluster without downtime.

  • This will cause increased replication traffic between servers.

  • Doing this operation repeatedly before the last operation has completed replication can save the cluster overall rebuild time.

To add a disk (/dev/sdd in our example) to a Ceph cluster, follow these steps:

  1. Create a partition sdd1 on the disk:

    sudo parted /dev/sdd1 mkpart primary 0.0 -1s
  2. Format the partition with either the XFS (strongly recommended) or Btrfs filesystem:

    sudo mkfs.xfs -f /dev/sdd1
  3. Find out the UUID (Universally Unique Identifier) of the disk:

    ls -l /dev/disk/by-uuid | grep sdd1
    [...] 04bb24f1-d631-47ff-a2ee-22d94ad4f80c -> ../../sdd1
  4. Add the corresponding line to /etc/fstab for the example disk osd.12:

    [...]
    UUID=04bb24f1-d631-47ff-a2ee-22d94ad4f80c /mnt/osd.12 xfs \
    defaults,errors=remount-ro 0 1
    [...]
  5. Mount the disk:

    sudo mount /mnt/osd.12
  6. Add the new disk to /etc/ceph/ceph.conf and copy the updated configuration file to all other nodes in the cluster.

  7. Create the OSD.

    ceph osd create 04bb24f1-d631-47ff-a2ee-22d94ad4f80c
  8. Make sure that the new OSD is accepted into the cluster:

    sudo mkdir /srv/ceph/04bb24f1-d631-47ff-a2ee-22d94ad4f80c
    ceph-osd -i 12 --mkfs --mkkey
    ceph auth add osd.12 osd 'allow *' mon 'allow rwx' -i /etc/ceph/keyring.osd.12
  9. Start the newly added OSD:

    sudo systemctl start ceph-osd@12.service
  10. Add it to the cluster and allow replication based on CRUSH Map:

    ceph osd crush set 12 osd.12 1.0 \
    pool=pool_name rack=rack_name host=host_name-osd
  11. Check that the new OSD is in the right place within the cluster:

    ceph osd tree

10.2.2 Deleting disks

Important
Important
  • This can be done on a live cluster without downtime.

  • This will cause increased replication traffic between servers.

  • Be sure not to remove too many disks from your cluster to be able to keep the replication rules. See Section 4.3, “Rules Sets” for more information.

To delete a disk (for example osd.12) from a Ceph cluster, follow these steps:

  1. Make sure you have the right disk:

    ceph osd tree
  2. If the disk is a member of a pool and/or active:

    1. Mark the disk out:

      ceph osd out 12
    2. Wait for data migration to complete with ceph -w, then stop it:

      sudo systemctl stop ceph-osd@12.service
  3. Remove the disk from CRUSH Map:

    ceph osd crush remove osd.12
  4. Remove authentication information for the disk:

    ceph auth del osd.12
  5. Remove the disk from the cluster:

    ceph osd rm 12
Tip
Tip

The process of preparing/adding a disk can be simplified with the ceph-disk command. See http://ceph.com/docs/master/man/8/ceph-disk/ for more information on ceph-disk.

10.3 Recovery

10.3.1 'nn pg stuck inactive' Status Message

If you receive stuck inactive status message after running ceph status, it means that Ceph does not know where to replicate the stored data to fulfill the replication rules. It can happen shortly after the initial Ceph setup and fix itself automatically. In other cases, this may require a manual interaction, such as bringing up a broken OSD, or adding a new OSD to the cluster. In very rare cases, reducing the replication level may help.

10.3.2 OSD Weight is 0

When OSD starts, it is assigned a weight. The higher the weight, the bigger the chance that the cluster writes data to the OSD. The weight is either specified in a cluster CRUSH Map, or calculated by the OSD's start-up script.

In some cases, the calculated value for OSD's weight may be rounded down to zero. It means that the OSD is not scheduled to store data, and no data are written to it. The reason is usually that the disk is too small (smaller than 15GB) and should be replaced with a bigger one.

Glossary

General

admin node

The node from which you run the ceph-deploy utility to deploy Ceph on OSD nodes.

CRUSH

TBD

monitor node

A cluster node that maintains maps of cluster state, including the monitor map, or the OSD map.

OSD node

A cluster node that stores data, handles data replication, recovery, backfilling, rebalancing, and provides some monitoring information to Ceph monitors by checking other Ceph OSD daemons.

Ceph Specific Terms

Ceph Storage Cluster

The core set of storage software which stores the user’s data. Such a set consists of Ceph monitors and OSDs.

AKA Ceph Object Store.

Print this page