SUSE Enterprise Storage 5

Deployment Guide

Authors: Tomáš Bažant, Jana Haláčková, and Sven Seeberg
Publication Date: 05/03/2018
About This Guide
Available Documentation
Feedback
Documentation Conventions
About the Making of This Manual
I SUSE Enterprise Storage
1 SUSE Enterprise Storage and Ceph
1.1 Ceph Features
1.2 Core Components
1.3 Storage Structure
1.4 Additional Information
2 Hardware Requirements and Recommendations
2.1 Object Storage Nodes
2.2 Monitor Nodes
2.3 Object Gateway Nodes
2.4 Salt Master
2.5 iSCSI Nodes
2.6 Network Recommendations
2.7 Naming Limitations
2.8 OSD and Monitor Sharing One Server
2.9 Minimum Cluster Configuration
2.10 Recommended Production Cluster Configuration
2.11 SUSE Enterprise Storage and Other SUSE Products
II Cluster Deployment and Upgrade
3 Deploying with DeepSea/Salt
3.1 Read the Release Notes
3.2 Introduction to DeepSea
3.3 Cluster Deployment
3.4 DeepSea CLI
3.5 Configuration and Customization
3.6 Custom ceph.conf File
4 Upgrading from Previous Releases
4.1 Read the Release Notes
4.2 General Upgrade Procedure
4.3 Upgrade from SUSE Enterprise Storage 4 (DeepSea Deployment) to 5
4.4 Upgrade from SUSE Enterprise Storage 4 (ceph-deploy Deployment) to 5
4.5 Upgrade from SUSE Enterprise Storage 4 (Crowbar Deployment) to 5
4.6 Upgrade from SUSE Enterprise Storage 3 to 5
5 Backing Up the Cluster Configuration
5.1 Back Up Salt Configuration
5.2 Back Up DeepSea Configuration
6 Customizing the Default Configuration
6.1 Using Customized Configuration Files
6.2 Modifying Discovered Configuration
III Installation of Additional Services
7 Installation of Services to Access your Data
8 Ceph Object Gateway
8.1 Object Gateway Manual Installation
9 Installation of iSCSI Gateway
9.1 iSCSI Block Storage
9.2 General Information about lrbd
9.3 Deployment Considerations
9.4 Installation and Configuration
9.5 Exporting RADOS Block Device Images using tcmu-runner
10 Installation of CephFS
10.1 Supported CephFS Scenarios and Guidance
10.2 Ceph Metadata Server
10.3 CephFS
11 Installation of NFS Ganesha
11.1 Preparation
11.2 Example Installation
11.3 High Availability Active-Passive Configuration
11.4 More Information
12 Export CephFS via Samba
12.1 Example Installation
12.2 High Availability Configuration
A Documentation Updates
A.1 Documentation Maintenance Update
A.2 Documentation Maintenance Update
A.3 October, 2017 (Release of SUSE Enterprise Storage 5)

Copyright © 2018 SUSE LLC

Copyright © 2010-2014, Inktank Storage, Inc. and contributors.

The text of and illustrations in this document are licensed under a Creative Commons Attribution-Share Alike 4.0 International ("CC-BY-SA"). An explanation of CC-BY-SA is available at http://creativecommons.org/licenses/by-sa/4.0/legalcode. In accordance with CC-BY-SA, if you distribute this document or an adaptation of it, you must provide the URL for the original version.

Red Hat, Red Hat Enterprise Linux, the Shadowman logo, JBoss, MetaMatrix, Fedora, the Infinity Logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries. Linux® is the registered trademark of Linus Torvalds in the United States and other countries. Java® is a registered trademark of Oracle and/or its affiliates. XFS® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries. MySQL® is a registered trademark of MySQL AB in the United States, the European Union and other countries. All other trademarks are the property of their respective owners.

For SUSE trademarks, see http://www.suse.com/company/legal/. All other third-party trademarks are the property of their respective owners. Trademark symbols (®, ™ etc.) denote trademarks of SUSE and its affiliates. Asterisks (*) denote third-party trademarks.

All information found in this book has been compiled with utmost attention to detail. However, this does not guarantee complete accuracy. Neither SUSE LLC, its affiliates, the authors nor the translators shall be held liable for possible errors or the consequences thereof.

About This Guide

SUSE Enterprise Storage is an extension to SUSE Linux Enterprise. It combines the capabilities of the Ceph (http://ceph.com/) storage project with the enterprise engineering and support of SUSE. SUSE Enterprise Storage provides IT organizations with the ability to deploy a distributed storage architecture that can support a number of use cases using commodity hardware platforms.

This guide helps you understand the concept of the SUSE Enterprise Storage with the main focus on managing and administrating the Ceph infrastructure. It also demonstrates how to use Ceph with other related solutions, such as OpenStack or KVM.

Many chapters in this manual contain links to additional documentation resources. These include additional documentation that is available on the system as well as documentation available on the Internet.

For an overview of the documentation available for your product and the latest documentation updates, refer to http://www.suse.com/documentation.

1 Available Documentation

The following manuals are available for this product:

Book “Administration Guide”

The guide describes various administration tasks that are typically performed after the installation. The guide also introduces steps to integrate Ceph with virtualization solutions such as libvirt, Xen, or KVM, and ways to access objects stored in the cluster via iSCSI and RADOS gateways.

Deployment Guide

Guides you through the installation steps of the Ceph cluster and all services related to Ceph. The guide also illustrates a basic Ceph cluster structure and provides you with related terminology.

HTML versions of the product manuals can be found in the installed system under /usr/share/doc/manual. Find the latest documentation updates at http://www.suse.com/documentation where you can download the manuals for your product in multiple formats.

2 Feedback

Several feedback channels are available:

Bugs and Enhancement Requests

For services and support options available for your product, refer to http://www.suse.com/support/.

To report bugs for a product component, log in to the Novell Customer Center from http://www.suse.com/support/ and select My Support › Service Request.

User Comments

We want to hear your comments about and suggestions for this manual and the other documentation included with this product. Use the User Comments feature at the bottom of each page in the online documentation or go to http://www.suse.com/documentation/feedback.html and enter your comments there.

Mail

For feedback on the documentation of this product, you can also send a mail to doc-team@suse.de. Make sure to include the document title, the product version, and the publication date of the documentation. To report errors or suggest enhancements, provide a concise description of the problem and refer to the respective section number and page (or URL).

3 Documentation Conventions

The following typographical conventions are used in this manual:

  • /etc/passwd: directory names and file names

  • placeholder: replace placeholder with the actual value

  • PATH: the environment variable PATH

  • ls, --help: commands, options, and parameters

  • user: users or groups

  • Alt, AltF1: a key to press or a key combination; keys are shown in uppercase as on a keyboard

  • File, File › Save As: menu items, buttons

  • Dancing Penguins (Chapter Penguins, ↑Another Manual): This is a reference to a chapter in another manual.

4 About the Making of This Manual

This book is written in GeekoDoc, a subset of DocBook (see http://www.docbook.org). The XML source files were validated by xmllint, processed by xsltproc, and converted into XSL-FO using a customized version of Norman Walsh's stylesheets. The final PDF can be formatted through FOP from Apache or through XEP from RenderX. The authoring and publishing tools used to produce this manual are available in the package daps. The DocBook Authoring and Publishing Suite (DAPS) is developed as open source software. For more information, see http://daps.sf.net/.

Part I SUSE Enterprise Storage

1 SUSE Enterprise Storage and Ceph

SUSE Enterprise Storage is a distributed storage system designed for scalability, reliability and performance which is based on the Ceph technology. A Ceph cluster can be run on commodity servers in a common network like Ethernet. The cluster scales up well to thousands of servers (later on referred…

2 Hardware Requirements and Recommendations

The hardware requirements of Ceph are heavily dependent on the IO workload. The following hardware requirements and recommendations should be considered as a starting point for detailed planning.

1 SUSE Enterprise Storage and Ceph

SUSE Enterprise Storage is a distributed storage system designed for scalability, reliability and performance which is based on the Ceph technology. A Ceph cluster can be run on commodity servers in a common network like Ethernet. The cluster scales up well to thousands of servers (later on referred to as nodes) and into the petabyte range. As opposed to conventional systems which have allocation tables to store and fetch data, Ceph uses a deterministic algorithm to allocate storage for data and has no centralized information structure. Ceph assumes that in storage clusters the addition or removal of hardware is the rule, not the exception. The Ceph cluster automates management tasks such as data distribution and redistribution, data replication, failure detection and recovery. Ceph is both self-healing and self-managing which results in a reduction of administrative and budget overhead.

This chapter provides a high level overview of SUSE Enterprise Storage and briefly describes the most important components.

Tip
Tip

Since SUSE Enterprise Storage 5, the only cluster deployment method is DeepSea. Refer to Chapter 3, Deploying with DeepSea/Salt for details about the deployment process.

1.1 Ceph Features

The Ceph environment has the following features:

Scalability

Ceph can scale to thousands of nodes and manage storage in the range of petabytes.

Commodity Hardware

No special hardware is required to run a Ceph cluster. For details, see Chapter 2, Hardware Requirements and Recommendations

Self-managing

The Ceph cluster is self-managing. When nodes are added, removed or fail, the cluster automatically redistributes the data. It is also aware of overloaded disks.

No Single Point of Failure

No node in a cluster stores important information alone. The number of redundancies can be configured.

Open Source Software

Ceph is an open source software solution and independent of specific hardware or vendors.

1.2 Core Components

To make full use of Ceph's power, it is necessary to understand some of the basic components and concepts. This section introduces some parts of Ceph that are often referenced in other chapters.

1.2.1 RADOS

The basic component of Ceph is called RADOS (Reliable Autonomic Distributed Object Store). It is responsible for managing the data stored in the cluster. Data in Ceph is usually stored as objects. Each object consists of an identifier and the data.

RADOS provides the following access methods to the stored objects that cover many use cases:

Object Gateway

Object Gateway is an HTTP REST gateway for the RADOS object store. It enables direct access to objects stored in the Ceph cluster.

RADOS Block Device

RADOS Block Devices (RBD) can be accessed like any other block device. These can be used for example in combination with libvirt for virtualization purposes.

CephFS

The Ceph File System is a POSIX-compliant file system.

librados

librados is a library that can be used with many programming languages to create an application capable of directly interacting with the storage cluster.

librados is used by Object Gateway and RBD while CephFS directly interfaces with RADOS Figure 1.1, “Interfaces to the Ceph Object Store”.

Interfaces to the Ceph Object Store
Figure 1.1: Interfaces to the Ceph Object Store

1.2.2 CRUSH

At the core of a Ceph cluster is the CRUSH algorithm. CRUSH is the acronym for Controlled Replication Under Scalable Hashing. CRUSH is a function that handles the storage allocation and needs comparably few parameters. That means only a small amount of information is necessary to calculate the storage position of an object. The parameters are a current map of the cluster including the health state, some administrator-defined placement rules and the name of the object that needs to be stored or retrieved. With this information, all nodes in the Ceph cluster are able to calculate where an object and its replicas are stored. This makes writing or reading data very efficient. CRUSH tries to evenly distribute data over all nodes in the cluster.

The CRUSH map contains all storage nodes and administrator-defined placement rules for storing objects in the cluster. It defines a hierarchical structure that usually corresponds to the physical structure of the cluster. For example, the data-containing disks are in hosts, hosts are in racks, racks in rows and rows in data centers. This structure can be used to define failure domains. Ceph then ensures that replications are stored on different branches of a specific failure domain.

If the failure domain is set to rack, replications of objects are distributed over different racks. This can mitigate outages caused by a failed switch in a rack. If one power distribution unit supplies a row of racks, the failure domain can be set to row. When the power distribution unit fails, the replicated data is still available on other rows.

1.2.3 Ceph Nodes and Daemons

In Ceph, nodes are servers working for the cluster. They can run several different types of daemons. It is recommended to run only one type of daemon on each node, except for MGR daemons which can be collocated with MONs. Each cluster requires at least MON, MGR and OSD daemons:

Ceph Monitor

Ceph Monitor (often abbreviated as MON) nodes maintain information about the cluster health state, a map of all nodes and data distribution rules (see Section 1.2.2, “CRUSH”).

If failures or conflicts occur, the Ceph Monitor nodes in the cluster decide by majority which information is correct. To form a qualified majority, it is recommended to have an odd number of Ceph Monitor nodes, and at least three of them.

If more than one site is used, the Ceph Monitor nodes should be distributed over an odd number of sites. The number of Ceph Monitor nodes per site should be such that more than 50% of the Ceph Monitor nodes remain functional if one site fails.

Ceph Manager

The Ceph manager (MGR) collects the state information from the whole cluster. The Ceph manager daemon runs alongside the monitor daemons. It provides additional monitoring, and interfaces the external monitoring and management systems.

The Ceph manager requires no additional configuration, beyond ensuring it is running. You can deploy it as a separate role using DeepSea.

Ceph OSD

A Ceph OSD is a daemon handling Object Storage Devices which are a physical or logical storage units (hard disks or partitions). Object Storage Devices can be physical disks/partitions or logical volumes. The daemon additionally takes care of data replication and rebalancing in case of added or removed nodes.

Ceph OSD daemons communicate with monitor daemons and provide them with the state of the other OSD daemons.

To use CephFS, Object Gateway, NFS Ganesha, or iSCSI Gateway, additional nodes are required:

Metadata Server (MDS)

The metadata servers store metadata for the CephFS. By using an MDS you can execute basic file system commands such as ls without overloading the cluster.

Object Gateway

The Ceph Object Gateway provided by Object Gateway is an HTTP REST gateway for the RADOS object store. It is compatible with OpenStack Swift and Amazon S3 and has its own user management.

NFS Ganesha

NFS Ganesha provides an NFS access to either the Object Gateway or the CephFS. It runs in the user instead of the kernel space and directly interacts with the Object Gateway or CephFS.

iSCSI Gateway

iSCSI is a storage network protocol that allows clients to send SCSI commands to SCSI storage devices (targets) on remote servers.

1.3 Storage Structure

1.3.1 Pool

Objects that are stored in a Ceph cluster are put into pools. Pools represent logical partitions of the cluster to the outside world. For each pool a set of rules can be defined, for example, how many replications of each object must exist. The standard configuration of pools is called replicated pool.

Pools usually contain objects but can also be configured to act similar to a RAID 5. In this configuration, objects are stored in chunks along with additional coding chunks. The coding chunks contain the redundant information. The number of data and coding chunks can be defined by the administrator. In this configuration, pools are referred to as erasure coded pools.

1.3.2 Placement Group

Placement Groups (PGs) are used for the distribution of data within a pool. When creating a pool, a certain number of placement groups is set. The placement groups are used internally to group objects and are an important factor for the performance of a Ceph cluster. The PG for an object is determined by the object's name.

1.3.3 Example

This section provides a simplified example of how Ceph manages data (see Figure 1.2, “Small Scale Ceph Example”). This example does not represent a recommended configuration for a Ceph cluster. The hardware setup consists of three storage nodes or Ceph OSDs (Host 1, Host 2, Host 3). Each node has three hard disks which are used as OSDs (osd.0 to osd.9). The Ceph Monitor nodes are neglected in this example.

Note
Note: Difference between Ceph OSD and OSD

While Ceph OSD or Ceph OSD daemon refers to a daemon that is run on a node, the word OSD refers to the logical disk that the daemon interacts with.

The cluster has two pools, Pool A and Pool B. While Pool A replicates objects only two times, resilience for Pool B is more important and it has three replications for each object.

When an application puts an object into a pool, for example via the REST API, a Placement Group (PG1 to PG4) is selected based on the pool and the object name. The CRUSH algorithm then calculates on which OSDs the object is stored, based on the Placement Group that contains the object.

In this example the failure domain is set to host. This ensures that replications of objects are stored on different hosts. Depending on the replication level set for a pool, the object is stored on two or three OSDs that are used by the Placement Group.

An application that writes an object only interacts with one Ceph OSD, the primary Ceph OSD. The primary Ceph OSD takes care of replication and confirms the completion of the write process after all other OSDs have stored the object.

If osd.5 fails, all object in PG1 are still available on osd.1. As soon as the cluster recognizes that an OSD has failed, another OSD takes over. In this example osd.4 is used as a replacement for osd.5. The objects stored on osd.1 are then replicated to osd.4 to restore the replication level.

Small Scale Ceph Example
Figure 1.2: Small Scale Ceph Example

If a new node with new OSDs is added to the cluster, the cluster map is going to change. The CRUSH function then returns different locations for objects. Objects that receive new locations will be relocated. This process results in a balanced usage of all OSDs.

1.4 Additional Information

  • Ceph as a community project has its own extensive online documentation. For topics not found in this manual, refer to http://ceph.com/docs/luminous/.

  • The original publication CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data by S.A. Weil, S.A. Brandt, E.L. Miller, C. Maltzahn provides helpful insight into the inner workings of Ceph. Especially when deploying large scale clusters it is a recommended reading. The publication can be found at http://www.ssrc.ucsc.edu/papers/weil-sc06.pdf.

2 Hardware Requirements and Recommendations

The hardware requirements of Ceph are heavily dependent on the IO workload. The following hardware requirements and recommendations should be considered as a starting point for detailed planning.

In general, the recommendations given in this section are on a per-process basis. If several processes are located on the same machine, the CPU, RAM, disk and network requirements have to be added up.

2.1 Object Storage Nodes

2.1.1 Minimum Requirements

  • At least 4 OSD nodes, with 8 OSD disks each, are required.

  • For OSDs that do not use BlueStore, 1 GB of RAM per terabyte of raw OSD capacity is minimally required for each OSD storage node. 1.5 GB of RAM per terabyte of raw OSD capacity is recommended. During recovery, 2 GB of RAM per terabyte of raw OSD capacity may be optimal.

    For OSDs that use BlueStore, first calculate the size of RAM that is recommended for OSDs that do not use BlueStore, then calculate 2 GB plus the size of the BlueStore cache of RAM is recommended for each OSD process, and choose the bigger value of RAM of the two results. Note that the default BlueStore cache is 1 GB for HDD and 3 GB for SSD drives by default. In summary, pick the greater of:

    [1GB * OSD count * OSD size]

    or

    [(2 + BS cache) * OSD count]
  • 1.5 GHz of a logical CPU core per OSD is minimally required for each OSD daemon process. 2 GHz per OSD daemon process is recommended. Note that Ceph runs one OSD daemon process per storage disk; do not count disks reserved solely for use as OSD journals, WAL journals, omap metadata, or any combination of these three cases.

  • 10 Gb Ethernet (two network interfaces bonded to multiple switches).

  • OSD disks in JBOD configurations.

  • OSD disks should be exclusively used by SUSE Enterprise Storage.

  • Dedicated disk/SSD for the operating system, preferably in a RAID 1 configuration.

  • If this OSD host will host part of a cache pool used for cache tiering, allocate at least an additional 4 GB of RAM.

  • OSD nodes should be bare metal, not virtualized, for disk performance reasons.

2.1.2 Minimum Disk Size

There are two types of disk space needed to run on OSD: the space for the disk journal (for FileStore) or WAL/DB device (for BlueStore), and the primary space for the stored data. The minimum (and default) value for the journal/WAL/DB is 6 GB. The minimum space for data is 5 GB, as partitions smaller than 5 GB are automatically assigned the weight of 0.

So although the minimum disk space for an OSD is 11 GB, we do not recommend a disk smaller than 20 GB, even for testing purposes.

2.1.3 Using SSD for OSD Journals

Solid-state drives (SSD) have no moving parts. This reduces random access time and read latency while accelerating data throughput. Because their price per 1MB is significantly higher than the price of spinning hard disks, SSDs are only suitable for smaller storage.

OSDs may see a significant performance improvement by storing their journal on an SSD and the object data on a separate hard disk.

Tip
Tip: Sharing an SSD for Multiple Journals

As journal data occupies relatively little space, you can mount several journal directories to a single SSD disk. Keep in mind that with each shared journal, the performance of the SSD disk degrades. We do not recommend sharing more than six journals on the same SSD disk and 12 on NVMe disks.

2.1.4 Maximum Recommended Number of Disks

You can have as many disks in one server as it allows. There are a few things to consider when planning the number of disks per server:

  • Network bandwidth. The more disks you have in a server, the more data must be transferred via the network card(s) for the disk write operations.

  • Memory. For optimum performance, reserve at least 2 GB of RAM per terabyte of disk space installed.

  • Fault tolerance. If the complete server fails, the more disks it has, the more OSDs the cluster temporarily loses. Moreover, to keep the replication rules running, you need to copy all the data from the failed server among the other nodes in the cluster.

2.2 Monitor Nodes

  • At least three Ceph Monitor nodes are required. The number of monitors should always be odd (1+2n).

  • 4 GB of RAM.

  • Processor with four logical cores.

  • An SSD or other sufficiently fast storage type is highly recommended for monitors, as quorum may be unstable with high disk latencies. Two disks in RAID 1 configuration is recommended for redundancy. It is recommended that separate disks or at least separate disk partitions are used for the monitor processes to protect the monitor's available disk space from things like log file creep.

  • There must only be one monitor process per node.

  • Mixing OSD, monitor, or Object Gateway nodes is only supported if sufficient hardware resources are available. That means that the requirements for all services need to be added up.

  • Two network interfaces bonded to multiple switches.

2.3 Object Gateway Nodes

Object Gateway nodes should have six to eight CPU cores and 32 GB of RAM (64 GB recommended). When other processes are co-located on the same machine, their requirements need to be added up.

2.4 Salt Master

At least 4 GB of RAM and a quad-core CPU are required. This is includes running openATTIC on the Salt master. For large clusters with hundreds of nodes, 6 GB of RAM is suggested.

2.5 iSCSI Nodes

iSCSI nodes should have six to eight CPU cores and 16 GB of RAM.

2.6 Network Recommendations

The network environment where you intend to run Ceph should ideally be a bonded set of at least two network interfaces that is logically split into a public part and a trusted internal part using VLANs. The bonding mode is recommended to be 802.3ad if possible to provide maximum bandwidth and resiliency.

The public VLAN serves to provide the service to the customers, while the internal part provides for the authenticated Ceph network communication. The main reason for this is that although Ceph provides authentication and protection against attacks once secret keys are in place, the messages used to configure these keys may be transferred openly and are vulnerable.

Tip
Tip: Nodes Configured via DHCP

If your storage nodes are configured via DHCP, the default timeouts may not be sufficient for the network to be configured correctly before the various Ceph daemons start. If this happens, the Ceph MONs and OSDs will not start correctly (running systemctl status ceph\* will result in "unable to bind" errors) To avoid this issue, we recommend increasing the DHCP client timeout to at least 30 seconds on each node in your storage cluster. This can be done by changing the following settings on each node:

In /etc/sysconfig/network/dhcp, set

DHCLIENT_WAIT_AT_BOOT="30"

In /etc/sysconfig/network/config, set

WAIT_FOR_INTERFACES="60"

2.6.1 Adding a Private Network to a Running Cluster

If you do not specify a cluster network during Ceph deployment, it assumes a single public network environment. While Ceph operates fine with a public network, its performance and security improves when you set a second private cluster network. To support two networks, each Ceph node needs to have at least two network cards.

You need to apply the following changes to each Ceph node. It is relatively quick to do for a small cluster, but can be very time consuming if you have a cluster consisting of hundreds or thousands of nodes.

  1. Stop Ceph related services on each cluster node.

    Add a line to /etc/ceph/ceph.conf to define the cluster network, for example:

    cluster network = 10.0.0.0/24

    If you need to specifically assign static IP addresses or override cluster network settings, you can do so with the optional cluster addr.

  2. Check that the private cluster network works as expected on the OS level.

  3. Start Ceph related services on each cluster node.

    sudo systemctl start ceph.target

2.6.2 Monitor Nodes on Different Subnets

If the monitor nodes are on multiple subnets, for example they are located in different rooms and served by different switches, you need to adjust the ceph.conf file accordingly. For example if the nodes have IP addresses 192.168.123.12, 1.2.3.4, and 242.12.33.12, add the following lines to its global section:

[global]
[...]
mon host = 192.168.123.12, 1.2.3.4, 242.12.33.12
mon initial members = MON1, MON2, MON3
[...]

Additionally, if for some reason you have to specify a per-monitor public address or network, you need to add a [mon.X] section per each monitor:

[mon.MON1]
public network = 192.168.123.0/24

[mon.MON2]
public network = 1.2.3.0/24

[mon.MON3]
public network = 242.12.33.12/0

2.7 Naming Limitations

Ceph does not generally support non-ASCII characters in configuration files, pool names, user names and so forth. When configuring a Ceph cluster we recommend using only simple alphanumeric characters (A-Z, a-z, 0-9) and minimal punctuation ('.', '-', '_') in all Ceph object/configuration names.

2.8 OSD and Monitor Sharing One Server

Although it is technically possible to run Ceph OSDs and Monitors on the same server in test environments, we strongly recommend having a separate server for each monitor node in production. The main reason is performance—the more OSDs the cluster has, the more I/O operations the monitor nodes need to perform. And when one server is shared between a monitor node and OSD(s), the OSD I/O operations are a limiting factor for the monitor node.

Another consideration is whether to share disks between an OSD, a monitor node, and the operating system on the server. The answer is simple: if possible, dedicate a separate disk to OSD, and a separate server to a monitor node.

Although Ceph supports directory-based OSDs, an OSD should always have a dedicated disk other than the operating system one.

Tip
Tip

If it is really necessary to run OSD and monitor node on the same server, run the monitor on a separate disk by mounting the disk to the /var/lib/ceph/mon directory for slightly better performance.

2.9 Minimum Cluster Configuration

  • Four Object Storage Nodes

    • 10 Gb Ethernet (two networks bonded to multiple switches)

    • 32 OSDs per storage cluster

    • OSD journal can reside on OSD disk

    • Dedicated OS disk for each Object Storage Node

    • 1 GB of RAM per TB of raw OSD capacity for each Object Storage Node

    • 1.5 GHz per OSD for each Object Storage Node

    • Ceph Monitors, gateway and Metadata Servers can reside on Object Storage Nodes

      • Three Ceph Monitor nodes (requires SSD for dedicated OS drive)

      • Ceph Monitors, Object Gateways and Metadata Servers nodes require redundant deployment

      • iSCSI Gateways, Object Gateways and Metadata Servers require incremental 4 GB RAM and four cores

  • Separate management node with 4 GB RAM, four cores, 1 TB capacity

2.10 Recommended Production Cluster Configuration

  • Seven Object Storage Nodes

    • No single node exceeds ~15% of total storage

    • 10 Gb Ethernet (four physical networks bonded to multiple switches)

    • 56+ OSDs per storage cluster

    • RAID 1 OS disks for each OSD storage node

    • SSDs for Journal with 6:1 ratio SSD journal to OSD

    • 1.5 GB of RAM per TB of raw OSD capacity for each Object Storage Node

    • 2 GHz per OSD for each Object Storage Node

  • Dedicated physical infrastructure nodes

    • Three Ceph Monitor nodes: 4 GB RAM, 4 core processor, RAID 1 SSDs for disk

    • One SES management node: 4 GB RAM, 4 core processor, RAID 1 SSDs for disk

    • Redundant physical deployment of gateway or Metadata Server nodes:

      • Object Gateway nodes: 32 GB RAM, 8 core processor, RAID 1 SSDs for disk

      • iSCSI Gateway nodes: 16 GB RAM, 4 core processor, RAID 1 SSDs for disk

      • Metadata Server nodes (one active/one hot standby): 32 GB RAM, 8 core processor, RAID 1 SSDs for disk

2.11 SUSE Enterprise Storage and Other SUSE Products

This section contains important information about integrating SUSE Enterprise Storage with other SUSE products.

2.11.1 SUSE Manager

SUSE Manager and SUSE Enterprise Storage are not integrated, therefore SUSE Manager cannot currently manage a SUSE Enterprise Storage cluster.

Part II Cluster Deployment and Upgrade

3 Deploying with DeepSea/Salt

Salt along with DeepSea is a stack of components that help you deploy and manage server infrastructure. It is very scalable, fast, and relatively easy to get running. Read the following considerations before you start deploying the cluster with Salt:

4 Upgrading from Previous Releases

This chapter introduces steps to upgrade SUSE Enterprise Storage from the previous release(s) to the current one.

5 Backing Up the Cluster Configuration

This chapter explains which files on the admin node should be backed up. As soon as you are finished with your cluster deployment or migration, create a backup of these directories.

6 Customizing the Default Configuration

You can change the default cluster configuration generated in Stage 2 (refer to DeepSea Stages Description). For example, you may need to change network settings, or software that is installed on the Salt master by default installed. You can perform the former by modifying the pillar updated after S…

3 Deploying with DeepSea/Salt

Note
Note: ceph-deploy Removed in SUSE Enterprise Storage 5

The ceph-deploy cluster deployment tool was deprecated in SUSE Enterprise Storage 4 and is completely removed in favor of DeepSea as of SUSE Enterprise Storage 5.

Salt along with DeepSea is a stack of components that help you deploy and manage server infrastructure. It is very scalable, fast, and relatively easy to get running. Read the following considerations before you start deploying the cluster with Salt:

  • Salt minions are the nodes controlled by a dedicated node called Salt master. Salt minions have roles, for example Ceph OSD, Ceph Monitor, Ceph Manager, Object Gateway, iSCSI Gateway, or NFS Ganesha.

  • A Salt master runs its own Salt minion. It is required for running privileged tasks—for example creating, authorizing, and copying keys to minions—so that remote minions never need to run privileged tasks. Do not run Ceph related services on the Salt master.

  • Salt minions need to correctly resolve the Salt master's host name over the network. By default, they look for the salt host name, but you can specify any other network-reachable host name in the /etc/salt/minion file, see Section 3.3, “Cluster Deployment”.

3.1 Read the Release Notes

In the release notes you can find additional information on changes since the previous release of SUSE Enterprise Storage. Check the release notes to see whether:

  • your hardware needs special considerations.

  • any used software packages have changed significantly.

  • special precautions are necessary for your installation.

The release notes also provide information that could not make it into the manual on time. They also contain notes about known issues.

After having installed the package release-notes-ses, find the release notes locally in the directory /usr/share/doc/release-notes or online at https://www.suse.com/releasenotes/.

3.2 Introduction to DeepSea

The goal of DeepSea is to save the administrator time and confidently perform complex operations on a Ceph cluster.

Ceph is a very configurable software solution. It increases both the freedom and responsibility of system administrators.

The minimal Ceph setup is good for demonstration purposes, but does not show interesting features of Ceph that you can see with a big number of nodes.

DeepSea collects and stores data about individual servers, such as addresses and device names. For a distributed storage system such as Ceph, there can be hundreds of such items to collect and store. Collecting the information and entering the data manually into a configuration management tool is exhausting and error prone.

The steps necessary to prepare the servers, collect the configuration, and configure and deploy Ceph are mostly the same. However, this does not address managing the separate functions. For day to day operations, the ability to trivially add hardware to a given function and remove it gracefully is a requirement.

DeepSea addresses these observations with the following strategy: DeepSea consolidates the administrator's decisions in a single file. The decisions include cluster assignment, role assignment and profile assignment. And DeepSea collects each set of tasks into a simple goal. Each goal is a stage:

DeepSea Stages Description
  • Stage 0—the preparation— during this stage, all required updates are applied and your system may be rebooted.

    Important
    Important: Re-run Stage 0 after Salt master Reboot

    If, during Stage 0, the Salt master reboots to load the new kernel version, you need to run Stage 0 again, otherwise minions will not be targeted.

  • Stage 1—the discovery—here you detect all hardware in your cluster and collect necessary information for the Ceph configuration. For details about configuration, refer to Section 3.5, “Configuration and Customization”.

  • Stage 2—the configuration—you need to prepare configuration data in a particular format.

  • Stage 3—the deployment—creates a basic Ceph cluster with mandatory Ceph services. See Section 1.2.3, “Ceph Nodes and Daemons” for their list.

  • Stage 4—the services—additional features of Ceph like iSCSI, Object Gateway and CephFS can be installed in this stage. Each is optional.

  • Stage 5—the removal stage. This stage is not mandatory and during the initial setup it is usually not needed. In this stage the roles of minions and also the cluster configuration are removed. You need to run this stage when you need to remove a storage node from your cluster. For details refer to Book “Administration Guide”, Chapter 1 “Salt Cluster Administration”, Section 1.3 “Removing and Reinstalling Cluster Nodes”.

You can find a more detailed introduction into DeepSea at https://github.com/suse/deepsea/wiki.

3.2.1 Organization and Important Locations

Salt has several standard locations and several naming conventions used on your master node:

/srv/pillar

The directory stores configuration data for your cluster minions. Pillar is an interface for providing global configuration values to all your cluster minions.

/srv/salt/

The directory stores Salt state files (also called sls files). State files are formatted descriptions of states in which the cluster should be. For more information, refer to the Salt documentation.

/srv/module/runners

The directory stores Python scripts known as runners. Runners are executed on the master node.

/srv/salt/_modules

The directory stores Python scripts that are called modules. The modules are applied to all minions in your cluster.

/srv/pillar/ceph

The directory is used by DeepSea. Collected configuration data are stored here.

/srv/salt/ceph

A directory used by DeepSea. It stores sls files that can be in different formats, but each subdirectory contains sls files. Each subdirectory contains only one type of sls file. For example, /srv/salt/ceph/stage contains orchestration files that are executed by salt-run state.orchestrate.

3.2.2 Targeting the Minions

DeepSea commands are executed via the Salt infrastructure. When using the salt command, you need to specify a set of Salt minions that the command will affect. We describe the set of the minions as a target for the salt command. The following sections describe possible methods to target the minions.

3.2.2.1 Matching the Minion Name

You can target a minion or a group of minions by matching their names. A minion's name is usually the short host name of the node where the minion runs. This is a general Salt targeting method, not related to DeepSea. You can use globbing, regular expressions, or lists to limit the range of minion names. The general syntax follows:

root@master # salt target example.module
Tip
Tip: Ceph-only Cluster

If all Salt minions in your environment belong to your Ceph cluster, you can safely substitute target with '*' to include all registered minions.

Match all minions in the example.net domain (assuming the minion names are identical to their "full" host names):

root@master # salt '*.example.net' test.ping

Match the 'web1' to 'web5' minions:

root@master # salt 'web[1-5]' test.ping

Match both 'web1-prod' and 'web1-devel' minions using a regular expression:

root@master # salt -E 'web1-(prod|devel)' test.ping

Match a simple list of minions:

root@master # salt -L 'web1,web2,web3' test.ping

Match all minions in the cluster:

root@master # salt '*' test.ping

3.2.2.2 Targeting with a 'deepsea' Grain

In a heterogeneous Salt-managed environment where SUSE Enterprise Storage is deployed on a subset of nodes alongside other cluster solution(s), it is a good idea to 'mark' the relevant minions by applying a 'deepsea' grain to them. This way you can easily target DeepSea minions in environments where matching by the minion name is problematic.

To apply the 'deepsea' grain to a group of minions, run:

root@master # salt target grains.append deepsea default

To remove the 'deepsea' grain from a group of minions, run:

root@master # salt target grains.delval deepsea destructive=True

After applying the 'deepsea' grain to the relevant minions, you can target them as follows:

root@master # salt -G 'deepsea:*' test.ping

The following command is an equivalent:

root@master # salt -C 'G@deepsea:*' test.ping

3.2.2.3 Set the deepsea_minions Option

Setting the deepsea_minions option's target is a requirement for DeepSea deployments. DeepSea uses it to instruct minions during stages execution (refer to DeepSea Stages Description for details.

To set or change the deepsea_minions option, edit the /srv/pillar/ceph/deepsea_minions.sls file on the Salt master and add or replace the following line:

deepsea_minions: target
Tip
Tip: deepsea_minions Target

As the target for the deepsea_minions option, you can use any targeting method: both Matching the Minion Name and Targeting with a 'deepsea' Grain.

Match all Salt minions in the cluster:

deepsea_minions: '*'

Match all minions with the 'deepsea' grain:

deepsea_minions: 'G@deepsea:*'

3.2.2.4 For More Information

You can use more advanced ways to target minions using the Salt infrastructure. Refer to https://docs.saltstack.com/en/latest/topics/targeting/ for a description of all targeting techniques.

Also, the 'deepsea-minions' manual page gives you more detail about DeepSea targeting (man 7 deepsea_minions).

3.3 Cluster Deployment

The cluster deployment process has several phases. First, you need to prepare all nodes of the cluster by configuring Salt and then deploy and configure Ceph.

Tip
Tip: Deploying Monitor Nodes without Defining OSD Profiles

If you need to skip defining OSD profiles and deploy the monitor nodes first, you can do so by setting the DEV_ENV variable. It allows deploying monitors without the presence of the profile/ directory, as well as deploying a cluster with at least one storage, monitor, and manager node.

To set the environment variable, either enable it globally by setting it in the /srv/pillar/ceph/stack/global.yml file, or set it for the current shell session only:

root@master # export DEV_ENV=true

The following procedure describes the cluster preparation in detail.

  1. Install and register SUSE Linux Enterprise Server 12 SP3 together with SUSE Enterprise Storage extension on each node of the cluster.

  2. Verify that proper products are installed and registered by listing existing software repositories. The list will be similar to this output:

     root@minion > zypper lr -E
    #  | Alias   | Name                              | Enabled | GPG Check | Refresh
    ---+---------+-----------------------------------+---------+-----------+--------
     4 | [...]   | SUSE-Enterprise-Storage-5-Pool    | Yes     | (r ) Yes  | No
     6 | [...]   | SUSE-Enterprise-Storage-5-Updates | Yes     | (r ) Yes  | Yes
     9 | [...]   | SLES12-SP3-Pool                   | Yes     | (r ) Yes  | No
    11 | [...]   | SLES12-SP3-Updates                | Yes     | (r ) Yes  | Yes
  3. Configure network settings including proper DNS name resolution on each node. The Salt master and all the Salt minions need to resolve each other by their host names. For more information on configuring a network, see https://www.suse.com/documentation/sles-12/book_sle_admin/data/sec_basicnet_yast.html For more information on configuring a DNS server, see https://www.suse.com/documentation/sles-12/book_sle_admin/data/cha_dns.html.

  4. Configure, enable, and start the NTP time synchronization server:

    root@master # systemctl enable ntpd.service
    root@master # systemctl start ntpd.service

    Find more information on setting up NTP in https://www.suse.com/documentation/sles-12/book_sle_admin/data/sec_netz_xntp_yast.html.

  5. Check whether the AppArmor service is running and disable it on each cluster node. Start YaST AppArmor module, and select Settings and then deactivate the Enable Apparmor check box. Confirm with Done.

    Note that SUSE Enterprise Storage will not work with AppArmor enabled.

  6. Install the salt-master and salt-minion packages on the Salt master node:

    root@master # zypper in salt-master salt-minion

    Check that the salt-master service is enabled and started, and enable and start it if needed:

    root@master # systemctl enable salt-master.service
    root@master # systemctl start salt-master.service
  7. Check that the Salt master node has ports 4505 and 4506 open to all Salt minion nodes on the firewall. If not, you can open them using the yast2 firewall command and allowing the SaltStack service.

  8. Install the package salt-minion on all minion nodes.

    root@minion > zypper in salt-minion

    Make sure that the fully qualified domain name of each node can be resolved to the public network IP address by all other nodes.

  9. Configure all minions (including the master minion) to connect to the master. If your Salt master is not reachable by the host name salt, edit the file /etc/salt/minion or create a new file /etc/salt/minion.d/master.conf with the following content:

    master: host_name_of_salt_master

    If you performed any changes to the configuration files mentioned above, restart the Salt service on all Salt minions:

    root@minion > systemctl restart salt-minion.service
  10. Check that the salt-minion service is enabled and started on all nodes. Enable and start it if needed:

    root@minion > systemctl enable salt-minion.service
    root@minion > systemctl start salt-minion.service
  11. Verify each Salt minion's fingerprint and accept all salt keys on the Salt master if the fingerprints match.

    View each minion's fingerprint:

    root@minion > salt-call --local key.finger
    local:
    3f:a3:2f:3f:b4:d3:d9:24:49:ca:6b:2c:e1:6c:3f:c3:83:37:f0:aa:87:42:e8:ff...

    After gathering fingerprints of all the Salt minions, list fingerprints of all unaccepted minion keys on the Salt master:

    root@master # salt-key -F
    [...]
    Unaccepted Keys:
    minion1:
    3f:a3:2f:3f:b4:d3:d9:24:49:ca:6b:2c:e1:6c:3f:c3:83:37:f0:aa:87:42:e8:ff...

    If the minions' fingerprint match, accept them:

    root@master # salt-key --accept-all
  12. Verify that the keys have been accepted:

    root@master # salt-key --list-all
  13. Prior to deploying SUSE Enterprise Storage make sure that all disks that were used as OSD by previous clusters are empty without partitions. To ensure this, you need to manually zap all the disks. Remember to replace 'X' with the correct disk letter:

    1. Stop all processes that are using the specific disk.

    2. Verify whether any partition on the disk is mounted, and unmount if needed.

    3. If the disk is managed by LVM, deactivate and delete the whole LVM infrastructure. Refer to cha.lvm for more details.

    4. If the disk is part of MD RAID, deactivate the RAID. Refer to part.software.raid for more details.

    5. Tip
      Tip: Rebooting the Server

      If you get error messages such as 'partition in use' or 'kernel can not be updated with the new partition table' during the following steps, reboot the server.

      Wipe the beginning of each partition:

      for partition in /dev/sdX[0-9]*
      do
        dd if=/dev/zero of=$partition bs=4096 count=1 oflag=direct
      done
    6. Wipe the partition table:

      sgdisk -Z --clear -g /dev/sdX
    7. Wipe the backup partition tables:

      size=`blockdev --getsz /dev/sdX`
      position=$((size/4096 - 33))
      dd if=/dev/zero of=/dev/sdX bs=4096 count=33 seek=$position oflag=direct
  14. Install DeepSea on the Salt master node:

    root@master # zypper in deepsea
  15. Check that the file /srv/pillar/ceph/master_minion.sls on the Salt master points to your Salt master. If your Salt master is reachable via more host names, use the one suitable for the storage cluster. If you used the default host name for your Salt master—salt—in the ses domain, then the file looks as follows:

    master_minion: salt.ses

Now you deploy and configure Ceph. Unless specified otherwise, all steps are mandatory.

Note
Note: Salt Command Conventions

There are two possible ways how to run salt-run state.orch—one is with stage.<stage number>, the other is with the name of the stage. Both notations have the same impact and it is fully your preference which command you use.

Procedure 3.1: Running Deployment Stages
  1. Include the Salt minions belonging to the Ceph cluster that you are currently deploying. Refer to Section 3.2.2.1, “Matching the Minion Name” for more information on targeting the minions.

  2. Prepare your cluster. Refer to DeepSea Stages Description for more details.

    root@master # salt-run state.orch ceph.stage.0

    or

    root@master # salt-run state.orch ceph.stage.prep
    Note
    Note: Run or Monitor Stages using DeepSea CLI

    Using the DeepSea CLI, you can follow the stage execution progress in real-time, either by running the DeepSea CLI in the monitoring mode, or by running the stage directly through DeepSea CLI. For details refer to Section 3.4, “DeepSea CLI”.

  3. Optional: create Btrfs subvolumes for /var/lib/ceph/. This step should only be executed before the next stages of DeepSea have been executed. To migrate existing directories or for more details, see Book “Administration Guide”, Chapter 18 “Hints and Tips”, Section 18.4 “Btrfs Subvolume for /var/lib/ceph”.

    root@master # salt-run state.orch ceph.migrate.subvolume
  4. The discovery stage collects data from all minions and creates configuration fragments that are stored in the directory /srv/pillar/ceph/proposals. The data are stored in the YAML format in *.sls or *.yml files.

    root@master # salt-run state.orch ceph.stage.1

    or

    root@master # salt-run state.orch ceph.stage.discovery
  5. After the previous command finishes successfully, create a policy.cfg file in /srv/pillar/ceph/proposals. For details refer to Section 3.5.1, “The policy.cfg File”.

    Tip
    Tip

    If you need to change the cluster's network setting, edit /srv/pillar/ceph/stack/ceph/cluster.yml and adjust the lines starting with cluster_network: and public_network:.

  6. The configuration stage parses the policy.cfg file and merges the included files into their final form. Cluster and role related content are placed in /srv/pillar/ceph/cluster, while Ceph specific content is placed in /srv/pillar/ceph/stack/default.

    Run the following command to trigger the configuration stage:

    root@master # salt-run state.orch ceph.stage.2

    or

    root@master # salt-run state.orch ceph.stage.configure

    The configuration step may take several seconds. After the command finishes, you can view the pillar data for the specified minions (for example, named ceph_minion1, ceph_minion2, etc.) by running:

    root@master # salt 'ceph_minion*' pillar.items
    Note
    Note: Overwriting Defaults

    As soon as the command finishes, you can view the default configuration and change it to suit your needs. For details refer to Chapter 6, Customizing the Default Configuration.

  7. Now you run the deployment stage. In this stage, the pillar is validated, and monitors and ODS daemons are started on the storage nodes. Run the following to start the stage:

    root@master # salt-run state.orch ceph.stage.3

    or

    root@master # salt-run state.orch ceph.stage.deploy

    The command may take several minutes. If it fails, you need to fix the issue and run the previous stages again. After the command succeeds, run the following to check the status:

    root@master # ceph -s
  8. The last step of the Ceph cluster deployment is the services stage. Here you instantiate any of the currently supported services: iSCSI Gateway, CephFS, Object Gateway, openATTIC, and NFS Ganesha. In this stage, the necessary pools, authorizing keyrings, and starting services are created. To start the stage, run the following:

    root@master # salt-run state.orch ceph.stage.4

    or

    root@master # salt-run state.orch ceph.stage.services

    Depending on the setup, the command may run for several minutes.

3.4 DeepSea CLI

DeepSea also provides a CLI tool that allows the user to monitor or run stages while visualizing the execution progress in real-time.

Two modes are supported for visualizing a stage's execution progress:

DeepSea CLI Modes
  • Monitoring mode: visualizes the execution progress of a DeepSea stage triggered by the salt-run command issued in another terminal session.

  • Stand-alone mode: runs a DeepSea stage while providing real-time visualization of its component steps as they are executed.

Important
Important: DeepSea CLI Commands

The DeepSea CLI commands can only be run in the Salt master node, with root privileges.

3.4.1 DeepSea CLI: Monitor Mode

The progress monitor provides a detailed, real-time visualization of what is happening during execution of stages using salt-run state.orch commands in other terminal sessions.

You need to start the monitor before running any salt-run state.orch so that the monitor can detect the start of the stage's execution.

If you start the monitor after issuing the salt-run state.orch command, then no execution progress will be shown.

You can start the monitor mode by running the following command:

root@master # deepsea monitor

For more information about the available command-line options of the deepsea monitor command, please check its manual page:

root@master # man deepsea-monitor

3.4.2 DeepSea CLI: Stand-alone Mode

In the stand-alone mode, DeepSea CLI can be used to run a DeepSea stage, showing its execution in real-time.

The command to run a DeepSea stage from the DeepSea CLI has the following form:

root@master # deepsea stage run stage-name

where stage-name corresponds to the way Salt orchestration state files are referenced. For example, stage deploy, which corresponds to the directoy located in /srv/salt/ceph/stage/deploy, is referenced as ceph.stage.deploy.

This command is an alternative to the Salt-based commands for running DeepSea stages (or any DeepSea orchestration state file).

The command deepsea stage run ceph.stage.0 is equivalent to salt-run state.orch ceph.stage.0.

For more information about the available command-line options accepted by the deepsea stage run command, please check its manual page:

root@master # man deepsea-stage run

In the following figure shows an example of the output of the DeepSea CLI when running Stage 2:

DeepSea CLI stage execution progress output
Figure 3.1: DeepSea CLI stage execution progress output

3.4.2.1 DeepSea CLI stage run Alias

For advanced users of Salt, we also support an alias for running a DeepSea stage that takes the Salt command used to run a stage, for example, salt-run state.orch stage-name, as a command of the DeepSea CLI.

Example:

root@master # deepsea salt-run state.orch stage-name

3.5 Configuration and Customization

3.5.1 The policy.cfg File

The /srv/pillar/ceph/proposals/policy.cfg configuration file is used to determine roles of individual cluster nodes. For example, which node acts as an OSD or which as a monitor node. Edit policy.cfg in order to reflect your desired cluster setup. The order of the sections is arbitrary, but the content of included lines overwrites matching keys from the content of previous lines.

Tip
Tip: Examples of policy.cfg

You can find several examples of complete policy files in the /usr/share/doc/packages/deepsea/examples/ directory.

3.5.1.1 Cluster Assignment

In the cluster section you select minions for your cluster. You can select all minions, or you can blacklist or whitelist minions. Examples for a cluster called ceph follow.

To include all minions, add the following lines:

cluster-ceph/cluster/*.sls

To whitelist a particular minion:

cluster-ceph/cluster/abc.domain.sls

or a group of minions—you can shell glob matching:

cluster-ceph/cluster/mon*.sls

To blacklist minions, set the them to unassigned:

cluster-unassigned/cluster/client*.sls

3.5.1.2 Role Assignment

This section provides you with details on assigning 'roles' to your cluster nodes. A 'role' in this context means the service you need to run on the node, such as Ceph Monitor, Object Gateway, iSCSI Gateway, or openATTIC. No role is assigned automatically, only roles added to policy.cfg will be deployed.

The assignment follows this pattern:

role-ROLE_NAME/PATH/FILES_TO_INCLUDE

Where the items have the following meaning and values:

  • ROLE_NAME is any of the following: 'master', 'admin', 'mon', 'mgr', 'mds', 'igw', 'rgw', 'ganesha', or 'openattic'.

  • PATH is a relative directory path to .sls or .yml files. In case of .sls files, it usually is cluster, while .yml files are located at stack/default/ceph/minions.

  • FILES_TO_INCLUDE are the Salt state files or YAML configuration files. They normally consist of Salt minions host names, for example ses5min2.yml. Shell globbing can be used for more specific matching.

An example for each role follows:

  • master - the node has admin keyrings to all Ceph clusters. Currently, only a single Ceph cluster is supported. As the master role is mandatory, always add a similar line to the following:

    role-master/cluster/master*.sls
  • admin - the minion will have an admin keyring. You define the role as follows:

    role-admin/cluster/abc*.sls
  • mon - the minion will provide the monitoring service to the Ceph cluster. This role requires addresses of the assigned minions. As of SUSE Enterprise Storage 5, the public address are calculated dynamically and are no longer needed in the Salt pillar.

    role-mon/cluster/mon*.sls

    The example assigns the monitoring role to a group of minions.

  • mgr - the Ceph manager daemon which collects all the state information from the whole cluster. Deploy it on all minions where you plan to deploy the Ceph monitor role.

    role-mgr/cluster/mgr*.sls
  • mds - the minion will provide the metadata service to support CephFS.

    role-mds/cluster/mds*.sls
  • igw - the minion will act as an iSCSI Gateway. This role requires addresses of the assigned minions, thus you need to also include the files from the stack directory:

    role-igw/stack/default/ceph/minions/xyz.domain.yml
    role-igw/cluster/*.sls
  • rgw - the minion will act as a Object Gateway:

    role-rgw/cluster/rgw*.sls
  • openattic - the minion will act as an openATTIC server:

    role-openattic/cluster/openattic*.sls

    For more information, see Book “Administration Guide”, Chapter 15 “openATTIC”.

  • ganesha - the minion will act as an NFS Ganesha server. The 'ganesha' role requires either an 'rgw' or 'mds' role in cluster, otherwise the validation will fail in Stage 3.

    To successfully install NFS Ganesha, additional configuration is required. If you want to use NFS Ganesha, read Chapter 11, Installation of NFS Ganesha before executing stages 2 and 4. However, it is possible to install NFS Ganesha later.

    In some cases it can be useful to define custom roles for NFS Ganesha nodes. For details, see Book “Administration Guide”, Chapter 14 “NFS Ganesha: Export Ceph Data via NFS”, Section 14.3 “Custom NFS Ganesha Roles”.

Note
Note: Multiple Roles of Cluster Nodes

You can assign several roles to a single node. For example, you can assign the mds roles to the monitor nodes:

role-mds/cluster/mon[1,2]*.sls

3.5.1.3 Common Configuration

The common configuration section includes configuration files generated during the discovery (stage 1). These configuration files store parameters like fsid or public_network. To include the required Ceph common configuration, add the following lines:

config/stack/default/global.yml
config/stack/default/ceph/cluster.yml

3.5.1.4 Profile Assignment

In Ceph, a single storage role would be insufficient to describe the many disk configurations available with the same hardware. DeepSea stage 1 will generate a default storage profile proposal. By default this proposal will be a bluestore profile and will try to propose the highest performing configuration for the given hardware setup. For example, external journals will be preferred over a single disk containing objects and metadata. Solid state storage will be prioritized over spinning disks. Profiles are assigned in the policy.cfg similar to roles.

The default proposal can be found in the profile-default directory tree. To include this add the following two lines to your policy.cfg.

profile-default/cluster/*.sls
profile-default/stack/default/ceph/minions/*.yml

You can also create a customized storage profile to your liking by using the proposal runner. This runner offers three methods: help, peek, and populate.

salt-run proposal.help prints the runner help text about the various arguments it accepts.

salt-run proposal.peek shows the generated proposal according to the arguments passed.

salt-run proposal.populate writes the proposal to the /srv/pillar/ceph/proposals subdirectory. Pass name=myprofile to name the storage profile. This will result in a profile-myprofile subdirectory.

For all other arguments, consult the output of salt-run proposal.help.

3.5.1.4.1 Deploying Encrypted OSDs

By default, OSDs are deployed unencrypted. If you need to use encrypted OSDs for your deployment, use the proposal.populate runner with the encryption=dmcrypt argument:

root # salt-run proposal.populate encryption=dmcrypt

3.5.1.5 Item Filtering

Sometimes it is not practical to include all files from a given directory with *.sls globbing. The policy.cfg file parser understands the following filters:

Warning
Warning: Advanced Techniques

This section describes filtering techniques for advanced users. When not used correctly, filtering can cause problems for example in case your node numbering changes.

slice=[start:end]

Use the slice filter to include only items start through end-1. Note that items in the given directory are sorted alphanumerically. The following line includes the third to fifth files from the role-mon/cluster/ subdirectory:

role-mon/cluster/*.sls slice[3:6]
re=regexp

Use the regular expression filter to include only items matching the given expressions. For example:

role-mon/cluster/mon*.sls re=.*1[135]\.subdomainX\.sls$

3.5.1.6 Example policy.cfg File

Following is an example of a basic policy.cfg file:

## Cluster Assignment
cluster-ceph/cluster/*.sls 1

## Roles
# ADMIN
role-master/cluster/examplesesadmin.sls 2
role-admin/cluster/sesclient*.sls 3

# MON
role-mon/cluster/ses-example-[123].sls 4

# MGR
role-mgr/cluster/ses-example-[123].sls 5

# MDS
role-mds/cluster/ses-example-4.sls 6

# IGW
role-igw/stack/default/ceph/minions/ses-example-4.yml 7
role-igw/cluster/ses-example-4.sls 8

# RGW
role-rgw/cluster/ses-example-4.sls 9

# openATTIC
role-openattic/cluster/openattic*.sls 10

# COMMON
config/stack/default/global.yml 11
config/stack/default/ceph/cluster.yml 12

## Profiles
profile-default/cluster/*.sls 13
profile-default/stack/default/ceph/minions/*.yml 14

1

Indicates that all minions are included in the Ceph cluster. If you have minions you do not want to include in the Ceph cluster, use:

cluster-unassigned/cluster/*.sls
cluster-ceph/cluster/ses-example-*.sls

The first line marks all minions as unassigned. The second line overrides minions matching 'ses-example-*.sls', and assigns them to the Ceph cluster.

2

The minion called 'examplesesadmin' has the 'master' role. This, by the way, means it will get admin keys to the cluster.

3

All minions matching 'sesclient*' will get admin keys as well.

4

All minions matching 'ses-example-[123]' (presumably three minions: ses-example-1, ses-example-2, and ses-example-3) will be set up as MON nodes.

5

All minions matching 'ses-example-[123]' (all MON nodes in the example) will be set up as MGR nodes.

6

Minion 'ses-example-4' will have the MDS role.

7

Ensures that DeepSea knows the IP address of the IGW node.

8

Minion 'ses-example-4' will have the IGW role.

9

Minion 'ses-example-4' will have the RGW role.

10

Specifies to deploy the openATTIC user interface to administer the Ceph cluster. See Book “Administration Guide”, Chapter 15 “openATTIC” for more details.

11

Means that we accept the default values for common configuration parameters such as fsid and public_network.

12

Means that we accept the default values for common configuration parameters such as fsid and public_network.

13

We are telling DeepSea to use the default hardware profile for each minion. Choosing the default hardware profile means that we want all additional disks (other than the root disk) as OSDs.

14

We are telling DeepSea to use the default hardware profile for each minion. Choosing the default hardware profile means that we want all additional disks (other than the root disk) as OSDs.

3.6 Custom ceph.conf File

If you need to put custom settings into the ceph.conf configuration file, see Book “Administration Guide”, Chapter 1 “Salt Cluster Administration”, Section 1.10 “Custom ceph.conf File” for more details.

4 Upgrading from Previous Releases

This chapter introduces steps to upgrade SUSE Enterprise Storage from the previous release(s) to the current one.

4.1 Read the Release Notes

In the release notes you can find additional information on changes since the previous release of SUSE Enterprise Storage. Check the release notes to see whether:

  • your hardware needs special considerations.

  • any used software packages have changed significantly.

  • special precautions are necessary for your installation.

The release notes also provide information that could not make it into the manual on time. They also contain notes about known issues.

After having installed the package release-notes-ses , find the release notes locally in the directory /usr/share/doc/release-notes or online at https://www.suse.com/releasenotes/.

4.2 General Upgrade Procedure

Consider the following items before starting the upgrade procedure:

Upgrade Order

Before upgrading the Ceph cluster, you need to have both the underlying SUSE Linux Enterprise Server and SUSE Enterprise Storage correctly registered against SCC or SMT. You can upgrade daemons in your cluster while the cluster is online and in service. Certain types of daemons depend upon others. For example Ceph Object Gateways depend upon Ceph monitors and Ceph OSD daemons. We recommend upgrading in this order:

  1. Ceph Monitors

  2. Ceph Managers

  3. Ceph OSDs

  4. Metadata Servers

  5. Object Gateways

  6. iSCSI Gateways

  7. NFS Ganesha

Delete Unnecessary Operating System Snapshots

Remove not needed file system snapshots on the operating system partitions of nodes. This ensures that there is enough free disk space during the upgrade.

Check Cluster Health

We recommend to check the cluster health before starting the upgrade procedure.

Upgrade One by One

We recommend upgrading all the daemons of a specific type—for example all monitor daemons or all OSD daemons—one by one to ensure that they are all on the same release. We also recommend that you upgrade all the daemons in your cluster before you try to exercise new functionality in a release.

After all the daemons of a specific type are upgraded, check their status.

Ensure each monitor has rejoined the quorum after all monitors are upgraded:

root # ceph mon stat

Ensure each Ceph OSD daemon has rejoined the cluster after all OSDs are upgraded:

root # ceph osd stat
Set require-osd-release luminous Flag

When the last OSD is upgraded to SUSE Enterprise Storage 5, the monitor nodes will detect that all OSDs are running the 'luminous' version of Ceph and they may complain that the require-osd-release luminous osdmap flag is not set. In that case, you need to set this flag manually to acknowledge that—now that the cluster has been upgraded to 'luminous'—it cannot be downgraded back to Ceph 'jewel'. Set the flag by running the following command:

root@minion > sudo ceph osd require-osd-release luminous

After the command completes, the warning disappears.

On fresh installs of SUSE Enterprise Storage 5, this flag is set automatically when the Ceph monitors create the initial osdmap, so no end user action is needed.

4.3 Upgrade from SUSE Enterprise Storage 4 (DeepSea Deployment) to 5

Important
Important: Software Requirements

You need to have the following software installed and updated to the latest package versions on all the Ceph nodes you want to upgrade before you can start with the upgrade procedure:

  • SUSE Linux Enterprise Server 12 SP2

  • SUSE Enterprise Storage 4

In addition, before starting the upgrade, you need to upgrade the Salt master node to SUSE Linux Enterprise Server 12 SP3 and SUSE Enterprise Storage 5 by running zypper migration (or your preferred way of upgrading).

Warning
Warning: Points to Consider before the Upgrade
  • Check whether the AppArmor service is running and disable it on each cluster node. Start the YaST AppArmor module, select Settings, and then deactivate the Enable Apparmor check box. Confirm with Done.

    Note that SUSE Enterprise Storage will not work with AppArmor enabled.

  • Although the cluster is fully functional during the upgrade, DeepSea sets the 'noout' flag which prevents Ceph from rebalancing data during downtime and therefore avoids unnecessary data transfers.

  • To optimize the upgrade process, DeepSea upgrades your nodes in the order, based on their assigned role as recommended by Ceph upstream: MONs, MGRs, OSDs, MDS, RGW, IGW, and NFS Ganesha.

    Note that DeepSea cannot prevent the prescribed order from being violated if a node runs multiple services.

  • Although the Ceph cluster is operational during the upgrade, nodes may get rebooted in order to apply, for example, new kernel versions. To reduce waiting I/O operations, we recommend declining incoming requests for the duration of the upgrade process.

  • The cluster upgrade may take a very long time—approximately the time it takes to upgrade one machine multiplied by the number of cluster nodes.

  • Since Ceph Luminous, the osd crush location config option is no longer supported. Please update your DeepSea configuration files to use crush location before upgrading.

To upgrade the SUSE Enterprise Storage 4 cluster to version 5, follow these steps:

  1. Set the new internal object sort order, run:

    root # ceph osd set sortbitwise
    Tip
    Tip

    To verify that the command was successful, we recommend running

    root # ceph osd dump --format json-pretty | grep sortbitwise
     "flags": "sortbitwise,recovery_deletes,purged_snapdirs",
  2. Using rpm -q deepsea, verify that the version of the DeepSea package on the Salt master node starts with at least 0.7. For example:

    root # rpm -q deepsea
    deepsea-0.7.27+git.0.274c55d-5.1

    If the DeepSea package version number starts with 0.6, double check whether you successfully migrated the Salt master node to SUSE Linux Enterprise Server 12 SP3 and SUSE Enterprise Storage 5 (refer to Important: Software Requirements at the beginning of this section). This is a prerequisite that must be completed before starting the upgrade procedure.

    1. If you registered your systems with SUSEConnect and use SCC/SMT, no further actions need to be taken. Continue with Step 4.

    2. If you are not using SCC/SMT but a Media-ISO or other package source, add the following repositories manually: SLE12-SP3 Base, SLE12-SP3 Update, SES5 Base, and SES5 Update. You can do so using the zypper command. First remove all existing software repositories, then add the required new ones, and finally refresh the repositories sources:

      root # zypper sd {0..99}
      root # zypper ar \
       http://172.17.2.210:82/repo/SUSE/Products/Storage/5/x86_64/product/ SES5-POOL
      root # zypper ar \
       http://172.17.2.210:82/repo/SUSE/Updates/Storage/5/x86_64/update/ SES5-UPDATES
      root # zypper ar \
       http://172.17.2.210:82/repo/SUSE/Products/SLE-SERVER/12-SP3/x86_64/product/ SLES12-SP3-POOL
      root # zypper ar \
       http://172.17.2.210:82/repo/SUSE/Updates/SLE-SERVER/12-SP3/x86_64/update/ SLES12-SP3-UPDATES
      root # zypper ref

      Then change your Pillar data in order to use a different strategy. Edit

      /srv/pillar/ceph/stack/name_of_cluster/cluster.yml

      and add the following line:

      upgrade_init: zypper-dup
      Tip
      Tip

      The zypper-dup strategy requires you to manually add the latest software repositories, while the default zypper-migration relies on the repositories provided by SCC/SMT.

  3. Update your Pillar:

    root@master # salt target saltutil.sync_all

    See Section 3.2.2, “Targeting the Minions” for details about Salt minions targeting.

  4. Verify that you successfully wrote to the Pillar:

    root@master # salt target pillar.get upgrade_init

    The command's output should mirror the entry you added.

  5. Upgrade Salt minions:

    root@master # salt target state.apply ceph.updates.salt
  6. Verify that all Salt minions are upgraded:

    root@master # salt target test.version
  7. Include the cluster's Salt minions. Refer to Section 3.2.2, “Targeting the Minions” of Procedure 3.1, “Running Deployment Stages” for more details.

  8. Start the upgrade of SUSE Linux Enterprise Server and Ceph:

    root@master # salt-run state.orch ceph.maintenance.upgrade
    Tip
    Tip: Re-run on Reboot

    If the process results in a reboot of the Salt master, re-run the command to start the upgrade process for the Salt minions again.

  9. Check that AppArmor is disabled and stopped on all nodes after the upgrade:

    root # systemctl disable apparmor.service
    systemctl stop apparmor.service
  10. After the upgrade, the Ceph Managers are not installed yet. To reach a healthy cluster state, do the following:

    1. Run Stage 0 to enable the Salt REST API:

      root@master # salt-run state.orch ceph.stage.0
    2. Run Stage 1 to create the role-mgr/ subdirectory:

      root@master # salt-run state.orch ceph.stage.1
    3. Edit policy.cfg as described in Section 3.5.1, “The policy.cfg File” and add a Ceph Manager role to the nodes where Ceph Monitors are deployed. Also, add the openATTIC role to one of the cluster nodes. Refer to Book “Administration Guide”, Chapter 15 “openATTIC” for more details.

    4. Run Stage 2 to update the Pillar:

      root@master # salt-run state.orch ceph.stage.2
    5. DeepSea uses a different approach to generate the ceph.conf configuration file now, refer to Book “Administration Guide”, Chapter 1 “Salt Cluster Administration”, Section 1.10 “Custom ceph.conf File” for more details.

    6. Run Stage 3 to deploy Ceph Managers:

      root@master # salt-run state.orch ceph.stage.3
    7. Run Stage 4 to configure openATTIC properly:

      root@master # salt-run state.orch ceph.stage.4
    Note
    Note: Ceph Key Caps Mismatch

    If ceph.stage.3 fails with "Error EINVAL: entity client.bootstrap-osd exists but caps do not match", it means the key capabilities (caps) for the existing cluster's client.bootstrap.osd key do not match the caps that DeepSea is trying to set. Above the error message, in red text, you can see a dump of the ceph auth command that failed. Look at this command to check the key ID and file being used. In the case of client.bootstrap-osd, the command will be

    root # ceph auth add client.bootstrap-osd \
     -i /srv/salt/ceph/osd/cache/bootstrap.keyring

    To fix mismatched key caps, check the content of the keyring file DeepSea is trying to deploy, for example:

    cephadm > cat /srv/salt/ceph/osd/cache/bootstrap.keyring
    [client.bootstrap-osd]
         key = AQD6BpVZgqVwHBAAQerW3atANeQhia8m5xaigw==
         caps mgr = "allow r"
         caps mon = "allow profile bootstrap-osd"

    Compare this with the output of ceph auth get client.bootstrap-osd:

    root # ceph auth get client.bootstrap-osd
    exported keyring for client.bootstrap-osd
    [client.bootstrap-osd]
         key = AQD6BpVZgqVwHBAAQerW3atANeQhia8m5xaigw==
         caps mon = "allow profile bootstrap-osd"

    Note how the latter key is missing caps mgr = "allow r". To fix this, run:

    root # ceph auth caps client.bootstrap-osd mgr \
     "allow r" mon "allow profile bootstrap-osd"

    Running ceph.stage.3 should now succeed.

    The same issue can occur with Metadata Server and Object Gateway keyrings when running ceph.stage.4. The same procedure as above applies: check the command that failed, the keyring file being deployed, and the caps of the existing key. Then run ceph auth caps to update the existing key caps to match what is being deployed by DeepSea.

Important
Important: Upgrade Failure

If the cluster is in 'HEALTH_ERR' state for more than 300 seconds, or one of the services for each assigned role is down for more than 900 seconds, the upgrade failed. In that case, try to find the problem, resolve it, and re-run the upgrade procedure. Note that in virtualized environments, the timeouts are shorter.

Important
Important: Rebooting OSDs

After upgrading to SUSE Enterprise Storage 5, FileStore OSDs need approximately five minutes longer to start as the OSD will do a one-off conversion of its on-disk files.

Tip
Tip: Check for the Version of Cluster Components/Nodes

When you need to find out the versions of individual cluster components and nodes—for example to find out if all your nodes are actually on the same patch level after the upgrade—you can run

root@master # salt-run status.report

The command goes through the connected Salt minions and scans for the version numbers of Ceph, Salt, and SUSE Linux Enterprise Server, and gives you a report displaying the version that the majority of nodes have and showing nodes whose version is different from the majority.

4.3.1 OSD Migration to BlueStore

OSD BlueStore is a new back end for the OSD daemons. It is the default option since SUSE Enterprise Storage 5. Compared to FileStore, which stores objects as files in an XFS file system, BlueStore can deliver increased performance because it stores objects directly on the underlying block device. BlueStore also enables other features, such as built-in compression and EC overwrites, that are unavailable with FileStore.

Specifically for BlueStore, an OSD has a 'wal' (Write Ahead Log) device and a 'db' (RocksDB database) device. The RocksDB database holds the metadata for a BlueStore OSD. These two devices will reside on the same device as an OSD by default, but either can be placed on faster/different media.

In SES5, both FileStore and BlueStore are supported and it is possible for FileStore and BlueStore OSDs to co-exist in a single cluster. During the SUSE Enterprise Storage upgrade procedure, FileStore OSDs are not automatically converted to BlueStore. Be aware that the BlueStore-specific features will not be available on OSDs that have not been migrated to BlueStore.

Before converting to BlueStore, the OSDs need to be running SUSE Enterprise Storage 5. The conversion is a slow process as all data gets re-written twice. Though the migration process can take a long time to complete, there is no cluster outage and all clients can continue accessing the cluster during this period. However, do expect lower performance for the duration of the migration. This is caused by rebalancing and backfilling of cluster data.

Use the following procedure to migrate FileStore OSDs to BlueStore:

Tip
Tip: Turn Off Safety Measures

Salt commands needed for running the migration are blocked by safety measures. In order to turn these precautions off, run the following command:

root@master # salt-run disengage.safety
  1. Migrate hardware profiles:

    root@master # salt-run state.orch ceph.migrate.policy

    This runner migrates any hardware profiles currently in use by the policy.cfg file. It processes policy.cfg, finds any hardware profile using the original data structure, and converts it to the new data structure. The result is a new hardware profile named 'migrated-original_name'. policy.cfg is updated as well.

    If the original configuration had separate journals, the BlueStore configuration will use the same device for the 'wal' and 'db' for that OSD.

  2. DeepSea migrates OSDs by setting their weight to 0 which 'vacuums' the data until the OSD is empty. You can either migrate OSDs one by one, or all OSDs at once. In either case, when the OSD is empty, the orchestration removes it and then re-creates it with the new configuration.

    Tip
    Tip: Recommended Method

    Use ceph.migrate.nodes if you have a large number of physical storage nodes or almost no data. If one node represents less than 10% of your capacity, then the ceph.migrate.nodes may be marginally faster moving all the data from those OSDs in parallel.

    If you are not sure about which method to use, or the site has few storage nodes (for example each node has more than 10% of the cluster data), then select ceph.migrate.osds.

    1. To migrate OSDs one at a time, run:

      root@master # salt-run state.orch ceph.migrate.osds
    2. To migrate all OSDs on each node in parallel, run:

      root@master # salt-run state.orch ceph.migrate.nodes
    Tip
    Tip

    As the orchestration gives no feedback about the migration progress, use

    root # ceph osd tree

    to see which OSDs have a weight of zero periodically.

After the migration to BlueStore, the object count will remain the same and disk usage will be nearly the same.

4.4 Upgrade from SUSE Enterprise Storage 4 (ceph-deploy Deployment) to 5

Important
Important: Software Requirements

You need to have the following software installed and updated to the latest package versions on all the Ceph nodes you want to upgrade before you can start with the upgrade procedure:

  • SUSE Linux Enterprise Server 12 SP2

  • SUSE Enterprise Storage 4

Choose the Salt master for your cluster. If your cluster has Calamari deployed, then the Calamari node already is the Salt master. Alternatively, the admin node from which you ran the ceph-deploy command will become the Salt master.

Before starting the procedure below, you need to upgrade the Salt master node to SUSE Linux Enterprise Server 12 SP3 and SUSE Enterprise Storage 5 by running zypper migration (or your preferred way of upgrading).

To upgrade the SUSE Enterprise Storage 4 cluster which was deployed with ceph-deploy to version 5, follow these steps:

Procedure 4.1: Steps to Apply to All Cluster Nodes (including the Calamari Node)
  1. Install the salt package from SLE-12-SP2/SES4:

    root # zypper install salt
  2. Install the salt-minion package from SLE-12-SP2/SES4, then enable and start the related service:

    root # zypper install salt-minion
    root # systemctl enable salt-minion
    root # systemctl start salt-minion
  3. Ensure that the host name 'salt' resolves to the IP address of the Salt master node. If your Salt master is not reachable by the host name salt, edit the file /etc/salt/minion or create a new file /etc/salt/minion.d/master.conf with the following content:

    master: host_name_of_salt_master

    If you performed any changes to the configuration files mentioned above, restart the Salt service on all Salt minions:

    root@minion > systemctl restart salt-minion.service
    1. If you registered your systems with SUSEConnect and use SCC/SMT, no further actions need to be taken.

    2. If you are not using SCC/SMT but a Media-ISO or other package source, add the following repositories manually: SLE12-SP3 Base, SLE12-SP3 Update, SES5 Base, and SES5 Update. You can do so using the zypper command. First remove all existing software repositories, then add the required new ones, and finally refresh the repositories sources:

      root # zypper sd {0..99}
      root # zypper ar \
       http://172.17.2.210:82/repo/SUSE/Products/Storage/5/x86_64/product/ SES5-POOL
      root # zypper ar \
       http://172.17.2.210:82/repo/SUSE/Updates/Storage/5/x86_64/update/ SES5-UPDATES
      root # zypper ar \
       http://172.17.2.210:82/repo/SUSE/Products/SLE-SERVER/12-SP3/x86_64/product/ SLES12-SP3-POOL
      root # zypper ar \
       http://172.17.2.210:82/repo/SUSE/Updates/SLE-SERVER/12-SP3/x86_64/update/ SLES12-SP3-UPDATES
      root # zypper ref
Procedure 4.2: Steps to Apply to the Salt master Node
  1. Set the new internal object sort order, run:

    root@master # ceph osd set sortbitwise
    Tip
    Tip

    To verify that the command was successful, we recommend running

    root@master # ceph osd dump --format json-pretty | grep sortbitwise
     "flags": "sortbitwise,recovery_deletes,purged_snapdirs",
  2. Upgrade the Salt master node to SUSE Linux Enterprise Server 12 SP3 and SUSE Enterprise Storage 5. For SCC-registered systems, use zypper migration. If you provide the required software repositories manually, use zypper dup. After the upgrade, ensure that only repositories for SUSE Linux Enterprise Server 12 SP3 and SUSE Enterprise Storage 5 are active (and refreshed) on the Salt master node before proceeding.

  3. If not already present, install the salt-master package, then enable and start the related service:

    root@master # zypper install salt-master
    root@master # systemctl enable salt-master
    root@master # systemctl start salt-master
  4. Verify the presence of all Salt minions by listing their keys:

    root@master # salt-key -L
  5. Add all Salt minions keys to Salt master including the minion master:

    root@master # salt-key -A -y
  6. Ensure that all Salt minions' keys were accepted:

    root@master # salt-key -L
  7. Make sure that the software on your Salt master node is up to date:

    root@master # zypper migration
  8. Install the deepsea package:

    root@master # zypper install deepsea
  9. Include the cluster's Salt minions. Refer to Section 3.2.2, “Targeting the Minions” of Procedure 3.1, “Running Deployment Stages” for more details.

  10. Import the existing ceph-deploy installed cluster:

    root@master # salt-run populate.engulf_existing_cluster

    The command will do the following:

    • Distribute all the required Salt and DeepSea modules to all the Salt minions.

    • Inspect the running Ceph cluster and populate /srv/pillar/ceph/proposals with a layout of the cluster.

      /srv/pillar/ceph/proposals/policy.cfg will be created with roles matching all detected running Ceph services. View this file to verify that each of your existing MON, OSD, RGW and MDS nodes have the appropriate roles. OSD nodes will be imported into the profile-import/ subdirectory, so you can examine the files in /srv/pillar/ceph/proposals/profile-import/cluster/ and /srv/pillar/ceph/proposals/profile-import/stack/default/ceph/minions/ to confirm that the OSDs were correctly picked up.

      Note
      Note

      The generated policy.cfg will only apply roles for detected Ceph services 'role-mon', 'role-mgr', 'role-mds', 'role-rgw', 'role-admin', and 'role-master' for the Salt master node. Any other desired roles will need to be added to the file manually (see Section 3.5.1.2, “Role Assignment”).

    • The existing cluster's ceph.conf will be saved to /srv/salt/ceph/configuration/files/ceph.conf.import.

    • /srv/pillar/ceph/proposals/config/stack/default/ceph/cluster.yml will include the cluster's fsid, cluster and public networks, and also specifies the configuration_init: default-import option, which makes DeepSea use the ceph.conf.import configuration file mentioned previously, rather than using DeepSea's default /srv/salt/ceph/configuration/files/ceph.conf.j2 template.

      Note
      Note: Custom ceph.conf

      If you need to integrate the ceph.conf file with custom changes, wait until the engulf/upgrade process successfully finishes. Then edit the /srv/pillar/ceph/proposals/config/stack/default/ceph/cluster.yml file and comment the following line:

      configuration_init: default-import

      Save the file and follow the information in Book “Administration Guide”, Chapter 1 “Salt Cluster Administration”, Section 1.10 “Custom ceph.conf File”.

    • The cluster's various keyrings will be saved to the following directories:

      /srv/salt/ceph/admin/cache/
      /srv/salt/ceph/mon/cache/
      /srv/salt/ceph/osd/cache/
      /srv/salt/ceph/mds/cache/
      /srv/salt/ceph/rgw/cache/

      Verify that the keyring files exist, and that there is no keyring file in the following directory (the Ceph Manager did not exist before SUSE Enterprise Storage 5):

      /srv/salt/ceph/mgr/cache/
  11. The salt-run populate.engulf_existing_cluster command does not handle importing the iSCSI Gateways configurations. If your cluster includes iSCSI Gateways, import their configurations manually:

    1. On one of iSCSI Gateway nodes, export the current lrbd.conf and copy it to the Salt master node:

      root@minion > lrbd -o >/tmp/lrbd.conf
      root@minion > scp /tmp/lrbd.conf admin:/srv/salt/ceph/igw/cache/lrbd.conf
    2. On the Salt master node, add the default iSCSI Gateway configuration to the DeepSea setup:

      root@master # mkdir -p /srv/pillar/ceph/stack/ceph/
      root@master # echo 'igw_config: default-ui' >> /srv/pillar/ceph/stack/ceph/cluster.yml
      root@master # chown salt:salt /srv/pillar/ceph/stack/ceph/cluster.yml
    3. Add the iSCSI Gateway roles to policy.cfg and save the file:

      role-igw/stack/default/ceph/minions/ses-1.ses.suse.yml
      role-igw/cluster/ses-1.ses.suse.sls
      [...]
  12. Run Stage 1 to create all possible roles:

    root@master # salt-run state.orch ceph.stage.1
  13. Generate required subdirectories under /srv/pillar/ceph/stack:

    root@master # salt-run push.proposal
  14. Verify that there is a working DeepSea-managed cluster with correctly assigned roles:

    root@master # salt target pillar.get roles

    Compare the output with the actual layout of the cluster.

  15. Calamari leaves a scheduled Salt job running to check the cluster status. Remove the job:

    root@minion > salt target schedule.delete ceph.heartbeat
  16. From this point on, follow the procedure described in Section 4.3, “Upgrade from SUSE Enterprise Storage 4 (DeepSea Deployment) to 5”.

4.5 Upgrade from SUSE Enterprise Storage 4 (Crowbar Deployment) to 5

Important
Important: Software Requirements

You need to have the following software installed and updated to the latest package versions on all the Ceph nodes you want to upgrade before you can start with the upgrade procedure:

  • SUSE Linux Enterprise Server 12 SP2

  • SUSE Enterprise Storage 4

To upgrade SUSE Enterprise Storage 4 deployed using Crowbar to version 5, follow these steps:

  1. For each Ceph node (including the Calamari node), stop and disable all Crowbar-related services :

    root@minion > sudo systemctl stop chef-client
    root@minion > sudo systemctl disable chef-client
    root@minion > sudo systemctl disable crowbar_join
    root@minion > sudo systemctl disable crowbar_notify_shutdown
  2. For each Ceph node (including the Calamari node), verify that the software repositories point to SUSE Enterprise Storage 5 and SUSE Linux Enterprise Server 12 SP3 products. If repositories pointing to older product versions are still present, disable them.

  3. For each Ceph node (including the Calamari node), verify that the salt-minion is installed. If not, install it:

    root@minion > sudo zypper in salt salt-minion
  4. For the Ceph nodes that did not have the salt-minion package installed, create the file /etc/salt/minion.d/master.conf with the master option pointing to the full Calamari node hostname:

    master: full_calamari_hostname
    Tip
    Tip

    The existing Salt minions have the master: option already set in /etc/salt/minion.d/calamari.conf. The configuration file name does not matter, the /etc/salt/minion.d/ directory is important.

    Enable and start the salt-minion service:

    root@minion > sudo systemctl enable salt-minion
    root@minion > sudo systemctl start salt-minion
  5. On the Calamari node, accept any remaining salt minion keys:

    root@master # salt-key -L
    [...]
    Unaccepted Keys:
    d52-54-00-16-45-0a.example.com
    d52-54-00-70-ac-30.example.com
    [...]
    
    root@master # salt-key -A
    The following keys are going to be accepted:
    Unaccepted Keys:
    d52-54-00-16-45-0a.example.com
    d52-54-00-70-ac-30.example.com
    Proceed? [n/Y] y
    Key for minion d52-54-00-16-45-0a.example.com accepted.
    Key for minion d52-54-00-70-ac-30.example.com accepted.
  6. If Ceph was deployed on the public network and no VLAN interface is present, add a VLAN interface on Crowbar's public network to the Calamari node.

  7. Upgrade the Calamari node to SUSE Linux Enterprise Server 12 SP3 and SUSE Enterprise Storage 5, either by using zypper migration or your favorite method. From here onwards, the Calamari node becomes the Salt master. After the upgrade, reboot the Salt master.

  8. Install DeepSea on the Salt master:

    root@master # zypper in deepsea
  9. Specify the deepsea_minions option to include the correct group of Salt minions into deployment stages. Refer to Section 3.2.2.3, “Set the deepsea_minions Option” for more details.

  10. DeepSea expects all Ceph nodes to have an identical /etc/ceph/ceph.conf. Crowbar deploys a slightly different ceph.conf to each node, so you need to consolidate them:

    • Remove the osd crush location hook option, it was included by Calamari.

    • Remove the public addr option from the [mon] section.

    • Remove the port numbers from the mon host option.

  11. If you were running the Object Gateway, Crowbar deployed a separate /etc/ceph/ceph.conf.radosgw file to keep the keystone secrets separated from the regular ceph.conf file. Crowbar also added a custom /etc/systemd/system/ceph-radosgw@.service file. Because DeepSea does not support it, you need to remove it:

    • Append all [client.rgw....] sections from the ceph.conf.radosgw file to /etc/ceph/ceph.conf on all nodes.

    • On the Object Gateway node, run the following:

      root@minion > rm /etc/systemd/system/ceph-radosgw@.service
      systemctl reenable ceph-radosgw@rgw.public.$hostname
  12. Double check that ceph status works when run from the Salt master:

    root@master # ceph status
    cluster a705580c-a7ae-4fae-815c-5cb9c1ded6c2
    health HEALTH_OK
    [...]
  13. Import the existing cluster:

    root@master # salt-run populate.engulf_existing_cluster
    root@master # salt-run state.orch ceph.stage.1
    root@master # salt-run push.proposal
  14. The salt-run populate.engulf_existing_cluster command does not handle importing the iSCSI Gateways configurations. If your cluster includes iSCSI Gateways, import their configurations manually:

    1. On one of iSCSI Gateway nodes, export the current lrbd.conf and copy it to the Salt master node:

      root@minion > lrbd -o > /tmp/lrbd.conf
      root@minion > scp /tmp/lrbd.conf admin:/srv/salt/ceph/igw/cache/lrbd.conf
    2. On the Salt master node, add the default iSCSI Gateway configuration to the DeepSea setup:

      root@master # mkdir -p /srv/pillar/ceph/stack/ceph/
      root@master # echo 'igw_config: default-ui' >> /srv/pillar/ceph/stack/ceph/cluster.yml
      root@master # chown salt:salt /srv/pillar/ceph/stack/ceph/cluster.yml
    3. Add the iSCSI Gateway roles to policy.cfg and save the file:

      role-igw/stack/default/ceph/minions/ses-1.ses.suse.yml
      role-igw/cluster/ses-1.ses.suse.sls
      [...]
    1. If you registered your systems with SUSEConnect and use SCC/SMT, no further actions need to be taken.

    2. If you are not using SCC/SMT but a Media-ISO or other package source, add the following repositories manually: SLE12-SP3 Base, SLE12-SP3 Update, SES5 Base, and SES5 Update. You can do so using the zypper command. First remove all existing software repositories, then add the required new ones, and finally refresh the repositories sources:

      root # zypper sd {0..99}
      root # zypper ar \
       http://172.17.2.210:82/repo/SUSE/Products/Storage/5/x86_64/product/ SES5-POOL
      root # zypper ar \
       http://172.17.2.210:82/repo/SUSE/Updates/Storage/5/x86_64/update/ SES5-UPDATES
      root # zypper ar \
       http://172.17.2.210:82/repo/SUSE/Products/SLE-SERVER/12-SP3/x86_64/product/ SLES12-SP3-POOL
      root # zypper ar \
       http://172.17.2.210:82/repo/SUSE/Updates/SLE-SERVER/12-SP3/x86_64/update/ SLES12-SP3-UPDATES
      root # zypper ref

      Then change your Pillar data in order to use a different strategy. Edit

      /srv/pillar/ceph/stack/name_of_cluster/cluster.yml

      and add the following line:

      upgrade_init: zypper-dup
      Tip
      Tip

      The zypper-dup strategy requires you to manually add the latest software repositories, while the default zypper-migration relies on the repositories provided by SCC/SMT.

  15. Fix host grains to make DeepSea use short host names on the public network for the Ceph daemon instance IDs. For each node, you need to run grains.set with the new (short) host name. Before running grains.set, verify the current monitor instances by running ceph status. A before and after example follows:

    root@master # salt target grains.get host
    d52-54-00-16-45-0a.example.com:
        d52-54-00-16-45-0a
    d52-54-00-49-17-2a.example.com:
        d52-54-00-49-17-2a
    d52-54-00-76-21-bc.example.com:
        d52-54-00-76-21-bc
    d52-54-00-70-ac-30.example.com:
        d52-54-00-70-ac-30
    root@master # salt d52-54-00-16-45-0a.example.com grains.set \
     host public.d52-54-00-16-45-0a
    root@master # salt d52-54-00-49-17-2a.example.com grains.set \
     host public.d52-54-00-49-17-2a
    root@master # salt d52-54-00-76-21-bc.example.com grains.set \
     host public.d52-54-00-76-21-bc
    root@master # salt d52-54-00-70-ac-30.example.com grains.set \
     host public.d52-54-00-70-ac-30
    root@master # salt target grains.get host
    d52-54-00-76-21-bc.example.com:
        public.d52-54-00-76-21-bc
    d52-54-00-16-45-0a.example.com:
        public.d52-54-00-16-45-0a
    d52-54-00-49-17-2a.example.com:
        public.d52-54-00-49-17-2a
    d52-54-00-70-ac-30.example.com:
        public.d52-54-00-70-ac-30
  16. Run the upgrade:

    root@master # salt target state.apply ceph.updates
    root@master # salt target test.version
    root@master # salt-run state.orch ceph.maintenance.upgrade

    Every node will reboot. The cluster will come back up complaining that there is no active Ceph Manager instance. This is normal. Calamari should not be installed/running anymore at this point.

  17. Run all the required deployment stages to get the cluster to a healthy state:

    root@master # salt-run state.orch ceph.stage.0
    root@master # salt-run state.orch ceph.stage.1
    root@master # salt-run state.orch ceph.stage.2
    root@master # salt-run state.orch ceph.stage.3
  18. To deploy openATTIC (see Book “Administration Guide”, Chapter 15 “openATTIC”), add an appropriate role-openattic (see Section 3.5.1.2, “Role Assignment”) line to /srv/pillar/ceph/proposals/policy.cfg, then run:

    root@master # salt-run state.orch ceph.stage.2
    root@master # salt-run state.orch ceph.stage.4
  19. During the upgrade, you may receive "Error EINVAL: entity [...] exists but caps do not match" errors. To fix them, refer to Section 4.3, “Upgrade from SUSE Enterprise Storage 4 (DeepSea Deployment) to 5”.

  20. Do the remaining cleanup:

    • Crowbar creates entries in /etc/fstab for each OSD. They are not necessary, so delete them.

    • Calamari leaves a scheduled Salt job running to check the cluster status. Remove the job:

      root@master # salt target schedule.delete ceph.heartbeat
    • There are still some unnecessary packages installed, mostly rubygems, and chef related. Their removal is not required but you may want to delete them by running zypper rm pkg_name.

4.6 Upgrade from SUSE Enterprise Storage 3 to 5

Important
Important: Software Requirements

You need to have the following software installed and updated to the latest package versions on all the Ceph nodes you want to upgrade before you can start with the upgrade procedure:

  • SUSE Linux Enterprise Server 12 SP1

  • SUSE Enterprise Storage 3

To upgrade the SUSE Enterprise Storage 3 cluster to version 5, follow the steps described in Procedure 4.1, “Steps to Apply to All Cluster Nodes (including the Calamari Node)” and then Procedure 4.2, “Steps to Apply to the Salt master Node”.

5 Backing Up the Cluster Configuration

This chapter explains which files on the admin node should be backed up. As soon as you are finished with your cluster deployment or migration, create a backup of these directories.

5.1 Back Up Salt Configuration

The /etc/salt/ directory can be backed up. It contains the Salt configuration files, for example the Salt master key and accepted client keys.

The Salt files are not strictly required for backing up the admin node, but make redeploying the Salt cluster easier. If there is no backup of these files, the Salt minions need to be registered again at the new admin node.

Note
Note: Security of the Salt Master Private Key

Make sure that the backup of the Salt master private key is stored in a safe location. The Salt master key can be used to manipulate all cluster nodes.

After restoring the /etc/salt directory from a backup, restart the Salt services:

root@master # systemctl restart salt-master
root@master # systemctl restart salt-minion

5.2 Back Up DeepSea Configuration

All files required by DeepSea are stored in /srv/pillar/, /srv/salt/ and /etc/salt/master.d.

If you need to redeploy the admin node, install the DeepSea package on the new node and move the backed up data back into the directories. DeepSea can then be used again without any further changes being required. Before using DeepSea again, make sure that all Salt minions are correctly registered on the admin node.

6 Customizing the Default Configuration

You can change the default cluster configuration generated in Stage 2 (refer to DeepSea Stages Description). For example, you may need to change network settings, or software that is installed on the Salt master by default installed. You can perform the former by modifying the pillar updated after Stage 2, while the latter is usually done by creating a custom sls file and adding it to the pillar. Details are described in following sections.

6.1 Using Customized Configuration Files

This section lists several tasks that require adding/changing your own sls files. Such a procedure is typically used when you need to change the default deployment process.

Tip
Tip: Prefix Custom .sls Files

Your custom .sls files belong to the same subdirectory as DeepSea's .sls files. To prevent overwriting your .sls files with the possibly newly added ones from the DeepSea package, prefix their name with the custom- string.

6.1.1 Disabling a Deployment Step

If you address a specific task outside of the DeepSea deployment process and therefore need to skip it, create a 'no-operation' file following this example:

Procedure 6.1: Disabling Time Synchronization
  1. Create /srv/salt/ceph/time/disabled.sls with the following content and save it:

    disable time setting:
    test.nop
  2. Edit /srv/pillar/ceph/stack/global.yml, add the following line, and save it:

    time_init: disabled
  3. Verify by refreshing the pillar and running the step:

    root@master # salt target saltutil.pillar_refresh
    root@master # salt 'admin.ceph' state.apply ceph.time
    admin.ceph:
      Name: disable time setting - Function: test.nop - Result: Clean
    
    Summary for admin.ceph
    ------------
    Succeeded: 1
    Failed:    0
    ------------
    Total states run:     1
    Note
    Note: Unique ID

    The task ID 'disable time setting' may be any message unique within an sls file. Prevent ID collisions by specifying unique descriptions.

6.1.2 Replacing a Deployment Step

If you need to replace the default behavior of a specific step with a custom one, create a custom sls file with replacement content.

By default /srv/salt/ceph/pool/default.sls creates an rbd image called 'demo'. In our example, we do not want this image to be created, but we need two images: 'archive1' and 'archive2'.

Procedure 6.2: Replacing the demo rbd Image with Two Custom rbd Images
  1. Create /srv/salt/ceph/pool/custom.sls with the following content and save it:

    wait:
      module.run:
        - name: wait.out
        - kwargs:
            'status': "HEALTH_ERR"1
        - fire_event: True
    
    archive1:
      cmd.run:
        - name: "rbd -p rbd create archive1 --size=1024"2
        - unless: "rbd -p rbd ls | grep -q archive1$"
        - fire_event: True
    
    archive2:
      cmd.run:
        - name: "rbd -p rbd create archive2 --size=768"
        - unless: "rbd -p rbd ls | grep -q archive2$"
        - fire_event: True

    1

    The wait module will pause until the Ceph cluster does not have a status of HEALTH_ERR. In fresh installations, a Ceph cluster may have this status until a sufficient number of OSDs become available and the creation of pools has completed.

    2

    The rbd command is not idempotent. If the same creation command is re-run after the image exists, the Salt state will fail. The unless statement prevents this.

  2. To call the newly created custom file instead of the default, you need to edit /srv/pillar/ceph/stack/ceph/cluster.yml, add the following line, and save it:

    pool_init: custom
  3. Verify by refreshing the pillar and running the step:

    root@master # salt target saltutil.pillar_refresh
    root@master # salt 'admin.ceph' state.apply ceph.pool
Note
Note: Authorization

The creation of pools or images requires sufficient authorization. The admin.ceph minion has an admin keyring.

Tip
Tip: Alternative Way

Another option is to change the variable in /srv/pillar/ceph/stack/ceph/roles/master.yml instead. Using this file will reduce the clutter of pillar data for other minions.

6.1.3 Modifying a Deployment Step

Sometimes you may need a specific step to do some additional tasks. We do not recommend modifying the related state file as it may complicate a future upgrade. Instead, create a separate file to carry out the additional tasks identical to what was described in Section 6.1.2, “Replacing a Deployment Step”.

Name the new sls file descriptively. For example, if you need to create two rbd images in addition to the demo image, name the file archive.sls.

Procedure 6.3: Creating Two Additional rbd Images
  1. Create /srv/salt/ceph/pool/custom.sls with the following content and save it:

    include:
     - .archive
     - .default
    Tip
    Tip: Include Precedence

    In this example, Salt will create the archive images and then create the demo image. The order does not matter in this example. To change the order, reverse the lines after the include: directive.

    You can add the include line directly to archive.sls and all the images will get created as well. However, regardless of where the include line is placed, Salt processes the steps in the included file first. Although this behavior can be overridden with requires and order statements, a separate file that includes the others guarantees the order and reduces the chances of confusion.

  2. Edit /srv/pillar/ceph/stack/ceph/cluster.yml, add the following line, and save it:

    pool_init: custom
  3. Verify by refreshing the pillar and running the step:

    root@master # salt target saltutil.pillar_refresh
    root@master # salt 'admin.ceph' state.apply ceph.pool

6.1.4 Modifying a Deployment Stage

If you need to add a completely separate deployment step, create three new files—an sls file that performs the command, an orchestration file, and a custom file which aligns the new step with the original deployment steps.

For example, if you need to run logrotate on all minions as part of the preparation stage:

First create an sls file and include the logrotate command.

Procedure 6.4: Running logrotate on all Salt minions
  1. Create a directory such as /srv/salt/ceph/logrotate.

  2. Create /srv/salt/ceph/logrotate/init.sls with the following content and save it:

    rotate logs:
      cmd.run:
        - name: "/usr/sbin/logrotate /etc/logrotate.conf"
  3. Verify that the command works on a minion:

    root@master # salt 'admin.ceph' state.apply ceph.logrotate

Because the orchestration file needs to run before all other preparation steps, add it to the Prep stage 0:

  1. Create /srv/salt/ceph/stage/prep/logrotate.sls with the following content and save it:

    logrotate:
      salt.state:
        - tgt: '*'
        - sls: ceph.logrotate
  2. Verify that the orchestration file works:

    root@master # salt-run state.orch ceph.stage.prep.logrotate

The last file is the custom one which includes the additional step with the original steps:

  1. Create /srv/salt/ceph/stage/prep/custom.sls with the following content and save it:

    include:
      - .logrotate
      - .master
      - .minion
  2. Override the default behavior. Edit /srv/pillar/ceph/stack/global.yml, add the following line, and save the file:

    stage_prep: custom
  3. Verify that Stage 0 works:

    root@master # salt-run state.orch ceph.stage.0
Note
Note: Why global.yml?

The global.yml file is chosen over the cluster.yml because during the prep stage, no minion belongs to the Ceph cluster and has no access to any settings in cluster.yml.

6.1.5 Disabling Updates and Reboots during Stage 0

During Stage 0 (refer to DeepSea Stages Description for more information on DeepSea stages). The Salt master and Salt minions may reboot because newly updated packages, for example kernel, require rebooting the system.

To prevent updating or rebooting cluster nodes during Stage 0, edit /srv/pillar/ceph/stack/ceph/cluster.yml and add the stage_prep_master or stage_prep_minion options, depending whether you need to modify the behavior of the Salt master, all Salt minions, or all nodes.

Both options accept the following values:

default-no-update-no-reboot

Prevents the nodes form both updating their packages and rebooting.

default-no-update-reboot

Prevents the nodes from updating their packages, but reboots are allowed.

default-update-no-reboot

Prevents the nodes from rebooting, but allows updating their packages.

default

Allows both updating the nodes' packages, and rebooting them as well.

6.2 Modifying Discovered Configuration

After you completed Stage 2, you may want to change the discovered configuration. To view the current settings, run:

root@master # salt target pillar.items

The output of the default configuration for a single minion is usually similar to the following:

----------
    available_roles:
        - admin
        - mon
        - storage
        - mds
        - igw
        - rgw
        - client-cephfs
        - client-radosgw
        - client-iscsi
        - mds-nfs
        - rgw-nfs
        - master
    cluster:
        ceph
    cluster_network:
        172.16.22.0/24
    fsid:
        e08ec63c-8268-3f04-bcdb-614921e94342
    master_minion:
        admin.ceph
    mon_host:
        - 172.16.21.13
        - 172.16.21.11
        - 172.16.21.12
    mon_initial_members:
        - mon3
        - mon1
        - mon2
    public_address:
        172.16.21.11
    public_network:
        172.16.21.0/24
    roles:
        - admin
        - mon
        - mds
    time_server:
        admin.ceph
    time_service:
        ntp

The above mentioned settings are distributed across several configuration files. The directory structure with these files is defined in the /srv/pillar/ceph/stack/stack.cfg directory. The following files usually describe your cluster:

  • /srv/pillar/ceph/stack/global.yml - the file affects all minions in the Salt cluster.

  • /srv/pillar/ceph/stack/ceph/cluster.yml - the file affects all minions in the Ceph cluster called ceph.

  • /srv/pillar/ceph/stack/ceph/roles/role.yml - affects all minions that are assigned the specific role in the ceph cluster.

  • /srv/pillar/ceph/stack/cephminions/minion ID/yml - affects the individual minion.

Note
Note: Overwriting Directories with Default Values

There is a parallel directory tree that stores the default configuration setup in /srv/pillar/ceph/stack/default. Do not change values here, as they are overwritten.

The typical procedure for changing the collected configuration is the following:

  1. Find the location of the configuration item you need to change. For example, if you need to change cluster related setting such as cluster network, edit the file /srv/pillar/ceph/stack/ceph/cluster.yml.

  2. Save the file.

  3. Verify the changes by running:

    root@master # salt target saltutil.pillar_refresh

    and then

    root@master # salt target pillar.items

Part III Installation of Additional Services

7 Installation of Services to Access your Data

After you deploy your SUSE Enterprise Storage cluster you may need to install additional software for accessing your data, such as the Object Gateway or the iSCSI Gateway, or you can deploy a clustered file system on top of the Ceph cluster. This chapter mainly focuses on manual installation. If you…

8 Ceph Object Gateway

Ceph Object Gateway is an object storage interface built on top of librgw to provide applications with a RESTful gateway to Ceph clusters. It supports two interfaces:

9 Installation of iSCSI Gateway

iSCSI is a storage area network (SAN) protocol that allows clients (called initiators) to send SCSI commands to SCSI storage devices (targets) on remote servers. SUSE Enterprise Storage includes a facility that opens Ceph storage management to heterogeneous clients, such as Microsoft Windows* and VM…

10 Installation of CephFS

The Ceph file system (CephFS) is a POSIX-compliant file system that uses a Ceph storage cluster to store its data. CephFS uses the same cluster system as Ceph block devices, Ceph object storage with its S3 and Swift APIs, or native bindings (librados).

11 Installation of NFS Ganesha

NFS Ganesha provides NFS access to either the Object Gateway or the CephFS. In SUSE Enterprise Storage 5, NFS versions 3 and 4 are supported. NFS Ganesha runs in the user space instead of the kernel space and directly interacts with the Object Gateway or CephFS.

12 Export CephFS via Samba

This section describes how to export CephFS via a Samba/CIFS share. Samba shares can be used with Windows* clients.

7 Installation of Services to Access your Data

After you deploy your SUSE Enterprise Storage cluster you may need to install additional software for accessing your data, such as the Object Gateway or the iSCSI Gateway, or you can deploy a clustered file system on top of the Ceph cluster. This chapter mainly focuses on manual installation. If you have a cluster deployed using Salt, refer to Chapter 3, Deploying with DeepSea/Salt for a procedure on installing particular gateways or the CephFS.

8 Ceph Object Gateway

Ceph Object Gateway is an object storage interface built on top of librgw to provide applications with a RESTful gateway to Ceph clusters. It supports two interfaces:

  • S3-compatible: Provides object storage functionality with an interface that is compatible with a large subset of the Amazon S3 RESTful API.

  • Swift-compatible: Provides object storage functionality with an interface that is compatible with a large subset of the OpenStack Swift API.

The Object Gateway daemon uses an embedded HTTP server (CivetWeb) for interacting with the Ceph cluster. Since it provides interfaces compatible with OpenStack Swift and Amazon S3, the Object Gateway has its own user management. Object Gateway can store data in the same cluster that is used to store data from CephFS clients or RADOS Block Device clients. The S3 and Swift APIs share a common name space, so you may write data with one API and retrieve it with the other.

Important
Important: Object Gateway Deployed by DeepSea

Since SUSE Enterprise Storage 5, the Object Gateway is installed as a DeepSea role, therefore you do not need to install it manually.

To install the Object Gateway during the cluster deployment, see Section 3.3, “Cluster Deployment”.

To add a new node with Object Gateway to the cluster, see Book “Administration Guide”, Chapter 1 “Salt Cluster Administration”, Section 1.2 “Adding New Roles to Nodes”.

8.1 Object Gateway Manual Installation

  1. Install Object Gateway on a node that is not using port 80. For example a node already running openATTIC is already using port 80. The following command installs all required components:

    cephadm > sudo zypper ref && sudo zypper in ceph-radosgw
  2. If the Apache server from the previous Object Gateway instance is running, stop it and disable the relevant service:

    cephadm > sudo systemctl stop disable apache2.service
  3. Edit /etc/ceph/ceph.conf and add the following lines:

    [client.rgw.gateway_host]
     rgw frontends = "civetweb port=80"
    Tip
    Tip

    If you want to configure Object Gateway/CivetWeb for use with SSL encryption, modify the line accordingly:

    rgw frontends = civetweb port=7480s ssl_certificate=path_to_certificate.pem
  4. Restart the Object Gateway service.

    cephadm > sudo systemctl restart ceph-radosgw@rgw.gateway_host

8.1.1 Object Gateway Configuration

Several steps are required to configure a Object Gateway.

8.1.1.1 Basic Configuration

Configuring a Ceph Object Gateway requires a running Ceph Storage Cluster. The Ceph Object Gateway is a client of the Ceph Storage Cluster. As a Ceph Storage Cluster client, it requires:

  • A host name for the gateway instance, for example gateway.

  • A storage cluster user name with appropriate permissions and a keyring.

  • Pools to store its data.

  • A data directory for the gateway instance.

  • An instance entry in the Ceph configuration file.

Each instance must have a user name and key to communicate with a Ceph storage cluster. In the following steps, we use a monitor node to create a bootstrap keyring, then create the Object Gateway instance user keyring based on the bootstrap one. Then, we create a client user name and key. Next, we add the key to the Ceph Storage Cluster. Finally, we distribute the keyring to the node containing the gateway instance.

  1. Create a keyring for the gateway:

    cephadm > sudo ceph-authtool --create-keyring /etc/ceph/ceph.client.rgw.keyring
    cephadm > sudo chmod +r /etc/ceph/ceph.client.rgw.keyring
  2. Generate a Ceph Object Gateway user name and key for each instance. As an example, we will use the name gateway after client.radosgw:

    cephadm > sudo ceph-authtool /etc/ceph/ceph.client.rgw.keyring \
      -n client.rgw.gateway --gen-key
  3. Add capabilities to the key:

    cephadm > sudo ceph-authtool -n client.rgw.gateway --cap osd 'allow rwx' \
      --cap mon 'allow rwx' /etc/ceph/ceph.client.rgw.keyring
  4. Once you have created a keyring and key to enable the Ceph Object Gateway with access to the Ceph Storage Cluster, add the key to your Ceph Storage Cluster. For example:

    cephadm > sudo ceph -k /etc/ceph/ceph.client.admin.keyring auth add client.rgw.gateway \
      -i /etc/ceph/ceph.client.rgw.keyring
  5. Distribute the keyring to the node with the gateway instance:

    cephadm > sudo scp /etc/ceph/ceph.client.rgw.keyring  ceph@hostname:/home/ceph
    cephadm > ssh hostname
    cephadm > sudo mv ceph.client.rgw.keyring /etc/ceph/ceph.client.rgw.keyring
Tip
Tip: Use Bootstrap Keyring

An alternative way is to create the Object Gateway bootstrap keyring, and then create the Object Gateway keyring from it:

  1. Create a Object Gateway bootstrap keyring on one of the monitor nodes:

    cephadm > sudo ceph \
     auth get-or-create client.bootstrap-rgw mon 'allow profile bootstrap-rgw' \
     --connect-timeout=25 \
     --cluster=ceph \
     --name mon. \
     --keyring=/var/lib/ceph/mon/ceph-node_host/keyring \
     -o /var/lib/ceph/bootstrap-rgw/keyring
  2. Create the /var/lib/ceph/radosgw/ceph-rgw_name directory for storing the bootstrap keyring:

    cephadm > sudo mkdir \
    /var/lib/ceph/radosgw/ceph-rgw_name
  3. Create a Object Gateway keyring from the newly created bootstrap keyring:

    cephadm > sudo ceph \
     auth get-or-create client.rgw.rgw_name osd 'allow rwx' mon 'allow rw' \
     --connect-timeout=25 \
     --cluster=ceph \
     --name client.bootstrap-rgw \
     --keyring=/var/lib/ceph/bootstrap-rgw/keyring \
     -o /var/lib/ceph/radosgw/ceph-rgw_name/keyring
  4. Copy the Object Gateway keyring to the Object Gateway host:

    cephadm > sudo scp \
    /var/lib/ceph/radosgw/ceph-rgw_name/keyring \
    rgw_host:/var/lib/ceph/radosgw/ceph-rgw_name/keyring

8.1.1.2 Create Pools (Optional)

Ceph Object Gateways require Ceph Storage Cluster pools to store specific gateway data. If the user you created has proper permissions, the gateway will create the pools automatically. However, ensure that you have set an appropriate default number of placement groups per pool in the Ceph configuration file.

The pool names follow the {zone-name}.{pool-name} syntax. When configuring a gateway with the default region and zone, the default zone name is 'default' as in our example:

.rgw.root
default.rgw.control
default.rgw.meta
default.rgw.log
default.rgw.buckets.index
default.rgw.buckets.data

To create the pools manually, see Book “Administration Guide”, Chapter 7 “Managing Storage Pools”, Section 7.2.2 “Create a Pool”.

8.1.1.3 Adding Gateway Configuration to Ceph

Add the Ceph Object Gateway configuration to the Ceph Configuration file. The Ceph Object Gateway configuration requires you to identify the Ceph Object Gateway instance. Then, specify the host name where you installed the Ceph Object Gateway daemon, a keyring (for use with cephx), and optionally a log file. For example:

[client.rgw.instance-name]
host = hostname
keyring = /etc/ceph/ceph.client.rgw.keyring
Tip
Tip: Object Gateway Log File

To override the default Object Gateway log file, include the following:

log file = /var/log/radosgw/client.rgw.instance-name.log

The [client.rgw.*] portion of the gateway instance identifies this portion of the Ceph configuration file as configuring a Ceph Storage Cluster client where the client type is a Ceph Object Gateway (radosgw). The instance name follows. For example:

[client.rgw.gateway]
host = ceph-gateway
keyring = /etc/ceph/ceph.client.rgw.keyring
Note
Note

The host must be your machine host name, excluding the domain name.

Then turn off print continue. If you have it set to true, you may encounter problems with PUT operations:

rgw print continue = false

To use a Ceph Object Gateway with subdomain S3 calls (for example http://bucketname.hostname), you must add the Ceph Object Gateway DNS name under the [client.rgw.gateway] section of the Ceph configuration file:

[client.rgw.gateway]
...
rgw dns name = hostname

You should also consider installing a DNS server such as Dnsmasq on your client machine(s) when using the http://bucketname.hostname syntax. The dnsmasq.conf file should include the following settings:

address=/hostname/host-ip-address
listen-address=client-loopback-ip

Then, add the client-loopback-ip IP address as the first DNS server on the client machine(s).

8.1.1.4 Create Data Directory

Deployment scripts may not create the default Ceph Object Gateway data directory. Create data directories for each instance of a radosgw daemon if not already done. The host variables in the Ceph configuration file determine which host runs each instance of a radosgw daemon. The typical form specifies the radosgw daemon, the cluster name, and the daemon ID.

cephadm > sudo mkdir -p /var/lib/ceph/radosgw/cluster-id

Using the example ceph.conf settings above, you would execute the following:

cephadm > sudo mkdir -p /var/lib/ceph/radosgw/ceph-radosgw.gateway

8.1.1.5 Restart Services and Start the Gateway

To ensure that all components have reloaded their configurations, we recommend restarting your Ceph Storage Cluster service. Then, start up the radosgw service. For more information, see Book “Administration Guide”, Chapter 2 “Introduction” and Book “Administration Guide”, Chapter 11 “Ceph Object Gateway”, Section 11.3 “Operating the Object Gateway Service”.

When the service is up and running, you can make an anonymous GET request to see if the gateway returns a response. A simple HTTP request to the domain name should return the following:

<ListAllMyBucketsResult>
      <Owner>
              <ID>anonymous</ID>
              <DisplayName/>
      </Owner>
      <Buckets/>
</ListAllMyBucketsResult>

9 Installation of iSCSI Gateway

iSCSI is a storage area network (SAN) protocol that allows clients (called initiators) to send SCSI commands to SCSI storage devices (targets) on remote servers. SUSE Enterprise Storage includes a facility that opens Ceph storage management to heterogeneous clients, such as Microsoft Windows* and VMware* vSphere, through the iSCSI protocol. Multipath iSCSI access enables availability and scalability for these clients, and the standardized iSCSI protocol also provides an additional layer of security isolation between clients and the SUSE Enterprise Storage cluster. The configuration facility is named lrbd. Using lrbd, Ceph storage administrators can define thin-provisioned, replicated, highly-available volumes supporting read-only snapshots, read-write clones, and automatic resizing with Ceph RADOS Block Device (RBD). Administrators can then export volumes either via a single lrbd gateway host, or via multiple gateway hosts supporting multipath failover. Linux, Microsoft Windows, and VMware hosts can connect to volumes using the iSCSI protocol, which makes them available like any other SCSI block device. This means SUSE Enterprise Storage customers can effectively run a complete block-storage infrastructure subsystem on Ceph that provides all the features and benefits of a conventional SAN, enabling future growth.

This chapter introduces detailed information to set up a Ceph cluster infrastructure together with an iSCSI gateway so that the client hosts can use remotely stored data as local storage devices using the iSCSI protocol.

9.1 iSCSI Block Storage

iSCSI is an implementation of the Small Computer System Interface (SCSI) command set using the Internet Protocol (IP), specified in RFC 3720. iSCSI is implemented as a service where a client (the initiator) talks to a server (the target) via a session on TCP port 3260. An iSCSI target's IP address and port are called an iSCSI portal, where a target can be exposed through one or more portals. The combination of a target and one or more portals is called the target portal group (TPG).

The underlying data link layer protocol for iSCSI is commonly Ethernet. More specifically, modern iSCSI infrastructures use 10 Gigabit Ethernet or faster networks for optimal throughput. 10 Gigabit Ethernet connectivity between the iSCSI gateway and the back-end Ceph cluster is strongly recommended.

9.1.1 The Linux Kernel iSCSI Target

The Linux kernel iSCSI target was originally named LIO for linux-iscsi.org, the project's original domain and Web site. For some time, no fewer than four competing iSCSI target implementations were available for the Linux platform, but LIO ultimately prevailed as the single iSCSI reference target. The mainline kernel code for LIO uses the simple, but somewhat ambiguous name "target", distinguishing between "target core" and a variety of front-end and back-end target modules.

The most commonly used front-end module is arguably iSCSI. However, LIO also supports Fibre Channel (FC), Fibre Channel over Ethernet (FCoE) and several other front-end protocols. At this time, only the iSCSI protocol is supported by SUSE Enterprise Storage.

The most frequently used target back-end module is one that is capable of simply re-exporting any available block device on the target host. This module is named iblock. However, LIO also has an RBD-specific back-end module supporting parallelized multipath I/O access to RBD images.

9.1.2 iSCSI Initiators

This section introduces brief information on iSCSI initiators used on Linux, Microsoft Windows, and VMware platforms.

9.1.2.1 Linux

The standard initiator for the Linux platform is open-iscsi. open-iscsi launches a daemon, iscsid, which the user can then use to discover iSCSI targets on any given portal, log in to targets, and map iSCSI volumes. iscsid communicates with the SCSI mid layer to create in-kernel block devices that the kernel can then treat like any other SCSI block device on the system. The open-iscsi initiator can be deployed in conjunction with the Device Mapper Multipath (dm-multipath) facility to provide a highly available iSCSI block device.

9.1.2.2 Microsoft Windows and Hyper-V

The default iSCSI initiator for the Microsoft Windows operating system is the Microsoft iSCSI initiator. The iSCSI service can be configured via a graphical user interface (GUI), and supports multipath I/O for high availability.

9.1.2.3 VMware

The default iSCSI initiator for VMware vSphere and ESX is the VMware ESX software iSCSI initiator, vmkiscsi. When enabled, it can be configured either from the vSphere client, or using the vmkiscsi-tool command. You can then format storage volumes connected through the vSphere iSCSI storage adapter with VMFS, and use them like any other VM storage device. The VMware initiator also supports multipath I/O for high availability.

9.2 General Information about lrbd

lrbd combines the benefits of RADOS Block Devices with the ubiquitous versatility of iSCSI. By employing lrbd on an iSCSI target host (known as the lrbd gateway), any application that needs to make use of block storage can benefit from Ceph, even if it does not speak any Ceph client protocol. Instead, users can use iSCSI or any other target front-end protocol to connect to an LIO target, which translates all target I/O to RBD storage operations.

Ceph Cluster with a Single iSCSI Gateway
Figure 9.1: Ceph Cluster with a Single iSCSI Gateway

lrbd is inherently highly-available and supports multipath operations. Thus, downstream initiator hosts can use multiple iSCSI gateways for both high availability and scalability. When communicating with an iSCSI configuration with more than one gateway, initiators may load-balance iSCSI requests across multiple gateways. In the event of a gateway failing, being temporarily unreachable, or being disabled for maintenance, I/O will transparently continue via another gateway.

Ceph Cluster with Multiple iSCSI Gateways
Figure 9.2: Ceph Cluster with Multiple iSCSI Gateways

9.3 Deployment Considerations

A minimum configuration of SUSE Enterprise Storage with lrbd consists of the following components:

  • A Ceph storage cluster. The Ceph cluster consists of a minimum of four physical servers hosting at least eight object storage daemons (OSDs) each. In such a configuration, three OSD nodes also double as a monitor (MON) host.

  • An iSCSI target server running the LIO iSCSI target, configured via lrbd.

  • An iSCSI initiator host, running open-iscsi (Linux), the Microsoft iSCSI Initiator (Microsoft Windows), or any other compatible iSCSI initiator implementation.

A recommended production configuration of SUSE Enterprise Storage with lrbd consists of:

  • A Ceph storage cluster. A production Ceph cluster consists of any number of (typically more than 10) OSD nodes, each typically running 10-12 object storage daemons (OSDs), with no fewer than three dedicated MON hosts.

  • Several iSCSI target servers running the LIO iSCSI target, configured via lrbd. For iSCSI fail-over and load-balancing, these servers must run a kernel supporting the target_core_rbd module. Update packages are available from the SUSE Linux Enterprise Server maintenance channel.

  • Any number of iSCSI initiator hosts, running open-iscsi (Linux), the Microsoft iSCSI Initiator (Microsoft Windows), or any other compatible iSCSI initiator implementation.

9.4 Installation and Configuration

This section describes steps to install and configure an iSCSI Gateway on top of SUSE Enterprise Storage.

9.4.1 Deploy the iSCSI Gateway to a Ceph Cluster

You can deploy the iSCSI Gateway either during Ceph cluster deployment process, or add it to an existing cluster using DeepSea.

To include the iSCSI Gateway during the cluster deployment process, refer to Section 3.5.1.2, “Role Assignment”.

To add the iSCSI Gateway to an existing cluster, refer to Book “Administration Guide”, Chapter 1 “Salt Cluster Administration”, Section 1.2 “Adding New Roles to Nodes”.

9.4.2 Create RBD Images

RBD images are created in the Ceph store and subsequently exported to iSCSI. We recommend that you use a dedicated RADOS pool for this purpose. You can create a volume from any host that is able to connect to your storage cluster using the Ceph rbd command line utility. This requires the client to have at least a minimal ceph.conf configuration file, and appropriate CephX authentication credentials.

To create a new volume for subsequent export via iSCSI, use the rbd create command, specifying the volume size in megabytes. For example, in order to create a 100 GB volume named testvol in the pool named iscsi, run:

root # rbd --pool iscsi create --size=102400 testvol

The above command creates an RBD volume in the default format 2.

Note
Note

Since SUSE Enterprise Storage 3, the default volume format is 2, and format 1 is deprecated. However, you can still create the deprecated format 1 volumes with the --image-format 1 option.

9.4.3 Export RBD Images via iSCSI

To export RBD images via iSCSI, use the lrbd utility. lrbd allows you to create, review, and modify the iSCSI target configuration, which uses a JSON format.

Tip
Tip: Import Changes into openATTIC

Any changes made to the iSCSI Gateway configuration using the lrbd command are not visible to DeepSea and openATTIC. To import your manual changes, you need to export the iSCSI Gateway configuration to a file:

root@minion > lrbd -o /tmp/lrbd.conf

Then copy it to the Salt master so that DeepSea and openATTIC can see it:

root@minion > scp /tmp/lrbd.conf ses5master:/srv/salt/ceph/igw/cache/lrbd.conf

Finally, edit /srv/pillar/ceph/stack/global.yml and set:

igw_config: default-ui

In order to edit the configuration, use lrbd -e or lrbd --edit. This command will invoke the default editor, as defined by the EDITOR environment variable. You may override this behavior by setting the -E option in addition to -e.

Below is an example configuration for

  • two iSCSI gateway hosts named iscsi1.example.com and iscsi2.example.com,

  • defining a single iSCSI target with an iSCSI Qualified Name (IQN) of iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol,

  • with a single iSCSI Logical Unit (LU),

  • backed by an RBD image named testvol in the RADOS pool rbd,

  • and exporting the target via two portals named "east" and "west":

{
    "auth": [
        {
            "target": "iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol",
            "authentication": "none"
        }
    ],
    "targets": [
        {
            "target": "iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol",
            "hosts": [
                {
                    "host": "iscsi1.example.com",
                    "portal": "east"
                },
                {
                    "host": "iscsi2.example.com",
                    "portal": "west"
                }
            ]
        }
    ],
    "portals": [
        {
            "name": "east",
            "addresses": [
                "192.168.124.104"
            ]
        },
        {
            "name": "west",
            "addresses": [
                "192.168.124.105"
            ]
        }
    ],
    "pools": [
        {
            "pool": "rbd",
            "gateways": [
                {
                    "target": "iqn.2003-01.org.linux-iscsi.iscsi.x86:testvol",
                    "tpg": [
                        {
                            "image": "testvol"
                        }
                    ]
                }
            ]
        }
    ]
    }

Note that whenever you refer to a host name in the configuration, this host name must match the iSCSI gateway's uname -n command output.

The edited JSON is stored in the extended attributes (xattrs) of a single RADOS object per pool. This object is available to the gateway hosts where the JSON is edited, as well as to all gateway hosts connected to the same Ceph cluster. No configuration information is stored locally on the lrbd gateway.

To activate the configuration, store it in the Ceph cluster, and do one of the following things (as root):

  • Run the lrbd command (without additional options) from the command line,

or

  • Restart the lrbd service with service lrbd restart.

The lrbd "service" does not operate any background daemon. Instead, it simply invokes the lrbd command. This type of service is known as a "one-shot" service.

You should also enable lrbd to auto-configure on system start-up. To do so, run the systemctl enable lrbd command.

The configuration above reflects a simple, one-gateway setup. lrbd configuration can be much more complex and powerful. The lrbd RPM package comes with an extensive set of configuration examples, which you may refer to by checking the content of the /usr/share/doc/packages/lrbd/samples directory after installation. The samples are also available from https://github.com/SUSE/lrbd/tree/master/samples.

9.4.4 Optional Settings

The following settings may be useful for some environments. For images, there are uuid, lun, retries, sleep, and retry_errors attributes. The first two—uuid and lun—allow hardcoding of the 'uuid' or 'lun' for a specific image. You can specify either of them for an image. The retries, sleep and retry_errors affect attempts to map an rbd image.

"pools": [
    {
        "pool": "rbd",
        "gateways": [
        {
        "host": "igw1",
        "tpg": [
                    {
                        "image": "archive",
                        "uuid": "12345678-abcd-9012-efab-345678901234",
                        "lun": "2",
                        "retries": "3",
                        "sleep": "4",
                        "retry_errors": [ 95 ],
                        [...]
                    }
                ]
            }
        ]
    }
]

9.4.5 Advanced Settings

lrbd can be configured with advanced parameters which are subsequently passed on to the LIO I/O target. The parameters are divided up into iSCSI and backing store components, which can then be specified in the "targets" and "tpg" sections, respectively, of the lrbd configuration.

Warning
Warning

Changing these parameters from the default setting is not recommended.

"targets": [
    {
        [...]
        "tpg_default_cmdsn_depth": "64",
        "tpg_default_erl": "0",
        "tpg_login_timeout": "10",
        "tpg_netif_timeout": "2",
        "tpg_prod_mode_write_protect": "0",
    }
]

A description of the options follows:

tpg_default_cmdsn_depth

Default CmdSN (Command Sequence Number) depth. Limits the amount of requests that an iSCSI initiator can have outstanding at any moment.

tpg_default_erl

Default error recovery level.

tpg_login_timeout

Login timeout value in seconds.

tpg_netif_timeout

NIC failure timeout in seconds.

tpg_prod_mode_write_protect

If set to 1, prevents writes to LUNs.

"pools": [
    {
        "pool": "rbd",
        "gateways": [
        {
        "host": "igw1",
        "tpg": [
                    {
                        "image": "archive",
                        "backstore_block_size": "512",
                        "backstore_emulate_3pc": "1",
                        "backstore_emulate_caw": "1",
                        "backstore_emulate_dpo": "0",
                        "backstore_emulate_fua_read": "0",
                        "backstore_emulate_fua_write": "1",
                        "backstore_emulate_model_alias": "0",
                        "backstore_emulate_rest_reord": "0",
                        "backstore_emulate_tas": "1",
                        "backstore_emulate_tpu": "0",
                        "backstore_emulate_tpws": "0",
                        "backstore_emulate_ua_intlck_ctrl": "0",
                        "backstore_emulate_write_cache": "0",
                        "backstore_enforce_pr_isids": "1",
                        "backstore_fabric_max_sectors": "8192",
                        "backstore_hw_block_size": "512",
                        "backstore_hw_max_sectors": "8192",
                        "backstore_hw_pi_prot_type": "0",
                        "backstore_hw_queue_depth": "128",
                        "backstore_is_nonrot": "1",
                        "backstore_max_unmap_block_desc_count": "1",
                        "backstore_max_unmap_lba_count": "8192",
                        "backstore_max_write_same_len": "65535",
                        "backstore_optimal_sectors": "8192",
                        "backstore_pi_prot_format": "0",
                        "backstore_pi_prot_type": "0",
                        "backstore_queue_depth": "128",
                        "backstore_unmap_granularity": "8192",
                        "backstore_unmap_granularity_alignment": "4194304"
                    }
                ]
            }
        ]
    }
]

A description of the options follows:

backstore_block_size

Block size of the underlying device.

backstore_emulate_3pc

If set to 1, enables Third Party Copy.

backstore_emulate_caw

If set to 1, enables Compare and Write.

backstore_emulate_dpo

If set to 1, turns on Disable Page Out.

backstore_emulate_fua_read

If set to 1, enables Force Unit Access read.

backstore_emulate_fua_write

If set to 1, enables Force Unit Access write.

backstore_emulate_model_alias

If set to 1, uses the back-end device name for the model alias.

backstore_emulate_rest_reord

If set to 0, the Queue Algorithm Modifier has Restricted Reordering.

backstore_emulate_tas

If set to 1, enables Task Aborted Status.

backstore_emulate_tpu

If set to 1, enables Thin Provisioning Unmap.

backstore_emulate_tpws

If set to 1, enables Thin Provisioning Write Same.

backstore_emulate_ua_intlck_ctrl

If set to 1, enables Unit Attention Interlock.

backstore_emulate_write_cache

If set to 1, turns on Write Cache Enable.

backstore_enforce_pr_isids

If set to 1, enforces persistent reservation ISIDs.

backstore_fabric_max_sectors

Maximum number of sectors the fabric can transfer at once.

backstore_hw_block_size

Hardware block size in bytes.

backstore_hw_max_sectors

Maximum number of sectors the hardware can transfer at once.

backstore_hw_pi_prot_type

If non-zero, DIF protection is enabled on the underlying hardware.

backstore_hw_queue_depth

Hardware queue depth.

backstore_is_nonrot

If set to 1, the backstore is a non-rotational device.

backstore_max_unmap_block_desc_count

Maximum number of block descriptors for UNMAP.

backstore_max_unmap_lba_count:

Maximum number of LBAs for UNMAP.

backstore_max_write_same_len

Maximum length for WRITE_SAME.

backstore_optimal_sectors

Optimal request size in sectors.

backstore_pi_prot_format

DIF protection format.

backstore_pi_prot_type

DIF protection type.

backstore_queue_depth

Queue depth.

backstore_unmap_granularity

UNMAP granularity.

backstore_unmap_granularity_alignment

UNMAP granularity alignment.

For targets, the tpg attributes allow tuning of kernel parameters. Use with caution.

"targets": [
{
    "host": "igw1",
    "target": "iqn.2003-01.org.linux-iscsi.generic.x86:sn.abcdefghijk",
    "tpg_default_cmdsn_depth": "64",
    "tpg_default_erl": "0",
    "tpg_login_timeout": "10",
    "tpg_netif_timeout": "2",
    "tpg_prod_mode_write_protect": "0",
    "tpg_t10_pi": "0"
}
Tip
Tip

If a site needs statically assigned LUNs, then assign numbers to each LUN.

9.5 Exporting RADOS Block Device Images using tcmu-runner

Since version 5, SUSE Enterprise Storage ships a user-space RBD back-end for tcmu-runner (see man 8 tcmu-runner for details).

Warning
Warning: Technology Preview

tcmu-runner based iSCSI Gateway deployments are currently a technology preview. See Chapter 9, Installation of iSCSI Gateway for instructions on kernel-based iSCSI Gateway deployment with lrbd.

Unlike kernel-based lrbd iSCSI Gateway deployments, tcmu-runner based iSCSI Gateways do not offer support for multipath I/O or SCSI Persistent Reservations.

As DeepSea and openATTIC do not currently support tcmu-runner deployments, you need to manage the installation, deployment, and monitoring manually.

9.5.1 Installation

On your iSCSI Gateway node, install the tcmu-runner-handler-rbd package from the SUSE Enterprise Storage 5 media, together with the libtcmu1 and tcmu-runner package dependencies. Install the targetcli-fb package for configuration purposes. Note that the targetcli-fb package is incompatible with the 'non-fb' version of the targetcli package.

Confirm that the tcmu-runner systemd service is running:

root # systemctl enable tcmu-runner
tcmu-gw:~ # systemctl status tcmu-runner
● tcmu-runner.service - LIO Userspace-passthrough daemon
  Loaded: loaded (/usr/lib/systemd/system/tcmu-runner.service; static; vendor
  preset: disabled)
    Active: active (running) since ...

9.5.2 Configuration and Deployment

Create a RADOS Block Device image on your existing Ceph cluster. In the following example, we will use a 10G image called 'tcmu-lu' located in the 'rbd' pool.

Following RADOS Block Device image creation, run targetcli, and ensure that the tcmu-runner RBD handler (plug-in) is available:

root # targetcli
targetcli shell version 2.1.fb46
Copyright 2011-2013 by Datera, Inc and others.
For help on commands, type 'help'.

/> ls
o- / ................................... [...]
  o- backstores ........................ [...]
...
  | o- user:rbd ......... [Storage Objects: 0]

Create a backstore configuration entry for the RBD image:

/> cd backstores/user:rbd
/backstores/user:rbd> create tcmu-lu 10G /rbd/tcmu-lu
Created user-backed storage object tcmu-lu size 10737418240.

Create an iSCSI transport configuration entry. In the following example, the target IQN "iqn.2003-01.org.linux-iscsi.tcmu-gw.x8664:sn.cb3d2a3a" is automatically generated by targetcli for use as a unique iSCSI target identifier:

/backstores/user:rbd> cd /iscsi
/iscsi> create
Created target iqn.2003-01.org.linux-iscsi.tcmu-gw.x8664:sn.cb3d2a3a.
Created TPG 1.
Global pref auto_add_default_portal=true
Created default portal listening on all IPs (0.0.0.0), port 3260.

Create an ACL entry for the iSCSI initiator(s) that you wish to connect to the target. In the following example, an initiator IQN of "iqn.1998-01.com.vmware:esxi-872c4888" is used:

/iscsi> cd
iqn.2003-01.org.linux-iscsi.tcmu-gw.x8664:sn.cb3d2a3a/tpg1/acls/
/iscsi/iqn.20...a3a/tpg1/acls> create iqn.1998-01.com.vmware:esxi-872c4888

Finally, link the previously created RBD backstore configuration to the iSCSI target:

/iscsi/iqn.20...a3a/tpg1/acls> cd ../luns
/iscsi/iqn.20...a3a/tpg1/luns> create /backstores/user:rbd/tcmu-lu
Created LUN 0.
Created LUN 0->0 mapping in node ACL iqn.1998-01.com.vmware:esxi-872c4888

Exit the shell to save the existing configuration:

/iscsi/iqn.20...a3a/tpg1/luns> exit
Global pref auto_save_on_exit=true
Last 10 configs saved in /etc/target/backup.
Configuration saved to /etc/target/saveconfig.json

9.5.3 Usage

From your iSCSI initiator (client) node, connect to your newly provisioned iSCSI target using the IQN and host name configured above.

10 Installation of CephFS

The Ceph file system (CephFS) is a POSIX-compliant file system that uses a Ceph storage cluster to store its data. CephFS uses the same cluster system as Ceph block devices, Ceph object storage with its S3 and Swift APIs, or native bindings (librados).

To use CephFS, you need to have a running Ceph storage cluster, and at least one running Ceph metadata server.

10.1 Supported CephFS Scenarios and Guidance

With SUSE Enterprise Storage, SUSE introduces official support for many scenarios in which the scale-out and distributed component CephFS is used. This entry describes hard limits and provides guidance for the suggested use cases.

A supported CephFS deployment must meet these requirements:

  • A minimum of one Metadata Server. SUSE recommends to deploy several nodes with the MDS role. Only one will be 'active' and the rest will be 'passive'. Remember to mention all the MDS nodes in the mount command when mounting the CephFS from a client.

  • CephFS snapshots are disabled (default) and not supported in this version.

  • Clients are SUSE Linux Enterprise Server 12 SP2 or SP3 based, using the cephfs kernel module driver. The FUSE module is not supported.

  • CephFS quotas are not supported in SUSE Enterprise Storage, as support for quotas is implemented in the FUSE client only.

  • CephFS supports file layout changes as documented in http://docs.ceph.com/docs/jewel/cephfs/file-layouts/. However, while the file system is mounted by any client, new data pools may not be added to an existing CephFS file system (ceph mds add_data_pool). They may only be added while the file system is unmounted.

10.2 Ceph Metadata Server

Ceph metadata server (MDS) stores metadata for the CephFS. Ceph block devices and Ceph object storage do not use MDS. MDSs make it possible for POSIX file system users to execute basic commands—such as ls or find—without placing an enormous burden on the Ceph storage cluster.

10.2.1 Adding a Metadata Server

You can deploy MDS either during the initial cluster deployment process as described in Section 3.3, “Cluster Deployment”, or add it to an already deployed cluster as described in Book “Administration Guide”, Chapter 1 “Salt Cluster Administration”, Section 1.1 “Adding New Cluster Nodes”.

After you deploy your MDS, allow the Ceph OSD/MDS service in the firewall setting of the server where MDS is deployed: Start yast, navigate to Security and Users › Firewall › Allowed Services and in the Service to Allow drop–down menu select Ceph OSD/MDS. If the Ceph MDS node is not allowed full traffic, mounting of a file system fails, even though other operations may work properly.

10.2.2 Configuring a Metadata Server

You can fine-tune the MDS behavior by inserting relevant options in the ceph.conf configuration file.

MDS Cache Size
mds cache memory limit

The soft memory limit (in bytes) that the MDS will enforce for its cache. Administrators should use this instead of the old mds cache size setting. Defaults to 1GB.

mds cache reservation

The cache reservation (memory or inodes) for the MDS cache to maintain. Once the MDS begins touching its reservation, it will recall client state until its cache size shrinks to restore the reservation. Defaults to 0.05.

For a detailed list of MDS related configuration options, see http://docs.ceph.com/docs/master/cephfs/mds-config-ref/.

For a detailed list of MDS journaler configuration options, see http://docs.ceph.com/docs/master/cephfs/journaler/.

10.3 CephFS

When you have a healthy Ceph storage cluster with at least one Ceph metadata server, you can create and mount your Ceph file system. Ensure that your client has network connectivity and a proper authentication keyring.

10.3.1 Creating CephFS

A CephFS requires at least two RADOS pools: one for data and one for metadata. When configuring these pools, you might consider:

  • Using a higher replication level for the metadata pool, as any data loss in this pool can render the whole file system inaccessible.

  • Using lower-latency storage such as SSDs for the metadata pool, as this will improve the observed latency of file system operations on clients.

When assigning a role-mds in the policy.cfg, the required pools are automatically created. You can manually create the pools cephfs_data and cephfs_metadata for manual performance tuning before setting up the Metadata Server. DeepSea will not create these pools if they already exist.

For more information on managing pools, see Book “Administration Guide”, Chapter 7 “Managing Storage Pools”.

To create the two required pools—for example, 'cephfs_data' and 'cephfs_metadata'—with default settings for use with CephFS, run the following commands:

root # ceph osd pool create cephfs_data pg_num
root # ceph osd pool create cephfs_metadata pg_num

It is possible to use EC pools instead of replicated pools. We recommend to only use EC pools for low performance requirements and infrequent random access, for example cold storage, backups, archiving. CephFS on EC pools requires Bluestore to be enabled and the pool must have the allow_ec_overwrite option set. This option can be set by running ceph osd pool set ec_pool allow_ec_overwrites true.

Erasure coding adds significant overhead to file system operations, especially small updates. This overhead is inherent to using erasure coding as a fault tolerance mechanism. This penalty is the trade off for significantly reduced storage space overhead.

When the pools are created, you may enable the file system with the ceph fs new command:

root # ceph fs new fs_name metadata_pool_name data_pool_name

For example:

root # ceph fs new cephfs cephfs_metadata cephfs_data

You can check that the file system was created by listing all available CephFSs:

root # ceph fs ls
 name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ]

When the file system has been created, your MDS will be able to enter an active state. For example, in a single MDS system:

root # ceph mds stat
e5: 1/1/1 up
Tip
Tip: More Topics

You can find more information of specific tasks—for example mounting, unmounting, and advanced CephFS setup—in Book “Administration Guide”, Chapter 13 “Clustered File System”.

11 Installation of NFS Ganesha

NFS Ganesha provides NFS access to either the Object Gateway or the CephFS. In SUSE Enterprise Storage 5, NFS versions 3 and 4 are supported. NFS Ganesha runs in the user space instead of the kernel space and directly interacts with the Object Gateway or CephFS.

11.1 Preparation

11.1.1 General Information

To successfully deploy NFS Ganesha, you need to add a role-ganesha to your /srv/pillar/ceph/proposals/policy.cfg. For details, see Section 3.5.1, “The policy.cfg File”. NFS Ganesha also needs either a role-rgw or a role-mds present in the policy.cfg.

Although it is possible to install and run the NFS Ganesha server on an already existing Ceph node, we recommend running it on a dedicated host with access to the Ceph cluster. The client hosts are typically not part of the cluster, but they need to have network access to the NFS Ganesha server.

To enable the NFS Ganesha server at any point after the initial installation, add the role-ganesha to the policy.cfg and re-run at least DeepSea stages 2 and 4. For details, see Section 3.3, “Cluster Deployment”.

NFS Ganesha is configured via the file /etc/ganesha/ganesha.conf that exists on the NFS Ganesha node. However, this file is overwritten each time DeepSea stage 4 is executed. Therefore we recommend to edit the template used by Salt, which is the file /srv/salt/ceph/ganesha/files/ganesha.conf.j2 on the Salt master. For details about the configuration file, see Book “Administration Guide”, Chapter 14 “NFS Ganesha: Export Ceph Data via NFS”, Section 14.2 “Configuration”.

11.1.2 Summary of Requirements

The following requirements need to be met before DeepSea stages 2 and 4 can be executed to install NFS Ganesha:

  • At least one node needs to be assigned the role-ganesha.

  • You can define only one role-ganesha per minion.

  • NFS Ganesha needs either a Object Gateway or CephFS to work.

  • If NFS Ganesha is supposed to use the Object Gateway to interface with the cluster, the /srv/pillar/ceph/rgw.sls on the Salt master needs to be populated.

11.2 Example Installation

This procedure provides an example installation that uses both the Object Gateway and CephFS File System Abstraction Layers (FSAL) of NFS Ganesha.

  1. If you have not done so, execute DeepSea stages 0 and 1 before continuing with this procedure.

    root # salt-run state.orch ceph.stage.0
    root # salt-run state.orch ceph.stage.1
  2. After having executed stage 1 of DeepSea, edit the /srv/pillar/ceph/proposals/policy.cfg and add the line

    role-ganesha/cluster/NODENAME

    Replace NODENAME with the name of a node in your cluster.

    Also make sure that a role-mds and a role-rgw are assigned.

  3. Create the file /srv/pillar/ceph/rgw.sls and insert the following content:

    rgw_configurations:
      rgw:
        users:
          - { uid: "demo", name: "Demo", email: "demo@demo.nil" }
          - { uid: "demo1", name: "Demo1", email: "demo1@demo.nil" }

    These users are later created as Object Gateway users, and API keys are generated. On the Object Gateway node, you can later run radosgw-admin user list to list all created users and radosgw-admin user info --uid=demo to obtain details about single users.

    DeepSea makes sure that Object Gateway and NFS Ganesha both receive the credentials of all users listed in the rgw section of the rgw.sls.

    The exported NFS uses these user names on the first level of the file system, in this example the paths /demo and /demo1 would be exported.

  4. Execute at least stages 2 and 4 of DeepSea. Running stage 3 in between is recommended.

    root # salt-run state.orch ceph.stage.2
    root # salt-run state.orch ceph.stage.3 # optional but recommended
    root # salt-run state.orch ceph.stage.4
  5. Verify that NFS Ganesha is working by mounting the NFS share from a client node:

    root # mount -o sync -t nfs GANESHA_NODE:/ /mnt
    root # ls /mnt
    cephfs  demo  demo1

    /mnt should contain all exported paths. Directories for CephFS and both Object Gateway users should exist. For each bucket a user owns, a path /mnt/USERNAME/BUCKETNAME would be exported.

11.3 High Availability Active-Passive Configuration

This section provides an example of how to set up a two-node active-passive configuration of NFS Ganesha servers. The setup requires the SUSE Linux Enterprise High Availability Extension. The two nodes are called earth and mars.

For details about SUSE Linux Enterprise High Availability Extension, see https://www.suse.com/documentation/sle-ha-12/.

11.3.1 Basic Installation

In this setup earth has the IP address 192.168.1.1 and mars has the address 192.168.1.2.

Additionally, two floating virtual IP addresses are used, allowing clients to connect to the service independent of which physical node it is running on. 192.168.1.10 is used for cluster administration with Hawk2 and 192.168.2.1 is used exclusively for the NFS exports. This makes it easier to apply security restrictions later.

The following procedure describes the example installation. More details can be found at https://www.suse.com/documentation/sle-ha-12/install-quick/data/install-quick.html.

  1. Prepare the NFS Ganesha nodes on the Salt master:

    1. Run DeepSea stages 0 and 1 on the Salt master.

      root@master # salt-run state.orch ceph.stage.0
      root@master # salt-run state.orch ceph.stage.1
    2. Assign the nodes earth and mars the role-ganesha in the /srv/pillar/ceph/proposals/policy.cfg:

      role-ganesha/cluster/earth*.sls
      role-ganesha/cluster/mars*.sls
    3. Run DeepSea stages 3 and 4 on the Salt master.

      root@master # salt-run state.orch ceph.stage.3
      root@master # salt-run state.orch ceph.stage.4
  2. Register the SUSE Linux Enterprise High Availability Extension on earth and mars.

    root # SUSEConnect -r ACTIVATION_CODE -e E_MAIL
  3. Install ha-cluster-bootstrap on both nodes:

    root # zypper in ha-cluster-bootstrap
    1. Initialize the cluster on earth:

      root@earth # ha-cluster-init
    2. Let mars join the cluster:

      root@mars # ha-cluster-join -c earth
  4. Check the status of the cluster. You should see two nodes added to the cluster:

    root@earth # crm status
  5. On both nodes, disable the automatic start of the NFS Ganesha service at boot time:

    root # systemctl disable nfs-ganesha
  6. Start the crm shell on earth:

    root@earth # crm configure

    The next commands are executed in the crm shell.

  7. On earth, run the crm shell to execute the following commands to configure the resource for NFS Ganesha daemons as clone of systemd resource type:

    crm(live)configure# primitive nfs-ganesha-server systemd:nfs-ganesha \
    op monitor interval=30s
    crm(live)configure# clone nfs-ganesha-clone nfs-ganesha-server meta interleave=true
    crm(live)configure# commit
    crm(live)configure# status
        2 nodes configured
        2 resources configured
    
        Online: [ earth mars ]
    
        Full list of resources:
             Clone Set: nfs-ganesha-clone [nfs-ganesha-server]
             Started:  [ earth mars ]
  8. Create a primitive IPAddr2 with the crm shell:

    crm(live)configure# primitive ganesha-ip IPaddr2 \
    params ip=192.168.2.1 cidr_netmask=24 nic=eth0 \
    op monitor interval=10 timeout=20
    
    crm(live)# status
    Online: [ earth mars  ]
    Full list of resources:
     Clone Set: nfs-ganesha-clone [nfs-ganesha-server]
         Started: [ earth mars ]
     ganesha-ip    (ocf::heartbeat:IPaddr2):    Started earth
  9. To set up a relationship between the NFS Ganesha server and the floating Virtual IP, we use collocation and ordering.

    crm(live)configure# colocation ganesha-ip-with-nfs-ganesha-server inf: ganesha-ip nfs-ganesha-clone
    crm(live)configure# order ganesha-ip-after-nfs-ganesha-server Mandatory: nfs-ganesha-clone ganesha-ip
  10. Use the mount command from the client to ensure that cluster setup is complete:

    root # mount -t nfs -v -o sync,nfsvers=4 192.168.2.1:/ /mnt

11.3.2 Clean Up Resources

In the event of an NFS Ganesha failure at one of the node, for example earth, fix the issue and clean up the resource. Only after the resource is cleaned up can the resource fail back to earth in case NFS Ganesha fails at mars.

To clean up the resource:

root@earth # crm resource cleanup nfs-ganesha-clone earth
root@earth # crm resource cleanup ganesha-ip earth

11.3.3 Setting Up Ping Resource

It may happen that the server is unable to reach the client due to a network issue. A ping resource can detect and mitigate this problem. Configuring this resource is optional.

  1. Define the ping resource:

    crm(live)configure# primitive ganesha-ping ocf:pacemaker:ping \
            params name=ping dampen=3s multiplier=100 host_list="CLIENT1 CLIENT2" \
            op monitor interval=60 timeout=60 \
            op start interval=0 timeout=60 \
            op stop interval=0 timeout=60

    host_list is a list of IP addresses separated by space characters. The IP addresses will be pinged regularly to check for network outages. If a client must always have access to the NFS server, add it to host_list.

  2. Create a clone:

    crm(live)configure# clone ganesha-ping-clone ganesha-ping \
            meta interleave=true
  3. The following command creates a constraint for the NFS Ganesha service. It forces the service to move to another node when host_list is unreachable.

    crm(live)configure# location nfs-ganesha-server-with-ganesha-ping
            nfs-ganesha-clone \
            rule -inf: not_defined ping or ping lte 0

11.3.4 NFS Ganesha HA and DeepSea

DeepSea does not support configuring NFS Ganesha HA. To prevent DeepSea from failing after NFS Ganesha HA was configured, exclude starting and stopping the NFS Ganesha service from DeepSea Stage 4:

  1. Copy /srv/salt/ceph/ganesha/default.sls to /srv/salt/ceph/ganesha/ha.sls.

  2. Remove the .service entry from /srv/salt/ceph/ganesha/ha.sls so that it looks as follows:

    include:
    - .keyring
    - .install
    - .configure
  3. Add the following line to /srv/pillar/ceph/stack/global.yml:

    ganesha_init: ha

11.4 More Information

More information can be found in Book “Administration Guide”, Chapter 14 “NFS Ganesha: Export Ceph Data via NFS”.

12 Export CephFS via Samba

This section describes how to export CephFS via a Samba/CIFS share. Samba shares can be used with Windows* clients.

Warning
Warning: Technology Preview

As of SUSE Enterprise Storage 5, exporting Samba shares is considered a technology preview and is not supported.

12.1 Example Installation

Exporting CephFS is a technology preview and not supported. To export a Samba share, you have to manually install Samba on one cluster node and configure it. Failover functionality can be provided with CTDB and the SUSE Linux Enterprise High Availability Extension.

  1. Make sure that a working CephFS already exists in your cluster. For details, see Chapter 10, Installation of CephFS.

  2. Create a Samba Gateway specific keyring on the Salt master and copy it to the Samba gateway node:

    root@master # ceph auth get-or-create client.samba.gw mon 'allow r' \
        osd 'allow *' mds 'allow *' -o ceph.client.samba.gw.keyring
    root@master # scp ceph.client.samba.gw.keyring SAMBA_NODE:/etc/ceph/

    Replace SAMBA_NODE with the name of the Samba gateway node.

  3. The following steps are executed on the Samba gateway node. Install the Samba daemon on the Samba gateway node:

    root # zypper in samba samba-ceph
  4. Edit the /etc/samba/smb.conf and add the following section:

    [SHARE_NAME]
            path = /
            vfs objects = ceph
            ceph:config_file = /etc/ceph/ceph.conf
            ceph: user_id = samba.gw
            read only = no
  5. Start and enable the Samba daemon:

    root # systemctl start smb.service
    root # systemctl enable smb.service
    root # systemctl start nmb.service
    root # systemctl enable nmb.service

12.2 High Availability Configuration

This section provides an example of how to set up a two-node high availability configuration of Samba servers. The setup requires the SUSE Linux Enterprise High Availability Extension. The two nodes are called earth (192.168.1.1) and mars (192.168.1.2).

For details about SUSE Linux Enterprise High Availability Extension, see https://www.suse.com/documentation/sle-ha-12/.

Additionally, two floating virtual IP addresses allow clients to connect to the service no matter which physical node it is running on. 192.168.1.10 is used for cluster administration with Hawk2 and 192.168.2.1 is used exclusively for the CIFS exports. This makes it easier to apply security restrictions later.

The following procedure describes the example installation. More details can be found at https://www.suse.com/documentation/sle-ha-12/install-quick/data/install-quick.html.

  1. Create a Samba Gateway specific keyring on the Salt master and copy it to both nodes:

    root@master # ceph auth get-or-create client.samba.gw mon 'allow r' \
        osd 'allow *' mds 'allow *' -o ceph.client.samba.gw.keyring
    root@master # scp ceph.client.samba.gw.keyring earth:/etc/ceph/
    root@master # scp ceph.client.samba.gw.keyring mars:/etc/ceph/
  2. Prepare earth and mars to host the Samba service:

    1. Make sure the following packages are installed before you proceed: ctdb, tdb-tools, and samba (needed for smb and nmb resources).

      root # zypper in ctdb tdb-tools samba samba-ceph
    2. Make sure the services ctdb, smb, and nmb are stopped and disabled:

      root # systemctl disable ctdb
      root # systemctl disable smb
      root # systemctl disable nmb
      root # systemctl stop smb
      root # systemctl stop nmb
    3. Open port 4379 of your firewall on all nodes. This is needed for CTDB to communicate with other cluster nodes.

    4. Create a directory for the CTDB lock on the shared file system:

      root # mkdir -p /srv/samba/
  3. On earth create the configuration files for Samba. They will later automatically be synchronized to mars.

    1. In /etc/ctdb/nodes insert all nodes which contain all private IP addresses of each node in the cluster:

      192.168.1.1
      192.168.1.2
    2. Configure Samba. Add the following lines in the [global] section of /etc/samba/smb.conf. Use the host name of your choice in place of "CTDB-SERVER" (all nodes in the cluster will appear as one big node with this name, effectively):

      [global]
          netbios name = CTDB-SERVER
          clustering = yes
          idmap config * : backend = tdb2
          passdb backend = tdbsam
          ctdbd socket = /var/lib/ctdb/ctdb.socket

      For details about csync2, see https://www.suse.com/documentation/sle-ha-12/singlehtml/book_sleha/book_sleha.html#pro.ha.installation.setup.csync2.start.

  4. Install and bootstrap the SUSE Linux Enterprise High Availability cluster.

    1. Register the SUSE Linux Enterprise High Availability Extension on earth and mars.

      root@earth # SUSEConnect -r ACTIVATION_CODE -e E_MAIL
      root@mars # SUSEConnect -r ACTIVATION_CODE -e E_MAIL
    2. Install ha-cluster-bootstrap on both nodes:

      root@earth # zypper in ha-cluster-bootstrap
      root@mars # zypper in ha-cluster-bootstrap
    3. Init the cluster on earth:

      root@earth # ha-cluster-init
    4. Let mars join the cluster:

      root@mars # ha-cluster-join -c earth
  5. Check the status of the cluster. You should see two nodes added in cluster:

    root@earth # crm status
    2 nodes configured
    1 resource configured
    
    Online: [ earth mars ]
    
    Full list of resources:
    
     admin-ip       (ocf::heartbeat:IPaddr2):       Started earth
  6. Execute the following commands on earth to configure the CTDB resource:

    root@earth # crm configure
    crm(live)configure# primitive ctdb ocf:heartbeat:CTDB params \
        ctdb_manages_winbind="false" \
        ctdb_manages_samba="false" \
        ctdb_recovery_lock="!/usr/lib64/ctdb/ctdb_mutex_ceph_rados_helper
            ceph client.samba.gw cephfs_metadata ctdb-mutex"
        ctdb_socket="/var/lib/ctdb/ctdb.socket" \
            op monitor interval="10" timeout="20" \
            op start interval="0" timeout="90" \
            op stop interval="0" timeout="100"
    crm(live)configure# primitive nmb systemd:nmb \
        op start timeout="60" interval="0" \
        op stop timeout="60" interval="0" \
        op monitor interval="60" timeout="60"
    crm(live)configure# primitive smb systemd:smb \
        op start timeout="60" interval="0" \
        op stop timeout="60" interval="0" \
        op monitor interval="60" timeout="60"
    crm(live)configure# group g-ctdb ctdb nmb smb
    crm(live)configure# clone cl-ctdb g-ctdb meta interleave="true"
    crm(live)configure# commit

    The binary /usr/lib64/ctdb/ctdb_mutex_ceph_rados_helper in the configuration option ctdb_recovery_lock has the parameters CLUSTER_NAME CEPHX_USER CEPH_POOL CEPHX_USER in this order.

  7. Add a clustered IP address:

    crm(live)configure# primitive ip ocf:heartbeat:IPaddr2 params ip=192.168.2.1 \
        unique_clone_address="true" \
        op monitor interval="60" \
        meta resource-stickiness="0"
    crm(live)configure# clone cl-ip ip \
        meta interleave="true" clone-node-max="2" globally-unique="true"
    crm(live)configure# colocation col-with-ctdb 0: cl-ip cl-ctdb
    crm(live)configure# order o-with-ctdb 0: cl-ip cl-ctdb
    crm(live)configure# commit

    If unique_clone_address is set to true, the IPaddr2 resource agent adds a clone ID to the specified address, leading to three different IP addresses. These are usually not needed, but help with load balancing. For further information about this topic, see https://www.suse.com/documentation/sle-ha-12/book_sleha/data/cha_ha_lb.html.

  8. Check the result:

    root@earth # crm status
    Clone Set: base-clone [dlm]
         Started: [ factory-1 ]
         Stopped: [ factory-0 ]
     Clone Set: cl-ctdb [g-ctdb]
         Started: [ factory-1 ]
         Started: [ factory-0 ]
     Clone Set: cl-ip [ip] (unique)
         ip:0       (ocf:heartbeat:IPaddr2):       Started factory-0
         ip:1       (ocf:heartbeat:IPaddr2):       Started factory-1
  9. Test from a client machine. On a Linux client, run the following command to see if you can copy files from and to the system:

    root # smbclient //192.168.2.1/myshare

A Documentation Updates

This chapter lists content changes for this document since the initial release of SUSE Enterprise Storage 4. You can find changes related to the cluster deployment that apply to previous versions in https://www.suse.com/documentation/ses-4/book_storage_admin/data/ap_adm_docupdate.html.

The document was updated on the following dates:

A.1 Documentation Maintenance Update

Bugfixes
General Updates

A.2 Documentation Maintenance Update

Bugfixes

A.3 October, 2017 (Release of SUSE Enterprise Storage 5)

Bugfixes
Print this page