Share with friends and colleagues on social media

Guest Blog post byintel-ross-zwisler

Ross Zwisler
Linux Kernel Developer, Intel

Part 2: Software Configuration

SUSE Linux Enterprise Server 12, Service Pack 2 (SLES 12 SP 2) includes early support for NVDIMMs. This support includes kernel enabling as well as userspace libraries and configuration utilities available in the standard SUSE Linux Enterprise package repositories. This is the second of two blog posts providing an overview of this enabling.

The first post provided a high level overview of the concepts involved in persistent memory enabling.

Now we can examine some examples of how to configure NVDIMMs using the tools provided in SLES 12 SP 2. We will configure our NVDIMMs so we can access their content in a variety of ways, and with each configuration we will review how our persistent memory is actually being used and what software techniques are available to us.

Tool Installation:

For the examples in this post we use a freshly installed instance of SLES 12 SP 2 on a platform that has three NVDIMMs. These NVDIMMs are configured so that their persistent memory is available both via PMEM and BLK. Refer to part one of this blog series if you need a refresher on these terms and concepts.

The first thing we need to do to use our NVDIMMs is to install the ndctl packages from the standard SLES 12 SP 2 repos. YaST2 can be used for this, as shown:

pasted_image001

 

The libndctl package shown above provides a set of userspace libraries to configure NVDIMMs. It is a dependency of the ndctl command line utility.

The ndctl utility has a very helpful set of man pages, accessible via:

# ndctl help COMMAND

You can get a list of the ndctl commands via:

# ndctl --list-cmds
version
enable-namespace
disable-namespace
create-namespace
destroy-namespace
enable-region
disable-region
zero-labels
read-labels
list
help

We can view the three NVDIMMs in our system like this:

# ndctl list --dimms
[
  {
    "dev":"nmem2",
    "id":"8089-00-0000-12325476"
  },
  {
    "dev":"nmem1",
    "id":"8089-00-0000-11325476"
  },
  {
    "dev":"nmem0",
    "id":"8089-00-0000-10325476"
  }
]

These NVDIMMs make up four Regions:

# ndctl list --regions
[
  {
    "dev":"region1",
    "size":68182605824,
    "available_size":68182605824,
    "type":"blk"
  },
  {
    "dev":"region3",
    "size":202937204736,
    "available_size":202937204736,
    "type":"pmem",
    "iset_id":5903239628671731251
  },
  {
    "dev":"region0",
    "size":68182605824,
    "available_size":68182605824,
    "type":"blk"
  },
  {
    "dev":"region2",
    "size":68182605824,
    "available_size":68182605824,
    "type":"blk"
  }
]

Conceptually, our configuration looks like this:

pasted_image004

 

We have three unused NVDIMMs, each of which is a BLK Region with around 64 GiB of available space. These three NVDIMMs are also interleaved together into a 3-way interleave set which gives us a PMEM Region with around 189 GiB of available space.

PMEM Namespace + DAX:

First, we will configure our three NVDIMMs into a single PMEM Namespace.

# ndctl create-namespace --type=pmem --mode=memory
{
  "dev":"namespace3.0",
  "mode":"memory",
  "size":199764213760,
  "uuid":"dc8ebb84-c564-4248-9e8d-e18543c39b69",
  "blockdev":"pmem3"
}

This creates a block device, /dev/pmem3, that supports DAX. The “3” in the /dev/pmem3 device name is inherited from the parent Region number, in this case region3. The --mode=memory option sets aside part of the PMEM storage space on the NVDIMMs so that it can be used to allocate internal kernel data structures called struct pages. This allows our PMEM Namespace to be used with features like O_DIRECT I/O and RDMA. The fact that we reserve some persistent memory for kernel data structures is the reason why the resulting PMEM Namespace has a smaller capacity than the parent PMEM Region, as shown below:

pasted_image018

The green dotted lines in these diagrams outline the interfaces that we use with the NVDIMMs. In this case we directly access the persistent memory of our DAX-capable PMEM Namespace. The kernel uses the Label Space to record that we have our NVDIMMs configured in a single PMEM Namespace, along with the Mode, size, UUID and other metadata about the Namespace. In this configuration we do not use the NVDIMM Block Windows.

Next we will make a DAX-capable filesystem on our new /dev/pmem3 block device, and we will mount it with the DAX mount option. The XFS, ext4 and ext2 filesystems all support DAX.

# fdisk -l /dev/pmem3
Disk /dev/pmem3: 186 GiB, 199764213760 bytes, 390164480 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

# mkfs.xfs /dev/pmem3
meta-data=/dev/pmem3             isize=256    agcount=4, agsize=12192640 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=0        finobt=0, sparse=0
data     =                       bsize=4096   blocks=48770560, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=23813, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# mount -o dax /dev/pmem3 /mnt

# mount | grep dax
/dev/pmem3 on /mnt type xfs (rw,relatime,attr2,dax,inode64,noquota)

We now have a PMEM Namespace formatted with the XFS filesystem and mounted with DAX. Any mmap calls to files in that file system will return virtual addresses that directly map to the persistent memory on our NVDIMMs, completely bypassing the page cache. Any fsync or msync calls on files in that file system will still ensure that modified data has been fully written to the NVDIMMs. These calls flush the processor cache lines associated with any pages that have been modified in userspace via mmap mappings.

PMEM Namespace + BTT:

Next we will unmount our DAX filesystem, delete our DAX capable PMEM Namespace, and then create a PMEM Namespace that uses the BTT instead.

# umount /mnt

# ndctl disable-namespace namespace3.0
disabled 1 namespace

# ndctl destroy-namespace namespace3.0
destroyed 1 namespace

# ndctl create-namespace --type=pmem --mode=sector
{
  "dev":"namespace3.0",
  "mode":"sector",
  "uuid":"51ab652d-7f20-44ea-b51d-5670454f8b9b",
  "sector_size":4096,
  "blockdev":"pmem3s"
}

# fdisk -l /dev/pmem3s
Disk /dev/pmem3s: 188.8 GiB, 202738135040 bytes, 49496615 sectors
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Like the DAX-capable PMEM Namespace we previously configured, this BTT-capable PMEM Namespace consumes all the available storage on the NVDIMMs. The trailing “s” in the /dev/pmem3s device name stands for “sector” or “safe” and can be used to easily distinguish PMEM and BLK Namespaces that are configured to use the BTT.

pasted_image015

The PMEM Namespace shown here cannot use DAX. Instead it uses the BTT to provide sector write atomicity. On each sector write through the PMEM block driver the BTT will allocate a new sector to receive the new data. The BTT atomically updates its internal mapping structures after the new data is fully written so the newly written data will be available to applications. If the power fails at any point during this process, the write will be completely lost and the application will have access to its old data, still intact. This prevents the condition known as torn sectors.

This BTT-enabled PMEM Namespace can be formatted and used with a filesystem just like any other standard block device. Because it cannot be used with DAX, mmap mappings for files on this block device will use the page cache.

BLK Namespaces:

We will now reconfigure our three NVDIMMs so they each have a single BLK Namespace that contains all of their available space. Our configuration looks like this:

pasted_image010

 

Unlike the previous PMEM configurations where all three NVDIMMs contributed capacity to the Namespace, for BLK Namespaces each NVDIMM stands on its own. One of the consequences of this approach is that with this configuration each NVDIMM is a separate failure domain. This means that if one NVDIMM fails it would only affect the BLK Namespaces located on that NVDIMM. With PMEM the data is interleaved equally across all NVDIMMs in the interleave set so the failure of a single NVDIMM causes data loss for the entire PMEM Namespace.

The following commands disable and delete the old BTT-enabled PMEM Namespaces and create the three new BLK Namespaces. We will configure the BLK Namespaces so that they too use the BTT.

# ndctl disable-namespace namespace3.0
disabled 1 namespace

# ndctl destroy-namespace namespace3.0
destroyed 1 namespace

# ndctl create-namespace --type=blk --mode=sector
{
  "dev":"namespace1.0",
  "mode":"sector",
  "uuid":"fed466bd-90f6-460b-ac81-ad1f08716602",
  "sector_size":4096,
  "blockdev":"ndblk1.0s"
}

# ndctl create-namespace --type=blk --mode=sector
{
  "dev":"namespace0.0",
  "mode":"sector",
  "uuid":"12a29b6f-b951-4d08-8dbc-8dea1a2bb32d",
  "sector_size":4096,
  "blockdev":"ndblk0.0s"
}

# ndctl create-namespace --type=blk --mode=sector
{
  "dev":"namespace2.0",
  "mode":"sector",
  "uuid":"7c84dab5-cc08-452a-b18d-53e430bf8833",
  "sector_size":4096,
  "blockdev":"ndblk2.0s"
}

# fdisk -l /dev/ndblk*
Disk /dev/ndblk0.0s: 63.4 GiB, 68115001344 bytes, 16629639 sectors
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Disk /dev/ndblk1.0s: 63.4 GiB, 68115001344 bytes, 16629639 sectors
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Disk /dev/ndblk2.0s: 63.4 GiB, 68115001344 bytes, 16629639 sectors
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Because all I/O to BLK Namespaces happens through their constantly moving Block Windows, BLK Namespaces cannot use DAX. The block devices generated for BLK Namespaces are named /dev/ndblkX.Y where X is the parent Region number and the Y is a unique Namespace number within that Region. So, /dev/ndblk2.0s is child Namespace number 0 of region2. The trailing “s” means that this Namespace is configured to use the BTT.

Combining PMEM and BLK:

Since our NVDIMMs are capable of both PMEM and BLK we can set up a mixed configuration that utilizes both Types of Namespaces. The following diagram shows our target configuration:

pasted_image013

The commands below remove our previous three BLK Namespaces and create this new mixed configuration:

# ndctl disable-namespace all
disabled 7 namespaces

# ndctl destroy-namespace all
destroyed 7 namespaces

# ndctl create-namespace --type=blk --size=32g
{
  "dev":"namespace2.1",
  "mode":"sector",
  "uuid":"37c254cd-b123-4b13-b5b0-cd06c30e4efb",
  "sector_size":4096,
  "blockdev":"ndblk2.1s"
}

# ndctl create-namespace --type=blk --size=32g
{
  "dev":"namespace1.1",
  "mode":"sector",
  "uuid":"e1f5fa9f-4820-42f4-b8a3-be90fa00fe79",
  "sector_size":4096,
  "blockdev":"ndblk1.1s"
}

# ndctl create-namespace --type=blk --size=32g
{
  "dev":"namespace0.1",
  "mode":"sector",
  "uuid":"1f84a98c-8dac-4a29-966a-42a5ac78d78f",
  "sector_size":4096,
  "blockdev":"ndblk0.1s"
}

# ndctl create-namespace --type=pmem --mode=memory
{
  "dev":"namespace3.0",
  "mode":"memory",
  "size":99881058304,
  "uuid":"33311d73-487d-4d27-8f2a-9d682570e312",
  "blockdev":"pmem3"
}

# fdisk -l /dev/pmem3 /dev/ndblk*
Disk /dev/pmem3: 93 GiB, 99881058304 bytes, 195080192 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Disk /dev/ndblk0.1s: 32 GiB, 34325135360 bytes, 8380160 sectors
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Disk /dev/ndblk1.1s: 32 GiB, 34325135360 bytes, 8380160 sectors
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Disk /dev/ndblk2.1s: 32 GiB, 34325135360 bytes, 8380160 sectors
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

 

Simulated NVDIMMs:

 

If you do not yet have a platform that contains NVDIMMs but you would like to begin developing software that uses PMEM and DAX, you can create a PMEM Namespace out of normal RAM. This is done via the memmap kernel command line parameter. This parameter reserves memory space from the rest of the system before boot. When the PMEM driver later loads it is able to claim this reserved memory and use it for a PMEM Namespace.

The memmap kernel command line parameter works in terms of physical addresses. To determine which physical addresses are available to reserve for our PMEM Namespace we can look at the e820 table output in dmesg.

Here is an example setup using a virtual machine with 20 GiB of memory:

# dmesg | grep BIOS-e820
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff] usable
[    0.000000] BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000053fffffff] usable

In this output the regions marked as usable are fair game to be reserved for the PMEM driver, while the reserved regions are not. The last usable region represents the bulk of our available space, so we’ll use that.

[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000053fffffff] usable

Plugging these physical addresses into our hex calculator, the region starts at 0x0000000100000000 (4 GiB) and ends at 0x000000053fffffff (21 GiB). Say we want to reserve 16 GiB to be used by PMEM. We can start this reservation at 4 GiB, and given a size of 16 GiB it will end at 20 GiB which is still within this usable range. The syntax for this reservation is shown below:

memmap=16G!4G

After rebooting with our new kernel parameter, we can see our new user-defined e820 table via dmesg as well (the old table is still present, in case you want to compare):

# dmesg | grep user:
[    0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable
[    0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[    0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[    0.000000] user: [mem 0x0000000000100000-0x00000000bffdffff] usable
[    0.000000] user: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved
[    0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved
[    0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
[    0.000000] user: [mem 0x0000000100000000-0x00000004ffffffff] persistent (type 12)
[    0.000000] user: [mem 0x0000000500000000-0x000000053fffffff] usable

We can see that our new persistent memory range does indeed start at 4 GiB and end at 20 GiB, fully overlapping the usable memory range defined in the e820 table output.

If we have the PMEM driver loaded, we will see this reserved memory range as /dev/pmem0:

# fdisk -l /dev/pmem0
Disk /dev/pmem0: 16 GiB, 17179869184 bytes, 33554432 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

 

Further Reading:

In this section we will enumerate several other sources of code and information related to NVDIMM enabling in SLES 12 SP 2 and in the wider Linux community.

https://nvdimm.wiki.kernel.org/

This wiki contains instructions for configuring your NVDIMM system as well as various pieces of information about testing. It also contains a ton of links to specifications related to NVDIMM enabling. The intent is to keep this wiki up to date as NVDIMM enabling in Linux develops, so if you run into a problem this is probably a good place to start.

http://pmem.io/

This site contains tons of information related to PMEM enabling in Linux as well as other operating systems. This is also where you can learn more about the NVM Library (NVML), which aims to provide useful APIs for programming with persistent memory in userspace. The Documents section also contains links to the specifications that were needed for the kernel enabling.

https://www.kernel.org/doc/Documentation/nvdimm/nvdimm.txt

Part of the Documentation folder in the current kernel tree, this page is squarely aimed at kernel developers. It talks about the different kernel modules involved in NVDIMM enabling, lays out some technical details of the kernel implementation, and talks about the sysfs interface to the kernel that is used by the ndctl tool.

https://github.com/pmem/ndctl

This Github repository holds the source for the latest version of the ndctl utility and userspace libraries, as well as unit tests and documentation.

 

 

 

Share with friends and colleagues on social media
Tags: , , , ,
Category: Enterprise Linux, Server, SUSE Linux Enterprise, SUSE Linux Enterprise Server, Technical Solutions
This entry was posted Monday, 5 December, 2016 at 1:04 pm
You can follow any responses to this entry via RSS.

Leave a Reply

Your email address will not be published. Required fields are marked *

No comments yet