NVDIMM Enabling – Part 2 by Intel
Guest Blog post by
Ross Zwisler
Linux Kernel Developer, Intel
Part 2: Software Configuration
SUSE Linux Enterprise Server 12, Service Pack 2 (SLES 12 SP 2) includes early support for NVDIMMs. This support includes kernel enabling as well as userspace libraries and configuration utilities available in the standard SUSE Linux Enterprise package repositories. This is the second of two blog posts providing an overview of this enabling.
The first post provided a high level overview of the concepts involved in persistent memory enabling.
Now we can examine some examples of how to configure NVDIMMs using the tools provided in SLES 12 SP 2. We will configure our NVDIMMs so we can access their content in a variety of ways, and with each configuration we will review how our persistent memory is actually being used and what software techniques are available to us.
Tool Installation:
For the examples in this post we use a freshly installed instance of SLES 12 SP 2 on a platform that has three NVDIMMs. These NVDIMMs are configured so that their persistent memory is available both via PMEM and BLK. Refer to part one of this blog series if you need a refresher on these terms and concepts.
The first thing we need to do to use our NVDIMMs is to install the ndctl packages from the standard SLES 12 SP 2 repos. YaST2 can be used for this, as shown:
The libndctl package shown above provides a set of userspace libraries to configure NVDIMMs. It is a dependency of the ndctl command line utility.
The ndctl utility has a very helpful set of man pages, accessible via:
You can get a list of the ndctl commands via:
# ndctl --list-cmds version enable-namespace disable-namespace create-namespace destroy-namespace enable-region disable-region zero-labels read-labels list help
We can view the three NVDIMMs in our system like this:
# ndctl list --dimms [ { "dev":"nmem2", "id":"8089-00-0000-12325476" }, { "dev":"nmem1", "id":"8089-00-0000-11325476" }, { "dev":"nmem0", "id":"8089-00-0000-10325476" } ]
These NVDIMMs make up four Regions:
# ndctl list --regions [ { "dev":"region1", "size":68182605824, "available_size":68182605824, "type":"blk" }, { "dev":"region3", "size":202937204736, "available_size":202937204736, "type":"pmem", "iset_id":5903239628671731251 }, { "dev":"region0", "size":68182605824, "available_size":68182605824, "type":"blk" }, { "dev":"region2", "size":68182605824, "available_size":68182605824, "type":"blk" } ]
Conceptually, our configuration looks like this:
We have three unused NVDIMMs, each of which is a BLK Region with around 64 GiB of available space. These three NVDIMMs are also interleaved together into a 3-way interleave set which gives us a PMEM Region with around 189 GiB of available space.
PMEM Namespace + DAX:
First, we will configure our three NVDIMMs into a single PMEM Namespace.
# ndctl create-namespace --type=pmem --mode=memory { "dev":"namespace3.0", "mode":"memory", "size":199764213760, "uuid":"dc8ebb84-c564-4248-9e8d-e18543c39b69", "blockdev":"pmem3" }
This creates a block device, /dev/pmem3, that supports DAX. The “3” in the /dev/pmem3 device name is inherited from the parent Region number, in this case region3. The --mode=memory option sets aside part of the PMEM storage space on the NVDIMMs so that it can be used to allocate internal kernel data structures called struct pages. This allows our PMEM Namespace to be used with features like O_DIRECT I/O and RDMA. The fact that we reserve some persistent memory for kernel data structures is the reason why the resulting PMEM Namespace has a smaller capacity than the parent PMEM Region, as shown below:
The green dotted lines in these diagrams outline the interfaces that we use with the NVDIMMs. In this case we directly access the persistent memory of our DAX-capable PMEM Namespace. The kernel uses the Label Space to record that we have our NVDIMMs configured in a single PMEM Namespace, along with the Mode, size, UUID and other metadata about the Namespace. In this configuration we do not use the NVDIMM Block Windows.
Next we will make a DAX-capable filesystem on our new /dev/pmem3 block device, and we will mount it with the DAX mount option. The XFS, ext4 and ext2 filesystems all support DAX.
# fdisk -l /dev/pmem3 Disk /dev/pmem3: 186 GiB, 199764213760 bytes, 390164480 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes # mkfs.xfs /dev/pmem3 meta-data=/dev/pmem3 isize=256 agcount=4, agsize=12192640 blks = sectsz=4096 attr=2, projid32bit=1 = crc=0 finobt=0, sparse=0 data = bsize=4096 blocks=48770560, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal log bsize=4096 blocks=23813, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 # mount -o dax /dev/pmem3 /mnt # mount | grep dax /dev/pmem3 on /mnt type xfs (rw,relatime,attr2,dax,inode64,noquota)
We now have a PMEM Namespace formatted with the XFS filesystem and mounted with DAX. Any mmap calls to files in that file system will return virtual addresses that directly map to the persistent memory on our NVDIMMs, completely bypassing the page cache. Any fsync or msync calls on files in that file system will still ensure that modified data has been fully written to the NVDIMMs. These calls flush the processor cache lines associated with any pages that have been modified in userspace via mmap mappings.
PMEM Namespace + BTT:
Next we will unmount our DAX filesystem, delete our DAX capable PMEM Namespace, and then create a PMEM Namespace that uses the BTT instead.
# umount /mnt # ndctl disable-namespace namespace3.0 disabled 1 namespace # ndctl destroy-namespace namespace3.0 destroyed 1 namespace # ndctl create-namespace --type=pmem --mode=sector { "dev":"namespace3.0", "mode":"sector", "uuid":"51ab652d-7f20-44ea-b51d-5670454f8b9b", "sector_size":4096, "blockdev":"pmem3s" } # fdisk -l /dev/pmem3s Disk /dev/pmem3s: 188.8 GiB, 202738135040 bytes, 49496615 sectors Units: sectors of 1 * 4096 = 4096 bytes Sector size (logical/physical): 4096 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Like the DAX-capable PMEM Namespace we previously configured, this BTT-capable PMEM Namespace consumes all the available storage on the NVDIMMs. The trailing “s” in the /dev/pmem3s device name stands for “sector” or “safe” and can be used to easily distinguish PMEM and BLK Namespaces that are configured to use the BTT.
The PMEM Namespace shown here cannot use DAX. Instead it uses the BTT to provide sector write atomicity. On each sector write through the PMEM block driver the BTT will allocate a new sector to receive the new data. The BTT atomically updates its internal mapping structures after the new data is fully written so the newly written data will be available to applications. If the power fails at any point during this process, the write will be completely lost and the application will have access to its old data, still intact. This prevents the condition known as torn sectors.
This BTT-enabled PMEM Namespace can be formatted and used with a filesystem just like any other standard block device. Because it cannot be used with DAX, mmap mappings for files on this block device will use the page cache.
BLK Namespaces:
We will now reconfigure our three NVDIMMs so they each have a single BLK Namespace that contains all of their available space. Our configuration looks like this:
Unlike the previous PMEM configurations where all three NVDIMMs contributed capacity to the Namespace, for BLK Namespaces each NVDIMM stands on its own. One of the consequences of this approach is that with this configuration each NVDIMM is a separate failure domain. This means that if one NVDIMM fails it would only affect the BLK Namespaces located on that NVDIMM. With PMEM the data is interleaved equally across all NVDIMMs in the interleave set so the failure of a single NVDIMM causes data loss for the entire PMEM Namespace.
The following commands disable and delete the old BTT-enabled PMEM Namespaces and create the three new BLK Namespaces. We will configure the BLK Namespaces so that they too use the BTT.
# ndctl disable-namespace namespace3.0 disabled 1 namespace # ndctl destroy-namespace namespace3.0 destroyed 1 namespace # ndctl create-namespace --type=blk --mode=sector { "dev":"namespace1.0", "mode":"sector", "uuid":"fed466bd-90f6-460b-ac81-ad1f08716602", "sector_size":4096, "blockdev":"ndblk1.0s" } # ndctl create-namespace --type=blk --mode=sector { "dev":"namespace0.0", "mode":"sector", "uuid":"12a29b6f-b951-4d08-8dbc-8dea1a2bb32d", "sector_size":4096, "blockdev":"ndblk0.0s" } # ndctl create-namespace --type=blk --mode=sector { "dev":"namespace2.0", "mode":"sector", "uuid":"7c84dab5-cc08-452a-b18d-53e430bf8833", "sector_size":4096, "blockdev":"ndblk2.0s" } # fdisk -l /dev/ndblk* Disk /dev/ndblk0.0s: 63.4 GiB, 68115001344 bytes, 16629639 sectors Units: sectors of 1 * 4096 = 4096 bytes Sector size (logical/physical): 4096 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disk /dev/ndblk1.0s: 63.4 GiB, 68115001344 bytes, 16629639 sectors Units: sectors of 1 * 4096 = 4096 bytes Sector size (logical/physical): 4096 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disk /dev/ndblk2.0s: 63.4 GiB, 68115001344 bytes, 16629639 sectors Units: sectors of 1 * 4096 = 4096 bytes Sector size (logical/physical): 4096 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Because all I/O to BLK Namespaces happens through their constantly moving Block Windows, BLK Namespaces cannot use DAX. The block devices generated for BLK Namespaces are named /dev/ndblkX.Y where X is the parent Region number and the Y is a unique Namespace number within that Region. So, /dev/ndblk2.0s is child Namespace number 0 of region2. The trailing “s” means that this Namespace is configured to use the BTT.
Combining PMEM and BLK:
Since our NVDIMMs are capable of both PMEM and BLK we can set up a mixed configuration that utilizes both Types of Namespaces. The following diagram shows our target configuration:
The commands below remove our previous three BLK Namespaces and create this new mixed configuration:
# ndctl disable-namespace all disabled 7 namespaces # ndctl destroy-namespace all destroyed 7 namespaces # ndctl create-namespace --type=blk --size=32g { "dev":"namespace2.1", "mode":"sector", "uuid":"37c254cd-b123-4b13-b5b0-cd06c30e4efb", "sector_size":4096, "blockdev":"ndblk2.1s" } # ndctl create-namespace --type=blk --size=32g { "dev":"namespace1.1", "mode":"sector", "uuid":"e1f5fa9f-4820-42f4-b8a3-be90fa00fe79", "sector_size":4096, "blockdev":"ndblk1.1s" } # ndctl create-namespace --type=blk --size=32g { "dev":"namespace0.1", "mode":"sector", "uuid":"1f84a98c-8dac-4a29-966a-42a5ac78d78f", "sector_size":4096, "blockdev":"ndblk0.1s" } # ndctl create-namespace --type=pmem --mode=memory { "dev":"namespace3.0", "mode":"memory", "size":99881058304, "uuid":"33311d73-487d-4d27-8f2a-9d682570e312", "blockdev":"pmem3" } # fdisk -l /dev/pmem3 /dev/ndblk* Disk /dev/pmem3: 93 GiB, 99881058304 bytes, 195080192 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disk /dev/ndblk0.1s: 32 GiB, 34325135360 bytes, 8380160 sectors Units: sectors of 1 * 4096 = 4096 bytes Sector size (logical/physical): 4096 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disk /dev/ndblk1.1s: 32 GiB, 34325135360 bytes, 8380160 sectors Units: sectors of 1 * 4096 = 4096 bytes Sector size (logical/physical): 4096 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disk /dev/ndblk2.1s: 32 GiB, 34325135360 bytes, 8380160 sectors Units: sectors of 1 * 4096 = 4096 bytes Sector size (logical/physical): 4096 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Simulated NVDIMMs:
If you do not yet have a platform that contains NVDIMMs but you would like to begin developing software that uses PMEM and DAX, you can create a PMEM Namespace out of normal RAM. This is done via the memmap kernel command line parameter. This parameter reserves memory space from the rest of the system before boot. When the PMEM driver later loads it is able to claim this reserved memory and use it for a PMEM Namespace.
The memmap kernel command line parameter works in terms of physical addresses. To determine which physical addresses are available to reserve for our PMEM Namespace we can look at the e820 table output in dmesg.
Here is an example setup using a virtual machine with 20 GiB of memory:
# dmesg | grep BIOS-e820 [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable [ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff] usable [ 0.000000] BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000053fffffff] usable
In this output the regions marked as usable are fair game to be reserved for the PMEM driver, while the reserved regions are not. The last usable region represents the bulk of our available space, so we’ll use that.
Plugging these physical addresses into our hex calculator, the region starts at 0x0000000100000000 (4 GiB) and ends at 0x000000053fffffff (21 GiB). Say we want to reserve 16 GiB to be used by PMEM. We can start this reservation at 4 GiB, and given a size of 16 GiB it will end at 20 GiB which is still within this usable range. The syntax for this reservation is shown below:
After rebooting with our new kernel parameter, we can see our new user-defined e820 table via dmesg as well (the old table is still present, in case you want to compare):
# dmesg | grep user: [ 0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable [ 0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved [ 0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved [ 0.000000] user: [mem 0x0000000000100000-0x00000000bffdffff] usable [ 0.000000] user: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved [ 0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved [ 0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved [ 0.000000] user: [mem 0x0000000100000000-0x00000004ffffffff] persistent (type 12) [ 0.000000] user: [mem 0x0000000500000000-0x000000053fffffff] usable
We can see that our new persistent memory range does indeed start at 4 GiB and end at 20 GiB, fully overlapping the usable memory range defined in the e820 table output.
If we have the PMEM driver loaded, we will see this reserved memory range as /dev/pmem0:
# fdisk -l /dev/pmem0 Disk /dev/pmem0: 16 GiB, 17179869184 bytes, 33554432 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Further Reading:
In this section we will enumerate several other sources of code and information related to NVDIMM enabling in SLES 12 SP 2 and in the wider Linux community.
https://nvdimm.wiki.kernel.org/
https://www.kernel.org/doc/Documentation/nvdimm/nvdimm.txt
Related Articles
Oct 25th, 2022
No comments yet