NVDIMM Enabling in SUSE Linux Enterprise 12, Service Pack 2
Linux Kernel Developer, Intel
Part 1: High Level Overview
Persistent memory is an emerging technology in which NonVolatile DIMMs (NVDIMMs) are placed on the memory bus. These NVDIMMs can exist either along side or in the place of traditional volatile memory DIMMs. NVDIMMs have the property that data stored on them will be retained even when the system is shut down or loses power. This allows NVDIMMs to be used either as normal memory or as a form of permanent storage like Hard Disk Drives (HDDs) or Solid State Drives (SSDs).
SUSE Linux Enterprise 12, Service Pack 2 (SLE 12 SP 2) includes early support for NVDIMMs. This support includes kernel enabling as well as userspace libraries and configuration utilities available in the standard SUSE Linux Enterprise package repositories. This is the first of two blog posts that will give an overview of this enabling.
This post will provide a high level overview of the concepts involved in persistent memory enabling. We will talk about NVDIMMs and the various ways in which they can be configured. We will also talk about related software techniques such as memory Direct Access (DAX) and the Block Translation Table (BTT).
In the next post we will give examples of how to configure NVDIMMs using the tools provided in SLE 12 SP 2. We will configure our NVDIMMs so we can access their persistent memory in a variety of ways. With each configuration we will review how our persistent memory is actually being used and what software techniques are available.
Persistent memory DIMMs can be implemented in a number of different ways. NVDIMMs can be a combination of DRAM, an existing nonvolatile media like flash, a power source such as a super capacitor or a battery, and enough logic so that when a power failure happens the contents of the DRAM are moved to the nonvolatile media before the battery runs out. NVDIMMs can also be made from new types of media that are both byte addressable and nonvolatile, such as 3D XPoint, memristor, or spin transfer torque (STT) memory, to name just a few.
For the purposes of our discussion the actual hardware implementation is not important. What matters is that the memory sits on the memory bus, is byte addressable, has read and write latencies that are roughly within an order of magnitude of normal volatile memory, and that it conforms to the nonvolatile memory interfaces defined in ACPI 6.0 and beyond.
The NVDIMM enabling in SLE 12 SP 2 is based around the “NVDIMM Firmware Interface Table (NFIT)”, first defined in ACPI 6.0. Out of band communication with these NVDIMMs happens via Device Specific Methods (DSMs). Links to all the relevant specs and documents are included at the end of the second post for reference. The important thing to know is that any NVDIMM that conforms to these standards will work with SLE 12 SP 2, regardless of manufacturer or the underlying memory technology.
Before we dive into the different ways that NVDIMMs can be used we need to cover a few common definitions.
Namespace: An NVDIMM Namespace is the unit of persistent memory storage that appears in /dev as a device that you can use for I/O. A given Namespace has a block size, associated persistent memory, a kernel driver, and other related metadata. The devices in /dev associated with Namespaces are usually block devices. NVM Express SSDs also have Namespaces which partition the SSD’s storage into logical devices. NVDIMM Namespaces are modeled after this idea and provide the same function. Like NVM Express Namespaces and disk devices, NVDIMM Namespaces can be further subdivided by a partition table.
Region: A Region is a grouping of persistent memory storage that can be divided up into one or more Namespaces. You cannot access the persistent memory of a Region without first allocating it to a Namespace.
Type: Each Namespace and Region has a Type that defines the way in which the persistent memory associated with that Namespace or Region can be accessed. Valid Types are PMEM and BLK, which we will talk about in more detail later. A Namespace always has the same Type as its parent Region.
Mode: Each Namespace also has a Mode that defines which NVDIMM software features are enabled for that Namespace. Namespace Modes include raw, sector, memory and dax. Sibling Namespaces of the same parent Region will always have the same Type, but might be configured to have different Modes.
Normal volatile DRAM DIMMs in a system are usually configured so that multiple DIMMs are interleaved together into what is called an interleave set. For example, in a system with a single processor and four normal memory DIMMs, those four DIMMs are likely interleaved together so that a write to a given page in memory will end up writing 1/4 of its data to DIMM 0, 1/4 to DIMM 1, etc. Interleave sets are commonly called Nway, where N is the number of DIMMs involved in the interleave. So, in our example we would have a 4way interleave set.
The simplest way to use NVDIMMs is to configure them using interleave sets, much in the same way that you would if they were normal memory. Here is an example configuration with four NVDIMMs combined into a 4way interleave set:
In this configuration our 4way interleave set defines our Region, which has Type “PMEM”. We also have a single Namespace, also with Type PMEM, in our Region. In most cases this single PMEM Namespace will be surfaced as a block device named something like /dev/pmem0.
The ACPI specification allows DIMMs to have space for additional metadata called Labels which enable the system to keep track of which parts of the NVDIMM Regions are allocated to Namespaces. These Labels will be shown in future diagrams, but as this diagram shows, labels are not strictly necessary for PMEM Namespaces. Legacy NVDIMMs, those which were created before the NFIT table was added to the ACPI specification, do not have additional space for Labels. The Linux PMEM driver also allows users to create PMEM Namespaces from reserved DRAM via the memmap kernel parameter. For both Legacy NVDIMMs and for PMEM Namespaces created using the memmap kernel parameter, the system defaults to a single PMEM Region and a single PMEM Namespace that takes up all the available space in the Region.
One last thing to note about this diagram is the dotted green line surrounding the PMEM Region. In the diagrams presented in this series the green line identifies the parts of the NVDIMMs that are directly accessible as an interface to the kernel or user. In the case of a PMEM Region users have the ability to read and write directly to all the persistent memory in the Region.
DAX, which stands for memory Direct Access, is one of the persistent memory features enabled in the SLE 12 SP 2 Linux kernel. DAX allows userspace applications to directly access their persistent memory via mmap.
With normal storage or with NVDIMMs that are used without the DAX feature, mmap gives userspace load/store access to page cache pages which are written back to media at a later time either due to page cache pressure or in response to a user initiated fsync, msync or other related synchronization system call. However, with DAX the virtual address returned by mmap to the userspace application points to the persistent memory and not the page cache. This allows the userspace application to perform loads, stores and flushes directly to the persistent memory addresses, completely bypassing the page cache.
The fsync, msync and related system calls are still relevant for DAX. Instead of causing the kernel to write back dirty page cache pages to media, for DAX these system calls cause the kernel to flush modified data cache lines in the processor caches out to the NVDIMMs.
DAX requires support from the underlying filesystem. It is currently supported by XFS, ext4 and ext2.
Besides PMEM, the other Type that can be given to Regions and Namespaces is BLK. Here is a diagram with our same four NVDIMMs, this time populated with with BLK Namespaces:
The fundamental difference between BLK and PMEM is that with PMEM you are able to directly access the media on the NVDIMM via loads and stores to persistent memory virtual addresses. With BLK, though, you instead access the media through Apertures. Apertures are are essentially sliding memory windows, usually a few pages in size, that you can move to any place on the NVDIMM. Each Aperture is controlled by a Control Register that is used to adjust the offset of the aperture. There is also a Status Register that can be used to check for I/O errors after the I/O is complete. An Aperture, Control Register and Status Register group is collectively called a Block Window, hence the name of this Type of access as BLK.
This is the reason why the dotted green line in the diagram only surrounds the NVDIMM Block Windows and does not surround the actual Namespace media. For BLK Namespaces both userspace and the kernel do not have virtual addresses that correspond directly to the persistent memory. The basic flow for a write operation using a BLK Namespace is described below:
1. Locate a Block Window on the DIMM that holds your data.
2. Write the DIMMrelative offset into the Block Window’s Control Register.
3. Write the new data into the Block Window’s Aperture.
4. Read the Block Window’s Status Register to see if there was an error.
There is also a Label Space for each NVDIMM that is accessible by the kernel, allowing it to store Labels. Labels are used to store relevant metadata about how the DIMM is divided up into BLK Namespaces and what properties those BLK Namespaces have.
Because BLK Namespaces do not have virtual addresses that correspond directly to the persistent memory media, BLK Namespaces do not support the DAX feature. Filesystems built upon BLK Namespaces use the page cache just like traditional storage.
Combining PMEM and BLK:
Some NVDIMMs only support PMEM, some NVDIMMs only support BLK, and some NVDIMMs support both access methods. The diagram below shows an example configuration that contains four NVDIMMs that support both PMEM and BLK:
In this configuration we have an interleave set that covers the entirety of our four NVDIMMs, making one big PMEM Region. Each NVDIMM’s persistent memory is also available via the NVDIMM’s Block Windows.
The configuration for each NVDIMM is stored in the Labels held in the Label Space of that NVDIMM. These Labels help us keep track of which part of the NVDIMM is allocated for the PMEM Namespace and which parts of the NVDIMM are allocated to BLK Namespaces. The Labels help us ensure that we never have the same part of a given NVDIMM accessible via both PMEM and BLK. Aliasing between PMEM and BLK would quickly lead to data corruption.
Many applications rely on the fact that when they write a given sector to media the write will either succeed atomically or the write will fail atomically and leave the old data intact. With persistent memory this is not always true. If we lose power during a sector write it is entirely possible that the failure occurs half way through writing that sector. On the next boot the application will try and read the sector which now contains half old data and half new data. This condition is known as a torn sector.
The kernel component called the Block Translation Table (BTT) addresses the issue of torn sectors. The goal of the BTT is to provide sector write atomicity in software. It does this by turning each write into an allocating write, where the new data is written into a previously unused sector. After the write is complete and the data is fully written the sector then activated and atomically swapped into place so that it can be read.
The BTT is implemented via an indirection table and a free list. The indirection table provides a mapping between user visible Logical Block Addresses (LBAs) and the actual LBAs on the media. The free list provides a small pool of unused sectors that can be used to handle incoming writes, allowing them to write into unallocated space. This process is shown below:
In this example BTT the sector at the user’s LBA 3 is actually stored at LBA 8 on the media. When the BTT receives an incoming write for the user’s LBA 3, it first writes the data to one of the sectors on the free list.
If we lose power while this write is in progress the old data at the user’s LBA 3 will still be intact at the device’s LBA 8. When power is restored the user will be able to read the old sector, and the halfwritten new data will be ready to be overwritten on the free list.
If the write fully succeeds we atomically update the indirection table and the free list, swapping the old device LBA 8 and the device LBA 0.
This makes the newly written data at the device’s LBA 0 visible via the user’s LBA 3, and frees the device’s LBA 8 so it can be used for a future write.
Because the actual location of the data at a given user’s LBA is constantly moving in response to incoming writes the BTT is incompatible with DAX. DAX requires that the physical location of a given sector of data is constant over time so that the kernel can give userspace applications pointers to that data.
In this post we have covered some of the high level concepts involved in the persistent memory enabling found in SLE 12 SP 2. The storage provided by NVDIMMs is organized into Regions and Namespaces, and we have discussed some of the different ways those Regions and Namespaces can be configured. We have also talked about DAX and the BTT, which are two of the software techniques that make persistent memory even more useful to users.
In the next blog post we will discuss the details of how you can use the software tools and kernel enabling provided in SLE 12 SP 2 to configure your NVDIMM enabled platform.