persistent memory

Post on 14-Feb-2017

625 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Persistent MemoryDr. Benoit Hudzia

@blopeurbenoit@stratoscale.com

Agenda

NVM Evolution

Persistent Memory Linux Software Stack

Using , Emulating PMEM on Linux

Remote PMEM

Micro Storage Architecture

NVM Evolution

Persistent MemoryYesterday : Battery Backed RAM

Today : NVDIMM with RAM + FLASH

Power Down - copy to Flash, Power Up copy Back to RAM

Emerging NVDIMM : PCM - 3DX Point - Memristor - etc…

Offer 1000x speed vs NAND -> closer to RAM

Characteristics as seen by software : Synchronous Model

Load / Store memory instruction

No paging

Reasonably stall CPU

New Generation HW NVM is no longer the bottleneck

But still limited by Block stack latency + Asynchronous Model

Asynchronous Model : NVMe

“When Poll is Better than Interrupt” Yang & Al . Usenix Fast 2012 https://www.usenix.org/legacy/events/fast12/tech/full_papers/Yang.pdf

● Active Polling ( SYNC ) lower latency ( at the expense of CPU) vs interrupt MSI-X (ASYNC)

● Used in Intel SPDK

Enter persistent Memory

Source: Intel4KBRead

64BRead

Moving away from Block I/O

LATENCY

ACCESS

Lead to a new Tiered Software Stack

Challenge: Durability

PMEM Linux Software Stack

Linux kernel (>4.2) subsystem

NVDIMM Software Architecture

http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf

BTT vs DAXBTT : Block translation table

provides atomic sector update semantics for persistent memory devices

applications that rely on sector writes not being torn can continue to do so.

For Legacy application

DAX : stands for Direct Access

Allows mapping a pmem range directly into userspace via mmap

If the application is aware of persistent, byte-addressable memory, and can use it to an advantage, DAX is the best path for it

If the application relies on atomic sector update semantics, it must use the BTT

Note that PMEM page are not backed by Page struct , only by PFN (so far)

Using , Emulating PMEM on Linux

Kernel Config ( > 4.2 )

Enable NVDIMM dynamic debug before you start playing with NVDIMMsAdd to the kernel cmd line:libnvdimm.dyndbg nfit.dyndbg nd_pmem.dyndbg nd_blk.dyndbg ignore_loglevel

Pick your PMEMUse ACPI 6.0 compatible NVDIMM hardware or

legacy NVDIMMs

Use virtual NVDIMMs provided by hypervisor

RAM as persistent memory

PCMSIM: NVM-disk Emulation

Emulation : RAM as PMEMBare metal :

Add 'memmap=16G!16G' to the kernel boot parameters will reserve 16G of memory, starting at 16G.

cat /proc/cmdline :

BOOT_IMAGE=/boot/vmlinuz-4.3.0-1-default root=UUID=39635fd6-64ee- 4538-9964-7de6bb181181 resume=/dev/sda1 splash=silent quiet showopts memmap=1G!5G memmap=1G!7G

BTT works

QEMU NVDIMMQemu :

qemu-system-x86_64 -object memory-backend-file,share,id=mem1,mem-path=/dax/D1 -device nvdimm,memdev=mem1,reserve-label-data,id=nv1 -m 2048,maxmem=100G,slots=10 ….

Not yet in Upstream Qemu :

https://github.com/xiaogr/qemu/tree/nvdimm-v9

Seabios integration :

http://www.seabios.org/pipermail/seabios/2015-September/009770.html

Still Missing some feature + high overhead for some operations

Supports PMEM only -> Good for NFIT dev

Playing with DAXOnly ext2, ext4 and xfs currently support DAX

Note that block size should match page size

mkfs.ext4 -b 4096 /dev/pmem1

mount -t ext4 -o dax /dev/pmem1 /tmp/dax/

Playing with DAX - Cont

Then you just have to mmap it!

But remember: CFLUSH, etc.. for durability

NVML : Lets somebody else do the heavy lifting

http://pmem.io/

libpmem – Basic persistency handling

Libvmmalloc - Transparently converts all the dynamic memory allocations into persistent memory allocations.

libpmemblk – Block access to pmem

libpmemlog - Log file on pmem (append-mostly)

libpmemobj - Transactional Object Store on pmem

Many more… pynvm , C++ bidings , etc..

Remote PMEM

Remote NVMe : using RDMA to transfer NVMe commands & data

http://blog.pmcs.com/flash-memory-summit-2015-special-nvm-express-rdma-awesome/

Transitioning from Indirect to Direct Flow

● Project Donard ( PMC - Microsemi)● Page Struct backed Pmem patch (I/O mem are normally accessed via PFN only)

Comes with Challenge : Durability vs Visibility

http://www.snia.org/sites/default/files/SDC15_presentations/persistant_mem/ChetDouglas_RDMA_with_PM.pdf

RDMA + DDIO

RDMA + Non Allocating write

Peer 2 Peer : Bypassing CPU + SW bottleneck

● NVM HW - Expose BAR address

● March 16 : RFC patchset for DAX allowing DMA to I/O mem

● CCIX fabric

● Use case: ○ Pre-process in Data

path○ Avoid RAM buffer

( HMM style ) ○ SW only fetch what is

necessary

Future Hyperscale Architecture

NVMe gravy train for 3-5 years

Transition to Pmem optimised apps and

Natural evolution of Ethernet Connected Drive => Fabric connected Pmem

Durable Array of Wimpy Nodes

Direct PMEM

Low power High perf K/V storage

Use pluggable front end

Rearranged based on needs

LinksDrivers specs: http://pmem.io/documents/

NVDIMM Namespace Specification: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf NVDIMM Drivers Writers Guide: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf NVDIMM DSM Interface Example: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf Linux docs: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/nvdimm/nvdimm.txtQemu : https://github.com/xiaogr/qemu/tree/nvdimm-v9Seabios : http://www.seabios.org/pipermail/seabios/2015-September/009770.html Libraries:

https://github.com/pmem/nvml/ https://github.com/perone/pynvm http://opennvm.github.io/index.html https://github.com/spdk/spdk

Project :PMFS : https://github.com/linux-pmfs/pmfs NOVA: NOn-Volatile memory Accelerated log-structured file system https://github.com/NVSL/NOVAPCMSIM : https://code.google.com/p/pcmsim/

Patch : Donard: A PCIe Peer-2-Peer kernel patch https://github.com/sbates130272/donard adds struct page backing for IO memory and as such allows IO memory to be used as a DMA target :

http://www.spinics.net/lists/linux-mm/msg103990.html

Thank You!Questions ?

NVDIMM block I/O path

top related