The CS/ECE 4/599 Course Blog

An Empirical Guide to the Behavior and Use of Scalable Persistent Memory

by Nolan Cutler (Leader/Presentor), Soren Emmons (Leader/Presentor), Jared Ho (Blogger), Deptmer Ashley (Blogger), Allen Lee (Scribe)

Purpose

This paper studies how real Intel Optane Persistent Memory behaves and explains why earlier research based on emulation produced misleading conclusions.

Main goals:

Key insight:

Persistent memory is not just slower DRAM — performance depends heavily on access patterns, access size, and concurrency.

What Scalable Persistent Memory Is

Optane DIMMs introduce a new tier between DRAM and storage.

Core properties

Operating Modes

App-Direct Mode

Persistent memory visible to software
Applications use load/store instructions
Used for durable data structures

Memory Mode

DRAM acts as cache
Appears as a volatile memory extension
Persistence hidden from software

The paper focuses on App-Direct mode.


Internal Architecture Overview

The internal architecture of scalable persistent memory comprises several components that collectively shape its performance characteristics. The Integrated Memory Controller (iMC) manages memory traffic through dedicated read and write pending queues, while persistence guarantees are provided by the Asynchronous DRAM Refresh (ADR). Specifically, stores become durable once they reach the write pending queue (WPQ) within the iMC, even before data is committed to the underlying media. Access to the persistent memory device is coordinated by an on-DIMM controller (XPController), which performs address translation and mediates operations on the storage medium. Internally, the device operates at a 256-byte access granularity (XPLine), causing small writes to incur additional internal operations. To reduce this overhead, the controller employs a small write-combining buffer (XPBuffer, approximately 16 KB) that merges adjacent writes prior to media updates. Consequently, write operations may exhibit low apparent latency because completion is acknowledged at the ADR boundary. However, sustained bandwidth remains limited by the rate at which the XPBuffer drains data to the underlying 3D-XPoint media.

Experimental Methodology

To accurately characterize persistent memory behavior, the authors developed a custom microbenchmarking framework called LATTester. The methodology was designed to minimize software and operating-system interference in order to isolate hardware-level effects. Kernel threads were pinned to fixed CPU cores, interrupts and hardware prefetchers were disabled, and memory accesses were performed on pre-populated addresses to eliminate page-fault overhead. Measurements relied on precise cycle-level timing while systematically sweeping a large parameter space, including access patterns, access sizes, and concurrency levels. The primary objective of this approach was to capture intrinsic hardware behavior rather than performance artifacts introduced by software abstractions or system noise.


Key Hardware Findings

Latency

Tail Latency

Rare outliers exist where some writes take up to 50 µs (100× slower than normal). This is likely caused by internal remapping for wear-leveling or thermal management.

Bandwidth and Thread Scaling

DRAM: Scales predictably with thread count.\

vs.

Optane: Performance is non-monotonic. It peaks at low thread counts (e.g., 1-4 threads for non-interleaved writes) and then drops due to contention in the XPBuffer

Access Size Effects

4KB Performance Dip

Observed performance drop around 4KB:

Behavior not captured by emulation.

Sequential vs Random Access

Optane strongly prefers:

Random small writes cause:

Emulation Was Inaccurate

Common emulation techniques:

Problems:

Case Study: RocksDB

Emulation result: Fine-grained persistence looked best.

Real Optane result: Write-ahead logging performs better.

Reason: Sequential logging matches hardware strengths.


Software Design Guidelines

Based on their empirical characterization, the authors derive a set of practical design guidelines for software targeting persistent memory systems. First, applications should avoid random accesses smaller than 256 bytes, as the device’s internal write granularity introduces read–modify–write amplification for fine-grained updates; preserving spatial locality improves efficiency. Second, large transfers should preferentially use non-temporal stores, which bypass the cache hierarchy and reduce unnecessary cache-line traffic, thereby improving sustained write bandwidth. Third, the number of concurrent threads targeting a single DIMM should be limited, since excessive concurrency increases contention within controller queues and write buffers, leading to performance collapse beyond a small optimal operating point. Finally, software should avoid NUMA accesses whenever possible, as remote persistent memory accesses incur significant latency and bandwidth penalties, particularly under mixed read–write workloads.

Case Study Insights

NOVA Filesystem

The paper demonstrates the practical impact of its design guidelines through optimizations applied to the NOVA persistent-memory file system. In its original design, frequent small metadata updates resulted in inefficient fine-grained writes that degraded performance on Optane hardware. To mitigate this issue, the authors restructured updates by embedding write data within larger sequential log entries, thereby increasing write locality and reducing internal write amplification. This modification significantly improved performance, yielding up to a sevenfold reduction in latency for small writes.

PMDK Micro-Buffering

The authors further evaluate micro-buffering techniques within PMDK-based transactional workloads. Their analysis shows that the choice of persistence mechanism should depend on object size: conventional cached stores combined with cache-line write-back instructions are more efficient for small objects, whereas non-temporal stores provide higher bandwidth for larger updates by bypassing the cache hierarchy. Experimental results identify a performance crossover point at approximately 1 KB, beyond which non-temporal stores become preferable.

Multi-DIMM Awareness

Another case study highlights the importance of hardware-aware thread placement. By balancing and pinning threads across multiple DIMMs rather than allowing uncontrolled sharing, the system reduces contention within memory controllers and improves overall bandwidth utilization. This optimization resulted in performance improvements of up to 34%, demonstrating the importance of aligning software parallelism with the underlying memory topology.


Class Discussion Notes Summary

Historical context

Architecture discussion

Memory bottleneck insight


Conclusion

This paper highlights how persistent memory introduces a new set of design challenges that cannot be understood by simply thinking of it as slower DRAM. Through real hardware measurements, the authors show that performance is highly sensitive to access size, locality, and concurrency, and that many assumptions made in earlier emulation-based studies do not hold in practice. The key takeaway is that software must be designed with clear awareness of how the hardware actually behaves. In practice, this means favoring sequential access patterns, avoiding small random writes, and carefully managing thread placement and memory topology. More broadly, the paper serves as a reminder that emerging hardware technologies often require rethinking established design intuition, and that meaningful system optimization ultimately depends on evaluating real devices rather than relying solely on simulation or emulation.

AI Disclosure

ChatGPT was used to summarize the paper, notes, and class discussion. The generative AI created a template which was then reviewed, edited and revised by the group.