An Empirical Guide to the Behavior and Use of Scalable Persistent Memory
Purpose
This paper studies how real Intel Optane Persistent Memory behaves and explains why earlier research based on emulation produced misleading conclusions.
Main goals:
- Characterize Optane on real hardware
- Compare real behavior vs emulation assumptions
- Extract practical software design guidelines
- Reevaluate prior persistent-memory systems
Key insight:
Persistent memory is not just slower DRAM — performance depends heavily on access patterns, access size, and concurrency.
What Scalable Persistent Memory Is
Optane DIMMs introduce a new tier between DRAM and storage.
Core properties
- Non-volatile (data survives power loss)
- Byte-addressable
- Installed on the memory bus
- Slower than DRAM, faster than SSDs
- Higher density than DRAM
Operating Modes
App-Direct Mode
Persistent memory visible to software
Applications use load/store instructions
Used for durable data structures
Memory Mode
DRAM acts as cache
Appears as a volatile memory extension
Persistence hidden from software
The paper focuses on App-Direct mode.
Internal Architecture Overview
The internal architecture of scalable persistent memory comprises several components that collectively shape its performance characteristics. The Integrated Memory Controller (iMC) manages memory traffic through dedicated read and write pending queues, while persistence guarantees are provided by the Asynchronous DRAM Refresh (ADR). Specifically, stores become durable once they reach the write pending queue (WPQ) within the iMC, even before data is committed to the underlying media. Access to the persistent memory device is coordinated by an on-DIMM controller (XPController), which performs address translation and mediates operations on the storage medium. Internally, the device operates at a 256-byte access granularity (XPLine), causing small writes to incur additional internal operations. To reduce this overhead, the controller employs a small write-combining buffer (XPBuffer, approximately 16 KB) that merges adjacent writes prior to media updates. Consequently, write operations may exhibit low apparent latency because completion is acknowledged at the ADR boundary. However, sustained bandwidth remains limited by the rate at which the XPBuffer drains data to the underlying 3D-XPoint media.
Experimental Methodology
To accurately characterize persistent memory behavior, the authors developed a custom microbenchmarking framework called LATTester. The methodology was designed to minimize software and operating-system interference in order to isolate hardware-level effects. Kernel threads were pinned to fixed CPU cores, interrupts and hardware prefetchers were disabled, and memory accesses were performed on pre-populated addresses to eliminate page-fault overhead. Measurements relied on precise cycle-level timing while systematically sweeping a large parameter space, including access patterns, access sizes, and concurrency levels. The primary objective of this approach was to capture intrinsic hardware behavior rather than performance artifacts introduced by software abstractions or system noise.
Key Hardware Findings
Latency
- Read latency: ~2-3× slower than DRAM
- Sequential reads are significantly faster than random reads
- Write latency appears similar due to DRAM because it is acknowledged at the iMC (ADR domain)
Tail Latency
Rare outliers exist where some writes take up to 50 µs (100× slower than normal). This is likely caused by internal remapping for wear-leveling or thermal management.
Bandwidth and Thread Scaling
DRAM: Scales predictably with thread count.\
vs.
Optane: Performance is non-monotonic. It peaks at low thread counts (e.g., 1-4 threads for non-interleaved writes) and then drops due to contention in the XPBuffer
Access Size Effects
- Internal access granularity = 256 bytes
- Writes smaller than 256B cause severe bandwidth loss
- Caused by internal read-modify-write operations
4KB Performance Dip
Observed performance drop around 4KB:
- Memory controller contention
- DIMM interleaving imbalance
Behavior not captured by emulation.
Sequential vs Random Access
Optane strongly prefers:
- Sequential access
- Large contiguous writes
Random small writes cause:
- Bandwidth collapse
- Increased latency
- Poor scaling
Emulation Was Inaccurate
Common emulation techniques:
- DRAM with added delays
- NUMA DRAM emulation
- Software simulators
- Hardware emulators
Problems:
- Failed to model read/write asymmetry
- Ignored sequential preference
- Missed small-write penalties
- Produced misleading conclusions
Case Study: RocksDB
Emulation result: Fine-grained persistence looked best.
Real Optane result: Write-ahead logging performs better.
Reason: Sequential logging matches hardware strengths.
Software Design Guidelines
Based on their empirical characterization, the authors derive a set of practical design guidelines for software targeting persistent memory systems. First, applications should avoid random accesses smaller than 256 bytes, as the device’s internal write granularity introduces read–modify–write amplification for fine-grained updates; preserving spatial locality improves efficiency. Second, large transfers should preferentially use non-temporal stores, which bypass the cache hierarchy and reduce unnecessary cache-line traffic, thereby improving sustained write bandwidth. Third, the number of concurrent threads targeting a single DIMM should be limited, since excessive concurrency increases contention within controller queues and write buffers, leading to performance collapse beyond a small optimal operating point. Finally, software should avoid NUMA accesses whenever possible, as remote persistent memory accesses incur significant latency and bandwidth penalties, particularly under mixed read–write workloads.
Case Study Insights
NOVA Filesystem
The paper demonstrates the practical impact of its design guidelines through optimizations applied to the NOVA persistent-memory file system. In its original design, frequent small metadata updates resulted in inefficient fine-grained writes that degraded performance on Optane hardware. To mitigate this issue, the authors restructured updates by embedding write data within larger sequential log entries, thereby increasing write locality and reducing internal write amplification. This modification significantly improved performance, yielding up to a sevenfold reduction in latency for small writes.
PMDK Micro-Buffering
The authors further evaluate micro-buffering techniques within PMDK-based transactional workloads. Their analysis shows that the choice of persistence mechanism should depend on object size: conventional cached stores combined with cache-line write-back instructions are more efficient for small objects, whereas non-temporal stores provide higher bandwidth for larger updates by bypassing the cache hierarchy. Experimental results identify a performance crossover point at approximately 1 KB, beyond which non-temporal stores become preferable.
Multi-DIMM Awareness
Another case study highlights the importance of hardware-aware thread placement. By balancing and pinning threads across multiple DIMMs rather than allowing uncontrolled sharing, the system reduces contention within memory controllers and improves overall bandwidth utilization. This optimization resulted in performance improvements of up to 34%, demonstrating the importance of aligning software parallelism with the underlying memory topology.
Class Discussion Notes Summary
Historical context
- CPU clock scaling slowed around early 2000s.
- Industry shifted toward thread-level parallelism.
- Adding cores became more effective than increasing complexity.
Architecture discussion
- Simpler cores scale more easily.
- Memory bottlenecks dominate modern systems.
- Directory coherence helps multi-core scalability.
Memory bottleneck insight
- Reads can be shared between caches.
- Writes serialize through memory controllers.
- Memory bandwidth becomes the limiting factor.
Conclusion
This paper highlights how persistent memory introduces a new set of design challenges that cannot be understood by simply thinking of it as slower DRAM. Through real hardware measurements, the authors show that performance is highly sensitive to access size, locality, and concurrency, and that many assumptions made in earlier emulation-based studies do not hold in practice. The key takeaway is that software must be designed with clear awareness of how the hardware actually behaves. In practice, this means favoring sequential access patterns, avoiding small random writes, and carefully managing thread placement and memory topology. More broadly, the paper serves as a reminder that emerging hardware technologies often require rethinking established design intuition, and that meaningful system optimization ultimately depends on evaluating real devices rather than relying solely on simulation or emulation.
AI Disclosure
ChatGPT was used to summarize the paper, notes, and class discussion. The generative AI created a template which was then reviewed, edited and revised by the group.