The ECE 4/599 Course Blog

An Empirical Guide to the Behavior and Use of Scalable Persistent Memory

by Shuxian Qin (Blogger) , Shuyi Zheng (Scribe) , Noah Bean (Leader)

Introduction

This blog discusses “An Empirical Guide to the Behavior and Use of Scalable Persistent Memory.” This work stands out as the first real-world performance benchmark of Intel’s Optane DIMMs, a groundbreaking technology that brings scalable, byte-addressable persistent memory to the computing landscape. Designed to bridge the gap between the fast but volatile DRAM and the slower, non-volatile SSDs, Optane DIMMs promised a new era of memory systems after nearly a decade of anticipation. The paper dives into how these devices actually perform, revealing surprises that upend assumptions from years of research based on emulated setups like DRAM stand-ins. Far from being just “slower DRAM,” Optane exhibits unique quirks that demand fresh thinking. Its authors set out not only to measure Optane’s behavior but also to deliver practical guidelines for software developers to optimize their code for this novel hardware. This blog unpacks their findings, shedding light on a technology that’s as complex as it is promising.

Background and Context

Persistent memory represents a transformative leap in the memory hierarchy, blending traits that set it apart from traditional technologies. Unlike DRAM, which loses its contents when power is cut, persistent memory is non-volatile, retaining data even after shutdowns. It’s denser too, packing more storage capacity into less space at a lower cost per gigabyte than DRAM. This makes it a versatile player, sitting between DRAM’s speed and SSDs’ persistence. Its byte-addressable nature—allowing direct access to individual memory locations—unlocks a range of use cases: speeding up database transaction logging and in-memory databases, providing fast persistent caching for high-performance computing (HPC), boosting memory-intensive virtualization in cloud setups, and accelerating storage as a cache or tiered layer.

For over a decade, researchers have been gearing up for persistent memory’s arrival, but without real hardware like Intel’s Optane DIMMs, they relied on emulations—often using DRAM tweaked to mimic non-volatile memory (NVM). These setups, sometimes paired with artificial delays or NUMA effects, fueled speculation about how NVM would behave. However, powered by 3D XPoint technology (a joint Intel-Micron innovation), Optane does not behave as expected. Unlike DRAM’s straightforward capacitor-based design, 3D XPoint introduces “second-order changes”—subtle, cascading effects in latency, bandwidth, and concurrency that add a new layer of complexity. Where DRAM emulation suggested predictable performance, Optane’s real-world behavior defies those models, revealing quirks that prior speculations couldn’t anticipate. This shift challenges the research community to rethink how software interacts with memory, setting the stage for the detailed benchmarks and insights this paper delivers.

Keywords

Summary of the Paper

1. Introduction

The paper discusses the decade-long wait for scalable non-volatile memory (NVM), a technology researchers have speculated about since the early 2000s. Intel’s Optane DIMMs finally arrived in 2019, offering a real-world testbed to challenge those expectations. The authors set out to measure how this persistent memory performs, aiming to replace guesswork from emulated setups with data.

2. Background and Methodology

This section discusses how optane works. It plugs into Intel’s Cascade Lake processors (24 cores, dual-socket, 3 TB total Optane across 2 sockets with 6 channels each). Unlike DRAM, Optane uses 256B blocks for internal writes, managed by an on-DIMM controller (XPController) and the CPU’s integrated memory controller (iMC), which handles write pending queues (WPQs). The Asynchronous DRAM Refresh (ADR) feature ensures data in WPQs persists during power loss. The study focuses on App Direct mode—where Optane acts as a standalone persistent device—running on Fedora 27 with a custom kernel, testing 384 GB DRAM alongside 256 GB Optane DIMMs per channel.

3. Performance Characterization

Read latency clocks in at 2x-3x higher than DRAM, with random reads hit harder than sequential ones due to the 256B XPBuffer’s limits. Tail latency spikes—up to 50 µs, 100x the norm—pop up unpredictably, likely from wear leveling. Bandwidth tells a split story: reads scale well with threads (up to 16), but writes plateau early (around 4-12 threads) and dip with more concurrency, far less scalable than DRAM’s steady climb. Small random writes tank performance, while sequential ones hold up better.

4. Comparison to Emulation

Past NVM studies used DRAM with delays or NUMA tricks, predicting Optane as “slower DRAM.” Real tests flip that script. In RocksDB, emulation favored fine-grained persistence (moving the memtable to NVM), but on Optane, the FLEX approach (sequential write-ahead logging) wins by 10%, thanks to its dislike for small random writes. Emulated guesses missed Optane’s nuanced behavior.

5. Best Practices for Optane DIMMs

The paper distills four guidelines from microbenchmarks: avoid small random writes under 256B (they amplify due to block size), use non-temporal stores (ntstores) for large sequential writes (bypassing cache boosts bandwidth), limit threads hitting one DIMM (more than 4-12 thrash the XPBuffer and iMC), and steer clear of NUMA accesses (remote hits crater performance, up to 30x worse than DRAM’s 3.3x).

6. Discussion

Looking ahead, the authors muse on how these tips might evolve. Extended ADR (eADR), including caches in the persistent domain, could nix ntstore needs. CXL might shift persistent memory paradigms, while bigger buffers or smaller blocks (unlike 256B) could tweak trade-offs—though power limits complicate that. The guidelines offer a roadmap for future NVM, even if Optane’s specifics change.

Before Optane, NVM research relied on emulation. These shaped file systems, transactional models, and data structures, but lacked real hardware validation. This paper marks a pivot to tangible benchmarks.

8. Conclusion

Optane DIMMs emerge slower (2x-3x DRAM latency) and more complex than expected, with thread scaling and write granularity tripping up performance. Coders bear the burden of optimization. As the first real benchmark of persistent memory, this study swaps speculation for data, spotlighting Optane’s promise and pitfalls.

Important Results and What They Mean

Emulation Errors: For years, researchers emulated non-volatile memory (NVM) with DRAM, tweaking it to guess how Optane might behave. Real tests cleared up this assumption.

Latency/Bandwidth: Read latency hits 2x-3x higher than DRAM, with random reads suffering more than sequential ones due to internal buffering limits. Write bandwidth lags too, far below DRAM’s, and scalability takes a hit: while DRAM’s bandwidth climbs steadily with threads, Optane peaks at 4-12 threads (reads at 16, writes at 4) then drops off—a stark contrast to DRAM’s monotonic rise.

256B Granularity: Small random writes (under 256B) amplify—updating one byte means rewriting the whole block, slashing efficiency and bandwidth. Sequential writes, though, shine, aligning with the block size to minimize overhead. This split explains why random access patterns tank performance while sequential ones improve performance.

Tail Latency: Reliability is reduced by rare tail latency spikes—up to 50 µs, 100x the typical latency. These outliers, likely from wear leveling or internal remapping, are sparse (0.006% of accesses) but cause serious problems. For systems needing predictability, like databases or real-time apps, this is a red flag.

Strengths and Weaknesses of the Paper

Strengths

First Real Optane Benchmark: This paper stands out as the first benchmark of Intel’s Optane DIMMs with actual hardware.

Practical Guidelines with Case Studies: Beyond measurements, it offers actionable advice—avoid small writes, limit threads, use ntstores—backed by real-world tweaks.

Weaknesses

Limited Testing Scope: The paper’s focus narrows to RocksDB and NOVA, with testing confined to App Direct mode. This leaves Memory Mode and broader applications unexplored, raising questions about how well the findings generalize across diverse workloads or setups.

No Cost-Benefit Analysis or NVM Comparisons: It skips the economics—Optane’s cost “dramatically increases” compared to DRAM or SSDs, but there’s no breakdown of trade-offs or benefits to justify it. Nor does it stack Optane against other NVM technologies, leaving a gap in context.

Class Discussion

Capacitors in 3D XPoint: The 3D XPoint explainer video sparked curiosity. Unlike DRAM’s 1T1C (one transistor, one capacitor) simplicity, 3D XPoint’s crosspoint structure ditches transistors for resistive switching, with capacitors likely aiding dense packing and fast bit access.

256B Block Size: Why 256B? Is it baked into Optane’s DNA or a deliberate pick? Eugene wondered if smaller blocks (say, 64B) could ease small-write woes—random writes under 256B tank performance due to whole-block updates. Kyle Hale guessed 256B strikes a “good medium ground,” balancing efficiency and complexity. Smaller blocks might cut amplification but could spike power demands or shrink density, a trade-off Optane’s designers might have dodged.

NUMA Puzzle: Optane’s NUMA penalty—up to 30x bandwidth drop vs. DRAM’s 3.3x. Class eyed iMC contention: mixed read-write loads (like 3:1 ratios) clog short queues, worse with remote nodes. Eugene suggested it’s less iMC and more NVDIMM thrashing, but DRAM handles NUMA better.

Thread Scaling: Bandwidth peaking at 4-12 threads then dropped. Eugene blamed XPBuffer thrashing—multiple threads trash its 16KB write-merging space. Class added iMC queues: slow Optane drainage blocks them, slashing throughput. Is it buffer size, queue depth, or both? How do we keep threads from tripping over each other?

Interconnect Efficiency: Optane’s interconnect got scrutiny. How does it handle mixed read-write access? Class asked about gigabytes-per-second graphs (assuming bandwidth), wondering if it copes under chaos.

Commercial Fate: Intel’s 2022 Optane exit, after a $7B loss, hit hard. This was refered to as “swept under the rug.” Supply chain difficulty and high costs tanked it. Micron got out of persistent memory, but tinkering with persistent memory persists. CXL’s rising for persistent and disaggregated setups.

eADR Shift: eADR was brought up — cache-inclusive ADR from Intel’s 3rd-gen Xeons. It simplifies coding (no ntstores or flushes) and boosts bandwidth by keeping caches persistent. Class saw it easing Optane’s quirks.

Security Risks: DIMM theft was mentioned. If someone takes an Optane DIMM, data’s readable. How big of a problem is this? For databases or HPC, it’s a dealbreaker; consumer use, maybe less so. Why no mitigation discussion in the paper?

Hybrid Potential: The class pitched DRAM (1T1C speed) plus Optane (persistence). Class liked the idea: DRAM for fast random access, Optane for big persistent stores.

Sources:

Generative AI

MetricChatGPT 4o (GPT-4o)DeepSeek R1Gemini 2.0 FlashClaude 3.7 Sonnet
MMLU Accuracy88.7%85%80%88%
TruthfulQA Accuracy65%60%55%62%
HellaSwag Accuracy96%94%90%95%
GSM8K Accuracy95%92%85%94%
BBH Accuracy85%75%70%85%
HumanEval Pass Rate85%80%75%84%