Managing Memory Tiers with CXL in Virtualized Environments

by Shuxian Qin (leader), Gabriel Rodgers (scribe), Shuyi Zheng (blogger) March 5, 2025

Introduction

The topic of this blog is to discuss the paper “Managing Memory Tiers with CXL in Virtualized Environments”. This paper discussed CXL, a high-speed interconnect standard designed to expand memory capacity and enable efficient memory sharing across devices. The key challenge lies in managing latency-sensitive workloads in virtualized environments where traditional software-managed tiering incurs prohibitive overheads.

Executive Summary

Background:

CPU core counts scaling faster than memory capacity CXL enables second-tier memory to facilitate core scaling But CXL adds latency that hurts performance if not mitigated Software tiering helps some but is not well suited for public clouds Contributions: Intel Flat Memory Mode: First hardware-managed memory tiering for CXL But still has limitations that degrade workloads Memstrata: Memory allocator for hardware tiering to mitigate outliers Slowdown reduces to ~5% vs. unattainable one-tier memory

Class Discussion:

So I think that as of this month you can buy Xeon 6 flat memory chips. It also has pcores - performance cores. hybrid cpu arches - mobile chips, also released for desktop chips, and now recently for diff arches for cores: small and weak (ecores) cores vs strong (pcores) cores

Is this related to dark silicon? Not quite as different as proposals as other dark silicon ideas starting 2006: concern that you could not power all parts of a chip. -> led to proposals to turn off unused portions of chip to save power. The current philosophy is to use parts that are low power and use other parts that are higher power. The trend of increasing power usage is plateauing.

Why is there a hump on the graph comparing core count to memory capacity per core? Memory capacity got more efficient in 2016. Counts could also be averaged, and on AMD and Intel processors.

Intel flat memory mode - direct mapped exclusive cache. How much of the slowdown is caused by conflict misses? You could still get significant cache misses even with page coloring due to the direct mapped nature of the cache. Cutting address space in half. Because cache is exclusive, two memory locations map to a given cache line. To have a system with no conflict misses, you would need a 2-way set associative cache. No clue why they did not try 2-way associative cache. Tag comparisons are done in parallel in the DRAM with n-way associative caches This would be intrusive and this paper is supposed to integrate into a product, so I guess they were conservative.

Did Intel model associativity with these workloads? How much is Memstrata compensating for limited associativity in the cache?

They do not talk about how tagging is done in the paper. Could they have presented differences in associativity to performance?

Noisy neighbor problem (other VM is using lots of memory bandwidth on the same chip). There exist hardware solutions to this (ARM and x86). Use a hardware identifier to determine different VMs and even cache partitioning. Why is the noisy neighbor problem such a large problem if solutions already exist? Does Intel flat memory not implement these existing hardware solutions? ARM: MPAM, x86: RDT (Resource Director Technology) Gives x number of bandwidth per VM Maybe it does not translate to CXL very well? These existing technologies exist on the core - the paper’s hardware exists on the memory controller, so these existing technologies would have to be copied onto the memory controller as well Maybe since this is first Gen it is not as ambitious?

Which VM is using more local memory vs. CXL memory - use this information to swap pages between VMs. Could this cause ping ponging, where swapping pages between VMs could cause an issue, where one VM might want to use a swapped page after it has been swapped? This ping ponging would cause a slowdown. This could occur in a memory-hungry application. Would want to use a control algorithm to have the correct ratios of local/CXL pages for each VM. Paper uses exponential moving algorithm to do this, but it can still cause oscillations (ping ponging)

Interesting thing compared to other far memory systems: we are not yet directly memory mapping CXL memory into the physical address space. This means that you can’t map CXL memory directly to cache lines. Why is this the case? TPP does this (transparent page placement) in software. Could they mark some pages as don’t pull into DRAM? Yes, and you might in some cases want this. They already have non-temporal loads/stores - does this effect DRAM cache? i.e., bringing CXL memory to local DRAM?

CXL memory looks desirable because normal DIMM connections do not scale. Much cheaper Also allows for recycling old DIMM memory chips as CXL memory Backwards compatibility: DDRx tech would need to be supported by the hardware that is used by CXL for far memory Decommissioning datacenter - all old stuff could be used in these CXL memory pools. Common proposal: use persistent memory in these CXL memory pools

When do you think we will have thunderbolt memory extenders? So that you can run 10 more cron tabs on a PCIe card connected via thunderbolt. External GPUs can be plugged into thunderbolt. GPU prices are expensive right now What if your friend is not using their GPU right now - you could plug into their GPU. Cloud GPUs also support multi-tenants. LAN parties with multiple people plugging into the same GPU.

References

Paper
Slides

The ECE 4/599 Course Blog