Lesson 4: NUMA

discussion thread
^ What is NUMA?
The Linux Kernel Docs
^ KSM overview
LWN article on KSM
^ Kernel Same Page Merging
The Linux Kernel Docs on KSM
* numactl
numactl command on linux
tasks due January 15

On readings: Recommended background readings are marked with (^) above. Optional historical or fun readings are marked with (*). If you feel confortable with the topic already, you may skip these readings.

Notes

I will cover some core concepts you’ll need to really get the most out of this paper. The slides can be found on Canvas.

NUMA Overview

NUMA stands for non-uniform memory access
The basic ideas is that we have multiple domains (NUMA nodes) of memory, and different CPU cores in the system will have different access latencies to the different domains. This allows for more scalable systems, but introduces the complication for system software of minimizing access latency. Namely, in general we want our threads running on our cores to be as close as possible to the memory they use. This is referred to as affinity.
NUMA is implemented at the hardware level. There are some platform firmware controls on how the physical address space gets decomposed. There are two primary approaches here:
- Interleaving: If we want to pretend like we’re not running on a NUMA system, we can interleave physical addresses across the NUMA nodes. For example, if we have 2 NUMA nodes, the memory controller would be set up to route physical address 0 to NUMA node 0, address k to NUMA node 1 (where k is the word size), address 2k to NUMA node 0, 3k to node 1, and so on.
- Regions: the physical address space is subdivided into linear chunks. If there are n NUMA nodes, a simple approach would just divide the size of the address space (call it a) by n. So then addresses 0 to (a/k - 1) route to NUMA node 0, (a/k - (2a/k - 1)) to node 1, and so on.

NUMA policies

The two core issues that arise with NUMA are (1) mapping memory, and (2) mapping threads. When we (the OS) allocate memory, we have to decide which NUMA node to put it on. There are many policy options we could come up with for this.
A typical data placement policy is the “first touch” approach. When a page first gets demand allocated because a process tried to touch unmapped memory, the OS will go to allocate a page. It can then decide which NUMA node that page is going to come from. In this case it is going to pick a page from the NUMA node closest to the CPU that the thread who generated the page fault is running on. This works well generally, but hopefully you can see that if a page is shared, it becomes less clear whether or not this is a good idea.
These policies can be tweaked, for example if the programmer has better knowledge of access patterns than the OS can surmise. See numactl for example.

Scale-out NUMA

Deduplication and KSM

Deduplication is a general technique used in systems to reduce memory usage. The basic observation (which can be leveraged in many domains) is that applications tend to have duplicate instances of data. When that occurs, instead of maintaining multiple copies, the system can create one copy of the data and store references to that data instead.

We can apply dedup to paging. Here, we’d like to examine all physical pages used on our system, and if we find that there are two (or more) pages that contain the same content, we can merge them into one physical page, and update the two PTEs to point to this one page. We can even do this across processes, and it will often be useful to do so. The Linux kernel has implemented this in the form of kernel same page merging (KSM).

KSM runs as a low-priority background kernel thread (ksmd). Scanning physical pages, merging them, and updating PTEs to reflect the merge.

Think about what would happen if duplicate pages across processes have been turned into a shared page, and then one of those processes goes to try to write its copy. What should the kernel do?

You should also think about this algorithmically. What’s the complexity of a KSM algorithm? Hint: you can identify duplicate pages by hashing their contents and comparing hashes (rather than a byte-by-byte comparison). How do you avoid merging pages that will soon be written to?