Lesson 3: Virtual Memory

discussion thread
^ OSTEP Textbook Chapters 18-20
Remzi Arpaci-Dusseau
* One-Level Storage System
Kilburn et al. IRE Transactions on Electronic Computers, 1962.
tasks due January 13

On readings: Recommended background readings are marked with (^) above. Optional historical or fun readings are marked with (*). If you feel confortable with the topic already, you may skip these readings.

Notes

I will cover some core concepts you’ll need to really get the most out of this paper. The slides can be found on Canvas.

VM Basics

Basic techniques for virtual memory:

Manual coordination
Timesharing (dumping memory on context switch)
Static relocation (using compiler)
Programmable base
Programmable base + bounds
Segmentation

Most of these either put undue burden on the user or toolchain, or place strict contiguity restrictions.

Page Tables

Translate an arbitrary virtual page number (VPN) into an arbitrary physical page number (PPN) or page frame (PFN).
Typical page size is 4K
Indexing into translation table (page table) is done by breaking down the virtual address into sub-fields, primarily offset within the page (lower bits) and PT index (higher bits)
Page tables live in memory, and need hardware support (at the very least, a page walker, and a new base register to hold the starting (physical) address of the PT.
Page walks increase cost of memory references (every VA needs to be translated to a PA)
Page attributes can be stored in the lower bits of page table entries (PTEs), e.g., valid, dirty, access timestamps etc.
Advantage: no external fragmentation, allocation is fast, swapping is easy
Disadvantages: page walk cost, internal fragmentation

Multi-Level Page Tables

Key observation: most of the virtual address space is not actually used (i.e., it is sparse). This means for a flat PT, most of the PTEs will be empty, wasting space. Multi-level PTs allow us to overcome this.
PTs are allocated on demand (policy)
Virtual address break down is subdivided further (upper bits: level one PT index, middle bits: level two PT index, lower bits: offset within page)
Page size stays the same. Base register setup also stays the same.
The bad? A page walk now has k steps for a k-level page table. Each step is a memory reference. So page walk latency goes up.

TLBs

Translation Lookaside Buffer: avoid page walks!
Small, on-chip, SRAM cache to store page walk outcomes. Maps VPN to PFN directly. Hit: no page walk. Miss: page walk + TLB fill.
Internal organization is similar to other on-chip caches
TLB reach is a big issue. Ideally, all the hot pages our program accesses frequently (i.e., its working set) would have translations stored in the TLB. But TLBs are small, so this will not always be the case. How big of a working set can be captured by the TLB? This is its reach. A TLB with 4K entries, each storing a translation for one 4K page, would have a reach of 2^12*2^12 = 2^24 = 16MB. This is often not enough

Variable Page Sizes

Allowing for bigger page sizes can extend the reach of the TLB by fitting more address space into one TLB entry. Suppose I have 1GB pages (instead of the usual 4KB). Now my TLB with 4K entries has a reach of 4TB.
Typical large page sizes are 16KB, 2MB, 1GB (and eventually 512GB).
The problem with large pages is that internal fragmentation becomes more likely, so it is typical that large pages are used sparingly
This means that our hardware and OS must have support for handling multiple page sizes concurrently. The biggest gotcha on the hardware side of this is that the page walk logic needs to know when to terminate.
This sets up an inherent tension between TLB reach and internal fragmentation, which is one primary motivator for the Mosaic Pages paper.

It is often the case we need two processes to be able to access the same physical page. The most common application of this is shared libraries (otherwise multiple programs that use the same library would need two copies of it in memory). Essentially, both processes will have two arbitrary VPNs that map to the same PFN. VPN A and VPN B can be the same or different.
Another way to put it is that both processes will have the same PTE somewhere.
Sharing can also be useful for inter-process communication (IPC).

Context Switches

Note that when we switch from one process to another, we have to flush the VPN->PFN translations in the TLB. Otherwise, we’d have a massive security vulnerability. (One process could read physical pages from another).
This means that when we switch from one process to another, we have to flush the TLB.
Flushing the entire TLB can be expensive, so often TLB entries are tagged with address space identifiers (ASIDs), so that only the TLB entries that correspond to the process we’re switching away from get flushed. This can get us back some performance

Page Walk Caching

Page walks can be quite deep, especially when we’re talking about virtualization. Server-grade chips will often incorporate microarchitectural caches between levels of the page walk to avoid the latencies. Not always documented well.
This is in addition to the TLB, and is mean to optimized the miss path.
Not easy to measure.

Locality of Reference

TLB only gets us so far. Applications with a large working set will confound the TLB, because it always has a finite number of entries. The smaller the working set is, the less of the problem the TLB will be. More and more applications these days have large, random access patterns that make this a problem (hence the Mosaic paper).
Think about how you’d design a TLB-friendly workload