Memento: Architectural Support for Ephemeral Memory Management in Serverless Environments
Introduction
First, let us understand the problem this paper is trying to solve. Serverless functions are very short-lived, and this prevents them from amortizing the high costs of memory allocation and deallocation. They were forced to pay the full critical path latency costs for memory management in both userspace (executing malloc) and the OS kernel (executing mmap, sycalls, faulthandlers). The paper mentions that for C++ applications, userspace operations accounted for 96% of the memory overhead, and for Python applications, the os kernel operations were responsible for 52% of the overhead. Memento, therefore, targeted functions that allocated small objects, showed short-lived behaviour, and had expensive kernel involvement.
What Memento adds
Two hardware pieces:
- Hardware Object Allocator: handles small-object malloc/free quickly
- Hardware Page Allocator: gives out arenas (VA ranges) and backs pages with physical memory on demand and also avoids kernel page fault handling on the critical path
ISA integration
Two new instructions are added:
- obj-alloc size: returns pointer
- obj-free ptr: frees it
We will talk about these in detail below.
Baseline Malloc
Before we look at the new hardware, let us first see what happens when our program uses malloc normally.
When the program starts, the OS already sets up some initial virtual addresses like a starter kit. Later on, the program may need more virtual addresses and may start calling malloc for two key reasons:
- The heap must grow
- The allocator wants a new region for its own structure, so it asks for more Virtual address. So malloc can use brk/sbrk: grow the traditional heap, or mmap: map a new virtual region somewhere else.
When the program runs, it does a few kinds of memory-related tasks:
- Fetch instructions
- Read data
- Write data
- Ask for more memory (malloc)
- Give memory back (free/delete)
Note - Malloc is more like a manager; it calls the userspace allocator, which has metadata that helps it answer quickly:
- what memory it already has
- what parts are free vs used
- where to put a new allocation of a given size
Allocators round malloc requests into size classes (ex 256 byte objects). For example, if the request is around 100B, then it will be rounded up to the 128B size class. A pool is typically one 4KB page that the allocator dedicates to only one size class. For example, a 4KB pool for 128B objects. A slot is one fixed-size piece inside that pool
Now, let us see if our program uses something like malloc (100).
Scenario A: When the allocator already has free space
- code calls malloc(p)
- Allocator checks if it already has a pool/page with a free slot for 128B objects.
- If yes, pick one free slot
- return pointer p immediately, no kernel call.
Scenario B: malloc(100), but the allocator has no free space
- malloc(100) happens
- allocator checks for a free slot, but noneare available
- allocator asks OS for more virtual memory:
- kernel runs and creates a mapping:
- It records that this VA range belongs to this process, with read/write permissions (but no mapping to physical RAM space)
- Kernel returns to userspace
- allocator carves that new region into blocks and returns p
Now, when the first time the touch happens to p, a page fault happens and a physical page gets allocated.
Let us say we do read/write for the first time, something like p[0] = x
- CPU tries to execute the store to virtual address p
- CPU checks TLB for p’s page, and it will be a miss for the first touch
- CPU does a page table walk
- It finds the PTE says: not present
- CPU raises a page fault exception and jumps into the kernel
Now the kernel will handle the page fault:
- confirms the address is valid
- allocates a physical 4KB page
- zeros the page so you can’t read someone else’s data or old data
- Updates the page table entry as it points to that physical page and sets “present=1.”
- Updates/invalidates TLB
- Returns from the fault back to the program
Now, VA to PA mapping exists, therefore the store succeeds.
This shows the full overhead of the memory management, and this is problematic for serverless functions, which are very short.
The Memento way
Hardware Object Allocator
It solves the userspace allocator cost for small objects (<= 512B). This hardware operation is often as fast as an L1. The key idea here is that hardware manages memory in arenas per size class. Arena = a contiguous VA range used only for one size class. It has i) an arena header with a bitmap and list pointers, ii)an arena body with a fixed number of objects.
Bit map - A tiny structure just enough bits to track all the slots in an arena body. 0 menas slot is free, and 1 means the slot is allocated. List pointer - A list pointer is simply a pointer used to link arenas together in a list. There are two lists: i)available list - arenas with at least one free slot,t ii) full list arenas with no free slots.
This hardware also has a tiny table called HOT (Hardware Object Table): HOT keeps the most recently used arena header for each size class.
This is what happens step by step:
-
App calls malloc(size) -> runtime routes small sizes to obj-alloc(size)\
-
Hardware object allocator rounds size to a size class
-
It indexes HOT using that size class, and if it finds an arena of that size class
- It scans the bitmap
- If it finds a 0 bit, it sets it to 1 (mark slot allocated)
- It computes the returned pointer and returns the VA to the core
-
If the arena is full (bitmap has no 0)
- Hardware loads another arena header of the same size class if available from the available list into HOT and updates lists, i.e., moves the old full arena to the full list.
-
If there is no available arena
- It requests a new arena from the hardware page allocator
So this whole process is very fast, as it is done by the hardware and skips the software allocation.
Hardware Page Allocator
It solves the kernel/OS cost of getting pages mapped. Page allocator lets the app get memory from a reserved region without the OS costs like context switches and kernel code execution. In normal systems, when your program touches a new page, the CPU may trigger a page fault, and the kernel handles it, but the kernel path is expensive, especially for short functions.
This does almost the same thing but in hardware, so faster.
Note - there is a one-time OS setup before a program uses Memento, which includes
- Reserves a special virtual address region for Memento (a range of VAs only used by Memento)
- Memento’s hardware interprets it as 64 equal lanes via address math, and for each size class/each lane, HPA keeps a pointer (per size class bump pointer) which tells where the next arena should start in this lane.
- Writes that region bounds into registers. MRS/MRE = start/end of Memento region.
- Keeps a pool of free physical pages available
When the object allocator runs out of space for a sizeclass it asks the HPA for a new arena of that particular size class
Now showing the step-by-step process of the hardware
- Picks a new virtual address range for that arena using the simple per-size-class bump pointer
- Immediately allocates 1 physical page for the first page of the arena because that first page holds the arena header metadata that must be written immediately
- Returns to the object allocator i) the arena virtual base address ii)the header page’s physical address
Memento deliberately does NOT allocate RAM for all pages of the arena up front. Only the first page is backed up immediately; the rest of the pages get backed up on first access.
Now First access to an unbacked page (the page fault replacer)
As we have discussed, normally it looks like CPU -> page fault -> kernel allocates page -> kernel updates page table -> return But memento changes it to CPU -> page walk -> HPA allocates page and updates memento page table -> return mapping -> CPU continues
So instead of not present, then move fault to kernel Memento does not present, pass it to HPA, which allocates RAM and fills page tables, so no kernel trap.
- CPU tries to load/store to some virtual address
- TLB miss happens, and CPU starts a page walk
- MMU checks if VA inside [MRS, MRE]
- If yes → it uses the Memento page table root instead of normal page tables. Basically looks up mappings using Memento’s own page tables
- The MMU’s memory reads during the walk are tagged as Memento walk, so the HPA knows i)these page table reads belong to Memento ii) and it is allowed to fix missing entries. So the HPA watches
- If the page table already has a mapping, it returns the physical address
- If invalid -> HPA(the hardware) creates what’s missing instead of the OS KERNER HANDLING IT
- Missing leaf ((actual data page) - then the HPA actually grabs a free 4KB physical page from its page pool, writes that physical page number into the leaf PT, E, and marks it present
- Missing higher level entry - Multi level page tables are like a tree: Root - Level 2 - Level 3 - Leaf, sometimes the intermediate page-table pages themselves don’t exist, so HPA allocates a fresh 4KB physical page to hold the next level page table then zeros it and then updates the current level entry to point to this new page table page.
So the whole thing is done in hardware without entering the kernel.
Evaluation
Methodology
Memento is evaluated using a full-system simulation (QEMU, SST, and DRAMSim3) running Linux 5.18. Two new hardware structures are modeled:
- Hardware Object Table (HOT): 3.4 KB
- Arena Allocation Cache (AAC): 32 entries
CACTI analysis shows both incur minimal area and power overhead.
Workloads included are:
- Serverless functions (Python, C++)
- Serverless platform operations (OpenFaaS: up, deploy, invoke)
- Data processing systems (Redis, Memcached, Silo, SQLite3)
Performance
Memonto achieves:
- 8-28% speedup for serverless functions
- 16% average improvement across functions
- 4-11% speedup for data processing workloads
Speedup comes from four sources:
- Hardware object allocation
- Hardware object free
- Hardware page management
- Main memory bypass
For functions, gains are split between object-level acceleration and page management, showing that optimizing only malloc is insufficient.
Python and Golang benefit strongly from page-level acceleration due to larger heaps, while C++ workloads benefit primarily from object-level acceleration.
Memory Impact
Memonto reduces:
- 30% DRAM bandwidth usage on average
- 15% total physical memory usage for functions
Class Discussion
During our class discussion, there were some notable questions discussed:
Normalized bandwidth usage, and what is the alternative?
How much would the chip actually cost?
- This was not discussed in the paper, but it would need a significantly large cache and the ability ot run through said cache
If you were Amazon and thinking of testing this, what would you test?
- Things that would cost the most money and ways to save money
- If workloads change over time, this could be a bad idea
- What is the baseline? Short-lived or long-lived? What percentage of the access needs to be short-lived to make this system work?
- Cloud provider wants to have as little waste as possible
Takeaway
Memento delivers consistent performance and bandwidth improvements across languages and workloads with minimal hardware overhead, and its benefits extend beyond showrt-lived functions to serverless platforms and data-processing systems.
AI Disclosure
ChatGPT was used to summarize the paper, notes, and class discussion. The generative AI created a template which was then reviewed, edited and revised by the group.