Memento: Architectural Support for Ephemeral Memory Management in Serverless Environments

by James Tappert (Leader Presentor), Kabir Vidyarthi (Presentor), Paul Suvrojyoti (Blogger), Carlos Alvarado-Lopez (Blogger), William Davis (Scribe) February 23, 2026

Introduction

First, let us understand the problem this paper is trying to solve. Serverless functions are very short-lived, and this prevents them from amortizing the high costs of memory allocation and deallocation. They were forced to pay the full critical path latency costs for memory management in both userspace (executing malloc) and the OS kernel (executing mmap, sycalls, faulthandlers). The paper mentions that for C++ applications, userspace operations accounted for 96% of the memory overhead, and for Python applications, the os kernel operations were responsible for 52% of the overhead. Memento, therefore, targeted functions that allocated small objects, showed short-lived behaviour, and had expensive kernel involvement.

What Memento adds

Two hardware pieces:

Hardware Object Allocator: handles small-object malloc/free quickly
Hardware Page Allocator: gives out arenas (VA ranges) and backs pages with physical memory on demand and also avoids kernel page fault handling on the critical path

ISA integration

Two new instructions are added:

obj-alloc size: returns pointer
obj-free ptr: frees it

We will talk about these in detail below.

Baseline Malloc

Before we look at the new hardware, let us first see what happens when our program uses malloc normally.

When the program starts, the OS already sets up some initial virtual addresses like a starter kit. Later on, the program may need more virtual addresses and may start calling malloc for two key reasons:

The heap must grow
The allocator wants a new region for its own structure, so it asks for more Virtual address. So malloc can use brk/sbrk: grow the traditional heap, or mmap: map a new virtual region somewhere else.

When the program runs, it does a few kinds of memory-related tasks:

Fetch instructions
Read data
Write data
Ask for more memory (malloc)
Give memory back (free/delete)

Note - Malloc is more like a manager; it calls the userspace allocator, which has metadata that helps it answer quickly:

what memory it already has
what parts are free vs used
where to put a new allocation of a given size

Allocators round malloc requests into size classes (ex 256 byte objects). For example, if the request is around 100B, then it will be rounded up to the 128B size class. A pool is typically one 4KB page that the allocator dedicates to only one size class. For example, a 4KB pool for 128B objects. A slot is one fixed-size piece inside that pool

Now, let us see if our program uses something like malloc (100).

Scenario A: When the allocator already has free space

code calls malloc(p)
Allocator checks if it already has a pool/page with a free slot for 128B objects.
If yes, pick one free slot
return pointer p immediately, no kernel call.

Scenario B: malloc(100), but the allocator has no free space

malloc(100) happens
allocator checks for a free slot, but noneare available
allocator asks OS for more virtual memory:
kernel runs and creates a mapping:
It records that this VA range belongs to this process, with read/write permissions (but no mapping to physical RAM space)
Kernel returns to userspace
allocator carves that new region into blocks and returns p

Now, when the first time the touch happens to p, a page fault happens and a physical page gets allocated.

Let us say we do read/write for the first time, something like p[0] = x

CPU tries to execute the store to virtual address p
CPU checks TLB for p’s page, and it will be a miss for the first touch
CPU does a page table walk
It finds the PTE says: not present
CPU raises a page fault exception and jumps into the kernel

Now the kernel will handle the page fault:

confirms the address is valid
allocates a physical 4KB page
zeros the page so you can’t read someone else’s data or old data
Updates the page table entry as it points to that physical page and sets “present=1.”
Updates/invalidates TLB
Returns from the fault back to the program

Now, VA to PA mapping exists, therefore the store succeeds.

This shows the full overhead of the memory management, and this is problematic for serverless functions, which are very short.

The Memento way

Hardware Object Allocator

It solves the userspace allocator cost for small objects (<= 512B). This hardware operation is often as fast as an L1. The key idea here is that hardware manages memory in arenas per size class. Arena = a contiguous VA range used only for one size class. It has i) an arena header with a bitmap and list pointers, ii)an arena body with a fixed number of objects.

Bit map - A tiny structure just enough bits to track all the slots in an arena body. 0 menas slot is free, and 1 means the slot is allocated. List pointer - A list pointer is simply a pointer used to link arenas together in a list. There are two lists: i)available list - arenas with at least one free slot,t ii) full list arenas with no free slots.

This hardware also has a tiny table called HOT (Hardware Object Table): HOT keeps the most recently used arena header for each size class.

This is what happens step by step:

App calls malloc(size) -> runtime routes small sizes to obj-alloc(size)\
Hardware object allocator rounds size to a size class
It indexes HOT using that size class, and if it finds an arena of that size class
- It scans the bitmap
- If it finds a 0 bit, it sets it to 1 (mark slot allocated)
- It computes the returned pointer and returns the VA to the core
If the arena is full (bitmap has no 0)
- Hardware loads another arena header of the same size class if available from the available list into HOT and updates lists, i.e., moves the old full arena to the full list.
If there is no available arena
- It requests a new arena from the hardware page allocator

So this whole process is very fast, as it is done by the hardware and skips the software allocation.

Hardware Page Allocator

It solves the kernel/OS cost of getting pages mapped. Page allocator lets the app get memory from a reserved region without the OS costs like context switches and kernel code execution. In normal systems, when your program touches a new page, the CPU may trigger a page fault, and the kernel handles it, but the kernel path is expensive, especially for short functions.

This does almost the same thing but in hardware, so faster.

Note - there is a one-time OS setup before a program uses Memento, which includes

Reserves a special virtual address region for Memento (a range of VAs only used by Memento)
Memento’s hardware interprets it as 64 equal lanes via address math, and for each size class/each lane, HPA keeps a pointer (per size class bump pointer) which tells where the next arena should start in this lane.
Writes that region bounds into registers. MRS/MRE = start/end of Memento region.
Keeps a pool of free physical pages available

When the object allocator runs out of space for a sizeclass it asks the HPA for a new arena of that particular size class

Now showing the step-by-step process of the hardware

Picks a new virtual address range for that arena using the simple per-size-class bump pointer
Immediately allocates 1 physical page for the first page of the arena because that first page holds the arena header metadata that must be written immediately
Returns to the object allocator i) the arena virtual base address ii)the header page’s physical address

Memento deliberately does NOT allocate RAM for all pages of the arena up front. Only the first page is backed up immediately; the rest of the pages get backed up on first access.

Now First access to an unbacked page (the page fault replacer)

As we have discussed, normally it looks like CPU -> page fault -> kernel allocates page -> kernel updates page table -> return But memento changes it to CPU -> page walk -> HPA allocates page and updates memento page table -> return mapping -> CPU continues

So instead of not present, then move fault to kernel Memento does not present, pass it to HPA, which allocates RAM and fills page tables, so no kernel trap.

CPU tries to load/store to some virtual address
TLB miss happens, and CPU starts a page walk
MMU checks if VA inside [MRS, MRE]
- If yes → it uses the Memento page table root instead of normal page tables. Basically looks up mappings using Memento’s own page tables
The MMU’s memory reads during the walk are tagged as Memento walk, so the HPA knows i)these page table reads belong to Memento ii) and it is allowed to fix missing entries. So the HPA watches
If the page table already has a mapping, it returns the physical address
If invalid -> HPA(the hardware) creates what’s missing instead of the OS KERNER HANDLING IT
- Missing leaf ((actual data page) - then the HPA actually grabs a free 4KB physical page from its page pool, writes that physical page number into the leaf PT, E, and marks it present
- Missing higher level entry - Multi level page tables are like a tree: Root - Level 2 - Level 3 - Leaf, sometimes the intermediate page-table pages themselves don’t exist, so HPA allocates a fresh 4KB physical page to hold the next level page table then zeros it and then updates the current level entry to point to this new page table page.

So the whole thing is done in hardware without entering the kernel.

Evaluation

Methodology

Memento is evaluated using a full-system simulation (QEMU, SST, and DRAMSim3) running Linux 5.18. Two new hardware structures are modeled:

Hardware Object Table (HOT): 3.4 KB
Arena Allocation Cache (AAC): 32 entries

CACTI analysis shows both incur minimal area and power overhead.

Workloads included are:

Serverless functions (Python, C++)
Serverless platform operations (OpenFaaS: up, deploy, invoke)
Data processing systems (Redis, Memcached, Silo, SQLite3)

Performance

Memonto achieves:

8-28% speedup for serverless functions
16% average improvement across functions
4-11% speedup for data processing workloads

Speedup comes from four sources:

Hardware object allocation
Hardware object free
Hardware page management
Main memory bypass

For functions, gains are split between object-level acceleration and page management, showing that optimizing only malloc is insufficient.

Python and Golang benefit strongly from page-level acceleration due to larger heaps, while C++ workloads benefit primarily from object-level acceleration.

Memory Impact

Memonto reduces:

30% DRAM bandwidth usage on average
15% total physical memory usage for functions

Class Discussion

During our class discussion, there were some notable questions discussed:

Normalized bandwidth usage, and what is the alternative?

How much would the chip actually cost?

This was not discussed in the paper, but it would need a significantly large cache and the ability ot run through said cache

If you were Amazon and thinking of testing this, what would you test?

Things that would cost the most money and ways to save money
If workloads change over time, this could be a bad idea
What is the baseline? Short-lived or long-lived? What percentage of the access needs to be short-lived to make this system work?
Cloud provider wants to have as little waste as possible

Takeaway

Memento delivers consistent performance and bandwidth improvements across languages and workloads with minimal hardware overhead, and its benefits extend beyond showrt-lived functions to serverless platforms and data-processing systems.

AI Disclosure

ChatGPT was used to summarize the paper, notes, and class discussion. The generative AI created a template which was then reviewed, edited and revised by the group.

The CS/ECE 4/599 Course Blog