The CS/ECE 4/599 Course Blog

Memento: Architectural Support for Ephemeral Memory Management in Serverless Environments

by James Tappert (Leader Presentor), Kabir Vidyarthi (Presentor), Paul Suvrojyoti (Blogger), Carlos Alvarado-Lopez (Blogger), William Davis (Scribe)

Introduction

First, let us understand the problem this paper is trying to solve. Serverless functions are very short-lived, and this prevents them from amortizing the high costs of memory allocation and deallocation. They were forced to pay the full critical path latency costs for memory management in both userspace (executing malloc) and the OS kernel (executing mmap, sycalls, faulthandlers). The paper mentions that for C++ applications, userspace operations accounted for 96% of the memory overhead, and for Python applications, the os kernel operations were responsible for 52% of the overhead. Memento, therefore, targeted functions that allocated small objects, showed short-lived behaviour, and had expensive kernel involvement.

What Memento adds

Two hardware pieces:
ISA integration

Two new instructions are added:

We will talk about these in detail below.

Baseline Malloc

Before we look at the new hardware, let us first see what happens when our program uses malloc normally.

When the program starts, the OS already sets up some initial virtual addresses like a starter kit. Later on, the program may need more virtual addresses and may start calling malloc for two key reasons:

When the program runs, it does a few kinds of memory-related tasks:

Note - Malloc is more like a manager; it calls the userspace allocator, which has metadata that helps it answer quickly:

Allocators round malloc requests into size classes (ex 256 byte objects). For example, if the request is around 100B, then it will be rounded up to the 128B size class. A pool is typically one 4KB page that the allocator dedicates to only one size class. For example, a 4KB pool for 128B objects. A slot is one fixed-size piece inside that pool

Now, let us see if our program uses something like malloc (100).

Scenario A: When the allocator already has free space

Scenario B: malloc(100), but the allocator has no free space

Now, when the first time the touch happens to p, a page fault happens and a physical page gets allocated.

Let us say we do read/write for the first time, something like p[0] = x

Now the kernel will handle the page fault:

Now, VA to PA mapping exists, therefore the store succeeds.

This shows the full overhead of the memory management, and this is problematic for serverless functions, which are very short.

The Memento way

Hardware Object Allocator

It solves the userspace allocator cost for small objects (<= 512B). This hardware operation is often as fast as an L1. The key idea here is that hardware manages memory in arenas per size class. Arena = a contiguous VA range used only for one size class. It has i) an arena header with a bitmap and list pointers, ii)an arena body with a fixed number of objects.

Bit map - A tiny structure just enough bits to track all the slots in an arena body. 0 menas slot is free, and 1 means the slot is allocated. List pointer - A list pointer is simply a pointer used to link arenas together in a list. There are two lists: i)available list - arenas with at least one free slot,t ii) full list arenas with no free slots.

This hardware also has a tiny table called HOT (Hardware Object Table): HOT keeps the most recently used arena header for each size class.

This is what happens step by step:

So this whole process is very fast, as it is done by the hardware and skips the software allocation.

Hardware Page Allocator

It solves the kernel/OS cost of getting pages mapped. Page allocator lets the app get memory from a reserved region without the OS costs like context switches and kernel code execution. In normal systems, when your program touches a new page, the CPU may trigger a page fault, and the kernel handles it, but the kernel path is expensive, especially for short functions.

This does almost the same thing but in hardware, so faster.

Note - there is a one-time OS setup before a program uses Memento, which includes

When the object allocator runs out of space for a sizeclass it asks the HPA for a new arena of that particular size class

Now showing the step-by-step process of the hardware

Memento deliberately does NOT allocate RAM for all pages of the arena up front. Only the first page is backed up immediately; the rest of the pages get backed up on first access.

Now First access to an unbacked page (the page fault replacer)

As we have discussed, normally it looks like CPU -> page fault -> kernel allocates page -> kernel updates page table -> return But memento changes it to CPU -> page walk -> HPA allocates page and updates memento page table -> return mapping -> CPU continues

So instead of not present, then move fault to kernel Memento does not present, pass it to HPA, which allocates RAM and fills page tables, so no kernel trap.

So the whole thing is done in hardware without entering the kernel.

Evaluation

Methodology

Memento is evaluated using a full-system simulation (QEMU, SST, and DRAMSim3) running Linux 5.18. Two new hardware structures are modeled:

CACTI analysis shows both incur minimal area and power overhead.

Workloads included are:

Performance

Memonto achieves:

Speedup comes from four sources:

For functions, gains are split between object-level acceleration and page management, showing that optimizing only malloc is insufficient.

Python and Golang benefit strongly from page-level acceleration due to larger heaps, while C++ workloads benefit primarily from object-level acceleration.

Memory Impact

Memonto reduces:

Class Discussion

During our class discussion, there were some notable questions discussed:

Normalized bandwidth usage, and what is the alternative?

How much would the chip actually cost?

If you were Amazon and thinking of testing this, what would you test?

Takeaway

Memento delivers consistent performance and bandwidth improvements across languages and workloads with minimal hardware overhead, and its benefits extend beyond showrt-lived functions to serverless platforms and data-processing systems.

AI Disclosure

ChatGPT was used to summarize the paper, notes, and class discussion. The generative AI created a template which was then reviewed, edited and revised by the group.