FMI: Fast and Cheap Message Passing for Serverless Functions

by Sami Aljabery (Scribe), Gabriel Rodgers (Blogger), Noah Bean (Leader) January 22, 2025

Introduction

This blog post covers FMI: Fast and Cheap Message Passing for Serverless Functions, a research paper submitted on May 15, 2023, and presented on January 22, 2025. The paper introduces the FaaS Message Interface (FMI), a high-performance communication framework for serverless computing. Traditional serverless architectures rely on storage-based communication (AWS S3, Redis, DynamoDB), introducing significant latency and cost overhead. FMI overcomes these challenges using direct TCP communication enabled through TCP NAT hole punching (TCPunch), reducing latency, cost, and complexity.

Background and Context

Function-as-a-Service (FaaS) is a widely used serverless cloud computing model, offering elastic scaling and fine-grained billing, making it ideal for machine learning, data analytics, and distributed applications. However, serverless functions lack efficient, low-latency communication mechanisms and often depend on cloud storage-based solutions such as AWS S3, Redis, and DynamoDB. These solutions increase latency and cost, making frequent inter-function communication inefficient.

Additionally, serverless functions operate behind Network Address Translation (NAT) gateways, preventing direct connections between functions. This introduces overhead and complexity, requiring functions to relay messages through cloud-based storage which further increases latency.

To solve these issues, FMI introduces TCP NAT hole punching (TCPunch) to establish direct, low-latency connections between functions.

KEYWORDS

High-Performance Computing (HPC) – Systems designed for maximum processing speed and efficiency, often used for computationally intensive tasks. Utilizes optimized communication protocols like Message Passing Interface (MPI).
I/O (Input/Output) – Data transfer mechanisms affecting performance in serverless platforms. High-latency I/O limits performance and scalability.
Serverless – A cloud model where functions execute without server management. Cost-effective, scalable, and eliminates infrastructure maintenance.
Function-as-a-Service (FaaS) – A subset of serverless computing focused on stateless function execution with elastic scaling and fine-grained billing.
Remote Memory Access (RMA) – A mechanism allowing a process to directly access the memory of a remote process, reducing access latency.
Stateless Functions – Functions without memory of prior invocations, simplifying scaling and deployment.
Elastic Scaling – Dynamic resource allocation based on workload demand.
Communication Bottleneck – Slow and costly inter-function messaging in serverless platforms.
Cloud Storage – Persistent storage solutions (AWS S3, Redis, DynamoDB) with high latency and additional cost.
Transmission Control Protocol (TCP) – A reliable communication protocol. FMI leverages TCP for message passing.
Network Address Translation (NAT) – A process that maps private IP addresses to public IPs, enabling internet communication but restricting direct function-to-function networking.
NAT Hole Punching – A technique to bypass NAT restrictions by using a relay server to share external ports, enabling direct connections.
TCP NAT Hole Punching (TCPunch) – The paper’s novel technique for direct TCP communication in serverless environments.
Message Passing Interface (MPI) – A standardized API for parallel computing that FMI is inspired by.
AWS Lambda – A widely used serverless computing platform.
Redis – An in-memory data store and fast message broker. Suitable for low-latency but requires user-managed scalability.
DynamoDB – A fully managed NoSQL database optimized for low-latency data retrieval.
AWS S3 – Scalable object storage with high durability and availability. Has higher latency and cost for frequent communication.

Detailed Summary of the Paper

Summary of 1. Introduction

The paper introduces the FaaS Message Interface (FMI), a modular and high-performance communication framework designed to address the inefficiencies of serverless communication. Inspired by the Message Passing Interface (MPI), FMI brings standardized abstractions for point-to-point and collective communication to Function-as-a-Service (FaaS) platforms.

Key contributions include:

A library for message passing that provides common, standardized abstractions for serverless point-to-point and group communication.
Analytical models for communication channels in FaaS and a discussion of the performance-price trade-offs of serverless communication.
A demonstration of the application of FMI to serverless machine learning, showing reduced communication overhead by up to 162x and cost savings up to 397x compared to existing solutions.

Summary of 2. Design of FMI

FMI’s design includes multiple communication channels, each tailored for specific use cases and trade-offs:

Direct Channels: Utilize TCP connections with NAT hole punching for low-latency, efficient communication.
Mediated Channels: Leverage cloud storage systems like AWS S3, Redis, and DynamoDB for scenarios requiring persistent data exchange.

Key features of FMI’s design:

Cloud-Agnostic Portability: Operates across various serverless platforms (AWS Lambda, Kubernetes) while integrating with MPI for hybrid HPC-serverless workflows.
Customizability: Developers can add new communication protocols (QUIC) and optimize collective algorithms for workload-specific needs.
Collective Operations: Implements MPI-inspired collective operations (broadcast, reduce, allreduce) optimized for serverless contexts using efficient algorithms.

Summary of 3. Implementation of FMI

The FMI framework is lightweight (~1,900 lines of C++ code) and provides:

TCPunch Library: A custom NAT hole-punching solution storing and sharing address translations to enable direct connections between serverless functions.
Multi-language Support: Primarily C++, but with Python bindings for accessibility.
Communicators: Organize serverless functions into groups, enabling scalable and independent communication patterns. Functions within a communicator synchronize with timers for consistency and fault isolation.

Summary of 4. Evaluation

FMI is evaluated for latency, bandwidth, cost, and scalability:

Latency: Direct TCP communication reaches microsecond-level latency, ~162x faster than AWS S3 or DynamoDB.
Cost Efficiency: FMI reduces costs by up to 397x, with as little as $0.02 per 1,000 epochs for distributed machine learning workloads.
Scalability: Scales efficiently to 256 functions, outperforming Redis and DynamoDB, which suffer bottlenecks and timeouts at scale.
Bandwidth: Delivers superior bandwidth across various message sizes, maintaining stability under high concurrency.

Important Results

Reduction in Communication Latency
- Direct TCP achieves microsecond-level latency, up to 162x faster than storage-based methods.
Cost Savings
- Up to 397x reduction in communication costs. Some ML workloads see costs under $0.02 per 1,000 epochs, versus $7.52 with DynamoDB.
Improved Scalability
- Efficient scaling to 256 serverless functions while maintaining low latency and high bandwidth.
Bandwidth Performance
- Superior bandwidth performance across message sizes, stable under high concurrency.
Optimized Collective Operations
- Implements broadcast, reduce, and allreduce with the lowest latency across all evaluated solutions.
Case Study in Distributed Machine Learning
- Replacing DynamoDB with FMI yields a 1224x improvement in communication speed, with no significant integration overhead.
Minimal Integration Overhead
- Only four lines of code were changed to replace DynamoDB with FMI in the machine learning example.

Strengths and Weaknesses of the Paper

Strengths

Innovative Solution
- Introduces a novel approach (TCP NAT hole punching) to solve communication bottlenecks in serverless computing.
Comprehensive Evaluation
- Benchmarks compare FMI against AWS S3, DynamoDB, and Redis for latency, cost, bandwidth, and scalability.
Scalability
- Maintains performance up to 256 serverless functions without significant degradation.
Low Cost and High Performance
- Achieves up to 397x cost savings and 162x faster communication.
Portability and Modularity
- Cloud-agnostic design compatible with AWS Lambda, Kubernetes, and MPI.
Ease of Integration
- Minimal code changes required, facilitating adoption in existing systems.

Weaknesses

Reliance on Assumptions
- Assumes all functions in a communication group are co-located and run concurrently, which may not always hold in practice.
Limited Fault Tolerance
- Lacks built-in mechanisms for handling individual function failures mid-communication.
Dependency on External Infrastructure
- Requires a hole-punching server for address translation.
Limited Real-World Testing
- Evaluated mainly in controlled benchmarks and case studies with broader real-world validations needed.

Class Discussion

Clarification on Table 1

Table 1 presents a general performance comparison of object storage, NoSQL databases, in-memory caches, and direct TCP.
Key takeaway: Direct TCP has zero cloud provider cost, but the computation burden may shift to the user, slightly reducing the impact of “no cost.”

UDP vs TCP?

Why UDP wasn’t tested:
- Cloud providers (like AWS) control UDP usage and do not currently allow it in serverless environments.
- Existing serverless frameworks already provide direct TCP communication.
- If AWS supported UDP for serverless computing, the authors might have tested it instead of TCP.

Open Source Cloud?

Are cloud providers open source?
- Typically, no. Cloud providers like AWS are highly opaque, preventing visibility into the underlying infrastructure.
- They can move serverless functions among different physical servers without user knowledge, complicating assumptions about network state.

Clarification on Figures

Violin plots and box plots: The paper uses them to display data distribution across multiple trials.
- Violin plots show data spread (with a median dot).
- Box plots depict the median, quartiles, and outliers more succinctly.
The authors use these to demonstrate performance variance and the stability of FMI versus other solutions.

OpenAI (and other cloud providers’) Use of FMI?

Could OpenAI use FMI?
- Potentially yes. OpenAI’s services could benefit from direct TCP for certain workloads.
- TCP tunables (like window size) impact performance, as shown in the paper’s figures.
- If an open-source serverless framework offered full control, UDP or RMA-based approaches might be tested as well.
- AWS’s closed nature limits certain optimizations.

RMA Discussion?

The paper references RMA (Remote Memory Access) as another model of communication.
- It’s mentioned to highlight potential benefits and trade-offs of different data-exchange paradigms in heterogeneous environments (including FPGAs).

Sources

FMI: Fast and Cheap Message Passing for Serverless Functions
OSTEP Textbook – Remzi Arpaci-Dusseau
Computer Organization and Design RISC-V Edition: The Hardware/Software Interface, 2nd Edition (2020)
- Authors: David A. Patterson, John L. Hennessy
Computer Architecture, Sixth Edition: A Quantitative Approach (2017)
- Authors: John L. Hennessy, David A. Patterson

Generative AI

AI Tools Used

ChatGPT: https://chatgpt.com/
DeepSeek: https://www.deepseek.com/

These tools aided in:

Generating ideas for the outline.
Explaining complex keywords.
Providing prose feedback.
Converting the document to Markdown

Example Contribution

Improving Clarity: Simplifying sentence structure and refining word choices for better readability.
Changed “The paper references RMA, which does shared memory over distributed memory - another way of doing send/receives. What was this reference for? The reference to RMA was to talk about heterogeneous environments in comparison with others such as FPGAs.” to “The paper references RMA (Remote Memory Access) as another model of communication. It’s mentioned to highlight potential benefits and trade-offs of different data-exchange paradigms in heterogeneous environments (including FPGAs).”

Limitations

Potential Inaccuracy: Generative AI content should be externally validated.
Brainstorming vs. Factual Source: Great for brainstorming and suggestions, but final facts must be verified.
Unavailable: Deepseek provides a free alternative to ChatGPT, but the server is frequently Unavailable.
Low Quality: Writing is often repetitive and uninspired with important details missing. Changed “Message Passing Interface (MPI) – A standardized API for parallel programming in distributed memory systems. MPI enables communication between processes on different nodes. FMI is modeled after MPI to provide similar abstractions for serverless platforms.” to “Message Passing Interface (MPI) – A standardized API for parallel computing, which FMI models for FaaS.”

Did you find this post insightful? Share your thoughts below!

The ECE 4/599 Course Blog