Lesson 5: Distributed Memory and Message Passing

discussion thread
^ MPI Basics
Overview of the MPI interface
^ Cloud Programming Simplified: A Berkeley View on Serverless Computing
UC Berkeley's take on serverless
tasks due January 21

On readings: Recommended background readings are marked with (^) above. Optional historical or fun readings are marked with (*). If you feel confortable with the topic already, you may skip these readings.

Notes

I will cover some core concepts you’ll need to really get the most out of this paper. The slides can be found on Canvas.

Distributed Memory Machines

One of the earliest ways to do parallel processing was to subdivide a problem between multiple machines connected together over a network. The key here is that each of these machines have their own distinct memory with a distinct address space, and they cannot read/write one another’s memory explicitly.
One typical style of parallel processing using distributed memory is the bulk-synchronous parallel (BSP) model. This involves each processor working away at its own subproblem for a bit, then all processors communicate intermediate results with one another over the network, then processing proceeds. This happens synchronously, hence the acronym. This would be called a “regular” workload, in that the communication steps happen at regular intervals.
One gotcha with this style of parallel processing is that it is subject to stragglers: one slow processor can slow down the whole computation.

Message Passing

To allow processors to communicate, we need a programming interface and communication abstractions. We could just use something like sockets, but that is a bit too low-level. Really what we need is a namespace of processors. We can give each processing element a unique ID, and allow processors to communicate with one another using those IDs.
This is where the message passing interface (MPI) comes in. This gives us a high-level interface for sending messages between processors. It has evolved a lot over the years, but at its core it really is meant to allow you to say, send this blob of data to processor number 4.
This interface isn’t quite as flexible as the network APIs you may be used to, but it has the benefit of being simpler in some ways, and easier for non-CS people to pick up quickly (e.g., scientists).

Collectives

In many cases we may need to send messages to more than one processor, or we may want to capture common communication patterns in the API. We refer to these as collective operations.
Example: one processor wants to send a message to all other processors. This would be a broadcast message. We can do that with MPI_Bcast.
Example: Say each processor has an integer value, and we want to sum all integers in parallel. This is an example of a reduction operation. We could achieve this with the MPI_Reduce function.
MPI is still used today as the de facto message passing interface.

Serverless/FaaS

The serverless paradigm takes advantage of virtualization capabilities present in most software stacks today. The basic premise is this: suppose I have some python code I want to run in the cloud in reaction to a sensor reading on my Raspberry Pi. The traditional way I’d have to set this up is to set up a server on the cloud (e.g. on AWS EC2) with a webserver to listen to requests from the Pi and some server-side scripts to initiate computations, and possibly some kind of lightweight database to store the input events. I’d also have to worry about setting up the OS for the cloud VM and so on.
However, as the developer all I really care about is that Python function foo runs somewhere in reaction to a sensor event. I don’t really care what OS it runs on, what hardware it runs on, how the data is stored, etc.
This is where serverless comes in: the cloud provider handles all of the infrastructure provisioning for you: the VM, the OS running on it, the Python runtime, etc. You simply provide the code to run and tell them what event should trigger that code. The rest is taken care of for you. These functions are usually provisioned in the context of a container running in a VM on the cloud.
Serverless functions are meant to be stateless. The machine won’t have any (visible) memory of the function having run. For example, a serverless function cannot open a file for writing. If you really want to save state, it is typically done using network-connected object storage (e.g., Amazon S3).

The Communication Problem

Because serverless functions are stateless and short-lived, they have an ephemeral execution environment. One effect of that is that they aren’t (at least from the programmer’s perspective) attached to a network namespace. A serverless function doesn’t have an IP address. So if we want two functions to communicate with one another, how do we do it?
A common but expensive way to solve this problem involves using a centralized rendezvous server in a long-running VM that acts as a message board. Serverless functions register themselves with this server, and when another function wants to communicate, it looks up the proper address (name) in this centralized directory. You can hopefully see the obvious performance issues with this approach.

Note on Plots

Two types of plots appear in this paper that you may not have seen: Box plots (the rectangles with the ticks on either end) and violin plots (the violin-shaped points). These are a useful way to present a comprehensive view of experimental data. The violin plots are basically a PDF (or PMF for discrete data) on its side: you get to see the whole data distribution, and you can clearly see the median at the peaks. Box plots have the same intent, but only capture the quartiles (and potentially outlier dots), not the shape of the entire distribution.