Lesson 10: In-Memory KV Stores

discussion thread
About memcached
* Scaling Memcache at Facebook
Nishtala et al., NSDI '13

On readings: Recommended background readings are marked with (^) above. Optional historical or fun readings are marked with (*). If you feel comfortable with the topic already, you may skip these readings.

Notes

The slides from lecture can be found on Canvas.

Basics of KV Stores

You can think of a key-value store (KV Store) as a big hashtable storing variable-length objects. These objects are looked up using keys. If you’re a python programmer, you can think of a dict. However, unlike typical hash tables, KV stores may involve persistence and/or distributed approaches. A common implementation method involves using these hash tables as an alternative to a database (i.e., a NoSQL database). Here we need the values to be durable on disk, but we don’t need complex queries like joins or aggregations. The paper is more concerned about KV caches, where the values are not made durable, and everything lives in memory. However, even in this system, to increase availability and performance, such KV Caches are commonly distributed among many machines.

Some KVStores can maintain consistent replicas, e.g., etcd, which uses Raft.

Distributed hash tables (DHT)

How do we spread a KV cache across multiple nodes? The easy answer is to just partition your key space. Suppose we use English given names as keys. We might say node 0 handles names starting with “A-C”, node 1 handles names starting with “D-F”, and so on. There are two problems with this:

there may be hot spots (the request stream may not be uniformly distributed across the keyspace), limiting performance.
node additions are difficult (adding a new node means we have to repartition the keyspace).

The alternative is to use an implementation that evenly distributes the keyspace and handles dynamic node changes. Consistent hashing is a common approach.

Slab Allocators

External fragmentation (holes of unusable free space around allocations) makes memory allocation slower. Allocators have to handle this either by defragmenting (e.g., compacting) the free space, or by using fixed-length allocations that render such fragmentation impossible. A slab allocator is an instance of the latter approach. The basic observation is that some systems have many allocations of the same type of object. For example in an OS, we might have many instances of a struct page or a struct mem_region or a struct proc. Since all such structures are the same size, we could tailor our allocator to account for that. This is what a slab allocator does: each object type gets its own cache. Within that cache, we can have several large slabs of memory, each of which has k slots for an object of the corresponding type. Thus, to allocate, we first check to see if there is an existing slab. If there is one, we find the next empty object slot and use it. If not, we create a new slab (by requesting memory from the OS, e.g., via mmap()). Slabs are also fixed size, so their allocation is relatively fast. There are also caching benefits to this approach.

Log-structured Storage

In some systems, in-place updates can introduce performance issues and poor locality. An example would be on a storage medium that has bad random access performance (e.g., a spinning disk drive). This suggests implementation schemes that prefer sequential access. The basic idea of log-structured storage is that instead of doing in-place updates to existing data, we just create a new copy at the front of a log. Items are added to the log at the tail end, and updates also produce new log entries. Now, of course, we need to manage stale copies towards the front of the log that were stranded by updates. We can use garbage collection for this, where we periodically scan the log from the front, clearing out overwritten items. This may produce holes, necessitating log compaction. See LFS for an influential example.

Spin Locks

This is a simple lock implementation (to ensure mutual exclusion for critical sections). There is a flag value in a shared memory location that all threads race on to obtain exclusive access. Here’s a sample implementation in C

typedef struct {
    int flag;
} spinlock_t;

void spinlock_init(spinlock_t * l) {
    l->flag = 0;
}

void spinlock_lock(spinlock_t * l) {
    while (atomic_test_and_set(&l->flag)); // spin until the flag is 0 using atomics
}

void spinlock_unlock(spinlock_t * l) {
   l->flag = 0;
}

Frequency Counting

Count-min sketch is one algorithm referred to in the paper today. The basic idea is that we don’t want to store exact counts for every item to save space. This is useful for large-scale streaming systems where we’re seeing a lot of objects and where some degree of inaccuracy is viable. Basically, we maintain a 2D matrix of dimensions D x W, where there are D rows and W columns. When we get a read, we want to update a frequency count. We take the key and perform D hashes on it, $H_d1(k)$, $H_d2(k)$, and so on. Each of these hashes will produce an index into one of W cells, each containing a count. We will increment each such count. You can hopefully see that because these hashes could have collisions, two items might map to the same count for any given row (hash). Thus, what we’ll do to minimize this error is to take the minimum of the D counts on a lookup. We will also size our matrix sufficiently to minimize such collisions. The CM Sketch is an example of a probabalistic data structure. Bloom filters are another one commonly used in systems.

Notes on methods

Trace-based Evaluation

Typically we can’t run experimental prototypes on production systems. Sometimes companies (like Twitter) will release traces (usually with proprietary and user info obfuscated) that allow researchers in the community to understand the behavior of real systems.

Evaluation Methods

The segcache paper employed two types of evaluation methods you’ll commonly see. An ablation study is something you’ll see when a proposed system or method has several components, each of which may contribute to the figure of merit (e.g., lowering performance). In such cases, readers will want to know which component has the largest impact. That can be determined by enabling or disabling the components (or optimizations) selectively. In general, a perfectly complete ablation study for $n$ components would have $2^n$ experimental configurations, so you’ll rarely see such a study for systems with components greater than, say, n=2. Instead, what you’ll often see is components successively applied (system with no optimizations, system with opt. 1 only, with opt. 1 and 2, with opt. 1, 2, and 3; and so on).

A sensitivity study is similar in spirit, in that we want to see the degree to which design choices impact the system performance (or other metric). This type of study is used when there is some parameter in the design, where the value of the parameter could affect the evaluation metric. In such cases we can “vary the knob” for this parameter and see how the system responds. As an example, in a consensus protocol we might vary the leader timeout to see how it affects failure recovery times; in an OS we might vary the timer interrupt rate to see its impact on job throughput or performance interference.