Lesson 8: Fault Tolerance and Replication

discussion thread
* Free Distributed Systems textbook
Maarten van Steen and Andrew Tanenbaum
* Time, Clocks, and the Ordering of Events in a Distributed System
Leslie Lamport, 1978

On readings: Recommended background readings are marked with (^) above. Optional historical or fun readings are marked with (*). If you feel comfortable with the topic already, you may skip these readings.

Notes

The slides from lecture can be found on Canvas.

Distributed systems

So far, the systems we’ve looked at have been distributed in the sense that we have multiple elements working in parallel. However, what we’ll be talking about now is a little different, because these systems will be loosely coupled. We’ll talk about precisely what this means in lecture, but in short, the networks are unreliable, and nodes in the system are much more prone to failure. In loosely-coupled systems, we also tend to see more support for dynamic membership in the system, i.e. nodes can come and go at will. We’ll talk more about that next time.

What makes DS challenging?

unreliable communication links (WiFi, undersea cable links, IT guy unplugging the ethernet cable, etc.)
processes may crash due to bugs or node failures
events generally are non-deterministic

Despite all of this, we still want to be able to build systems for such environments that continue to work. This requires fault tolerance. A fault-tolerant system is one that can detect faults and recover from them (and/or keep working correctly in their presence).

In the slides, we’ll try to cover (at a surface level):

RPC and networking
Two-generals and byzantine generals problem
System models (network model, node/failure model, timing model)
Failure Detection
Lamport Clocks and Vector clocks
Broadcast Models
Basics of Replication
Idempotence
Quorums
State Machine Replication (SMR)