Lesson 8: Fault Tolerance and Replication
- discussion thread
-
* Free Distributed Systems textbook
Maarten van Steen and Andrew Tanenbaum -
* Time, Clocks, and the Ordering of Events in a Distributed System
Leslie Lamport, 1978
On readings: Recommended background readings are marked with (^) above. Optional historical or fun readings are marked with (*). If you feel comfortable with the topic already, you may skip these readings.
Notes
The slides from lecture can be found on Canvas.
Distributed systems
So far, the systems we’ve looked at have been distributed in the sense that we have multiple elements working in parallel. However, what we’ll be talking about now is a little different, because these systems will be loosely coupled. We’ll talk about precisely what this means in lecture, but in short, the networks are unreliable, and nodes in the system are much more prone to failure. In loosely-coupled systems, we also tend to see more support for dynamic membership in the system, i.e. nodes can come and go at will. We’ll talk more about that next time.
What makes DS challenging?
- unreliable communication links (WiFi, undersea cable links, IT guy unplugging the ethernet cable, etc.)
- processes may crash due to bugs or node failures
- events generally are non-deterministic
Despite all of this, we still want to be able to build systems for such environments that continue to work. This requires fault tolerance. A fault-tolerant system is one that can detect faults and recover from them (and/or keep working correctly in their presence).
In the slides, we’ll try to cover (at a surface level):
- RPC and networking
- Two-generals and byzantine generals problem
- System models (network model, node/failure model, timing model)
- Failure Detection
- Lamport Clocks and Vector clocks
- Broadcast Models
- Basics of Replication
- Idempotence
- Quorums
- State Machine Replication (SMR)