Lesson 8: Fault Tolerance and Replication

On readings: Recommended background readings are marked with (^) above. Optional historical or fun readings are marked with (*). If you feel comfortable with the topic already, you may skip these readings.

Notes

The slides from lecture can be found on Canvas.

Distributed systems

So far, the systems we’ve looked at have been distributed in the sense that we have multiple elements working in parallel. However, what we’ll be talking about now is a little different, because these systems will be loosely coupled. We’ll talk about precisely what this means in lecture, but in short, the networks are unreliable, and nodes in the system are much more prone to failure. In loosely-coupled systems, we also tend to see more support for dynamic membership in the system, i.e. nodes can come and go at will. We’ll talk more about that next time.

What makes DS challenging?

Despite all of this, we still want to be able to build systems for such environments that continue to work. This requires fault tolerance. A fault-tolerant system is one that can detect faults and recover from them (and/or keep working correctly in their presence).

In the slides, we’ll try to cover (at a surface level):