Lesson 9: Byzantine Fault Tolerance

discussion thread
* The Byzantine Generals Problem
Lamport, 1982
* Practical Byzantine Fault Tolerance
Castro and Liskov, OSDI '99
* Making Byzantine Fault Tolerant Systems Tolerate Byzantine Faults
Clement, Wong, Alvisi, and Dahlin, NSDI '09

On readings: Recommended background readings are marked with (^) above. Optional historical or fun readings are marked with (*). If you feel comfortable with the topic already, you may skip these readings.

Notes

The slides from lecture can be found on Canvas.

Extending system models

In particular, we are worried about the Byzantine failure model, which is stronger than fail-stop or fail-recover. We talked briefly last time about the two generals problem and Byzantine generals. The latter is our focus today. Why do we need such a model? Primarily to be able to handle realities in the world:

Falied nodes or processes may fail in arbitrary (or malicious) ways:
- One of the nodes may be deploying an older version of the protocol, leading to weird behavior
- There could be a bug in one or more of the processes
- An attacker could have compromised one of the nodes These are all realistic problems that we need to handle to properly handle faults. With the Byzantine failure model, any node behavior is possible, and nodes can fail in arbitrary ways. This is why this is the strongest failure model.
We may want to operate a distributed system in the presence of untrusted nodes. This was not true when using Raft or Paxos for consensus. A notable example of where we’d want this is some blockchain-based platforms.

In general, we may want to have a system with dynamic membership, in the sense that new nodes could join the system at any time. This is something we’d want for, e.g. cryptocurrency blockchains, but something we will ignore here, as they will require different forms of consensus algorithms.