From "Designing Data-Intensive Applications"
🎧 Listen to Summary
Free 10-min PreviewFault Tolerance and Consensus in Distributed Systems
Key Insight
Distributed systems encounter numerous issues including lost packets, network reordering/duplication/delays, approximate clocks, and node pauses or crashes. When simply failing and showing an error is unacceptable, systems must tolerate these faults by continuing to function correctly despite internal component failures. The most effective approach for building fault-tolerant systems involves creating general-purpose abstractions that provide useful guarantees, which applications can then rely on. This method is analogous to how transaction abstractions hide problems like crashes, race conditions, and disk failures, allowing applications to operate without worrying about them.
A critical abstraction for distributed systems is consensus, defined as getting all nodes to agree on something. Reliably achieving consensus in the presence of network faults and process failures is a surprisingly complex problem. Once a consensus implementation is available, applications can use it for various purposes. For example, in a database with single-leader replication, if the current leader node fails, the remaining nodes can utilize consensus to reliably elect a new leader.
It is vital that there is only one leader, and all nodes agree on who that leader is. If two nodes concurrently believe they are the leader, a situation known as 'split brain' occurs, which frequently results in data loss. Correct implementations of consensus are fundamental in avoiding such split-brain scenarios and ensuring data integrity and system stability during node outages.
📚 Continue Your Learning Journey — No Payment Required
Access the complete Designing Data-Intensive Applications summary with audio narration, key takeaways, and actionable insights from Martin Kleppmann.