Cover of Designing Data-Intensive Applications by Martin Kleppmann - Business and Economics Book

From "Designing Data-Intensive Applications"

Author: Martin Kleppmann
Publisher: "O'Reilly Media, Inc."
Year: 2017
Category: Computers

🎧 Free Preview Complete

You've listened to your free 10-minute preview.
Sign up free to continue listening to the full summary.

Create Free Account Sign In

🎧 Listen to Summary

Free 10-min Preview

0:00

Speed:

10:00 free remaining

Chapter 8: The Trouble with Distributed Systems

Key Insight 7 from this chapter

Knowledge, Truth, and Lies in Distributed Systems (System Models)

Key Insight

In a distributed system, a node cannot be certain of the global state; it can only make probabilistic guesses based on messages received or not received over an unreliable network. This uncertainty is heightened by partial failures, where a remote node's state is indistinguishable from network issues. A single node cannot be relied upon to make critical decisions, as its own judgment might be flawed (e.g., being declared dead by others while still functioning, or mistakenly acting on an expired lease after a pause).

To overcome this, many distributed algorithms rely on quorums, which involve a minimum number of nodes voting to reach a decision, such as declaring a node dead or electing a leader. This ensures robustness against individual node failures and guarantees a consistent view. Fencing tokens, monotonically increasing numbers issued by a lock service, are used to prevent 'lying' nodes (those acting on stale information) from corrupting resources by rejecting requests with older tokens.

System models abstract the types of faults an algorithm must tolerate: synchronous (bounded delays/pauses/clock error), partially synchronous (realistic, mostly bounded but occasionally unbounded), and asynchronous (no timing assumptions). Node fault models include crash-stop (node permanently gone), crash-recovery (node restarts with stable storage), and Byzantine (nodes may maliciously 'lie'). Correctness is defined by safety properties (nothing bad happens, always holds) and liveness properties (something good eventually happens, holds under certain conditions like network recovery). While models simplify reality, real-world implementations still need to handle 'impossible' scenarios like disk corruption, underscoring the gap between theoretical proofs and practical engineering.

📚 Continue Your Learning Journey — No Payment Required

Access the complete Designing Data-Intensive Applications summary with audio narration, key takeaways, and actionable insights from Martin Kleppmann.

📖 Read Full Summary 🔍 Explore More Books