From "Designing Data-Intensive Applications"
🎧 Listen to Summary
Free 10-min PreviewUnreliable Networks
Key Insight
Shared-nothing distributed systems rely entirely on asynchronous packet networks like the internet or datacenter Ethernet. In such networks, a node sends a message without guarantees of timely arrival or even delivery. This introduces numerous failure modes where a request may be lost, delayed in a queue, or the remote node may have failed or temporarily paused, making it impossible for the sender to distinguish the cause if no response is received.
The standard method for handling non-responses is a timeout. However, a timeout provides no information about whether the request was processed, and if it was, whether the response was lost or delayed. Setting an appropriate timeout is difficult: a long timeout leads to prolonged waiting, while a short one risks prematurely declaring a node dead, especially if it's merely slow due to temporary overload.
Network delays are largely unbounded and variable, often due to queueing. This occurs when multiple nodes target the same destination, filling switch queues, or when CPU cores are busy, causing OS-level queueing. Virtual machine pauses, TCP flow control, and automatic retransmissions also contribute to delay variability. These factors mean there's no 'correct' timeout value, necessitating experimental determination and adaptive mechanisms like Phi Accrual failure detectors.
📚 Continue Your Learning Journey — No Payment Required
Access the complete Designing Data-Intensive Applications summary with audio narration, key takeaways, and actionable insights from Martin Kleppmann.