From "Designing Data-Intensive Applications"
🎧 Listen to Summary
Free 10-min PreviewImpact of Process Pauses
Key Insight
A significant challenge in distributed systems is that a process's execution can be unexpectedly paused for arbitrary lengths of time, without the process itself being aware of the interruption. This can lead to critical failures if the system assumes continuous execution. Such pauses are common and can stem from various sources within a non-real-time operating environment.
Key causes of these pauses include 'stop-the-world' garbage collection (GC) pauses in runtimes like JVM, which can last for minutes. Virtual machine (VM) suspensions for live migration or CPU core sharing can pause a VM for tens of milliseconds or longer. Other factors are operating system context switching, synchronous disk I/O operations (including lazy class loading or paging/swapping), and external signals like SIGSTOP. These interruptions can preempt a running thread at any point and resume it much later.
The consequences are severe: a node might believe it still holds a valid lease or leader status, but during a pause, the lease could expire, or another node might be elected leader. Upon resuming, the unnotified node could perform unsafe operations, like processing requests under an expired lease, leading to data corruption or split-brain scenarios. Mitigations include making GC pauses resemble brief planned outages or periodically restarting processes to avoid full GCs, but fundamental guarantees require robust design against arbitrary pauses.
📚 Continue Your Learning Journey — No Payment Required
Access the complete Designing Data-Intensive Applications summary with audio narration, key takeaways, and actionable insights from Martin Kleppmann.