From "Designing Data-Intensive Applications"
🎧 Listen to Summary
Free 10-min PreviewSupercomputing vs. Cloud Computing Fault Handling
Key Insight
Large-scale computing systems adopt diverse philosophies for handling faults. High-Performance Computing (HPC), characterized by supercomputers with thousands of CPUs, is typically used for computationally intensive scientific tasks like weather forecasting. These systems often use specialized, reliable hardware and communicate through shared memory or remote direct memory access (RDMA).
In supercomputers, fault handling involves checkpointing the computation state to durable storage. If a node fails, the common solution is to halt the entire cluster workload. After repair, the computation restarts from the last checkpoint. This approach treats partial failure as a total system failure, much like a kernel panic on a single machine, prioritizing eventual correctness over continuous availability.
Conversely, cloud computing and internet services prioritize online availability with low latency, often leveraging multi-tenant data centers with commodity machines and IP networks. These systems experience higher failure rates but cannot tolerate service interruptions. Instead, they build fault-tolerance into software, enabling rolling upgrades or replacing poorly performing virtual machines without stopping the entire service, even across geographically distributed deployments.
📚 Continue Your Learning Journey — No Payment Required
Access the complete Designing Data-Intensive Applications summary with audio narration, key takeaways, and actionable insights from Martin Kleppmann.