Cover of Designing Data-Intensive Applications by Martin Kleppmann - Business and Economics Book

From "Designing Data-Intensive Applications"

Author: Martin Kleppmann
Publisher: "O'Reilly Media, Inc."
Year: 2017
Category: Computers

🎧 Free Preview Complete

You've listened to your free 10-minute preview.
Sign up free to continue listening to the full summary.

🎧 Listen to Summary

Free 10-min Preview
0:00
Speed:
10:00 free remaining
Chapter 1: Reliable, Scalable, and Maintainable Applications
Key Insight 2 from this chapter

Reliability

Key Insight

Reliability signifies a system's ability to continue operating correctly and performing expected functions, even when adverse events or 'faults' occur. A fault is when a system component deviates from its specification, whereas a 'failure' means the entire system stops providing its required service. Since it is impossible to eliminate all faults, fault-tolerance mechanisms are crucial for preventing faults from escalating into failures. Counterintuitively, deliberately introducing faults, like randomly killing processes (e.g., Netflix Chaos Monkey), can enhance reliability by continually testing and improving these fault-tolerance mechanisms, particularly for handling poor error management.

Hardware faults, such as hard disk crashes, RAM failures, or power outages, are common in large datacenters. For example, hard disks have a mean time to failure (MTTF) of 10 to 50 years, suggesting that a storage cluster with 10000 disks could experience one failure per day. Historically, hardware redundancy (e.g., RAID, dual power supplies) was sufficient to mitigate these. However, with growing data volumes and computing demands, applications now use vastly more machines, increasing the rate of hardware faults. Cloud platforms like AWS, prioritizing flexibility, also see virtual machines become unavailable. This shift drives the adoption of software fault-tolerance techniques, which tolerate entire machine losses and enable operational advantages like rolling upgrades for patching without downtime.

Software errors represent systematic faults that are harder to anticipate and are often correlated across nodes, causing widespread failures. These include bugs triggered by specific inputs (e.g., the leap second on June 30, 2012, affecting Linux kernel-based applications), resource exhaustion by runaway processes, or cascading failures. Solutions involve meticulous design, thorough testing, process isolation, and continuous monitoring for discrepancies. Human errors, notably configuration mistakes by operators, are a leading cause of outages, often surpassing hardware faults (which account for only 10 to 25% of outages). Mitigating human error involves designing user-friendly abstractions and APIs, providing safe non-production sandbox environments, extensive automated and manual testing, enabling quick recovery mechanisms (e.g., rollback configuration changes, gradual code rollouts), and detailed telemetry for monitoring performance metrics and error rates.

📚 Continue Your Learning Journey — No Payment Required

Access the complete Designing Data-Intensive Applications summary with audio narration, key takeaways, and actionable insights from Martin Kleppmann.