Cover of Designing Data-Intensive Applications by Martin Kleppmann - Business and Economics Book

From "Designing Data-Intensive Applications"

Author: Martin Kleppmann
Publisher: "O'Reilly Media, Inc."
Year: 2017
Category: Computers

🎧 Free Preview Complete

You've listened to your free 10-minute preview.
Sign up free to continue listening to the full summary.

Create Free Account Sign In

🎧 Listen to Summary

Free 10-min Preview

0:00

Speed:

10:00 free remaining

Chapter 10: Batch Processing

Key Insight 7 from this chapter

Evolution Beyond MapReduce

Key Insight

While MapReduce was a significant step, its design decisions, particularly materializing all intermediate state to HDFS, introduced inefficiencies for certain workloads. Every MapReduce job is independent, requiring the output of one to be fully written to the distributed filesystem before the next can begin. This sequential execution, coupled with replication overhead and potential straggler tasks, slows down complex workflows, especially those with 50-100 chained jobs commonly found in recommendation systems. Mappers often perform redundant work, simply re-reading data just written by a reducer.

To address these limitations, 'dataflow engines' like Spark, Tez, and Flink emerged, treating entire workflows as a single job rather than independent subjobs. These systems explicitly model the flow of data through multiple processing stages (operators), akin to Unix pipes, where data streams incrementally between operators using in-memory buffers or local disks, minimizing costly HDFS writes. They offer flexible operator chaining, allowing sorting only when necessary, optimizing data locality, and reusing JVM processes to reduce overheads. Computations implemented in MapReduce abstractions like Pig or Hive can often run significantly faster on these newer engines with minimal configuration changes.

Dataflow engines also employ different fault tolerance strategies. Instead of relying on HDFS for intermediate state durability, they recompute lost data from available prior stages or original inputs (often HDFS). Spark uses Resilient Distributed Datasets (RDDs) to track data lineage for recomputation, while Flink checkpoints operator state. Deterministic operators are crucial for reliable recovery, as non-deterministic operations (e.g., arbitrary hash table iteration order, system clock, random numbers without fixed seeds) can lead to inconsistencies. This design choice by MapReduce for frequent task terminations due to preemption in Google's mixed-use datacenters, contrasts with MPP databases which typically abort entire queries on failure, and makes less sense in environments with fewer preemption events.

📚 Continue Your Learning Journey — No Payment Required

Access the complete Designing Data-Intensive Applications summary with audio narration, key takeaways, and actionable insights from Martin Kleppmann.

📖 Read Full Summary 🔍 Explore More Books