From "Designing Data-Intensive Applications"
🎧 Listen to Summary
Free 10-min PreviewEvolution Beyond MapReduce
Key Insight
While MapReduce was a significant step, its design decisions, particularly materializing all intermediate state to HDFS, introduced inefficiencies for certain workloads. Every MapReduce job is independent, requiring the output of one to be fully written to the distributed filesystem before the next can begin. This sequential execution, coupled with replication overhead and potential straggler tasks, slows down complex workflows, especially those with 50-100 chained jobs commonly found in recommendation systems. Mappers often perform redundant work, simply re-reading data just written by a reducer.
To address these limitations, 'dataflow engines' like Spark, Tez, and Flink emerged, treating entire workflows as a single job rather than independent subjobs. These systems explicitly model the flow of data through multiple processing stages (operators), akin to Unix pipes, where data streams incrementally between operators using in-memory buffers or local disks, minimizing costly HDFS writes. They offer flexible operator chaining, allowing sorting only when necessary, optimizing data locality, and reusing JVM processes to reduce overheads. Computations implemented in MapReduce abstractions like Pig or Hive can often run significantly faster on these newer engines with minimal configuration changes.
Dataflow engines also employ different fault tolerance strategies. Instead of relying on HDFS for intermediate state durability, they recompute lost data from available prior stages or original inputs (often HDFS). Spark uses Resilient Distributed Datasets (RDDs) to track data lineage for recomputation, while Flink checkpoints operator state. Deterministic operators are crucial for reliable recovery, as non-deterministic operations (e.g., arbitrary hash table iteration order, system clock, random numbers without fixed seeds) can lead to inconsistencies. This design choice by MapReduce for frequent task terminations due to preemption in Google's mixed-use datacenters, contrasts with MPP databases which typically abort entire queries on failure, and makes less sense in environments with fewer preemption events.
📚 Continue Your Learning Journey — No Payment Required
Access the complete Designing Data-Intensive Applications summary with audio narration, key takeaways, and actionable insights from Martin Kleppmann.