Cover of Designing Data-Intensive Applications by Martin Kleppmann - Business and Economics Book

From "Designing Data-Intensive Applications"

Author: Martin Kleppmann
Publisher: "O'Reilly Media, Inc."
Year: 2017
Category: Computers

🎧 Free Preview Complete

You've listened to your free 10-minute preview.
Sign up free to continue listening to the full summary.

🎧 Listen to Summary

Free 10-min Preview
0:00
Speed:
10:00 free remaining
Chapter 10: Batch Processing
Key Insight 4 from this chapter

MapReduce Framework and Distributed Storage

Key Insight

MapReduce is a programming framework enabling distributed batch processing across potentially thousands of machines, conceptually similar to a distributed Unix system. A single MapReduce job operates like a Unix process: it takes inputs, processes them without modifying the original data (immutable inputs), and produces new outputs, typically writing files sequentially to a distributed filesystem. This 'write-once' characteristic ensures consistency and simplifies recovery.

HDFS (Hadoop Distributed File System) is Hadoop's primary distributed filesystem, an open-source re-implementation of Google File System (GFS). HDFS is based on a 'shared-nothing' architecture, utilizing commodity hardware and a conventional datacenter network, unlike shared-disk systems. It consists of daemons on each machine exposing network services, with a central NameNode tracking file block locations. File blocks are replicated across multiple machines (e.g., three copies) or use erasure coding for fault tolerance against machine and disk failures, similar to RAID but distributed over a network.

HDFS has achieved significant scalability, with deployments on tens of thousands of machines and hundreds of petabytes of storage, making data storage and access cost-effective on commodity hardware. A MapReduce job's execution involves four main steps: reading input files into records, calling a 'mapper' function to extract key-value pairs from each record, sorting all key-value pairs by key, and then calling a 'reducer' function to process groups of values for the same key. The map and reduce steps are custom code, while input parsing and sorting are handled by the framework. Mappers process records independently, and reducers process sorted groups of values, producing output records.

📚 Continue Your Learning Journey — No Payment Required

Access the complete Designing Data-Intensive Applications summary with audio narration, key takeaways, and actionable insights from Martin Kleppmann.