From "Designing Data-Intensive Applications"
🎧 Listen to Summary
Free 10-min PreviewMapReduce Framework and Distributed Storage
Key Insight
MapReduce is a programming framework enabling distributed batch processing across potentially thousands of machines, conceptually similar to a distributed Unix system. A single MapReduce job operates like a Unix process: it takes inputs, processes them without modifying the original data (immutable inputs), and produces new outputs, typically writing files sequentially to a distributed filesystem. This 'write-once' characteristic ensures consistency and simplifies recovery.
HDFS (Hadoop Distributed File System) is Hadoop's primary distributed filesystem, an open-source re-implementation of Google File System (GFS). HDFS is based on a 'shared-nothing' architecture, utilizing commodity hardware and a conventional datacenter network, unlike shared-disk systems. It consists of daemons on each machine exposing network services, with a central NameNode tracking file block locations. File blocks are replicated across multiple machines (e.g., three copies) or use erasure coding for fault tolerance against machine and disk failures, similar to RAID but distributed over a network.
HDFS has achieved significant scalability, with deployments on tens of thousands of machines and hundreds of petabytes of storage, making data storage and access cost-effective on commodity hardware. A MapReduce job's execution involves four main steps: reading input files into records, calling a 'mapper' function to extract key-value pairs from each record, sorting all key-value pairs by key, and then calling a 'reducer' function to process groups of values for the same key. The map and reduce steps are custom code, while input parsing and sorting are handled by the framework. Mappers process records independently, and reducers process sorted groups of values, producing output records.
📚 Continue Your Learning Journey — No Payment Required
Access the complete Designing Data-Intensive Applications summary with audio narration, key takeaways, and actionable insights from Martin Kleppmann.