From "Designing Data-Intensive Applications"
🎧 Listen to Summary
Free 10-min PreviewIntroduction to Batch Processing
Key Insight
Data processing systems are commonly designed around requests and responses, where a user or service initiates a request and expects a timely answer. This 'online' style is prevalent in databases, caches, search indexes, and web servers, prioritizing low response times and high availability, as exemplified by web browsers requesting pages or services calling remote APIs.
However, batch processing systems offer an alternative, operating 'offline' by taking large input datasets, running jobs for minutes to days, and producing output. Unlike online systems, there's no user waiting for immediate results; jobs are often scheduled periodically. The primary performance metric for batch systems is throughput, measuring the time to process a given data size. This approach is fundamental, with historical roots in mechanical tabulating machines used in the 1890 US Census and electromechanical IBM card-sorting machines of the 1940s and 1950s.
Stream processing systems represent a 'near-real-time' middle ground, consuming and producing data like batch systems but operating on events shortly after they occur, leading to lower latency. MapReduce, a batch processing algorithm published in 2004, initially gained significant traction, being hailed as key to Google's scalability. Although its direct importance is waning, MapReduce remains valuable for understanding the utility and mechanics of batch processing, having been implemented in open-source systems like Hadoop, CouchDB, and MongoDB.
📚 Continue Your Learning Journey — No Payment Required
Access the complete Designing Data-Intensive Applications summary with audio narration, key takeaways, and actionable insights from Martin Kleppmann.