Cover of Designing Data-Intensive Applications by Martin Kleppmann - Business and Economics Book

From "Designing Data-Intensive Applications"

Author: Martin Kleppmann
Publisher: "O'Reilly Media, Inc."
Year: 2017
Category: Computers

🎧 Free Preview Complete

You've listened to your free 10-minute preview.
Sign up free to continue listening to the full summary.

🎧 Listen to Summary

Free 10-min Preview
0:00
Speed:
10:00 free remaining
Chapter 10: Batch Processing
Key Insight 6 from this chapter

Outputs and Benefits of Batch Workflows

Key Insight

The output of batch processing workflows extends beyond simple reports; it often involves constructing data structures for other systems. A prime example is building search indexes, such as those for Lucene/Solr. A workflow of 5 to 10 MapReduce jobs can partition documents, with each reducer building a portion of the index, which is then written to the distributed filesystem. These index files are immutable; if documents change, the entire indexing workflow can be rerun periodically, replacing old indexes with new ones. While computationally intensive, this approach simplifies reasoning about the indexing process and offers robust consistency.

Batch processing is also extensively used to build machine learning systems, including classifiers and recommendation engines. The output is typically a database (e.g., a key-value store) that web applications can query by user ID for friend suggestions or by product ID for related items. Instead of writing directly to an external database (which is slow, can overwhelm the database, and complicates fault tolerance), a better practice is to build an entirely new database as files within the batch job's output directory. These immutable data files are then loaded in bulk into read-only query servers.

This philosophy aligns with the Unix approach of immutable inputs and complete replacement of outputs, avoiding side effects on external systems. This design provides 'human fault tolerance': if buggy code corrupts output, rolling back to an earlier code version and rerunning the job fixes the output without irreversible damage. The MapReduce framework's automatic retry of failed tasks is safe because inputs are immutable and partial outputs are discarded. This separation of logic from wiring, combined with structured file formats like Avro or Parquet, fosters experimentation, faster feature development, and robust data management within large organizations.

📚 Continue Your Learning Journey — No Payment Required

Access the complete Designing Data-Intensive Applications summary with audio narration, key takeaways, and actionable insights from Martin Kleppmann.