From "Designing Data-Intensive Applications"
🎧 Listen to Summary
Free 10-min PreviewOutputs and Benefits of Batch Workflows
Key Insight
The output of batch processing workflows extends beyond simple reports; it often involves constructing data structures for other systems. A prime example is building search indexes, such as those for Lucene/Solr. A workflow of 5 to 10 MapReduce jobs can partition documents, with each reducer building a portion of the index, which is then written to the distributed filesystem. These index files are immutable; if documents change, the entire indexing workflow can be rerun periodically, replacing old indexes with new ones. While computationally intensive, this approach simplifies reasoning about the indexing process and offers robust consistency.
Batch processing is also extensively used to build machine learning systems, including classifiers and recommendation engines. The output is typically a database (e.g., a key-value store) that web applications can query by user ID for friend suggestions or by product ID for related items. Instead of writing directly to an external database (which is slow, can overwhelm the database, and complicates fault tolerance), a better practice is to build an entirely new database as files within the batch job's output directory. These immutable data files are then loaded in bulk into read-only query servers.
This philosophy aligns with the Unix approach of immutable inputs and complete replacement of outputs, avoiding side effects on external systems. This design provides 'human fault tolerance': if buggy code corrupts output, rolling back to an earlier code version and rerunning the job fixes the output without irreversible damage. The MapReduce framework's automatic retry of failed tasks is safe because inputs are immutable and partial outputs are discarded. This separation of logic from wiring, combined with structured file formats like Avro or Parquet, fosters experimentation, faster feature development, and robust data management within large organizations.
📚 Continue Your Learning Journey — No Payment Required
Access the complete Designing Data-Intensive Applications summary with audio narration, key takeaways, and actionable insights from Martin Kleppmann.