From "Designing Data-Intensive Applications"
🎧 Listen to Summary
Free 10-min PreviewBatch Processing with Unix Tools
Key Insight
Batch processing can be effectively demonstrated using standard Unix tools. For instance, analyzing a web server log to find the top five most popular pages involves chaining commands: `cat` to read the log, `awk` to extract the URL (seventh field), `sort` to alphabetize, `uniq -c` to count occurrences, another `sort -r -n` to rank by count, and `head -n 5` to get the top five. This powerful pipeline can process gigabytes of log files in seconds and is easily adaptable to different analytical needs, such as filtering CSS files or counting client IP addresses.
Comparing this Unix pipeline to a custom Ruby program for the same task reveals a key difference in execution. The Ruby script typically holds an in-memory hash table of URLs and their counts, efficient for small to mid-sized datasets where all distinct URLs fit into memory (e.g., 1 GB). Its working set depends on the number of distinct URLs, not the total log entries. For example, a million log entries for one URL still only require one hash table entry.
The Unix `sort` utility, in contrast, is designed to handle datasets larger than available memory by spilling data to disk and using a merge sort approach. This makes it highly efficient for sequential I/O and capable of parallelizing across multiple CPU cores, allowing the simple Unix command chain to scale effortlessly to large datasets without memory exhaustion. The primary bottleneck often becomes the disk read rate for the input file.
📚 Continue Your Learning Journey — No Payment Required
Access the complete Designing Data-Intensive Applications summary with audio narration, key takeaways, and actionable insights from Martin Kleppmann.