Cover of Designing Data-Intensive Applications by Martin Kleppmann - Business and Economics Book

From "Designing Data-Intensive Applications"

Author: Martin Kleppmann
Publisher: "O'Reilly Media, Inc."
Year: 2017
Category: Computers

🎧 Free Preview Complete

You've listened to your free 10-minute preview.
Sign up free to continue listening to the full summary.

🎧 Listen to Summary

Free 10-min Preview
0:00
Speed:
10:00 free remaining
Chapter 10: Batch Processing
Key Insight 2 from this chapter

Batch Processing with Unix Tools

Key Insight

Batch processing can be effectively demonstrated using standard Unix tools. For instance, analyzing a web server log to find the top five most popular pages involves chaining commands: `cat` to read the log, `awk` to extract the URL (seventh field), `sort` to alphabetize, `uniq -c` to count occurrences, another `sort -r -n` to rank by count, and `head -n 5` to get the top five. This powerful pipeline can process gigabytes of log files in seconds and is easily adaptable to different analytical needs, such as filtering CSS files or counting client IP addresses.

Comparing this Unix pipeline to a custom Ruby program for the same task reveals a key difference in execution. The Ruby script typically holds an in-memory hash table of URLs and their counts, efficient for small to mid-sized datasets where all distinct URLs fit into memory (e.g., 1 GB). Its working set depends on the number of distinct URLs, not the total log entries. For example, a million log entries for one URL still only require one hash table entry.

The Unix `sort` utility, in contrast, is designed to handle datasets larger than available memory by spilling data to disk and using a merge sort approach. This makes it highly efficient for sequential I/O and capable of parallelizing across multiple CPU cores, allowing the simple Unix command chain to scale effortlessly to large datasets without memory exhaustion. The primary bottleneck often becomes the disk read rate for the input file.

📚 Continue Your Learning Journey — No Payment Required

Access the complete Designing Data-Intensive Applications summary with audio narration, key takeaways, and actionable insights from Martin Kleppmann.