Cover of Designing Data-Intensive Applications by Martin Kleppmann - Business and Economics Book

From "Designing Data-Intensive Applications"

Author: Martin Kleppmann
Publisher: "O'Reilly Media, Inc."
Year: 2017
Category: Computers

🎧 Free Preview Complete

You've listened to your free 10-minute preview.
Sign up free to continue listening to the full summary.

🎧 Listen to Summary

Free 10-min Preview
0:00
Speed:
10:00 free remaining
Chapter 11: Stream Processing
Key Insight 3 from this chapter

Log-Based Message Brokers

Key Insight

Log-based message brokers represent a hybrid approach, combining the durable storage of databases with the low-latency notification of messaging systems. A 'log' is an append-only sequence of records on disk, similar to those used in log-structured storage engines or write-ahead logs for replication. In this model, a producer sends a message by appending it to the end of the log, and a consumer reads the log sequentially. If the consumer reaches the end, it waits for notification of new appended messages, much like the Unix 'tail -f' command.

To achieve high throughput, these logs are partitioned, allowing different partitions to be hosted on different machines, each operating as an independent log for reading and writing. A 'topic' then groups partitions carrying messages of the same type. Systems like Apache Kafka and Amazon Kinesis Streams exemplify this architecture, writing all messages to disk yet achieving millions of messages per second through partitioning and fault tolerance via replication. This design inherently supports 'fan-out' messaging, as multiple consumers can read the log independently without deleting messages. For 'load balancing,' entire partitions are assigned to consumer group nodes, with each node consuming all messages in its assigned partitions sequentially.

Consumers track their progress by periodically recording 'offsets,' which indicate the last processed message, significantly reducing bookkeeping overhead compared to individual message acknowledgments. This mechanism is akin to log sequence numbers in database replication, enabling seamless recovery: if a consumer fails, another node can resume processing from the last recorded offset. While logs are append-only, disk space is managed by dividing logs into segments and periodically deleting or archiving old ones, effectively creating a large, disk-backed circular buffer. For example, a 6 TB hard drive with 150 MB/s sequential write throughput can buffer about 11 hours of messages. This approach ensures consistent throughput, allowing days or weeks of message retention even for slow consumers, without impacting other active consumers.

📚 Continue Your Learning Journey — No Payment Required

Access the complete Designing Data-Intensive Applications summary with audio narration, key takeaways, and actionable insights from Martin Kleppmann.