Cover of Designing Data-Intensive Applications by Martin Kleppmann - Business and Economics Book

From "Designing Data-Intensive Applications"

Author: Martin Kleppmann
Publisher: "O'Reilly Media, Inc."
Year: 2017
Category: Computers

🎧 Free Preview Complete

You've listened to your free 10-minute preview.
Sign up free to continue listening to the full summary.

🎧 Listen to Summary

Free 10-min Preview
0:00
Speed:
10:00 free remaining
Chapter 11: Stream Processing
Key Insight 7 from this chapter

Stream Joins and Fault Tolerance

Key Insight

Joining datasets is as critical in stream processing as in batch processing, but the unbounded nature of streams introduces complexities. 'Stream-stream joins' combine events from two activity streams based on a shared key and a time window (e.g., a search event and a click event from the same session within an hour), requiring the stream processor to maintain state of recent events. 'Stream-table joins' (often called stream enrichment) augment activity events with profile information by joining them with a database changelog, where the stream processor maintains a local, continuously updated copy of the database. 'Table-table joins' involve joining two database changelogs to maintain a materialized view, like updating a Twitter timeline based on new tweets and follow relationships.

The time-dependence of joins is a significant challenge, as the order of events across different streams is typically not guaranteed, leading to potential nondeterminism. For instance, if a user profile update happens concurrently with user activity events, determining which profile version to join with becomes ambiguous, a problem akin to 'slowly changing dimensions' in data warehouses. Achieving deterministic joins often requires unique identifiers for different versions of joined records, which can prevent log compaction. Stream processors must maintain internal state for these joins, which also needs to be fault-tolerant.

Fault tolerance in stream processing is more intricate than in batch jobs, where entire failed tasks can simply be restarted and their output discarded. Since streams are infinite, continuous output cannot be simply discarded. Solutions include 'microbatching' (treating small stream blocks as mini-batches, e.g., 1 second) and 'checkpointing' (periodically saving operator state to durable storage), allowing restarts from the last successful checkpoint. However, for external side effects (e.g., database writes, emails), these alone are insufficient. To guarantee 'effectively-once semantics' for external interactions, all outputs and state changes must occur atomically, either through distributed transactions or by relying on 'idempotence,' where operations can be safely repeated without causing duplicate effects, often achieved by including metadata like message offsets to detect prior application. State recovery after failure is managed through replication to remote datastores, local state replication to log-compacted topics, or rebuilding state from input streams, with the optimal strategy depending on infrastructure performance.

📚 Continue Your Learning Journey — No Payment Required

Access the complete Designing Data-Intensive Applications summary with audio narration, key takeaways, and actionable insights from Martin Kleppmann.