Cover of Designing Data-Intensive Applications by Martin Kleppmann - Business and Economics Book

From "Designing Data-Intensive Applications"

Author: Martin Kleppmann
Publisher: "O'Reilly Media, Inc."
Year: 2017
Category: Computers

🎧 Free Preview Complete

You've listened to your free 10-minute preview.
Sign up free to continue listening to the full summary.

Create Free Account Sign In

🎧 Listen to Summary

Free 10-min Preview

0:00

Speed:

10:00 free remaining

Chapter 11: Stream Processing

Key Insight 6 from this chapter

Stream Processing Applications and Time Semantics

Key Insight

Stream processing enables diverse applications beyond merely transmitting events. Primarily, it processes input streams to produce derived output streams, functioning as 'operators' or 'jobs' that consume data read-only and write append-only. Key applications include monitoring systems (e.g., fraud detection, trading, manufacturing diagnostics) requiring sophisticated pattern matching. 'Complex Event Processing' (CEP), an older discipline, focuses on detecting specific patterns of events in a stream using declarative queries, treating queries as long-term, persistent entities that events flow past, rather than transient requests against static data.

'Stream analytics' is another major application, focusing on aggregations and statistical metrics over many events, such as calculating rolling averages or event rates within defined 'windows.' These statistics are often computed over fixed time intervals, smoothing out fluctuations while providing timely insights. While some stream analytics systems use probabilistic algorithms (e.g., Bloom filters, HyperLogLog) for memory efficiency and approximate results, stream processing is not inherently inexact; these are merely optimizations. Furthermore, stream processors are crucial for 'maintaining materialized views,' keeping derived data systems like caches, search indexes, or event-sourced application states continuously updated by applying a log of changes, often requiring a 'window' that extends back to the beginning of time.

Reasoning about 'time' in stream processing is surprisingly complex. Unlike batch processing which uses event timestamps for deterministic results, stream processors often use 'processing time' (the system clock of the processing machine), which breaks down with significant delays or out-of-order message arrivals. This distinction between 'event time' (when the event actually occurred) and 'processing time' (when it was processed) is crucial to avoid bad data, like false spikes in metrics during backlog processing. Handling 'straggler events' (late arrivals) requires strategies like ignoring them or publishing corrections. Defining windows—tumbling (fixed, non-overlapping), hopping (fixed, overlapping), sliding (events within an interval of each other), and session (user-activity based)—also relies on accurate time management, sometimes necessitating multiple timestamps to adjust for unreliable device clocks and network delays.

📚 Continue Your Learning Journey — No Payment Required

Access the complete Designing Data-Intensive Applications summary with audio narration, key takeaways, and actionable insights from Martin Kleppmann.

📖 Read Full Summary 🔍 Explore More Books