Cover of Designing Data-Intensive Applications by Martin Kleppmann - Business and Economics Book

From "Designing Data-Intensive Applications"

Author: Martin Kleppmann
Publisher: "O'Reilly Media, Inc."
Year: 2017
Category: Computers

🎧 Free Preview Complete

You've listened to your free 10-minute preview.
Sign up free to continue listening to the full summary.

🎧 Listen to Summary

Free 10-min Preview
0:00
Speed:
10:00 free remaining
Chapter 3: Storage and Retrieval
Key Insight 6 from this chapter

Column-Oriented Storage for Analytical Workloads

Key Insight

For analytical queries on fact tables containing trillions of rows and hundreds of columns, traditional row-oriented storage proves inefficient. A typical OLAP query only requires access to a handful of columns. However, row-oriented systems must load entire rows, involving unnecessary data parsing and filtering. 'Column-oriented storage' addresses this by storing all values from each column together, often in separate files. This design allows queries to read and process only the specific columns relevant to the query, drastically reducing disk I/O demands and improving efficiency.

Column-oriented storage inherently provides excellent opportunities for data compression. Techniques like 'bitmap encoding' are particularly effective, especially for columns with a relatively small number of distinct values compared to the total rows. Each distinct value can be represented by a separate bitmap, and these bitmaps can be further compressed using run-length encoding. This enables highly efficient bitwise operations for common WHERE clauses; for instance, filtering `product_sk IN (30, 68, 69)` becomes a quick bitwise OR of three bitmaps, and `product_sk = 31 AND store_sk = 3` becomes a bitwise AND, operating directly on the compressed data.

Beyond optimizing disk I/O, column-oriented storage enhances CPU utilization. Chunks of compressed column data fit comfortably into CPU caches, facilitating 'vectorized processing' – tight loops that operate on these chunks without frequent function calls or complex conditional logic. This leverages modern CPU features like Single-Instruction-Multiple-Data (SIMD) instructions. While new rows are typically appended to maintain consistency, sorting data by designated columns (e.g., `date_key` then `product_sk`) further improves query performance and compression. Writes in column stores are managed similarly to LSM-trees, by first accumulating in an in-memory store and then bulk merging with on-disk column files, providing immediate query visibility of modifications while retaining read-optimized storage.

📚 Continue Your Learning Journey — No Payment Required

Access the complete Designing Data-Intensive Applications summary with audio narration, key takeaways, and actionable insights from Martin Kleppmann.