From "Designing Data-Intensive Applications"
🎧 Listen to Summary
Free 10-min PreviewColumn-Oriented Storage for Analytical Workloads
Key Insight
For analytical queries on fact tables containing trillions of rows and hundreds of columns, traditional row-oriented storage proves inefficient. A typical OLAP query only requires access to a handful of columns. However, row-oriented systems must load entire rows, involving unnecessary data parsing and filtering. 'Column-oriented storage' addresses this by storing all values from each column together, often in separate files. This design allows queries to read and process only the specific columns relevant to the query, drastically reducing disk I/O demands and improving efficiency.
Column-oriented storage inherently provides excellent opportunities for data compression. Techniques like 'bitmap encoding' are particularly effective, especially for columns with a relatively small number of distinct values compared to the total rows. Each distinct value can be represented by a separate bitmap, and these bitmaps can be further compressed using run-length encoding. This enables highly efficient bitwise operations for common WHERE clauses; for instance, filtering `product_sk IN (30, 68, 69)` becomes a quick bitwise OR of three bitmaps, and `product_sk = 31 AND store_sk = 3` becomes a bitwise AND, operating directly on the compressed data.
Beyond optimizing disk I/O, column-oriented storage enhances CPU utilization. Chunks of compressed column data fit comfortably into CPU caches, facilitating 'vectorized processing' – tight loops that operate on these chunks without frequent function calls or complex conditional logic. This leverages modern CPU features like Single-Instruction-Multiple-Data (SIMD) instructions. While new rows are typically appended to maintain consistency, sorting data by designated columns (e.g., `date_key` then `product_sk`) further improves query performance and compression. Writes in column stores are managed similarly to LSM-trees, by first accumulating in an in-memory store and then bulk merging with on-disk column files, providing immediate query visibility of modifications while retaining read-optimized storage.
📚 Continue Your Learning Journey — No Payment Required
Access the complete Designing Data-Intensive Applications summary with audio narration, key takeaways, and actionable insights from Martin Kleppmann.