From "Designing Data-Intensive Applications"
🎧 Listen to Summary
Free 10-min PreviewLog-Structured Merge-Trees (LSM-trees)
Key Insight
Log-structured storage engines, exemplified by Bitcask (the default storage engine in Riak), operate on an append-only data file complemented by an in-memory hash map. This hash map maps keys to byte offsets within the data file, enabling high-performance value lookups via a single disk seek. New or updated key-value pairs are appended to the file, and the hash map is updated to reflect the latest offset. This approach excels in both reads and writes, provided all keys fit within the available RAM. Values can exceed memory capacity, as they are loaded on demand from disk.
To prevent log files from growing indefinitely, they are broken into fixed-size segments. When a segment reaches its threshold, it's closed, and subsequent writes go to a new segment file. A background 'compaction' process merges these segments, discarding duplicate keys and retaining only the most recent value for each key, often resulting in much smaller merged segments. These merged segments are written to new files, and once complete, read requests switch to the new segments, allowing old files to be deleted. Each segment has its own in-memory hash table, and lookups sequentially check hash maps from the newest to oldest segment until the key is found, a process optimized by keeping the number of segments small through merging.
Practical LSM-tree implementations, such as LevelDB and RocksDB, utilize Sorted String Tables (SSTables) and a 'memtable' (an in-memory balanced tree). Incoming writes are first added to the memtable. When it exceeds a few megabytes, it's written to disk as a sorted SSTable. Reads first check the memtable, then progressively older on-disk SSTables. A separate write-ahead log (WAL) ensures durability by immediately logging every write, enabling recovery of the memtable after a crash. These engines offer high write throughput, sequential disk writes, efficient range queries due to sorted data, and optimize lookups for non-existent keys with Bloom filters and various compaction strategies like size-tiered or leveled compaction.
📚 Continue Your Learning Journey — No Payment Required
Access the complete Designing Data-Intensive Applications summary with audio narration, key takeaways, and actionable insights from Martin Kleppmann.