From "Designing Data-Intensive Applications"
🎧 Listen to Summary
Free 10-min PreviewPartitioning Secondary Indexes
Key Insight
While primary keys uniquely identify individual records, secondary indexes enable searches based on attribute values that may not be unique, such as finding all cars of a specific color or all articles containing a certain word. These indexes are fundamental in relational and document databases and are the core functionality of search servers like Solr and Elasticsearch. However, incorporating secondary indexes into a partitioned database introduces complexity because secondary index entries do not naturally align with the primary key's partition, necessitating distinct strategies for their distribution.
One approach is document-partitioned indexing (also known as local indexing), where each partition maintains its own secondary indexes, covering only the documents stored within that partition. For instance, if a database is partitioned by a unique document ID, all secondary index entries for a document will reside alongside that document on its assigned partition. This method offers efficient writes, as only the specific partition containing the document needs to be updated. Conversely, reading from a document-partitioned index is less efficient; unless all relevant data for a query happens to be on a single partition, the query must be sent to all partitions (a 'scatter/gather' operation), with results combined client-side. This can be costly and exacerbate 'tail latency amplification.' Databases like MongoDB, Riak, Cassandra, Elasticsearch, SolrCloud, and VoltDB commonly utilize document-partitioned secondary indexes.
The alternative is term-partitioned indexing (also known as global indexing), where a single, global index covers data across all primary partitions. This global index is itself partitioned, not by document ID, but by the indexed term's value (e.g., 'color:red'). Its partitions might be defined by ranges or hashes of these terms. This strategy significantly improves read performance: a client can directly query the specific index partition containing the desired term, avoiding the scatter/gather overhead. However, writes become more complex and slower because a single document update might necessitate updating multiple partitions of the global index, potentially spread across different nodes. Due to this distributed write complexity, updates to global secondary indexes are often asynchronous, meaning there can be a delay between a write operation and its visibility in the index. Examples include Riak's search feature and Oracle data warehouses, which offer both local and global indexing options.
📚 Continue Your Learning Journey — No Payment Required
Access the complete Designing Data-Intensive Applications summary with audio narration, key takeaways, and actionable insights from Martin Kleppmann.