Cover of Designing Data-Intensive Applications by Martin Kleppmann - Business and Economics Book

From "Designing Data-Intensive Applications"

Author: Martin Kleppmann
Publisher: "O'Reilly Media, Inc."
Year: 2017
Category: Computers

🎧 Free Preview Complete

You've listened to your free 10-minute preview.
Sign up free to continue listening to the full summary.

Create Free Account Sign In

🎧 Listen to Summary

Free 10-min Preview

0:00

Speed:

10:00 free remaining

Chapter 6: Partitioning

Key Insight 2 from this chapter

Partitioning Secondary Indexes

Key Insight

While primary keys uniquely identify individual records, secondary indexes enable searches based on attribute values that may not be unique, such as finding all cars of a specific color or all articles containing a certain word. These indexes are fundamental in relational and document databases and are the core functionality of search servers like Solr and Elasticsearch. However, incorporating secondary indexes into a partitioned database introduces complexity because secondary index entries do not naturally align with the primary key's partition, necessitating distinct strategies for their distribution.

One approach is document-partitioned indexing (also known as local indexing), where each partition maintains its own secondary indexes, covering only the documents stored within that partition. For instance, if a database is partitioned by a unique document ID, all secondary index entries for a document will reside alongside that document on its assigned partition. This method offers efficient writes, as only the specific partition containing the document needs to be updated. Conversely, reading from a document-partitioned index is less efficient; unless all relevant data for a query happens to be on a single partition, the query must be sent to all partitions (a 'scatter/gather' operation), with results combined client-side. This can be costly and exacerbate 'tail latency amplification.' Databases like MongoDB, Riak, Cassandra, Elasticsearch, SolrCloud, and VoltDB commonly utilize document-partitioned secondary indexes.

The alternative is term-partitioned indexing (also known as global indexing), where a single, global index covers data across all primary partitions. This global index is itself partitioned, not by document ID, but by the indexed term's value (e.g., 'color:red'). Its partitions might be defined by ranges or hashes of these terms. This strategy significantly improves read performance: a client can directly query the specific index partition containing the desired term, avoiding the scatter/gather overhead. However, writes become more complex and slower because a single document update might necessitate updating multiple partitions of the global index, potentially spread across different nodes. Due to this distributed write complexity, updates to global secondary indexes are often asynchronous, meaning there can be a delay between a write operation and its visibility in the index. Examples include Riak's search feature and Oracle data warehouses, which offer both local and global indexing options.

📚 Continue Your Learning Journey — No Payment Required

Access the complete Designing Data-Intensive Applications summary with audio narration, key takeaways, and actionable insights from Martin Kleppmann.

📖 Read Full Summary 🔍 Explore More Books