Cover of Designing Data-Intensive Applications by Martin Kleppmann - Business and Economics Book

From "Designing Data-Intensive Applications"

Author: Martin Kleppmann
Publisher: "O'Reilly Media, Inc."
Year: 2017
Category: Computers

🎧 Free Preview Complete

You've listened to your free 10-minute preview.
Sign up free to continue listening to the full summary.

🎧 Listen to Summary

Free 10-min Preview
0:00
Speed:
10:00 free remaining
Chapter 6: Partitioning
Key Insight 1 from this chapter

Core Partitioning Strategies for Primary Keys

Key Insight

Partitioning, also known as sharding, is crucial for handling very large datasets or high query throughput, serving as a complement to replication. It involves dividing data into smaller, independent units called partitions, each effectively a self-contained database. This strategy, first adopted by products like Teradata and Tandem NonStop SQL in the 1980s and later by NoSQL and Hadoop systems, primarily addresses scalability by distributing data across many disks and query load across multiple processors in a shared-nothing cluster. Each record is assigned to exactly one partition, although these partitions themselves are typically replicated across several nodes for fault tolerance.

One common method is key-range partitioning, which assigns a continuous segment of keys (from a minimum to a maximum) to each partition, akin to volumes in a paper encyclopedia. Knowing the range boundaries allows direct routing to the correct partition. Within each partition, keys are often kept in sorted order, which greatly simplifies and speeds up range queries. For example, if a key includes a timestamp, all readings for a specific month can be efficiently retrieved. However, this approach risks creating 'hot spots' if write access patterns frequently target keys within the same range, such as all new sensor data being written to a single 'current day' partition. To mitigate this, key design might be adjusted, for instance, by prefixing timestamps with a sensor name to distribute write load across multiple sensor-specific ranges.

To counter the risk of hot spots from clustered key access, many distributed datastores employ hash partitioning, where a hash function determines a key's partition. A good hash function uniformly distributes even similar input keys across a wide range of numbers, allowing partitions to own specific hash ranges. This technique is effective at spreading data and load evenly, preventing hot spots if access is distributed. However, hashing destroys the original key sort order, making efficient range queries challenging or impossible, often necessitating a 'scatter/gather' approach across all partitions. Cassandra offers a hybrid solution using compound primary keys: the first part is hashed for partitioning, while subsequent parts maintain a sorted order within that partition, enabling efficient range scans for specific primary key prefixes. Despite hashing, extreme workloads targeting a single, frequently accessed key can still create a hot spot, requiring application-level strategies like appending a random number to the hot key to distribute its load across multiple derived keys.

📚 Continue Your Learning Journey — No Payment Required

Access the complete Designing Data-Intensive Applications summary with audio narration, key takeaways, and actionable insights from Martin Kleppmann.