How Do I Choose A Good Shard Key And Avoid Hotspots (Bucketing And Key Randomization)

A shard key is the attribute used to partition data in a sharded database, and a good shard key evenly distributes data and load across all shards to prevent any single shard from becoming a hotspot.

What is a Shard Key and Why It Matters

A shard key (sometimes called a partition key) determines how data is split among the shards (servers) in a distributed database. It’s essentially the “address” that the system uses to decide which shard stores a given piece of data.

Choosing the right shard key is critical because it affects performance, scalability , and even maintenance of the database cluster.

A well-chosen shard key keeps data evenly distributed across nodes, which balances the workload and avoids overloading any one shard.

Think of shards like checkout lines at a grocery store. If most customers (data requests) end up in one line because of how you directed them, that line becomes a hotspot. It’s overloaded while other lines are idle.

A poorly chosen shard key can cause exactly this scenario in a database: one node gets almost all the traffic while others sit mostly idle. This imbalance hurts performance (that busy shard becomes a bottleneck).

In contrast, a good shard key spreads customers evenly to all lines (shards), so no single shard handles a disproportionate load.

1762011191768203 Image scaled to 70%

Characteristics of a Good Shard Key

To ensure your data is spread out and queries run efficiently, keep the following qualities in mind when selecting a shard key:

High Cardinality: The key should have a large number of distinct values. This allows data to split into many chunks and spread across all shards. For example, using a unique user_id yields high cardinality (millions of possible values) and helps distribute records evenly, whereas a low-cardinality field like country_code (few distinct values) would lump many records on the same shard. High cardinality avoids any shard getting a disproportionate amount of data or traffic.
Even Data Distribution: The values of the shard key should naturally not all cluster in one shard. If certain key values dominate (or the key space isn’t uniformly used), you risk an imbalanced cluster. For instance, sharding by a plain timestamp might send all recent data to one shard (since new timestamps are always higher and fall into the latest shard range), overloading that shard. A good shard key distributes inserts and reads uniformly so each shard handles a similar load.
Non-Monotonic (Not Sequential): Avoid shard keys that always increase or decrease (like auto-increment IDs or raw timestamps). Monotonically changing keys route all new writes to the “last” shard, making it hot while others handle only older, cold data. For example, if you shard by an incrementing order ID, shard #N (holding the highest ID range) will get all new orders, becoming a write hotspot. Instead, a good shard key changes in a less predictable way (or is hashed) so that inserts go to different shards over time.
Aligns with Query Patterns: Ideally, choose a shard key that your common queries can use to target data. This isn’t directly about hotspots, but it’s important: if most queries filter by customer_id, using customer_id as part of the shard key means those queries hit only the relevant shard. Misalignment (e.g. sharding by region when you mostly query by user) forces cross-shard scatter-gather work. So, while ensuring even distribution, make sure the key also supports efficient query routing. (In practice, sometimes you use a composite shard key to balance both distribution and query needs, such as combining region and user_id.)

By satisfying the above (unique enough values, even spread, not sequential, and query-friendly), a shard key will prevent “hot shards” and keep the cluster performing well.

Avoiding Hotspots: Bucketing and Key Randomization

Hotspots (or “hot shards”) occur when one shard handles a disproportionate share of traffic or data, becoming a bottleneck while other shards are underutilized.

Common causes include using sequential keys, very skewed values (e.g. one popular item ID that everyone accesses), or low-cardinality keys that funnel many records to the same shard.

To avoid this, we can deliberately introduce more randomness or spread into the shard key.

Two effective strategies are bucketing and key randomization:

Bucketing (Salting the Key to Spread Load)

Bucketing involves adding a component to your shard key that creates multiple “sub-keys” (buckets) for what would otherwise be a single heavily-used key.

In practice, this often means prefixing or suffixing the key with a random or patterned value (sometimes called key salting).

By doing so, data that would have all gone to one shard gets split across several shards.

For example, imagine all user activity is keyed by user_id.

A single very active user could make that user’s shard a hotspot.

With bucketing, you could add a random digit (0–9) to the shard key, e.g. shard by (user_id + suffix) where suffix is 0 through 9.

Now user 1234’s data isn’t all under one shard key “1234”. It might be divided among keys “1234-0”, “1234-1”, …, “1234-9”. This effectively spreads that one user’s load across 10 logical partitions (buckets) instead of 1.

In a database like DynamoDB or Cassandra, this technique is known as write sharding: you add a random number to the partition key to distribute writes more evenly.

One AWS example suggests adding a shard number 1–200 to a date-based key; if your partition key was just date, all today’s writes hit one partition, but using (date, random_shard) with 200 possible shard values spreads today’s inserts across 200 partitions.

The result is much better parallelism and throughput, since no single partition for that date is overwhelmed.

Bucketing is especially useful for time-based or sequential keys.

Instead of all new data going into one bucket, you can direct writes into one of several buckets.

Another form of this is time bucketing: for instance, include the hour or day as part of the key (e.g. use day or hour as a shard key component).

That way, the writes rotate into a new bucket each hour or day, rather than all piling on the latest timestamp value.

The older time buckets become read-only and no longer get writes, which relieves the hot shard issue on the single “latest” shard.

The trade-off with bucketing/salting is that reading all data for one entity or time period becomes a bit more complex. You might have to query multiple buckets and merge results.

For instance, to read all logs for a given day when you’ve spread that day into 200 buckets, the application may need to query each bucket (or know which buckets are active) and combine the data.

However, this overhead is often worth the major improvement in write distribution. The key point is that bucketing prevents a single key or range from hogging one shard by artificially spreading it out.

Key Randomization (Hashed Shard Keys)

Key randomization refers to using a hash function or similar randomizing method on the shard key so that the resulting shard assignment is uniform.

Many distributed databases support hashed sharding natively – instead of partitioning by the raw key value, the system partitions by a hash of the key. This effectively scatters records randomly across shards, eliminating patterns that cause hotspots.

In MongoDB or DynamoDB, for example, you can declare a hashed key, and the database will ensure that inserts with sequential or similar keys end up distributed evenly in the cluster.

The classic case for hashing is a monotonically increasing key like an auto-increment ID or timestamp.

As noted earlier, a pure sequential key would send all new inserts to one shard (the “highest” range) until that shard splits.

By hashing the key, those sequential IDs get mapped to essentially random shard positions.

Even distribution is achieved because a good hash function outputs values that (statistically) spread out uniformly.

If you hash a timestamp or user ID before sharding , new entries no longer all land on the same shard. The hash results will point to different shards, so the writes and reads get balanced.

In other words, hash-based sharding breaks up sorted order and prevents any shard from consistently getting the “latest” or most active portion of the data.

However, hashed keys come with a consideration: you lose the natural ordering of data on each shard.

Range queries on the original key (e.g. “give me all users between ID 1000 and 2000” or “all orders from Jan 1–5”) become less efficient because those records won’t be localized to one or few shards. The hash spreads them around.

Thus, key randomization is best when your priority is uniform load and you don’t heavily need range queries on the shard key field. It’s a trade-off between even load vs. query locality.

Often, systems will choose a hashed shard key specifically to avoid hot partitions, accepting a bit of query complexity in return for scalability.