What topics does "Data Partitioning, Sharding, and the Routing Problem" cover?

This article covers: data partitioning, sharding, vertical partitioning, horizontal partitioning, sharding key, routing problem.

Data Partitioning, Sharding, and the Routing Problem

Q: What is "Data Partitioning, Sharding, and the Routing Problem" about?

Why random routing breaks stateful systems, the two ways to split a database, and how sharding is just a side effect of where requests land - not a separate process.

The Load Balancer from the previous chapter hides 500 servers behind a single IP. But there is a question it left unanswered: how does it decide which of those 500 gets a given request? Picking at random sounds fine - until you think about what is actually stored on each server.

Data Lives on Specific Servers

Consider a stateful setup where each server stores its own data. James's bookmarks are on Server 3. His next request comes in, and the Load Balancer routes it to Server 7. Server 7 has no record of James - it returns an empty list. From James's perspective, the app just deleted everything he saved.

This is not a hypothetical edge case. It is the fundamental constraint that makes routing hard: a request must arrive at the server that holds the relevant data. You cannot route arbitrarily.

Routing and Distribution Are the Same Problem

Here is the insight that reframes everything: routing and data distribution are not two separate problems. There is no background process deciding where to store data independently of where requests go. Data lands wherever the Load Balancer routes the write request. A user's bookmarks accumulate on whichever server always receives that user's requests.

The routing rule is the distribution rule. Which means the logic used to route reads must be identical to the logic used when the write happened. A mismatch - write using one rule, read using another - scatters data across servers and makes retrieval impossible.

[!NOTE] This is why choosing a routing algorithm is a high-stakes decision. Get it wrong and you end up with hot servers, unequal load, or data spread in a way no query can reassemble efficiently.

Before designing that routing algorithm, we need to understand how data can be split across machines in the first place.

Vertical Partitioning

Take a users table: id, username, password, name, avatar, gender. The first three columns are authentication concerns - used for login. The last three are profile concerns - used for display pages. You could split this into a users table and a profiles table.

That is vertical partitioning - splitting by columns, keeping rows intact. You likely know this as normalization. The benefit is preventing anomalies: update anomalies, delete anomalies, and the inconsistency that creeps in when the same fact is stored in two places.

Horizontal Partitioning

Flip the axis. Keep all columns the same, but split by rows. Every partition has the same schema - the rows are just divided by some criterion.

A practical example: in a train booking system, separating Tatkal tickets into their own table because Tatkal queries are time-critical and benefit from faster, isolated indexes. Or separating Slack messages by workspace - so a query bug in one tenant's code can never accidentally surface another company's data. Both are horizontal partitioning.

ExpandData Partitioning: Vertical vs Horizontal

The Interesting Part: Across Servers

Both strategies can be applied within one server or across multiple servers. That is where the implications diverge:

Vertical partitioning across servers: the authentication database lives on one machine, the profile database on another. Each service owns a distinct domain - this is the foundation of microservices.
Horizontal partitioning across servers: rows are split across machines because one machine cannot hold all of them. This is sharding.

What Sharding Actually Is

Sharding is horizontal partitioning across servers. The dataset is too large for one machine, so rows are distributed: some live on Server A, some on Server B, and so on. Benefits go beyond raw storage capacity - multiple machines handling separate subsets means better read throughput and higher write concurrency.

Sharding is always based on a sharding key: the column (or columns) used to decide which server a row belongs to. Shard by user_id and all of Denver's rows go to one server. Shard by created_at and this month's data lands on one shard, last month's on another.

One rule: a single database must use one consistent sharding key throughout. If different parts of your cluster use different logic to decide where rows live, you will never be able to locate anything reliably.

There Is No Sharding Process

This is the part that took me a while to internalize: sharding happens as a side effect of routing. There is no background daemon moving data around. No migration job running nightly.

If the Load Balancer always routes writes from a specific user to Server A, that user's data naturally accumulates on Server A. Routing controls where writes land - which means routing also controls how data is distributed. The routing rule and the sharding rule are the same rule.

ExpandSharding via Routing: Data distribution as a routing side effect

Sharding vs Replication

These two concepts get conflated constantly in interviews. They are not the same:

Sharding: split data across servers. Different rows live in different places. Increases storage capacity and throughput.
Replication: copy data across servers. The same row exists in multiple places. Increases durability and availability.

In production, you need both - shard for scale, replicate to prevent data loss. But they solve different problems and are designed independently.

The Essentials

Routing and distribution are the same decision - data lands wherever writes are routed, so the routing rule and the sharding rule must always match. A mismatch makes data unlocatable.
Vertical partitioning splits by columns; horizontal partitioning splits by rows - sharding is horizontal partitioning applied across multiple servers, always driven by a single consistent sharding key.
Sharding is a side effect, not a process - no background daemon redistributes data. The Load Balancer's routing behavior is what creates the distribution.