The Data Distribution Problem (Why Location Matters)

March 5, 20264 min read
system designhigh level designHLDdistributed systemsscalabilitymicroservicesload balancingcachingdatabase designAPI designsoftware architecture

The Critical Question

You have 100 servers. Josefa sends a request.

Which server should handle it?

The answer depends entirely on where Josefa's data lives.

Why Data Location Matters 🎯

Scenario:

  • Server A stores Josefa's bookmarks
  • Server B stores Abhishek's bookmarks

If load balancer routes Josefa's request to Server B:

  1. Server B has no Josefa data
  2. Returns empty list
  3. Josefa thinks: "My bookmarks are deleted!"
  4. Catastrophic user experience

Critical principle: Route users to servers containing their data.

The Impossible Options

Option 1: Store Everything on One Server

Can we avoid distribution entirely?

No. The entire reason for 100 servers was running out of space on one server.

Option 2: Store Data Randomly

Can we just randomly place data?

No. To retrieve data, you'd need to check all 100 servers. Every. Single. Request.

Example query time:

  • Check Server 1: 10ms
  • Check Server 2: 10ms
  • ...
  • Check Server 100: 10ms
  • Total: 1,000ms (1 second!) per request

Unacceptable.

The Solution: Logical Distribution

Requirements:

  1. Repeatable logic (same input → same server)
  2. Fast to compute (routing happens per request)
  3. Deterministic (no randomness)

This process is called data sharding.


Understanding Partitioning

Before sharding, understand partitioning.

Partitioning = Splitting data into non-overlapping subsets

Vertical Partitioning (Split by Columns)

Original users table:

| ID | username | password | name | avatar | gender |

Split into two tables:

Users table (Authentication):

| ID | username | password |

Profiles table (User Info):

| user_id (FK) | name | avatar | gender |

Why? Prevents data anomalies:

  • Read anomalies
  • Write anomalies
  • Update anomalies
  • Delete anomalies

This is normalization.

Horizontal Partitioning (Split by Rows)

Original users table:

User 0, User 1, User 2, ... User 999

Split by gender:

users_male:

User 0 (male), User 2 (male), User 4 (male)...

users_female:

User 1 (female), User 3 (female), User 5 (female)...

Same columns, different rows.


Use Cases for Horizontal Partitioning

1. Performance Optimization

IRCTC example:

  • Regular tickets: Normal table
  • Tatkal tickets: Separate table (time-critical queries need fast indexing)

Benefit: Smaller table = faster index lookups

2. Multi-Tenancy Isolation

Slack example:

  • Company A's messages: Table A
  • Company B's messages: Table B

Why separate?

  • Bug in filter code won't expose Company A data to Company B
  • Physical separation prevents data leaks
  • Compliance requirements

From Partitioning to Sharding

Vertical partitioning across servers = Microservices

  • Authentication service on Server A
  • Profile service on Server B
  • Separation of concerns

Horizontal partitioning across servers = Sharding 🎯

  • Users 0-99 on Server A
  • Users 100-199 on Server B
  • Scalability solution

What is Sharding?

Definition: Horizontal partitioning distributed across multiple servers.

Why shard?

1. Storage Capacity Single server can't hold all data.

2. Read Performance Parallel reads across servers.

3. Write Throughput Distribute writes across servers.

Sharding vs. Replication

Critical distinction:

Sharding:

  • Different data on different servers
  • User 0 data → Server A only
  • User 1 data → Server B only
  • Data splitting

Replication:

  • Same data on multiple servers
  • User 0 data → Server A AND Server B
  • Backup copies
  • Data copying

Production systems use both:

  • Shard for capacity
  • Replicate for fault tolerance

The Sharding Key 🔑

Every sharding strategy needs a sharding key.

Example: Gender-based sharding

  • Sharding key: gender column
  • Males → Server A
  • Females → Server B

Rules:

1. One Logic Per Database Can't have different sharding keys for different server groups within same database.

2. Composite Keys Allowed Sharding key can be multiple columns: (region, user_id)

3. Consistency Required Same key must apply across all shards in the database.

The Routing Insight 💡

Critical realization: Sharding happens through request routing, not data movement.

How it works:

# Load balancer routing def route_request(user_id): if user_id.gender == "male": return server_A else: return server_B

Data distribution follows automatically:

  1. Male user creates bookmark → Routed to Server A → Data stored on Server A
  2. Female user creates bookmark → Routed to Server B → Data stored on Server B

No separate "sharding process" routing algorithm determines data placement.

Example Flow

Ravikant (male) sends request:

  1. Load balancer checks: gender = male
  2. Routes to Server A
  3. Request creates data on Server A
  4. Ravikant's data now lives on Server A

Trisha (female) sends request:

  1. Load balancer checks: gender = female
  2. Routes to Server B
  3. Request creates data on Server B
  4. Trisha's data now lives on Server B

Future reads follow same logic:

  • Ravikant's requests → Always Server A → Finds his data
  • Trisha's requests → Always Server B → Finds her data

Key Takeaways

  1. Data location determines routing (can't route without knowing distribution)
  2. Partitioning = splitting (vertical by columns, horizontal by rows)
  3. Sharding = horizontal partitioning across servers
  4. Sharding key determines distribution logic
  5. Routing algorithm creates sharding (not separate process)