The Data Distribution Problem (Why Location Matters)

The Critical Question

You have 100 servers. Josefa sends a request.

Which server should handle it?

The answer depends entirely on where Josefa's data lives.

Why Data Location Matters 🎯

Scenario:

Server A stores Josefa's bookmarks
Server B stores Abhishek's bookmarks

If load balancer routes Josefa's request to Server B:

Server B has no Josefa data
Returns empty list
Josefa thinks: "My bookmarks are deleted!"
Catastrophic user experience

Critical principle: Route users to servers containing their data.

The Impossible Options

Option 1: Store Everything on One Server

Can we avoid distribution entirely?

No. The entire reason for 100 servers was running out of space on one server.

Option 2: Store Data Randomly

Can we just randomly place data?

No. To retrieve data, you'd need to check all 100 servers. Every. Single. Request.

Example query time:

Check Server 1: 10ms
Check Server 2: 10ms
...
Check Server 100: 10ms
Total: 1,000ms (1 second!) per request

Unacceptable.

The Solution: Logical Distribution

Requirements:

Repeatable logic (same input → same server)
Fast to compute (routing happens per request)
Deterministic (no randomness)

This process is called data sharding.

Understanding Partitioning

Before sharding, understand partitioning.

Partitioning = Splitting data into non-overlapping subsets

Vertical Partitioning (Split by Columns)

Original users table:

| ID | username | password | name | avatar | gender |

Split into two tables:

Users table (Authentication):

| ID | username | password |

Profiles table (User Info):

| user_id (FK) | name | avatar | gender |

Why? Prevents data anomalies:

Read anomalies
Write anomalies
Update anomalies
Delete anomalies

This is normalization.

Horizontal Partitioning (Split by Rows)

Original users table:

User 0, User 1, User 2, ... User 999

Split by gender:

users_male:

User 0 (male), User 2 (male), User 4 (male)...

users_female:

User 1 (female), User 3 (female), User 5 (female)...

Same columns, different rows.

Use Cases for Horizontal Partitioning

1. Performance Optimization

IRCTC example:

Regular tickets: Normal table
Tatkal tickets: Separate table (time-critical queries need fast indexing)

Benefit: Smaller table = faster index lookups

2. Multi-Tenancy Isolation

Slack example:

Company A's messages: Table A
Company B's messages: Table B

Why separate?

Bug in filter code won't expose Company A data to Company B
Physical separation prevents data leaks
Compliance requirements

From Partitioning to Sharding

Vertical partitioning across servers = Microservices

Authentication service on Server A
Profile service on Server B
Separation of concerns

Horizontal partitioning across servers = Sharding 🎯

Users 0-99 on Server A
Users 100-199 on Server B
Scalability solution

What is Sharding?

Definition: Horizontal partitioning distributed across multiple servers.

Why shard?

1. Storage Capacity Single server can't hold all data.

2. Read Performance Parallel reads across servers.

3. Write Throughput Distribute writes across servers.

Sharding vs. Replication

Critical distinction:

Sharding:

Different data on different servers
User 0 data → Server A only
User 1 data → Server B only
Data splitting

Replication:

Same data on multiple servers
User 0 data → Server A AND Server B
Backup copies
Data copying

Production systems use both:

Shard for capacity
Replicate for fault tolerance

The Sharding Key 🔑

Every sharding strategy needs a sharding key.

Example: Gender-based sharding

Sharding key: gender column
Males → Server A
Females → Server B

Rules:

1. One Logic Per Database Can't have different sharding keys for different server groups within same database.

2. Composite Keys Allowed Sharding key can be multiple columns: (region, user_id)

3. Consistency Required Same key must apply across all shards in the database.

The Routing Insight 💡

Critical realization: Sharding happens through request routing, not data movement.

How it works:

# Load balancer routing
def route_request(user_id):
    if user_id.gender == "male":
        return server_A
    else:
        return server_B

Data distribution follows automatically:

Male user creates bookmark → Routed to Server A → Data stored on Server A
Female user creates bookmark → Routed to Server B → Data stored on Server B

No separate "sharding process" routing algorithm determines data placement.

Example Flow

Ravikant (male) sends request:

Load balancer checks: gender = male
Routes to Server A
Request creates data on Server A
Ravikant's data now lives on Server A

Trisha (female) sends request:

Load balancer checks: gender = female
Routes to Server B
Request creates data on Server B
Trisha's data now lives on Server B

Future reads follow same logic:

Ravikant's requests → Always Server A → Finds his data
Trisha's requests → Always Server B → Finds her data

Key Takeaways

Data location determines routing (can't route without knowing distribution)
Partitioning = splitting (vertical by columns, horizontal by rows)
Sharding = horizontal partitioning across servers
Sharding key determines distribution logic
Routing algorithm creates sharding (not separate process)