Bucketing and Hash Maps (Still Not Good Enough)

Bucketing Algorithm

Concept: Assign contiguous user ID ranges to servers.

Users   0-99   → Server A
Users 100-199  → Server B
Users 200-299  → Server C
Users 300-399  → Server D

Implementation

def route_request(user_id):
    total_users = 400
    num_servers = 4
    bucket_size = total_users // num_servers  # 100
    server_index = user_id // bucket_size
    return servers[server_index]

Why Bucketing Fails

Problem 1: Server Crash

Initial: 400 users, 4 servers, bucket size = 100

Server B crashes: 400 users, 3 servers, bucket size = 133

New distribution:

Users   0-132  → Server A (grew from 100)
Users 133-265  → Server C (shifted)
Users 266-399  → Server D (shifted)

Result: Nearly complete redistribution (worse than round robin!)

Problem 2: New User Signups

Disaster scenario:

New user signs up → Total users = 401
Bucket size recalculates → Everyone shuffles
Every signup triggers system-wide data movement

Verdict: Even worse than round robin. Never use in production.

Hash Map Approach 🗂️

Concept: Maintain explicit user-to-server mapping.

user_to_server = {
    "sanjana": "server_0",
    "prem": "server_1",
    "bathula": "server_0",
    "pallavi": "server_3"
}

def route_request(user_id):
    return user_to_server[user_id]

Hash Map: The Good Parts

✅ Zero Unnecessary Movement

Server B crashes:

Iterate through hash map
Find users mapped to Server B
Reassign only those users to surviving servers
No impact on users mapped to A, C, D

Perfect! Only affected users move.

✅ Controlled Load Balancing

Add Server E:

total_users = len(user_to_server)
total_servers = 4 + 1  # Adding one
users_per_server = total_users // total_servers

# Randomly select users_per_server users
# Reassign them to Server E

Benefit: Precise control over data movement.

Hash Map: The Fatal Flaw

Storage Requirements

Calculation for 5 billion users:

Entry format: user_id (8 bytes) + server_id (4 bytes) = 12 bytes
Total memory: 5 billion × 12 bytes = 60 GB

60 GB is large but manageable with modern hardware.

Not the real problem.

The Real Problem: Synchronization 🚨

Multiple load balancers scenario:

Load Balancer 1 → Hash map in memory
Load Balancer 2 → Hash map in memory
Load Balancer 3 → Hash map in memory

Critical requirement: All maps must be identical.

Why?

User can connect to any load balancer
Load Balancer 1 maps User X → Server A
Load Balancer 2 maps User X → Server B (out of sync!)
User X gets routed inconsistently
Data appears "missing"

Synchronization is Impossible

CAP Theorem preview:

You cannot simultaneously achieve:

Consistency (all load balancers see same data)
Low latency (fast operations)
High availability (always accessible)

Pick any two. Not all three.

For real-time load balancing: We need low latency + high availability.

Therefore: Perfect consistency is mathematically impossible.

Verdict: Hash map approach requires information exchange, violating our fifth requirement.

Algorithm Comparison

Algorithm	Fast	Equal Distribution	Minimal Movement	No Sync
Round Robin	✅	✅	❌	✅
Bucketing	✅	✅	❌	✅
Hash Map	✅	✅	✅	❌

No algorithm satisfies all requirements.

What We Need

The impossible dream:

Fast computation
Equal load distribution
Dynamic server management
Minimal data movement
Deterministic without synchronization

This seems impossible.

But there's a solution: Consistent Hashing.

Key Takeaways

Bucketing is worse than round robin (new users trigger reshuffling)
Hash maps solve data movement perfectly
Hash maps fail at synchronization (mathematically impossible)
CAP theorem limits distributed systems
We need a new approach that satisfies all five requirements