Bucketing and Hash Maps (Still Not Good Enough)
Bucketing Algorithm
Concept: Assign contiguous user ID ranges to servers.
Users 0-99 → Server A
Users 100-199 → Server B
Users 200-299 → Server C
Users 300-399 → Server DImplementation
def route_request(user_id):
total_users = 400
num_servers = 4
bucket_size = total_users // num_servers # 100
server_index = user_id // bucket_size
return servers[server_index]Why Bucketing Fails
Problem 1: Server Crash
Initial: 400 users, 4 servers, bucket size = 100
Server B crashes: 400 users, 3 servers, bucket size = 133
New distribution:
Users 0-132 → Server A (grew from 100)
Users 133-265 → Server C (shifted)
Users 266-399 → Server D (shifted)Result: Nearly complete redistribution (worse than round robin!)
Problem 2: New User Signups
Disaster scenario:
- New user signs up → Total users = 401
- Bucket size recalculates → Everyone shuffles
- Every signup triggers system-wide data movement
Verdict: Even worse than round robin. Never use in production.
Hash Map Approach 🗂️
Concept: Maintain explicit user-to-server mapping.
user_to_server = {
"sanjana": "server_0",
"prem": "server_1",
"bathula": "server_0",
"pallavi": "server_3"
}
def route_request(user_id):
return user_to_server[user_id]Hash Map: The Good Parts
✅ Zero Unnecessary Movement
Server B crashes:
- Iterate through hash map
- Find users mapped to Server B
- Reassign only those users to surviving servers
- No impact on users mapped to A, C, D
Perfect! Only affected users move.
✅ Controlled Load Balancing
Add Server E:
total_users = len(user_to_server)
total_servers = 4 + 1 # Adding one
users_per_server = total_users // total_servers
# Randomly select users_per_server users
# Reassign them to Server EBenefit: Precise control over data movement.
Hash Map: The Fatal Flaw
Storage Requirements
Calculation for 5 billion users:
Entry format: user_id (8 bytes) + server_id (4 bytes) = 12 bytes
Total memory: 5 billion × 12 bytes = 60 GB60 GB is large but manageable with modern hardware.
Not the real problem.
The Real Problem: Synchronization 🚨
Multiple load balancers scenario:
Load Balancer 1 → Hash map in memory
Load Balancer 2 → Hash map in memory
Load Balancer 3 → Hash map in memoryCritical requirement: All maps must be identical.
Why?
- User can connect to any load balancer
- Load Balancer 1 maps User X → Server A
- Load Balancer 2 maps User X → Server B (out of sync!)
- User X gets routed inconsistently
- Data appears "missing"
Synchronization is Impossible
CAP Theorem preview:
You cannot simultaneously achieve:
- Consistency (all load balancers see same data)
- Low latency (fast operations)
- High availability (always accessible)
Pick any two. Not all three.
For real-time load balancing: We need low latency + high availability.
Therefore: Perfect consistency is mathematically impossible.
Verdict: Hash map approach requires information exchange, violating our fifth requirement.
Algorithm Comparison
| Algorithm | Fast | Equal Distribution | Minimal Movement | No Sync |
|---|---|---|---|---|
| Round Robin | ✅ | ✅ | ❌ | ✅ |
| Bucketing | ✅ | ✅ | ❌ | ✅ |
| Hash Map | ✅ | ✅ | ✅ | ❌ |
No algorithm satisfies all requirements.
What We Need
The impossible dream:
- Fast computation
- Equal load distribution
- Dynamic server management
- Minimal data movement
- Deterministic without synchronization
This seems impossible.
But there's a solution: Consistent Hashing.
Key Takeaways
- Bucketing is worse than round robin (new users trigger reshuffling)
- Hash maps solve data movement perfectly
- Hash maps fail at synchronization (mathematically impossible)
- CAP theorem limits distributed systems
- We need a new approach that satisfies all five requirements