Bucketing and Hash Maps (Still Not Good Enough)

March 5, 20263 min read
system designhigh level designHLDdistributed systemsscalabilitymicroservicesload balancingcachingdatabase designAPI designsoftware architecture

Bucketing Algorithm

Concept: Assign contiguous user ID ranges to servers.

Users 0-99 → Server A Users 100-199 → Server B Users 200-299 → Server C Users 300-399 → Server D

Implementation

def route_request(user_id): total_users = 400 num_servers = 4 bucket_size = total_users // num_servers # 100 server_index = user_id // bucket_size return servers[server_index]

Why Bucketing Fails

Problem 1: Server Crash

Initial: 400 users, 4 servers, bucket size = 100

Server B crashes: 400 users, 3 servers, bucket size = 133

New distribution:

Users 0-132 → Server A (grew from 100) Users 133-265 → Server C (shifted) Users 266-399 → Server D (shifted)

Result: Nearly complete redistribution (worse than round robin!)

Problem 2: New User Signups

Disaster scenario:

  • New user signs up → Total users = 401
  • Bucket size recalculates → Everyone shuffles
  • Every signup triggers system-wide data movement

Verdict: Even worse than round robin. Never use in production.


Hash Map Approach 🗂️

Concept: Maintain explicit user-to-server mapping.

user_to_server = { "sanjana": "server_0", "prem": "server_1", "bathula": "server_0", "pallavi": "server_3" } def route_request(user_id): return user_to_server[user_id]

Hash Map: The Good Parts

✅ Zero Unnecessary Movement

Server B crashes:

  1. Iterate through hash map
  2. Find users mapped to Server B
  3. Reassign only those users to surviving servers
  4. No impact on users mapped to A, C, D

Perfect! Only affected users move.

✅ Controlled Load Balancing

Add Server E:

total_users = len(user_to_server) total_servers = 4 + 1 # Adding one users_per_server = total_users // total_servers # Randomly select users_per_server users # Reassign them to Server E

Benefit: Precise control over data movement.

Hash Map: The Fatal Flaw

Storage Requirements

Calculation for 5 billion users:

Entry format: user_id (8 bytes) + server_id (4 bytes) = 12 bytes Total memory: 5 billion × 12 bytes = 60 GB

60 GB is large but manageable with modern hardware.

Not the real problem.

The Real Problem: Synchronization 🚨

Multiple load balancers scenario:

Load Balancer 1 → Hash map in memory Load Balancer 2 → Hash map in memory Load Balancer 3 → Hash map in memory

Critical requirement: All maps must be identical.

Why?

  • User can connect to any load balancer
  • Load Balancer 1 maps User X → Server A
  • Load Balancer 2 maps User X → Server B (out of sync!)
  • User X gets routed inconsistently
  • Data appears "missing"

Synchronization is Impossible

CAP Theorem preview:

You cannot simultaneously achieve:

  1. Consistency (all load balancers see same data)
  2. Low latency (fast operations)
  3. High availability (always accessible)

Pick any two. Not all three.

For real-time load balancing: We need low latency + high availability.

Therefore: Perfect consistency is mathematically impossible.

Verdict: Hash map approach requires information exchange, violating our fifth requirement.


Algorithm Comparison

AlgorithmFastEqual DistributionMinimal MovementNo Sync
Round Robin
Bucketing
Hash Map

No algorithm satisfies all requirements.

What We Need

The impossible dream:

  • Fast computation
  • Equal load distribution
  • Dynamic server management
  • Minimal data movement
  • Deterministic without synchronization

This seems impossible.

But there's a solution: Consistent Hashing.


Key Takeaways

  1. Bucketing is worse than round robin (new users trigger reshuffling)
  2. Hash maps solve data movement perfectly
  3. Hash maps fail at synchronization (mathematically impossible)
  4. CAP theorem limits distributed systems
  5. We need a new approach that satisfies all five requirements