Round Robin Load Balancing (And Why It Breaks)
Requirements for Good Routing Algorithms
Before evaluating algorithms, establish criteria:
1. Fast Computation ⚡
Routing happens per request. Slow algorithm = system bottleneck.
2. Equal Load Distribution
No server should handle 90% of load while others sit idle.
3. Dynamic Server Management
Must handle server crashes and additions seamlessly.
4. Minimal Data Movement
When servers change, move only necessary data.
5. Deterministic Without Sync
Same input → Same output. No coordination between load balancers needed.
Round Robin Algorithm
Simplest possible approach: Distribute requests in circular order.
Pattern:
Request 1 → Server A
Request 2 → Server B
Request 3 → Server C
Request 4 → Server D
Request 5 → Server A (cycle repeats)Implementation
def route_request(user_id, servers):
n = len(servers)
server_index = user_id % n
return servers[server_index]Example with 4 servers:
User 0 → 0 % 4 = 0 → Server A
User 1 → 1 % 4 = 1 → Server B
User 2 → 2 % 4 = 2 → Server C
User 3 → 3 % 4 = 3 → Server D
User 4 → 4 % 4 = 0 → Server A
User 5 → 5 % 4 = 1 → Server BCreates a "ring" of servers.
Evaluating Round Robin
✅ Pros
1. Extremely Fast Modulo operation: O(1) time complexity
2. Equal Distribution Each server gets ~25% of load (with 4 servers)
3. No Synchronization Needed Each load balancer independently calculates same result
❌ The Fatal Flaw: Data Movement
Scenario: Server B crashes
Initial state (4 servers):
Servers: [A, B, C, D]
n = 4
User 0 → 0 % 4 = 0 → Server A
User 1 → 1 % 4 = 1 → Server B
User 2 → 2 % 4 = 2 → Server C
User 3 → 3 % 4 = 3 → Server D
User 4 → 4 % 4 = 0 → Server A
User 5 → 5 % 4 = 1 → Server BAfter Server B crashes:
Servers: [A, C, D] // B removed by health check
n = 3
User 0 → 0 % 3 = 0 → Server A ✓ (unchanged)
User 1 → 1 % 3 = 1 → Server C ✗ (moved from B - NECESSARY)
User 2 → 2 % 3 = 2 → Server D ✗ (moved from C - UNNECESSARY!)
User 3 → 3 % 3 = 0 → Server A ✗ (moved from D - UNNECESSARY!)
User 4 → 4 % 3 = 1 → Server C ✗ (moved from A - UNNECESSARY!)
User 5 → 5 % 3 = 2 → Server D ✗ (moved from B - NECESSARY)The Problem 🚨
Required movements: Only users on Server B (Users 1, 5, 9, 13...)
Actual movements: Nearly EVERY user gets reassigned
Why? Changing denominator in user_id % n causes complete redistribution.
The Mathematics of Failure
Pattern before crash: (1, 2, 3, 4, 1, 2, 3, 4, ...) Pattern after crash: (1, 2, 3, 1, 2, 3, 1, 2, 3, ...)
Impact:
- User 2 unnecessarily moves C → D
- User 3 unnecessarily moves D → A
- User 4 unnecessarily moves A → C
- And so on...
At scale:
- 1 billion users
- 100 servers
- 1 server crashes
- ~990 million unnecessary user data movements
Unacceptable.
When Round Robin Works
Only viable when: Server count is fixed and never changes.
Example: Elasticsearch
- Enforces fixed shard count
- Master-slave replication prevents shard changes
- Round robin becomes safe
For dynamic server environments: Don't use round robin.
Key Takeaways
- Round robin is simple and fast
- Equal distribution works well
- Fatal flaw: Massive data movement when
nchanges - Root cause: Modulo denominator change affects all users
- Use case: Only for fixed server counts
Next article: We'll examine the bucketing algorithm and hash map approach—both have their own critical flaws.