Round Robin Load Balancing (And Why It Breaks)

Requirements for Good Routing Algorithms

Before evaluating algorithms, establish criteria:

1. Fast Computation ⚡

Routing happens per request. Slow algorithm = system bottleneck.

2. Equal Load Distribution

No server should handle 90% of load while others sit idle.

3. Dynamic Server Management

Must handle server crashes and additions seamlessly.

4. Minimal Data Movement

When servers change, move only necessary data.

5. Deterministic Without Sync

Same input → Same output. No coordination between load balancers needed.

Round Robin Algorithm

Simplest possible approach: Distribute requests in circular order.

Pattern:

Request 1 → Server A
Request 2 → Server B
Request 3 → Server C
Request 4 → Server D
Request 5 → Server A (cycle repeats)

Implementation

def route_request(user_id, servers):
    n = len(servers)
    server_index = user_id % n
    return servers[server_index]

Example with 4 servers:

User 0 → 0 % 4 = 0 → Server A
User 1 → 1 % 4 = 1 → Server B
User 2 → 2 % 4 = 2 → Server C
User 3 → 3 % 4 = 3 → Server D
User 4 → 4 % 4 = 0 → Server A
User 5 → 5 % 4 = 1 → Server B

Creates a "ring" of servers.

Evaluating Round Robin

✅ Pros

1. Extremely Fast Modulo operation: O(1) time complexity

2. Equal Distribution Each server gets ~25% of load (with 4 servers)

3. No Synchronization Needed Each load balancer independently calculates same result

❌ The Fatal Flaw: Data Movement

Scenario: Server B crashes

Initial state (4 servers):

Servers: [A, B, C, D]
n = 4

User 0 → 0 % 4 = 0 → Server A
User 1 → 1 % 4 = 1 → Server B
User 2 → 2 % 4 = 2 → Server C
User 3 → 3 % 4 = 3 → Server D
User 4 → 4 % 4 = 0 → Server A
User 5 → 5 % 4 = 1 → Server B

After Server B crashes:

Servers: [A, C, D]  // B removed by health check
n = 3

User 0 → 0 % 3 = 0 → Server A ✓ (unchanged)
User 1 → 1 % 3 = 1 → Server C ✗ (moved from B - NECESSARY)
User 2 → 2 % 3 = 2 → Server D ✗ (moved from C - UNNECESSARY!)
User 3 → 3 % 3 = 0 → Server A ✗ (moved from D - UNNECESSARY!)
User 4 → 4 % 3 = 1 → Server C ✗ (moved from A - UNNECESSARY!)
User 5 → 5 % 3 = 2 → Server D ✗ (moved from B - NECESSARY)

The Problem 🚨

Required movements: Only users on Server B (Users 1, 5, 9, 13...)

Actual movements: Nearly EVERY user gets reassigned

Why? Changing denominator in user_id % n causes complete redistribution.

The Mathematics of Failure

Pattern before crash: (1, 2, 3, 4, 1, 2, 3, 4, ...) Pattern after crash: (1, 2, 3, 1, 2, 3, 1, 2, 3, ...)

Impact:

User 2 unnecessarily moves C → D
User 3 unnecessarily moves D → A
User 4 unnecessarily moves A → C
And so on...

At scale:

1 billion users
100 servers
1 server crashes
~990 million unnecessary user data movements

Unacceptable.

When Round Robin Works

Only viable when: Server count is fixed and never changes.

Example: Elasticsearch

Enforces fixed shard count
Master-slave replication prevents shard changes
Round robin becomes safe

For dynamic server environments: Don't use round robin.

Key Takeaways

Round robin is simple and fast
Equal distribution works well
Fatal flaw: Massive data movement when n changes
Root cause: Modulo denominator change affects all users
Use case: Only for fixed server counts

Next article: We'll examine the bucketing algorithm and hash map approach—both have their own critical flaws.