Round Robin Load Balancing (And Why It Breaks)

March 5, 20263 min read
system designhigh level designHLDdistributed systemsscalabilitymicroservicesload balancingcachingdatabase designAPI designsoftware architecture

Requirements for Good Routing Algorithms

Before evaluating algorithms, establish criteria:

1. Fast Computation ⚡

Routing happens per request. Slow algorithm = system bottleneck.

2. Equal Load Distribution

No server should handle 90% of load while others sit idle.

3. Dynamic Server Management

Must handle server crashes and additions seamlessly.

4. Minimal Data Movement

When servers change, move only necessary data.

5. Deterministic Without Sync

Same input → Same output. No coordination between load balancers needed.

Round Robin Algorithm

Simplest possible approach: Distribute requests in circular order.

Pattern:

Request 1 → Server A Request 2 → Server B Request 3 → Server C Request 4 → Server D Request 5 → Server A (cycle repeats)

Implementation

def route_request(user_id, servers): n = len(servers) server_index = user_id % n return servers[server_index]

Example with 4 servers:

User 0 → 0 % 4 = 0 → Server A User 1 → 1 % 4 = 1 → Server B User 2 → 2 % 4 = 2 → Server C User 3 → 3 % 4 = 3 → Server D User 4 → 4 % 4 = 0 → Server A User 5 → 5 % 4 = 1 → Server B

Creates a "ring" of servers.

Evaluating Round Robin

✅ Pros

1. Extremely Fast Modulo operation: O(1) time complexity

2. Equal Distribution Each server gets ~25% of load (with 4 servers)

3. No Synchronization Needed Each load balancer independently calculates same result

❌ The Fatal Flaw: Data Movement

Scenario: Server B crashes

Initial state (4 servers):

Servers: [A, B, C, D] n = 4 User 0 → 0 % 4 = 0 → Server A User 1 → 1 % 4 = 1 → Server B User 2 → 2 % 4 = 2 → Server C User 3 → 3 % 4 = 3 → Server D User 4 → 4 % 4 = 0 → Server A User 5 → 5 % 4 = 1 → Server B

After Server B crashes:

Servers: [A, C, D] // B removed by health check n = 3 User 0 → 0 % 3 = 0 → Server A ✓ (unchanged) User 1 → 1 % 3 = 1 → Server C ✗ (moved from B - NECESSARY) User 2 → 2 % 3 = 2 → Server D ✗ (moved from C - UNNECESSARY!) User 3 → 3 % 3 = 0 → Server A ✗ (moved from D - UNNECESSARY!) User 4 → 4 % 3 = 1 → Server C ✗ (moved from A - UNNECESSARY!) User 5 → 5 % 3 = 2 → Server D ✗ (moved from B - NECESSARY)

The Problem 🚨

Required movements: Only users on Server B (Users 1, 5, 9, 13...)

Actual movements: Nearly EVERY user gets reassigned

Why? Changing denominator in user_id % n causes complete redistribution.

The Mathematics of Failure

Pattern before crash: (1, 2, 3, 4, 1, 2, 3, 4, ...) Pattern after crash: (1, 2, 3, 1, 2, 3, 1, 2, 3, ...)

Impact:

  • User 2 unnecessarily moves C → D
  • User 3 unnecessarily moves D → A
  • User 4 unnecessarily moves A → C
  • And so on...

At scale:

  • 1 billion users
  • 100 servers
  • 1 server crashes
  • ~990 million unnecessary user data movements

Unacceptable.

When Round Robin Works

Only viable when: Server count is fixed and never changes.

Example: Elasticsearch

  • Enforces fixed shard count
  • Master-slave replication prevents shard changes
  • Round robin becomes safe

For dynamic server environments: Don't use round robin.


Key Takeaways

  1. Round robin is simple and fast
  2. Equal distribution works well
  3. Fatal flaw: Massive data movement when n changes
  4. Root cause: Modulo denominator change affects all users
  5. Use case: Only for fixed server counts

Next article: We'll examine the bucketing algorithm and hash map approach—both have their own critical flaws.