What is "Scaling and Load Balancers" about?

Vertical vs horizontal scaling, load balancer architecture, health checks, SSL/TLS termination, active-passive failover, and geo-DNS explained.

What topics does "Scaling and Load Balancers" cover?

This article covers: horizontal vs vertical scaling, load balancer explained, health check vs heartbeat, SSL TLS termination, active passive failover, geo DNS.

Scaling & Load Balancers: HLD Fundamentals

Scaling: Vertical vs Horizontal

When one server is not enough, you have two options.

Vertical scaling means replacing the existing server with a more powerful one. More CPU cores, more RAM, faster storage. It is the simplest path and requires no code changes, but it has a hard ceiling -- machines only get so powerful -- and the cost increases exponentially as you go higher.

Horizontal scaling means adding more servers and distributing the load across all of them. There is no ceiling. You can keep adding machines. The cost grows linearly with capacity, which makes it far more economical at scale.

ExpandVertical vs Horizontal Scaling

The moment you go horizontal, though, a new problem appears: which server's IP address does the DNS point to? You cannot point the DNS at all servers at once and leave it up to the client to figure out. You need something in between.

The Load Balancer

The solution is a load balancer -- a dedicated server that sits in front of all your app servers and handles two jobs:

Unified view -- from the client's perspective, the entire backend looks like one address. The client sends a request to one IP and does not know how many machines are behind it.
Even load distribution -- the load balancer decides which backend server handles which request, keeping the traffic spread equally so no single server gets overwhelmed.

ExpandLoad Balancer Architecture

A load balancer is simpler than an app server by design. An app server has to decrypt the request, deserialize it, check authorization, query the database, process data, serialize a response, and send it back. A load balancer only reads the incoming IP and forwards the packet to the right destination.

This simplicity is why the capacity numbers are so different:

Component	Typical capacity
App server	100 to 1,000 req/s
Load balancer	100,000+ req/s

Tracking Healthy Servers

Servers go down. New servers get added. The load balancer must know the current health of every server -- routing a request to a dead machine means that request fails for the user.

Two mechanisms handle this:

Heartbeat (push-based): Each app server sends a periodic signal to the load balancer: "I am alive." If the signal stops, the load balancer marks that server as down.

Health check (pull-based): The load balancer periodically sends a request to each app server and checks whether it responds correctly. No signal from the server needed.

Most production systems use health checks. The responsibility sits entirely with the load balancer -- app servers do not need any extra logic to ping anyone.

Setting up a heartbeat (push)

The app server owns the loop. It pings the load balancer on a fixed interval:

TypeScript

// app-server.ts — push a heartbeat every 5 seconds
const LOAD_BALANCER_URL = "http://lb-internal/heartbeat";
const SERVER_ID = process.env.SERVER_ID ?? "server-1";

async function sendHeartbeat() {
  try {
    await fetch(LOAD_BALANCER_URL, {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({ id: SERVER_ID, ts: Date.now() }),
    });
  } catch {
    // network blip — next beat will retry
  }
}

setInterval(sendHeartbeat, 5_000);

The load balancer records the last-seen timestamp per server. If it goes stale (say, no beat for 15 s), the server is marked unhealthy.

Setting up a health check (pull)

The load balancer polls. The app server only needs to expose one endpoint:

TypeScript

// app-server.ts — expose GET /health
import express from "express";
const app = express();

app.get("/health", (_req, res) => {
  // Check anything real: DB reachable, memory not swapped, etc.
  const dbOk = true; // replace with an actual ping
  if (dbOk) {
    res.status(200).json({ status: "ok", ts: Date.now() });
  } else {
    res.status(503).json({ status: "degraded" });
  }
});

app.listen(3000);

The load balancer polls GET /health every 10 seconds. A 200 means healthy. A 5xx or timeout increments the failure counter.

TypeScript

// load-balancer-monitor.ts — pull-based health checker
interface ServerState {
  url: string;
  healthy: boolean;
  failures: number;
}

const SERVERS: ServerState[] = [
  { url: "http://server-1:3000", healthy: true, failures: 0 },
  { url: "http://server-2:3000", healthy: true, failures: 0 },
];

const POLL_INTERVAL_MS = 10_000;
const FAILURE_THRESHOLD = 3; // consecutive misses before marking down

async function checkServer(server: ServerState) {
  try {
    const res = await fetch(`${server.url}/health`, { signal: AbortSignal.timeout(2_000) });
    if (res.ok) {
      server.failures = 0;
      server.healthy = true;
    } else {
      throw new Error(`HTTP ${res.status}`);
    }
  } catch {
    server.failures += 1;
    if (server.failures >= FAILURE_THRESHOLD) {
      server.healthy = false;
      console.warn(`${server.url} marked unhealthy after ${server.failures} failures`);
    }
  }
}

setInterval(() => {
  SERVERS.forEach(checkServer);
}, POLL_INTERVAL_MS);

Why consecutive failures? A single timeout could be a slow GC pause or a dropped packet -- not a crashed server. Requiring 3 in a row filters transient noise without hiding real outages.

Consecutive timeouts. A single missed health check does not mean a server is dead. Transient network hiccups -- a dropped packet, a brief load spike -- happen constantly. The load balancer marks a server dead only after 2 to 3 consecutive timeouts. This filters out noise while still detecting real failures quickly.

The Load Balancer as a Single Point of Failure

A load balancer can fail too. And at massive scale, even a healthy single load balancer becomes a bottleneck.

At Google's scale -- roughly 10 million requests per second -- one load balancer handling 100,000 req/s is not even close to enough. You need many.

The solution is to run multiple load balancers in parallel and put the DNS in front of them. The client asks the DNS for the server address, and the DNS returns the IP of one of the available load balancers -- either the nearest one geographically or a randomly selected one. This eliminates both problems:

No single point of failure: if one load balancer goes down, DNS routes traffic to the others.
No bottleneck: traffic is spread across all load balancers.

ExpandMultiple Load Balancers with DNS

SSL/TLS Termination

HTTPS connections between the client and the server are encrypted. But where does that encryption end?

In most architectures, it ends at the load balancer. This boundary is called the SSL/TLS termination point. Everything from the client to the load balancer is encrypted. Everything from the load balancer to the app servers internally is not, because inside your own data center you can trust the machines talking to each other.

This design offloads the cryptographic work (which is CPU-intensive) from the app servers onto the load balancer layer, which is better positioned to handle it at scale.

Load Balancer Failure and Active-Passive Failover

If a load balancer crashes and DNS still points to its IP address, clients get a connection failure. Two standard ways to handle this:

First: DNS can return multiple IP addresses for the same hostname. Clients try each one until one succeeds. Most operating systems and HTTP clients implement this by default.

Second: The load balancer machine can be backed by an active-passive cluster behind the same IP address.

The key insight is that an IP address is not tied to a specific machine. It is tied to a network interface -- specifically, to the cable or virtual network link plugged in. If you replace the machine on the same link, the replacement inherits the IP address.

In practice: you run two (or more) load balancer machines. One is active, one is on standby. Both are configured behind the same IP. The moment the active machine becomes unresponsive, the networking layer (via a protocol like VRRP or a cloud provider's floating IP mechanism) automatically routes traffic to the standby. From the client's perspective, nothing changed.

This is called an active-passive failover. No client-side change is required and no DNS update is needed.

Geo-DNS

A standard DNS server resolves a hostname to the same IP address regardless of where the client is located. The DNS does not consider geography by default.

A geo-DNS is a DNS server configured to return different IP addresses based on the geographic location of the requesting client. If a user in India queries the DNS, they receive an IP pointing to a server in Mumbai. A user in Germany receives an IP pointing to Frankfurt. This reduces latency by directing users to the nearest data centre.

Geo-DNS is an optional configuration. It requires the DNS provider to support it and the infrastructure to have servers in multiple regions.