CDN Infrastructure Deep Dive: ISP Partnerships and System Redundancy

March 5, 20264 min read
system designhigh level designHLDdistributed systemsscalabilitymicroservicesload balancingcachingdatabase designAPI designsoftware architecture

CDN Scale: Not Optimization, Infrastructure 🏗️

Question: How do CDNs handle 10 petabits per second of bandwidth?

Answer: It's not clever optimization — it's massive infrastructure investment.

CDNs spend enormous amounts of money on:

  • Hundreds of millions of servers globally
  • High-speed network connections
  • Strategic placement in ISP facilities
  • Redundant load balancers and origin servers

Key insight: CDNs are an infrastructure problem, not an optimization problem.

CDN Redundancy: Multi-Layer Architecture 🔀

The M (Origin) Node is NOT a Single Server

Many developers assume the CDN origin server is a single point of failure. It's not.

Actual architecture:

DNS → Load Balancer 1 (LB1) → App Server Cluster → Redirects to Edge Node └─→ Load Balancer 2 (LB2) → App Server Cluster → Redirects to Edge Node

Flow:

  1. DNS returns multiple IP addresses (LB1, LB2, ...)
  2. Client connects to LB1
  3. LB1 routes to random app server (round-robin)
  4. App server redirects client to nearest edge node (E2)
  5. E2 serves the file

If LB1 crashes: DNS returns LB2's IP address. No single point of failure.

Internal Origin Server Structure

The "M" (main) node shown in diagrams is actually:

  • Multiple load balancers (LB1, LB2, LB3...)
  • Multiple app servers behind each LB
  • Each app server can redirect to appropriate edge nodes

Fault tolerance: Even if one load balancer fails, DNS routes traffic to backup load balancers.

CDN-ISP Partnerships: Why CDNs Are So Fast 🤝

Real-World Example: Jio + Akamai

If you use Jio internet in India:

  1. Visit your local Jio office building
  2. Look at the servers inside the facility
  3. You'll find an Akamai server physically installed there

What this means:

  • Users connected to that Jio building fetch content from the local Akamai server
  • No cross-country data transfer required
  • Latency reduced to near-zero for cached content

How ISP Partnerships Work

CDNs install edge servers inside ISP facilities:

  • Files are cached at the "last mile" (closest possible point to users)
  • ISP customers access content from the same building
  • LRU eviction ensures only popular content stays cached locally

Coverage: Major CDNs (Akamai, Cloudflare, Fastly) have partnerships with ISPs worldwide, placing edge servers in thousands of locations.

Is Redundant Caching Wasteful? 🔄

Question: If thousands of edge servers cache the same viral video, isn't that wasteful?

Answer: No, because proximity is worth the redundancy.

Scenario: 1 million users watch a viral video

  • Edge servers across the world cache the same file
  • Each server uses LRU to evict unpopular content
  • Result: Fast delivery for all users, acceptable storage cost

CDN principle: Bandwidth and latency optimization justify storage redundancy.

Efficient Cache Management

LRU eviction prevents bloat:

  • Viral video gets cached globally (millions of servers)
  • After virality fades, video is accessed less frequently
  • LRU automatically evicts it from edge servers
  • Storage reclaimed for new popular content

The Famous Computer Science Joke 😄

There are only three hard problems in computer science:

  1. Naming variables
  2. Cache invalidation
  3. Off-by-one errors

(Notice: "only three hard problems" but lists them as 1, 2, 3 — demonstrating off-by-one errors)

Modified version:

There are only three hard problems in computer science:

  1. Naming variables
  2. Thread synchronization
  3. Cache invalidation
  4. Off-by-one errors

Why cache invalidation is hard: Keeping cached data fresh while maintaining performance is one of the most challenging problems in distributed systems. It requires balancing consistency, availability, and performance — a notoriously difficult trade-off.

Looking Ahead: Upcoming Caching Topics 🚀

Topics NOT Covered in This Lecture

Write-through cache:

  • Writes go to both cache and database simultaneously
  • Ensures strong consistency but higher latency

Write-around cache:

  • Writes go to database only, bypass cache entirely
  • Cache is populated only on reads

Case Study: Scaler Code Judge

  • How test case data is cached locally
  • Why round-robin works for identical cache data

Case Study: Scaler Contest Leaderboard

  • Caching expensive leaderboard calculations
  • Balancing freshness vs. computation cost

Case Study: Facebook Newsfeed

  • Multi-tier caching strategy
  • Handling billions of personalized feeds

Advanced Topics (Future Lectures)

Cache coherence:

  • Keeping multiple caches in sync
  • Event-based invalidation

Distributed cache consistency:

  • CAP theorem implications
  • Quorum-based caching

Cache warming strategies:

  • Pre-loading cache before traffic spikes
  • Predictive caching

Summary: Key Caching Principles 🎯

  1. Memory is always hierarchical — From tea-making to CPU architecture
  2. LRU wins 99% of the time — Don't overthink eviction policies
  3. TTL provides eventual consistency — Acceptable for most use cases
  4. Eviction ≠ Invalidation — They solve different problems
  5. Caches should be dumb — Business logic belongs in app servers
  6. Write-back for high throughput — When 1-2% data loss is acceptable
  7. CDNs are infrastructure investments — Not optimization tricks
  8. Proximity beats optimization — Place cache close to users

Remember: Cache invalidation is one of the hardest problems in computer science. There's no perfect solution — only trade-offs between consistency, availability, and performance.