Write-Back Cache Strategy: High-Throughput Systems with Acceptable Data Loss

March 5, 20264 min read
system designhigh level designHLDdistributed systemsscalabilitymicroservicesload balancingcachingdatabase designAPI designsoftware architecture

What is Write-Back Cache? 🔄

Definition: Writes go to the cache first, then asynchronously flush to the database later.

Flow:

  1. Client sends write request
  2. App server writes to cache (fast)
  3. App server returns success to client immediately
  4. Cache periodically dumps data to database (background process)

Key characteristic: Database writes are deferred (not immediate).

When to Use Write-Back Cache ✅

Ideal for: High-throughput systems where occasional data loss is acceptable.

Two Key Requirements

  1. Extremely high write volume — Millions of writes per second
  2. 1-2% data loss tolerance — Losing some data is not catastrophic

Use Case 1: View Counting on YouTube/Twitter 👁️

The Problem

Scenario: Billions of users watching videos on YouTube.

Challenge:

  • Millions of view requests per second
  • Writing each view to disk (database) is too slow
  • Need extremely high throughput

The Write-Back Solution

Implementation:

  1. User watches a video → Increment view counter in cache (RAM)
  2. Cache accumulates views in memory
  3. Every 30 seconds, dump aggregated count to database
  4. If cache crashes before dump, some views are lost

Example:

Cache: Video X views = 1,523,891 (in memory) Database: Video X views = 1,520,000 (last dump 30 seconds ago) [Cache crashes before next dump] Result: ~3,891 views lost

Why This Is Acceptable

Question: Is losing 1-2% of views catastrophic?

Answer: No, because we care about trends, not individual data points.

Rationale:

  • Overall trend remains accurate (millions of views)
  • Losing 3,891 out of 1,520,000 views = 0.25% error
  • Nobody notices or cares about this margin
  • Benefit: Massively higher throughput (can handle millions of requests/second)

Server Crash Frequency

How often do servers crash? Approximately once per year (for well-maintained systems).

Impact: If cache dumps to database every 30 seconds, and the server crashes once per year, you lose at most 30 seconds of view data — negligible in the grand scheme.

Use Case 2: Multiplayer Gaming (PUBG, CS:GO, Dota) 🎮

The Problem

Scenario: 10-100 players in a live game match.

Challenge:

  • Real-time game state (player positions, health, ammo, kills)
  • Extremely high write frequency (every player action)
  • Writing every action to database is too slow

The Write-Back Solution

Architecture:

10 Players → App Server (local cache stores game state) → Database

Implementation:

  1. All players in a match connect to one app server
  2. App server stores entire game state in local cache (RAM)
  3. During the match: All updates stay in cache (no database writes)
  4. Match ends successfully: Dump final statistics to database
    • Winner
    • Kill count per player
    • Death count
    • MVP
    • Cheater detection results
  5. Server crashes mid-match: Game is lost (cannot resume)

Why This Is Acceptable

Question: If the server crashes and all game data is lost, isn't that bad?

Answer: Yes, but it's acceptable for this use case.

Rationale:

  • Server crashes are rare (once per year)
  • If a crash happens, players simply start a new game
  • Users accept this risk in exchange for smooth, lag-free gameplay
  • Alternative (writing every action to database) would make the game unplayable due to latency

Not stored in database during match:

  • Every bullet fired
  • Every player movement
  • Every health/ammo change

Stored in database after match:

  • Final game metadata (winner, kills, deaths, MVP)
  • Game recording (for replay/review)

Write-Back Cache: Key Takeaways 🎯

When to Use

  • High-throughput systems (millions of writes/second)
  • Data loss of 1-2% is acceptable
  • Trend accuracy matters more than individual precision

Examples:

  • View counts (YouTube, Twitter, Instagram)
  • Like counts (social media)
  • Analytics dashboards
  • Gaming sessions
  • Leaderboards (with periodic updates)

When NOT to Use

  • Financial transactions (bank transfers, payments)
  • Critical user data (account passwords, personal information)
  • Legal records (contracts, compliance data)
  • Medical records (patient data)
  • Any system where data loss is unacceptable

Trade-off Summary

Gain:

  • Massive throughput increase (10-100x faster writes)
  • Near-zero write latency for users

Cost:

  • Risk of data loss on cache failure
  • Eventual consistency (data may be stale until next flush)

Advanced Topic: Bloom Filters for Non-Existent Keys 🔍

Problem: What if users frequently request keys that don't exist in the database?

Scenario:

  1. User requests non-existent key
  2. Cache miss → Check database
  3. Database doesn't have the key either
  4. Return empty result
  5. This happens repeatedly → Database gets overwhelmed with useless queries

Solution: Bloom filters provide a fast containment check before querying the database.

How Bloom filters work:

  • Probabilistic data structure
  • Quickly answers: "Is this key definitely NOT in the database?"
  • If Bloom filter says "not present" → Skip database query entirely
  • If Bloom filter says "maybe present" → Check database

Note: Bloom filters will be covered in detail during the NoSQL internals lecture. For now, understand they optimize cache misses for non-existent keys.


Next: CDN infrastructure deep dive — ISP partnerships and redundancy architecture.