CPU Cache Architecture: The Speed Hierarchy Inside Your Processor

March 5, 20263 min read
system designhigh level designHLDdistributed systemsscalabilitymicroservicesload balancingcachingdatabase designAPI designsoftware architecture

The Performance Gap: CPU vs RAM ⚑

Disk and RAM are extremely slow compared to the CPU. If the CPU constantly fetched data from RAM, performance would bottleneck immediately.

Solution: Multi-level caching inside the CPU itself.

CPU Memory Hierarchy

![image](/api/blog-assets/hld/.gitbook/assets/image%20(1).png)

CPU Registers (Fastest)

  • Size: Few bytes
  • Speed: 0.2 nanoseconds per operation
  • Throughput: 5 billion operations/second
  • Access: Programming language dependent

Language access:

  • High-level (JavaScript, Python, Java): No direct access
  • Low-level (C, C++, Rust): Can suggest register usage (compiler may override)
  • Assembly: Direct register manipulation required

L1 Cache

  • Scope: Per-core (each CPU core has its own)
  • Size: Smallest cache level
  • Speed: Faster than L2

L2 Cache

  • Scope: Per-core (each CPU core has its own)
  • Size: Larger than L1
  • Speed: Slower than L1, faster than L3

L3 Cache

  • Scope: Shared across all CPU cores
  • Size: ~20 MB (typical modern processors)
  • Speed: Slower than L2, but still much faster than RAM
![image](/api/blog-assets/hld/.gitbook/assets/image%20(2).png)

Modern CPUs are multi-core chips (typically 1-inch Γ— 1-inch). Each core contains its own L1 and L2 caches for parallel processing, while L3 is shared among all cores.


Memory Speed Comparison: The Real Numbers πŸ“Š

CPU Registers: 0.2 nanoseconds

L1/L2/L3 Cache: Varies by processor (progressively slower)

RAM (Main Memory):

  • Modern RAM: ~10 nanoseconds latency
  • Older RAM: ~100 nanoseconds latency
  • 500-1,000x slower than CPU registers

SSD: 10x faster than hard disk, but still extremely slow compared to RAM

Hard Disk: 100,000x slower than RAM

Why This Matters

If the CPU constantly fetched data from the hard disk, performance would collapse. Even reading from RAM creates significant bottlenecks. This is why multi-level caching is essential.

The Universal Pattern (Again) πŸ”

As you move up the hierarchy:

  1. Size decreases
  2. Speed increases

This hierarchical access pattern exists in every computing system:

Hard Disk (slowest, largest) ↓ SSD ↓ RAM ↓ L3 Cache ↓ L2 Cache ↓ L1 Cache ↓ CPU Registers (fastest, smallest)

Why Multiple Cores Share L3 Cache πŸ”€

L1 and L2: Private to each core (optimized for single-threaded tasks)

L3: Shared across cores (enables efficient data sharing between threads)

Example: If Thread A (Core 1) caches data and Thread B (Core 2) needs it, L3 sharing prevents redundant memory fetches.

Real-World Analogy

Think of CPU caching like a library system:

  1. Hard disk = Main library warehouse (slow but has everything)
  2. RAM = Local library branch (faster, smaller collection)
  3. L3 cache = Shared reading room (quick access for multiple readers)
  4. L2 cache = Personal desk (your own workspace)
  5. L1 cache = Items in your hands (immediate access)
  6. Registers = Items you're actively reading (fastest)

Key takeaway: Every layer of caching exists to bridge the massive speed gap between the CPU and permanent storage. Without this hierarchy, modern computing would be impossibly slow.


How storage works

  1. Latency numbers: Latency Numbers Every Programmer Should Know
  2. Magnetic: How do Hard Disk Drives Work? πŸ’»πŸ’ΏπŸ› 
  3. Solid State: How do SSDs Work? | How does your Smartphone store data?
  4. RAM: How does Computer Memory Work? πŸ’»πŸ› 
  5. Optimizing code for efficient cache usage: CppCon 2016: Timur Doumler β€œWant fast C++? Know your hardware!"
  6. Brain:
    1. Information Storage and the Brain: Learning and Memory
    2. How We Make Memories: Crash Course Psychology #13
    3. Tools to Enhance Working Memory & Attention

Next: How web browsers implement caching at the client side.