Advanced Topics (LLMs, Cassandra, Performance Math)

March 5, 20264 min read
system designhigh level designHLDdistributed systemsscalabilitymicroservicesload balancingcachingdatabase designAPI designsoftware architecture

ChatGPT Architecture (Stateless Transformers)

Q: Does ChatGPT use stateful servers?

A: No. ChatGPT is completely stateless.

How it works:

Request 1:

User: "Hi" → Server A → Context: ["Hi"] → Response: "Hello! How can I help?"

Request 2:

User: "Bye" → Server B (different server!) → Context: ["Hi", "Hello! How can I help?", "Bye"] → Response: "Goodbye!"

Each request contains full conversation history.

Why stateless?

  • Different requests → Different servers
  • No server-side session state
  • Chat history in database

Transformer Architecture

How transformers work:

Step 1: Read full context → Predict next token Step 2: Read (context + token1) → Predict next token Step 3: Read (context + token1 + token2) → Predict next token

Transformers are next-token prediction machines:

  • Read entire input each time
  • No internal state maintained
  • Completely stateless

Contrast with LSTM:

  • LSTM: Stateful (maintains hidden state)
  • Transformer: Stateless (reads full context)

Context Limits

Q: Isn't sending full context expensive?

A: Yes, which is why context limits exist.

Limits:

  • GPT-4: 8,192 tokens (~6K words)
  • GPT-4-turbo: 128,000 tokens (~96K words)

Trade-off: Context size vs. computational cost


Cassandra Multi-Master Architecture

Q: Do database clusters always need load balancers?

A: Usually yes, but Cassandra is different.

Standard architecture:

App Servers → DB Load Balancer → DB Cluster

Cassandra architecture:

App Servers → Any Cassandra Node → Gossip → Correct Node(s)

Every Node is a Load Balancer

Cassandra nodes:

  • Store data (database server)
  • Route queries (load balancer)
  • Coordinate responses (query coordinator)

How it works:

  1. Send query to any node
  2. That node becomes coordinator
  3. Uses gossip protocol to find data
  4. Retrieves from correct nodes
  5. Returns result

Multi-master = All nodes equal (no primary)

Contrast with MongoDB

MongoDB: Master-slave architecture

  • Master handles writes
  • Slaves handle reads
  • Explicit routing needed

Cassandra: Multi-master

  • Any node handles writes
  • Any node handles reads
  • Built-in routing

Performance Mathematics

Binary Search Performance

Ring size: n servers × k spots = n×k entries

Google scale:

  • 10 million servers
  • k = 64
  • Ring size = 640 million

Binary search complexity:

O(log₂(640,000,000))

Calculating log₂(640 million)

Step 1: Express in factors

640 million = 64 × 10 million = 2^6 × 10^7

Step 2: Apply logarithm

log₂(640M) = log₂(2^6 × 10^7) = log₂(2^6) + log₂(10^7) = 6 + 7×log₂(10)

Step 3: Calculate log₂(10)

We know: 2^10 ≈ 10^3 Therefore: 10×log₂(2) ≈ 3×log₂(10) So: log₂(10) ≈ 10/3 ≈ 3.33

Step 4: Final calculation

log₂(640M) = 6 + 7×3.33 = 6 + 23.31 ≈ 29-30 operations

Result: ~30 binary search operations at Google scale

RAM Performance Impact

Modern DDR5 RAM:

  • 6000 MHz
  • CAS Latency 30
  • Random access: ~10 nanoseconds

Binary search iteration:

  • 10 memory reads per iteration
  • 10 ns × 10 reads = 100 ns per iteration

Total routing time:

30 iterations × 100 ns = 3,000 ns = 3 microseconds

Ultra-fast even at massive scale.

Quick Mental Math Tricks

Powers of 2:

  • 2^10 ≈ 10^3 (thousand)
  • 2^20 ≈ 10^6 (million)
  • 2^30 ≈ 10^9 (billion)

Example: log₂(64 billion)

64 billion = 64 × 10^9 = 2^6 × 2^30 log₂(64B) = 6 + 30 = 36 operations

Interview tip: Practice these mental calculations.


Partitioning Terminology Clarified

Partition vs. Sharding

Partition (mathematical term):

  • Splitting a set
  • No overlap between subsets
  • Process, not solution

Types:

Partition TypeLocationPurpose
VerticalSame serverNormalization
VerticalAcross serversMicroservices
HorizontalSame serverPerformance
HorizontalAcross serversSharding

Sharding = Horizontal partitioning + Distribution

Q: "What's the purpose of partitioning?"

A: Too broad. Ask instead:

  • "Purpose of normalization?" → Prevent anomalies
  • "Purpose of sharding?" → Scale beyond one server
  • "Purpose of multi-tenancy?" → Data isolation

Replication vs. Sharding

Sharding:

  • Different data on different servers
  • User 0 → Server A only
  • Splits data

Replication:

  • Same data on multiple servers
  • User 0 → Server A AND Server B
  • Copies data

Production: Use both

  • Shard for capacity
  • Replicate for fault tolerance

Replication reduces data movement (doesn't eliminate it)

When server crashes:

  • With replication: Copy from backup
  • Without replication: Data lost

Detailed coverage: Future lectures on replication strategies


Key Takeaways

  1. ChatGPT/Transformers are stateless (full context per request)
  2. Cassandra has no load balancer (every node routes)
  3. Binary search at scale: ~30 operations for 640M entries
  4. RAM performance limits routing (3 microseconds)
  5. Mental math tricks essential for interviews
  6. Partitioning is process; sharding is application
  7. Replication ≠ sharding (both needed in production)