Advanced Topics (LLMs, Cassandra, Performance Math)
ChatGPT Architecture (Stateless Transformers)
Q: Does ChatGPT use stateful servers?
A: No. ChatGPT is completely stateless.
How it works:
Request 1:
User: "Hi"
→ Server A
→ Context: ["Hi"]
→ Response: "Hello! How can I help?"Request 2:
User: "Bye"
→ Server B (different server!)
→ Context: ["Hi", "Hello! How can I help?", "Bye"]
→ Response: "Goodbye!"Each request contains full conversation history.
Why stateless?
- Different requests → Different servers
- No server-side session state
- Chat history in database
Transformer Architecture
How transformers work:
Step 1: Read full context → Predict next token
Step 2: Read (context + token1) → Predict next token
Step 3: Read (context + token1 + token2) → Predict next tokenTransformers are next-token prediction machines:
- Read entire input each time
- No internal state maintained
- Completely stateless
Contrast with LSTM:
- LSTM: Stateful (maintains hidden state)
- Transformer: Stateless (reads full context)
Context Limits
Q: Isn't sending full context expensive?
A: Yes, which is why context limits exist.
Limits:
- GPT-4: 8,192 tokens (~6K words)
- GPT-4-turbo: 128,000 tokens (~96K words)
Trade-off: Context size vs. computational cost
Cassandra Multi-Master Architecture
Q: Do database clusters always need load balancers?
A: Usually yes, but Cassandra is different.
Standard architecture:
App Servers → DB Load Balancer → DB ClusterCassandra architecture:
App Servers → Any Cassandra Node → Gossip → Correct Node(s)Every Node is a Load Balancer
Cassandra nodes:
- Store data (database server)
- Route queries (load balancer)
- Coordinate responses (query coordinator)
How it works:
- Send query to any node
- That node becomes coordinator
- Uses gossip protocol to find data
- Retrieves from correct nodes
- Returns result
Multi-master = All nodes equal (no primary)
Contrast with MongoDB
MongoDB: Master-slave architecture
- Master handles writes
- Slaves handle reads
- Explicit routing needed
Cassandra: Multi-master
- Any node handles writes
- Any node handles reads
- Built-in routing
Performance Mathematics
Binary Search Performance
Ring size: n servers × k spots = n×k entries
Google scale:
- 10 million servers
- k = 64
- Ring size = 640 million
Binary search complexity:
O(log₂(640,000,000))Calculating log₂(640 million)
Step 1: Express in factors
640 million = 64 × 10 million
= 2^6 × 10^7Step 2: Apply logarithm
log₂(640M) = log₂(2^6 × 10^7)
= log₂(2^6) + log₂(10^7)
= 6 + 7×log₂(10)Step 3: Calculate log₂(10)
We know: 2^10 ≈ 10^3
Therefore: 10×log₂(2) ≈ 3×log₂(10)
So: log₂(10) ≈ 10/3 ≈ 3.33Step 4: Final calculation
log₂(640M) = 6 + 7×3.33
= 6 + 23.31
≈ 29-30 operationsResult: ~30 binary search operations at Google scale
RAM Performance Impact
Modern DDR5 RAM:
- 6000 MHz
- CAS Latency 30
- Random access: ~10 nanoseconds
Binary search iteration:
- 10 memory reads per iteration
- 10 ns × 10 reads = 100 ns per iteration
Total routing time:
30 iterations × 100 ns = 3,000 ns = 3 microsecondsUltra-fast even at massive scale.
Quick Mental Math Tricks
Powers of 2:
- 2^10 ≈ 10^3 (thousand)
- 2^20 ≈ 10^6 (million)
- 2^30 ≈ 10^9 (billion)
Example: log₂(64 billion)
64 billion = 64 × 10^9
= 2^6 × 2^30
log₂(64B) = 6 + 30 = 36 operationsInterview tip: Practice these mental calculations.
Partitioning Terminology Clarified
Partition vs. Sharding
Partition (mathematical term):
- Splitting a set
- No overlap between subsets
- Process, not solution
Types:
| Partition Type | Location | Purpose |
|---|---|---|
| Vertical | Same server | Normalization |
| Vertical | Across servers | Microservices |
| Horizontal | Same server | Performance |
| Horizontal | Across servers | Sharding |
Sharding = Horizontal partitioning + Distribution
Q: "What's the purpose of partitioning?"
A: Too broad. Ask instead:
- "Purpose of normalization?" → Prevent anomalies
- "Purpose of sharding?" → Scale beyond one server
- "Purpose of multi-tenancy?" → Data isolation
Replication vs. Sharding
Sharding:
- Different data on different servers
- User 0 → Server A only
- Splits data
Replication:
- Same data on multiple servers
- User 0 → Server A AND Server B
- Copies data
Production: Use both
- Shard for capacity
- Replicate for fault tolerance
Replication reduces data movement (doesn't eliminate it)
When server crashes:
- With replication: Copy from backup
- Without replication: Data lost
Detailed coverage: Future lectures on replication strategies
Key Takeaways
- ChatGPT/Transformers are stateless (full context per request)
- Cassandra has no load balancer (every node routes)
- Binary search at scale: ~30 operations for 640M entries
- RAM performance limits routing (3 microseconds)
- Mental math tricks essential for interviews
- Partitioning is process; sharding is application
- Replication ≠ sharding (both needed in production)