Advanced Topics (LLMs, Cassandra, Performance Math)

ChatGPT Architecture (Stateless Transformers)

Q: Does ChatGPT use stateful servers?

A: No. ChatGPT is completely stateless.

How it works:

Request 1:

User: "Hi"
→ Server A
→ Context: ["Hi"]
→ Response: "Hello! How can I help?"

Request 2:

User: "Bye"
→ Server B (different server!)
→ Context: ["Hi", "Hello! How can I help?", "Bye"]
→ Response: "Goodbye!"

Each request contains full conversation history.

Why stateless?

Different requests → Different servers
No server-side session state
Chat history in database

Transformer Architecture

How transformers work:

Step 1: Read full context → Predict next token
Step 2: Read (context + token1) → Predict next token
Step 3: Read (context + token1 + token2) → Predict next token

Transformers are next-token prediction machines:

Read entire input each time
No internal state maintained
Completely stateless

Contrast with LSTM:

LSTM: Stateful (maintains hidden state)
Transformer: Stateless (reads full context)

Context Limits

Q: Isn't sending full context expensive?

A: Yes, which is why context limits exist.

Limits:

GPT-4: 8,192 tokens (~6K words)
GPT-4-turbo: 128,000 tokens (~96K words)

Trade-off: Context size vs. computational cost

Cassandra Multi-Master Architecture

Q: Do database clusters always need load balancers?

A: Usually yes, but Cassandra is different.

Standard architecture:

App Servers → DB Load Balancer → DB Cluster

Cassandra architecture:

App Servers → Any Cassandra Node → Gossip → Correct Node(s)

Every Node is a Load Balancer

Cassandra nodes:

Store data (database server)
Route queries (load balancer)
Coordinate responses (query coordinator)

How it works:

Send query to any node
That node becomes coordinator
Uses gossip protocol to find data
Retrieves from correct nodes
Returns result

Multi-master = All nodes equal (no primary)

Contrast with MongoDB

MongoDB: Master-slave architecture

Master handles writes
Slaves handle reads
Explicit routing needed

Cassandra: Multi-master

Any node handles writes
Any node handles reads
Built-in routing

Performance Mathematics

Binary Search Performance

Ring size: n servers × k spots = n×k entries

Google scale:

10 million servers
k = 64
Ring size = 640 million

Binary search complexity:

O(log₂(640,000,000))

Calculating log₂(640 million)

Step 1: Express in factors

640 million = 64 × 10 million
            = 2^6 × 10^7

Step 2: Apply logarithm

log₂(640M) = log₂(2^6 × 10^7)
           = log₂(2^6) + log₂(10^7)
           = 6 + 7×log₂(10)

Step 3: Calculate log₂(10)

We know: 2^10 ≈ 10^3
Therefore: 10×log₂(2) ≈ 3×log₂(10)
So: log₂(10) ≈ 10/3 ≈ 3.33

Step 4: Final calculation

log₂(640M) = 6 + 7×3.33
           = 6 + 23.31
           ≈ 29-30 operations

Result: ~30 binary search operations at Google scale

RAM Performance Impact

Modern DDR5 RAM:

6000 MHz
CAS Latency 30
Random access: ~10 nanoseconds

Binary search iteration:

10 memory reads per iteration
10 ns × 10 reads = 100 ns per iteration

Total routing time:

30 iterations × 100 ns = 3,000 ns = 3 microseconds

Ultra-fast even at massive scale.

Quick Mental Math Tricks

Powers of 2:

2^10 ≈ 10^3 (thousand)
2^20 ≈ 10^6 (million)
2^30 ≈ 10^9 (billion)

Example: log₂(64 billion)

64 billion = 64 × 10^9
           = 2^6 × 2^30
log₂(64B) = 6 + 30 = 36 operations

Interview tip: Practice these mental calculations.

Partitioning Terminology Clarified

Partition vs. Sharding

Partition (mathematical term):

Splitting a set
No overlap between subsets
Process, not solution

Types:

Partition Type	Location	Purpose
Vertical	Same server	Normalization
Vertical	Across servers	Microservices
Horizontal	Same server	Performance
Horizontal	Across servers	Sharding

Sharding = Horizontal partitioning + Distribution

Q: "What's the purpose of partitioning?"

A: Too broad. Ask instead:

"Purpose of normalization?" → Prevent anomalies
"Purpose of sharding?" → Scale beyond one server
"Purpose of multi-tenancy?" → Data isolation

Replication vs. Sharding

Sharding:

Different data on different servers
User 0 → Server A only
Splits data

Replication:

Same data on multiple servers
User 0 → Server A AND Server B
Copies data

Production: Use both

Shard for capacity
Replicate for fault tolerance

Replication reduces data movement (doesn't eliminate it)

When server crashes:

With replication: Copy from backup
Without replication: Data lost

Detailed coverage: Future lectures on replication strategies

Key Takeaways

ChatGPT/Transformers are stateless (full context per request)
Cassandra has no load balancer (every node routes)
Binary search at scale: ~30 operations for 640M entries
RAM performance limits routing (3 microseconds)
Mental math tricks essential for interviews
Partitioning is process; sharding is application
Replication ≠ sharding (both needed in production)