Implementation Details (Your Questions Answered)
Hash Function Implementation
Q: Do we need 64 different hash functions?
A: No. Use one function with parameterization.
# Don't do this:
def hash_1(server): ...
def hash_2(server): ...
# ... 62 more functions
# Do this instead:
def hash_server(server, key):
return SHA256(server + str(key))
# Generate 64 virtual spots:
for i in range(1, 65):
hash_value = hash_server("ServerA", i)
ring.append((hash_value, "ServerA"))How it works:
SHA256("ServerA1") = 0x3f2a...
SHA256("ServerA2") = 0x8b1c...
SHA256("ServerA3") = 0x1d94...Different keys → Completely different outputs
Same as salt in cryptography.
Collision Probability
Q: With 64 virtual spots per server, won't collisions increase?
A: Yes, but they remain astronomically rare.
Hash space: 2^64 possible values = 18,446,744,073,709,551,616
Collision probability for two random hashes:
P(collision) = 1 / 2^64
≈ 0.00000000000000000005Even with 640 hashes (10 servers × 64):
Still extremely unlikely.
What if collision occurs?
Scenario: Server B and C both hash to position 50
Ring: [..., (50, B), (50, C), ...]Impact:
- Tie-breaking: Use first occurrence
- User at position 50 → Goes to B
- Server C loses one virtual spot (63 instead of 64)
- Distribution remains excellent
One collision doesn't break the algorithm.
Probability of ALL hashes colliding?
Setup: 10 servers, k=64, total=640 hashes
Hash 1: Any value (probability = 1)
Hash 2: Must equal Hash 1 (probability = 1/2^64)
...
Hash 640: Must equal Hash 1 (probability = 1/2^64)
P(all collide) = (1/2^64)^639
= 1 / 2^40,896How small?
Number with 12,300 zeros after it.
More zeros than atoms in universe.
Verdict: Mathematically possible, practically impossible.
Database Sharding
Q: Does PostgreSQL automatically shard?
A: No. SQL databases don't support native sharding.
SQL databases without sharding:
- PostgreSQL
- MySQL
- Oracle
- SQLite
Why? Designed for single-server deployments.
Solutions:
- Managed services: AWS RDS, Google Cloud SQL
- Extensions: Citus Data for PostgreSQL
- Manual sharding: Complex, not recommended
Q: Which databases support native sharding?
NoSQL databases:
MongoDB:
# Configure sharding
sh.enableSharding("mydb")
sh.shardCollection("mydb.users", {"user_id": 1})
# MongoDB handles consistent hashing internallyRedis Cluster:
- Automatic sharding
- Client-side routing
- Built-in consistent hashing
Cassandra:
- Multi-master architecture
- Every node acts as coordinator
- Gossip protocol for routing
Session Management
Q: In round robin, do sessions move between servers?
A: No. Sessions stored in database, not app servers.
Architecture:
User Request → App Server (Round Robin) → Database (Sessions)Flow:
- User authenticates → Session in database
- Request 1 → App Server A → Fetch session from DB
- Request 2 → App Server B → Fetch same session from DB
- Request 3 → App Server C → Fetch same session from DB
Sessions live in database. App servers are stateless.
Session storage options:
- Redis (dedicated session store)
- Main database table
- Memcached
Server Capacity
Q: Can high-memory requests go to powerful servers?
A: Vanilla consistent hashing doesn't account for capacity.
Standard: All servers treated equally, same k value
Advanced: Weighted consistent hashing
def assign_virtual_spots(server):
base_k = 32
ram_multiplier = server.ram_gb / 8
return int(base_k * ram_multiplier)
# Results:
# 32 GB server → 128 spots
# 16 GB server → 64 spots
# 8 GB server → 32 spotsMore spots = More load
Trade-off: Adds complexity (monitoring, dynamic allocation)
Standard practice: Use homogeneous servers (same hardware)
Data Movement
Q: When adding servers, how does data transfer?
A: Automatically through routing, not manual transfer.
Server addition:
Before: User → Position 45 → Server B (position 50)
Add Server D at position 47
After: User → Position 45 → Server D (position 47)No manual transfer needed:
- Routing algorithm changes
- User requests go to new server
- Data copied from backups/replicas
- "Sonpari" (replication system) handles it
Covered in detail: Future lectures on replication
Fan-Out Queries
Q: How to query data across multiple users?
Problem:
- User A data → Server 1
- User B data → Server 2
- Query: "Get Users A, B, C together"
- Must hit 3 servers (expensive)
Solution: Choose better sharding key
Bad sharding key: user_id
Query: "Get all Group 5 users"
→ Hit all servers (scattered users)Good sharding key: group_id
Query: "Get all Group 5 users"
→ Hit one server (co-located users)Principle: Shard by access pattern, not just by primary key.
Key Takeaways
- Parameterized hash functions avoid creating 64 separate functions
- Collisions are extremely rare and don't break algorithm
- SQL databases need external sharding solutions
- NoSQL databases (MongoDB, Redis, Cassandra) have native sharding
- Sessions in database enable stateless app servers
- Weighted hashing possible but adds complexity
- Sharding key selection critical for query performance