Implementation Details (Your Questions Answered)

Hash Function Implementation

Q: Do we need 64 different hash functions?

A: No. Use one function with parameterization.

# Don't do this:
def hash_1(server): ...
def hash_2(server): ...
# ... 62 more functions

# Do this instead:
def hash_server(server, key):
    return SHA256(server + str(key))

# Generate 64 virtual spots:
for i in range(1, 65):
    hash_value = hash_server("ServerA", i)
    ring.append((hash_value, "ServerA"))

How it works:

SHA256("ServerA1") = 0x3f2a...
SHA256("ServerA2") = 0x8b1c...
SHA256("ServerA3") = 0x1d94...

Different keys → Completely different outputs

Same as salt in cryptography.

Collision Probability

Q: With 64 virtual spots per server, won't collisions increase?

A: Yes, but they remain astronomically rare.

Hash space: 2^64 possible values = 18,446,744,073,709,551,616

Collision probability for two random hashes:

P(collision) = 1 / 2^64
             ≈ 0.00000000000000000005

Even with 640 hashes (10 servers × 64):

Still extremely unlikely.

What if collision occurs?

Scenario: Server B and C both hash to position 50

Ring: [..., (50, B), (50, C), ...]

Impact:

Tie-breaking: Use first occurrence
User at position 50 → Goes to B
Server C loses one virtual spot (63 instead of 64)
Distribution remains excellent

One collision doesn't break the algorithm.

Probability of ALL hashes colliding?

Setup: 10 servers, k=64, total=640 hashes

Hash 1: Any value (probability = 1)
Hash 2: Must equal Hash 1 (probability = 1/2^64)
...
Hash 640: Must equal Hash 1 (probability = 1/2^64)

P(all collide) = (1/2^64)^639
               = 1 / 2^40,896

How small?

Number with 12,300 zeros after it.

More zeros than atoms in universe.

Verdict: Mathematically possible, practically impossible.

Database Sharding

Q: Does PostgreSQL automatically shard?

A: No. SQL databases don't support native sharding.

SQL databases without sharding:

PostgreSQL
MySQL
Oracle
SQLite

Why? Designed for single-server deployments.

Solutions:

Managed services: AWS RDS, Google Cloud SQL
Extensions: Citus Data for PostgreSQL
Manual sharding: Complex, not recommended

Q: Which databases support native sharding?

NoSQL databases:

MongoDB:

# Configure sharding
sh.enableSharding("mydb")
sh.shardCollection("mydb.users", {"user_id": 1})

# MongoDB handles consistent hashing internally

Redis Cluster:

Automatic sharding
Client-side routing
Built-in consistent hashing

Cassandra:

Multi-master architecture
Every node acts as coordinator
Gossip protocol for routing

Session Management

Q: In round robin, do sessions move between servers?

A: No. Sessions stored in database, not app servers.

Architecture:

User Request → App Server (Round Robin) → Database (Sessions)

Flow:

User authenticates → Session in database
Request 1 → App Server A → Fetch session from DB
Request 2 → App Server B → Fetch same session from DB
Request 3 → App Server C → Fetch same session from DB

Sessions live in database. App servers are stateless.

Session storage options:

Redis (dedicated session store)
Main database table
Memcached

Server Capacity

Q: Can high-memory requests go to powerful servers?

A: Vanilla consistent hashing doesn't account for capacity.

Standard: All servers treated equally, same k value

Advanced: Weighted consistent hashing

def assign_virtual_spots(server):
    base_k = 32
    ram_multiplier = server.ram_gb / 8
    return int(base_k * ram_multiplier)

# Results:
# 32 GB server → 128 spots
# 16 GB server → 64 spots
# 8 GB server → 32 spots

More spots = More load

Trade-off: Adds complexity (monitoring, dynamic allocation)

Standard practice: Use homogeneous servers (same hardware)

Data Movement

Q: When adding servers, how does data transfer?

A: Automatically through routing, not manual transfer.

Server addition:

Before: User → Position 45 → Server B (position 50)
Add Server D at position 47
After: User → Position 45 → Server D (position 47)

No manual transfer needed:

Routing algorithm changes
User requests go to new server
Data copied from backups/replicas
"Sonpari" (replication system) handles it

Covered in detail: Future lectures on replication

Fan-Out Queries

Q: How to query data across multiple users?

Problem:

User A data → Server 1
User B data → Server 2
Query: "Get Users A, B, C together"
Must hit 3 servers (expensive)

Solution: Choose better sharding key

Bad sharding key: user_id

Query: "Get all Group 5 users"
→ Hit all servers (scattered users)

Good sharding key: group_id

Query: "Get all Group 5 users"
→ Hit one server (co-located users)

Principle: Shard by access pattern, not just by primary key.

Key Takeaways

Parameterized hash functions avoid creating 64 separate functions
Collisions are extremely rare and don't break algorithm
SQL databases need external sharding solutions
NoSQL databases (MongoDB, Redis, Cassandra) have native sharding
Sessions in database enable stateless app servers
Weighted hashing possible but adds complexity
Sharding key selection critical for query performance