Implementation Details (Your Questions Answered)

March 5, 20264 min read
system designhigh level designHLDdistributed systemsscalabilitymicroservicesload balancingcachingdatabase designAPI designsoftware architecture

Hash Function Implementation

Q: Do we need 64 different hash functions?

A: No. Use one function with parameterization.

# Don't do this: def hash_1(server): ... def hash_2(server): ... # ... 62 more functions # Do this instead: def hash_server(server, key): return SHA256(server + str(key)) # Generate 64 virtual spots: for i in range(1, 65): hash_value = hash_server("ServerA", i) ring.append((hash_value, "ServerA"))

How it works:

SHA256("ServerA1") = 0x3f2a... SHA256("ServerA2") = 0x8b1c... SHA256("ServerA3") = 0x1d94...

Different keys → Completely different outputs

Same as salt in cryptography.


Collision Probability

Q: With 64 virtual spots per server, won't collisions increase?

A: Yes, but they remain astronomically rare.

Hash space: 2^64 possible values = 18,446,744,073,709,551,616

Collision probability for two random hashes:

P(collision) = 1 / 2^64 ≈ 0.00000000000000000005

Even with 640 hashes (10 servers × 64):

Still extremely unlikely.

What if collision occurs?

Scenario: Server B and C both hash to position 50

Ring: [..., (50, B), (50, C), ...]

Impact:

  • Tie-breaking: Use first occurrence
  • User at position 50 → Goes to B
  • Server C loses one virtual spot (63 instead of 64)
  • Distribution remains excellent

One collision doesn't break the algorithm.

Probability of ALL hashes colliding?

Setup: 10 servers, k=64, total=640 hashes

Hash 1: Any value (probability = 1) Hash 2: Must equal Hash 1 (probability = 1/2^64) ... Hash 640: Must equal Hash 1 (probability = 1/2^64) P(all collide) = (1/2^64)^639 = 1 / 2^40,896

How small?

Number with 12,300 zeros after it.

More zeros than atoms in universe.

Verdict: Mathematically possible, practically impossible.


Database Sharding

Q: Does PostgreSQL automatically shard?

A: No. SQL databases don't support native sharding.

SQL databases without sharding:

  • PostgreSQL
  • MySQL
  • Oracle
  • SQLite

Why? Designed for single-server deployments.

Solutions:

  1. Managed services: AWS RDS, Google Cloud SQL
  2. Extensions: Citus Data for PostgreSQL
  3. Manual sharding: Complex, not recommended

Q: Which databases support native sharding?

NoSQL databases:

MongoDB:

# Configure sharding sh.enableSharding("mydb") sh.shardCollection("mydb.users", {"user_id": 1}) # MongoDB handles consistent hashing internally

Redis Cluster:

  • Automatic sharding
  • Client-side routing
  • Built-in consistent hashing

Cassandra:

  • Multi-master architecture
  • Every node acts as coordinator
  • Gossip protocol for routing

Session Management

Q: In round robin, do sessions move between servers?

A: No. Sessions stored in database, not app servers.

Architecture:

User Request → App Server (Round Robin) → Database (Sessions)

Flow:

  1. User authenticates → Session in database
  2. Request 1 → App Server A → Fetch session from DB
  3. Request 2 → App Server B → Fetch same session from DB
  4. Request 3 → App Server C → Fetch same session from DB

Sessions live in database. App servers are stateless.

Session storage options:

  • Redis (dedicated session store)
  • Main database table
  • Memcached

Server Capacity

Q: Can high-memory requests go to powerful servers?

A: Vanilla consistent hashing doesn't account for capacity.

Standard: All servers treated equally, same k value

Advanced: Weighted consistent hashing

def assign_virtual_spots(server): base_k = 32 ram_multiplier = server.ram_gb / 8 return int(base_k * ram_multiplier) # Results: # 32 GB server → 128 spots # 16 GB server → 64 spots # 8 GB server → 32 spots

More spots = More load

Trade-off: Adds complexity (monitoring, dynamic allocation)

Standard practice: Use homogeneous servers (same hardware)


Data Movement

Q: When adding servers, how does data transfer?

A: Automatically through routing, not manual transfer.

Server addition:

Before: User → Position 45 → Server B (position 50) Add Server D at position 47 After: User → Position 45 → Server D (position 47)

No manual transfer needed:

  1. Routing algorithm changes
  2. User requests go to new server
  3. Data copied from backups/replicas
  4. "Sonpari" (replication system) handles it

Covered in detail: Future lectures on replication


Fan-Out Queries

Q: How to query data across multiple users?

Problem:

  • User A data → Server 1
  • User B data → Server 2
  • Query: "Get Users A, B, C together"
  • Must hit 3 servers (expensive)

Solution: Choose better sharding key

Bad sharding key: user_id

Query: "Get all Group 5 users" → Hit all servers (scattered users)

Good sharding key: group_id

Query: "Get all Group 5 users" → Hit one server (co-located users)

Principle: Shard by access pattern, not just by primary key.


Key Takeaways

  1. Parameterized hash functions avoid creating 64 separate functions
  2. Collisions are extremely rare and don't break algorithm
  3. SQL databases need external sharding solutions
  4. NoSQL databases (MongoDB, Redis, Cassandra) have native sharding
  5. Sessions in database enable stateless app servers
  6. Weighted hashing possible but adds complexity
  7. Sharding key selection critical for query performance