Replication & Sharding
Replication = multiple copies of the same data for availability/read scaling. Sharding = splitting data across nodes for write scaling. Both are essential at scale.
**** = multiple identical copies of the same data on different nodes. Gives you (a) **availability** — if the primary dies, a replica takes over, (b) **read scale** — spread read traffic across replicas.
**** = split the data across nodes — each node owns a slice. Gives you **** — 10 shards can handle ~10× more writes than 1. Each shard is itself replicated for availability.
**Leader-follower replication** (most common): one primary handles all writes, replicas copy the primary's log asynchronously. Readers can hit replicas for load distribution. Failover = promote a replica to primary when primary dies. Examples: Postgres streaming replication, MySQL replication, MongoDB replica sets.
**Gotcha with async replication**: replica lag. User writes to primary, immediately reads from replica — might not see their own write yet (replica is a few hundred ms behind). Fix: **read-your-writes** — route that user's reads to primary for a short window; or **synchronous replication** (primary waits for replica to confirm before ack, slower writes).
Sharding keys: how do you decide which shard owns which data? Range sharding (0-1k on shard A, 1k-2k on shard B — simple, risk of hot shards if data isn't uniform). Hash sharding (hash(id) % N — uniform distribution, harder to do range queries). Directory-based (a lookup table — flexible, adds a dependency).
**Sharding is painful**. Cross-shard joins/transactions are hard. Resharding (adding nodes) requires data movement. Consider it the **last** scaling resort, after exhausting vertical + . NewSQL (Spanner, CockroachDB) hides sharding behind SQL — at an operational cost.
Diagram
Grounded on https://martin.kleppmann.com/
Next up
Message Queues & async processing
Decouple producers from consumers with a queue in between. Smooths traffic spikes, enables async work, and isolates failures.