Dualo
System Design Essentials

Replication & Sharding

Replication = multiple copies of the same data for availability/read scaling. Sharding = splitting data across nodes for write scaling. Both are essential at scale.

2 min read

**Replication topologies**: **single-leader** (the norm — Postgres, MySQL, MongoDB, Redis replica): one primary, N followers; writes go to primary, replicate to followers; failover via leader election. **Multi-leader** (geo-distributed writes — BDR, PostgreSQL pglogical, CRDB multi-region): any node accepts writes; conflict resolution required (last-write-wins, CRDTs). **Leaderless** (Dynamo-style: Cassandra, Riak): client writes to N nodes, reads from N nodes, with quorums; no single leader.

Replication methods: statement-based (replay SQL — easy but non-deterministic statements like NOW() or RAND() diverge); row-based (replay physical row changes — MySQL binlog, deterministic); WAL/physical (stream the write-ahead log — Postgres streaming replication, fastest, bit-identical replicas); logical (decoded WAL → typed events, cross-version, subscriber-friendly).

**Replication lag implications**: async replication → followers trail primary by ms-seconds under load. **Replication-lag-aware clients**: Postgres GUC `hot_standby_feedback`, client routing that tracks per-user 'last write LSN' and routes reads to a replica only if caught up past that LSN. **Consistency levels (reader's choice)**: strong (always primary), eventual (any replica), read-your-writes (primary for some window post-write), bounded-staleness (replica if lag < N seconds).

Failover semantics: automated (Patroni for Postgres, Orchestrator for MySQL, MongoDB replica set native) vs manual. Key risks: split-brain (two primaries after a network partition — data divergence, hard to reconcile; use fencing/STONITH), lost committed writes (async replication + failover to a follower that didn't receive the last commits). Synchronous replication eliminates lost-writes at the cost of write availability during follower outages.

**Sharding strategies**: (i) **range sharding** — keys bucketed into ranges; easy range queries; risk of hot spots if data skew; range rebalancing (splits) needed; (ii) **hash sharding** — hash(key) % N or ; uniform distribution; range queries expensive (fan-out); (iii) **consistent hashing** — each node owns a hash-ring slice; adding/removing nodes moves only 1/N of data (vs N% in modulo hashing); (iv) **directory-based** — explicit shard map in a lookup service; maximum flexibility, introduces SPOF.

Hot spots (skew): celebrity keys (Taylor Swift's profile, Bitcoin price ticker, /Users/admin), time-ordered keys with recent-hot pattern (IoT). Fix: add a random suffix to spread (+0..63 → 64 sub-shards), pre-split the known hot range, read through a hot-key cache.

Resharding: the feared operation. Add 2 more shards to an existing 10-shard cluster. Approaches: (i) double-up (pre-split for known growth — start with 32 logical shards on 8 physical nodes, scale to 32 physical later); (ii) online live migration (dual-write to old and new, backfill, cut over; hours-to-days for TB-scale); (iii) offline (downtime window — rare in practice). NewSQL platforms abstract this.

Cross-shard transactions: 2-phase commit (2PC) is possible but slow and deadlock-prone. Alternatives: saga pattern (compensating transactions), eventual consistency with idempotent operations, Spanner-style distributed transactions with TrueTime. Avoid cross-shard transactions in high-volume paths by schema design (co-locate frequently-joined data on the same shard).

Operational cost honesty: sharding adds complexity across the stack — data modeling, query routing, migrations, monitoring, disaster recovery. Estimate: 2-3 engineers of work to reach steady-state on a sharded Postgres, ongoing maintenance. Exhaust vertical + replicas before sharding; consider NewSQL or sharding service (Vitess for MySQL, Citus for Postgres) before rolling your own.

Diagram

Grounded on https://martin.kleppmann.com/

Next up

Message Queues & async processing

Decouple producers from consumers with a queue in between. Smooths traffic spikes, enables async work, and isolates failures.