Scalability — vertical vs horizontal, stateless services
Scaling a system means serving more load without falling over. The choice — bigger machines or more machines — defines the rest of the architecture.
Scalability axes: throughput (req/s), concurrency (simultaneous requests), data volume, latency-at-load (p50/p95/p99 under peak). Each axis has different bottlenecks — a system that handles 10k req/s at p50=5ms may collapse at p99=5000ms when concurrency spikes.
Amdahl's Law: the speedup from parallelization is capped by the serial fraction. If 10% of a workload must run serially, max speedup = 10×. Applied to architecture: even fully-horizontal compute is gated by shared-state synchronization (DB writes, distributed locks).
Universal Scalability Law (Gunther): throughput(N) = N / (1 + α(N-1) + βN(N-1)). α = contention (queuing for shared resource), β = coherency delay (cross-node synchronization). Beyond a point, adding nodes DECREASES throughput — lock contention dominates. Classic example: Postgres max_connections pathological at N > ~500 without pooling.
Stateless tier design: instance-local state restricted to per-request cache. Session state externalized — Redis, DynamoDB, Postgres with short TTL. Enables: (a) auto-scaling (add/remove instances freely), (b) rolling deploys, (c) failure tolerance (kill any instance, traffic re-routes).
(data) and replication (compute) are the two horizontal levers. Partitioning splits data across nodes (each node owns a slice). Replication copies state across nodes (each node serves the same slice). Most systems do both — partition for capacity, replicate for availability.
Scaling phases (heuristic): (1) 1 server, 1 DB — works up to ~100 req/s; (2) split compute from DB, add CDN for static — ~1k req/s; (3) multi-instance stateless API behind LB, read replicas on DB — ~10k req/s; (4) cache tier, async queues, sharded DB — ~100k req/s; (5) service decomposition, per-service data store, multi-region — FAANG territory.
Anti-patterns: premature microservices (too many services before vertical exhaustion), cargo-cult Kubernetes (complexity cost > scaling gain for most apps), scaling before measuring (adding instances masks inefficient code — fix the code first).
Grounded on https://martinfowler.com/articles/patterns-of-distributed-systems/
Next up
Load Balancing — L4 vs L7, algorithms, health checks
The traffic cop in front of your service pool. Decides which backend handles each request, detects dead ones, and keeps things flowing.