Dualo
System Design Essentials

Load Balancing — L4 vs L7, algorithms, health checks

The traffic cop in front of your service pool. Decides which backend handles each request, detects dead ones, and keeps things flowing.

1 min read

OSI layers for LB: L4 operates on TCP/UDP — the LB forwards packets or TCP streams without parsing payload. Fast, low latency, protocol-agnostic. Used for raw TCP services, MySQL proxies, gaming. L7 terminates the connection, parses HTTP(S), can route by method/path/header/cookie/body, modify requests, offload SSL/TLS. Almost all modern web traffic uses L7.

Algorithms in depth: round-robin — simple, uneven load on heterogeneous backends. Least connections — better on long-lived connections (WebSockets). Least response time — rare, requires backend latency probe. Consistent hashing — hash the client key (IP, user id, session id) to a backend; adding/removing a node minimally re-shuffles keys. Used for cache-aware routing, sticky sessions, sharded services.

Health check mechanics: passive (observe real-traffic failures — if backend returns 5xx or timeouts, mark unhealthy) and active (synthetic probes — GET /health every N seconds). Thresholds: mark unhealthy after K consecutive failures; mark healthy after M successes (hysteresis to avoid flapping). Deep health checks (test DB reachability, downstream API) vs shallow (service is running).

SSL/TLS termination: L7 LBs decrypt HTTPS, forward HTTP to backends. Pro: offload crypto from app, centralized cert management. Con: backend-LB traffic unencrypted (fine in VPC, unacceptable otherwise — use mTLS). Modern setups use TLS end-to-end with LB re-encrypting.

Anycast + global LB: route clients to the nearest geographic LB via anycast IP advertising. Used by CloudFront, Cloudflare, GCP Global LB. Reduces latency for global audiences.

Graceful deploys (connection draining): when removing a backend (deploy, scale-down), stop sending new connections and let existing ones complete (drain window, typically 30-60s). Without it, mid-request connections drop — 5xx errors visible to users.

Rate limiting / throttling: often done at LB (Nginx rate_limit, ALB in combo with WAF, Cloudflare rate rules). Centralizes rate logic before requests reach backend. Token-bucket or leaky-bucket algorithms.

When you DON'T need an LB: serverless platforms (Cloud Run, Lambda, Vercel) hide the LB — it exists but you don't operate it. Single-instance apps don't need one. The decision is architectural: are you running >1 backend?

Diagram

Grounded on https://www.nginx.com/resources/glossary/load-balancing/

Next up

Caching — layers, strategies, invalidation

Speed up reads by storing recent/popular results closer to the caller. Done right, it turns a 100ms DB query into a 1ms cache hit.