Dualo
Backend Architectures Deep Dive

Concurrency models — threads, events, goroutines

How a backend handles many simultaneous requests. Thread-per-request, event loop, goroutines, worker pools — each has radically different scaling ceilings.

2 min read

Thread-per-request reality: OS threads aren't free. Stack size defaults to 1–8 MB; context switches take ~10 μs; scheduling contention grows non-linearly. At 10k concurrent threads, you're spending more time switching than working. This is the classic C10k problem — solvable with events, intractable with threads alone. Python adds the GIL on top: only one Python thread runs bytecode at a time, so threads help ONLY during I/O waits (releasing the GIL). CPU-bound Python with threads is strictly worse than single-threaded.

Event loop mechanics (reactor pattern): the runtime maintains a queue of coroutines and an epoll/kqueue FD set. Each coroutine runs until it hits await on I/O; the runtime registers the FD with the kernel, picks the next ready coroutine. When the kernel signals readiness, the original coroutine resumes. Single thread, massive concurrency on I/O wait. Absolute rules: (a) never block the loop (sync time.sleep, requests.get, fs.readFileSync all stall everything); (b) CPU-bound work starves other requests — offload to a threadpool (asyncio.to_thread, worker_threads in Node).

Goroutines / M:N scheduling: Go's runtime starts with tiny stacks (2 KB, grow as needed) and multiplexes millions of goroutines onto GOMAXPROCS OS threads. On a syscall or channel wait, the runtime parks the goroutine and runs another. You write resp, err := http.Get(url) — blocking-looking code that's actually non-blocking under the hood. JVM Virtual Threads (JDK 21+, Project Loom) bring the same model to Java; Elixir/BEAM has done it since the 1980s with ~300-byte processes.

BEAM (Erlang/Elixir) specifics: preemptive scheduler (unlike Go's cooperative points), per-process heap (no shared-state concurrency — messages only), supervisor trees for fault tolerance. WhatsApp famously served 2M concurrent TCP connections per node. Phoenix (Elixir web framework) inherits this: a LiveView socket per user is cheap.

Pre-fork worker pool nuances: orthogonal to per-worker concurrency. Common combinations: (a) Gunicorn sync workers = N processes × 1 thread each — simple, blocks during I/O, good for CPU-bound. (b) Gunicorn + gthread = N × M threads per process — lets each process handle M concurrent I/O waits. (c) Gunicorn + uvicorn.workers.UvicornWorker = N event loops — full async. (d) Puma cluster = N processes × M threads. Match worker type to workload: sync DB + few requests → sync; many slow HTTP calls → event loop or threaded; heavy CPU + parallelism → more processes.

Choosing an N (worker count) and M (threads per worker): start with N = CPU cores (for CPU-bound) or 2 × cores (for mixed I/O). For threaded workers, M depends on blocking I/O wait ratio — 5–30 threads common. Too many threads under GIL = wasted memory. Instrument p95 latency vs concurrency; the knee of that curve is your real ceiling.

Grounded on https://en.wikipedia.org/wiki/C10k_problem

Next up

Sync vs async — when each actually wins

Async is not universally faster. It changes the scaling profile: better under high I/O concurrency, worse for single-request latency and CPU-bound work.