Message Queues & async processing
Decouple producers from consumers with a queue in between. Smooths traffic spikes, enables async work, and isolates failures.
Pub/Sub vs Queue semantics: queue = competing consumers (one message → exactly one consumer in the group processes it); pub/sub (topics) = fan-out (one message → every subscriber receives a copy). Some brokers blend them (Kafka consumer groups = queue inside a topic; SNS+SQS = pub/sub feeding queues).
Kafka is a distributed, partitioned, replicated append-only log. Topics split into partitions; each partition ordered; consumers track offsets. Retains messages for a retention window (hours to forever). Enables replay, multiple consumer groups, high throughput (millions of msg/s). Trade: operational complexity (ZooKeeper or KRaft, tuning, storage).
RabbitMQ is a smart broker: exchanges (routing rules) → queues → consumers. Supports direct / topic / fanout / headers exchanges. Per-message ack, DLQ, priority queues, message TTL. Easier to operate than Kafka, lower max throughput.
Managed queues (SQS, Pub/Sub, Service Bus): zero ops, pay-per-message, auto-scale. SQS specifics: at-least-once by default; FIFO queues offer exactly-once (with message-dedup window + strict order, throughput-limited). Pub/Sub: push or pull delivery, auto-scaling subscribers.
Delivery guarantees in depth: at-least-once is the practical default — producer retries on failed ACK; consumer might process a message it already processed. Exactly-once in reality = at-least-once + idempotent consumer OR transactional outbox pattern (write to DB + emit to queue in the same transaction). Kafka exactly-once-semantics uses idempotent producer + transactional consumer read-then-write atomics.
****: when consumers are slower than producers, the queue grows. Mitigations: (i) scale consumers horizontally (if possible); (ii) reject-at-ingress / slow down producers via rate limits; (iii) drop low-priority (non-critical notifications); (iv) spill to cheaper storage (S3 archive).
Ordering: strict order is expensive at scale — requires a single partition per ordering key. Kafka's partitions are ordered; across partitions, no ordering guarantee. Model carefully: 'events for user X must be in order' → partition by user_id; 'events across the system must be in order' → single partition (throughput bottleneck).
Transactional outbox pattern: to reliably emit a message AFTER a DB commit, write the message to an 'outbox' table in the same transaction. A separate process reads the outbox and publishes to the queue, marking as sent. Avoids the classic dual-write problem (DB committed but queue push failed → lost event).
Observability: queue depth (alert when queue grows beyond baseline → consumers lagging), consumer lag per partition (Kafka), DLQ size (never 0 in prod without investigation), p99 processing time, retry rate. Build runbooks around these signals.
Diagram
Grounded on https://kafka.apache.org/intro
Next up
API Design — REST, RPC, GraphQL
Resource-oriented REST is the default. RPC (gRPC) for internal high-throughput. GraphQL for flexible client-driven queries. Pick by fit, not fashion.