Async architecture — System Design Handbook

Some work doesn't have to happen now. The act of admitting that — and putting a queue between the request and the work — is the single biggest reliability upgrade most systems will ever get.

So far, every system we've designed follows a synchronous request → response pattern: the user waits while the server does work. This pattern breaks at scale. Sending a welcome email, encoding a video, generating a report, fanning out a notification to a million followers — none of these need to finish before the user gets their answer. This module covers the asynchronous patterns that decouple "the user pressed a button" from "the work that the button kicked off." Done well, async architecture turns brittle systems resilient and slow systems fast.

Producer writes, gets acknowledged immediately. Consumers process at their own pace. The queue absorbs traffic spikes and outlives crashes.

The four reasons to go async

Asynchronous architecture is a tool. Like any tool, it solves specific problems. The four reasons it earns its complexity:

Latency: Returning 202 Accepted in 50 ms is better than returning 200 OK in 5 seconds. Users don't care that the email hasn't been sent; they care that the page didn't hang.
Decoupling: The producer and consumer don't have to be online at the same time. The consumer can be redeployed, scaled, or even rewritten in a different language, and the producer never notices.
Smoothing: Bursty traffic gets absorbed by the queue. A flash sale that produces 50,000 orders in five seconds doesn't crash the order pipeline; it gets buffered and worked off over minutes.
Fan-out: One event triggers many actions. A new signup writes one row to the queue and is consumed by the email service, the analytics service, the recommendation service, and the audit log — independently, without the producer knowing they exist.

The patterns

Async architecture comes in four canonical patterns. Each maps to a different shape of work and a different failure model.

Task queue

One producer writes a job, one consumer pulls it, runs it once, acks. Workhorse for "do this now-ish but not synchronously." SQS, Sidekiq, Celery.

Pub/sub

One producer publishes, N subscribers each get a copy. Used for fan-out: notifications, cache invalidation, audit. Redis pub/sub, SNS, NATS.

Event log

Append-only, durable, replayable. Multiple independent consumers track their own offset. Foundation for event-sourced systems and stream processing. Kafka, Kinesis, Pulsar.

Workflow

Long-lived state machines: orchestrated steps with retries, timers, branches. The pattern when "send email, wait 24h, if no reply send reminder, then mark stale" is one logical operation. Temporal, AWS Step Functions.

Delivery semantics — the trade-off no one explains clearly

Every queue offers one of three guarantees, and a system that doesn't pick one explicitly has the worst of all three. The choice cascades through the rest of the design.

	At-most-once	At-least-once	Exactly-once
What can happen	Lost messages	Duplicate messages	Neither (in theory)
How	Producer doesn't retry; consumer ack is fire-and-forget	Producer retries on uncertainty; consumer must be idempotent	Producer dedupes by id; consumer is transactional
Cost	Cheap; lossy	Cheap; idempotency burden moves to the consumer	Expensive; coordination overhead
When to use	Metrics, telemetry, sampling	~99% of business workflows	Money. Only money.

The honest version of "exactly-once" is "at-least-once delivery + idempotent processing." Even Kafka's exactly-once semantics work by giving each message an idempotency key inside a single Kafka cluster — once the data leaves the cluster, you're back to designing for at-least-once. Plan accordingly: every consumer should be idempotent. Every write should carry an idempotency key the system can dedupe on.

The moving parts of a production queue

Visibility / lease timeout: When a consumer pulls a message, the queue hides it for N seconds. If the consumer doesn't ack within N, the message becomes visible again. Set N to 2-3× the P99 processing time, not the average.
Retries with backoff: Failed messages should be re-tried with exponential backoff (1s → 2s → 4s → 8s) and jitter. Without backoff, a transient downstream failure becomes a thundering herd.
Dead-letter queue: After N retries the message goes to a DLQ for inspection. Critical: the DLQ must have alerting. An overflowing DLQ is the silent killer of async systems.
Ordering: Most queues offer per-partition or per-key ordering, not global. Choose a partition key (user_id, account_id) so related messages stay ordered without limiting parallelism.
Backpressure: What happens when consumers can't keep up? Queue depth grows. The producer either blocks (good — applies pressure upstream), drops (acceptable for telemetry), or paginates (good for batch).
Poison pill handling: One message that always crashes the consumer is a poison pill. It must move to the DLQ after a bounded number of retries; otherwise it stops processing for everyone.

The hard cases

The queue isn't free reliability. Adding a queue trades synchronous failure ("your request failed") for asynchronous failure ("the work silently never happened"). Without alerting on queue depth, DLQ count, and consumer lag, you have moved errors out of the user's view rather than fixed them. Wire up the dashboards before the queue.

Idempotency is not optional. At-least-once delivery means you will get duplicates. The consumer must be safe to call twice with the same input. The cleanest way: every message carries an idempotency key; the consumer records "I've processed key X" in a database before doing the work. Without this, retries cause double-charges, double-emails, double-everything.

Don't queue everything. Async adds operational complexity: a queue, a worker pool, a DLQ, retry semantics, monitoring, deployment coordination. If the work is <100 ms and not externally observable, do it inline. Reach for async when the work is slow, externally-facing, fan-out, or has its own SLO.

Choosing the right tool

Need	Pick
Background jobs in a single app	Sidekiq, Celery, BullMQ, RQ — language-native, fast to ship.
Fan-out across services	SNS + SQS, NATS, Redis Pub/Sub.
Event sourcing, replay, stream processing	Kafka, Pulsar, Kinesis.
Multi-step workflows with timers	Temporal, AWS Step Functions, Cadence.
Cron-like scheduled work	EventBridge Scheduler, Cloud Scheduler, or just Kubernetes CronJobs.
RPC-style with deferred response	RabbitMQ direct-reply, gRPC streaming, or just a polled job-id endpoint.

Practical defaults

Default to at-least-once delivery and idempotent consumers. Idempotency keys travel with every message.
Visibility timeout = 2-3× P99 processing time, with a hard maximum of 12 hours.
Retries: 5 attempts with exponential backoff (1s, 2s, 4s, 8s, 16s) + ±20% jitter, then to DLQ.
Alert on DLQ count > 0, queue depth growth rate, and consumer lag — these are the three signals that tell you async is failing silently.
Pick a partition key that reflects the natural ordering boundary (user_id, account_id) — global ordering is rarely needed and always expensive.
Replay matters. Pick a queue/log that lets you replay from a point in time when you'll want to backfill or recover.

Async architecture.

The four reasons to go async

The patterns

Delivery semantics — the trade-off no one explains clearly

The moving parts of a production queue

The hard cases

Choosing the right tool

Practical defaults