13 / 16
Cloud Codex · AWS / 13

Messaging.

Three services do almost the same thing with different shapes: SQS is a queue (one producer, one consumer per message). SNS is a topic (one publish, fan-out to many subscribers). EventBridge is a smart bus (publish events with attributes; routing rules send them to targets, with schedules and schema registry baked in). Plus DLQs as the universal "we couldn't process this" sink.


1 · What event-driven actually means

The mental model that survives every conversation: an event is a fact-shaped record of something that already happened ("order placed", "image uploaded", "user signed up"). A request is an instruction asking for something to happen ("place this order", "process this image", "create this user"). The two shapes call for different infrastructure. Requests want synchronous APIs, immediate responses, and tight coupling between caller and callee. Events want asynchronous delivery, decoupled producers and consumers, and somebody — a queue, a topic, a bus — sitting between them to absorb the impedance mismatch.

What event-driven isn't: a magic decoupling spell. Producers and consumers still have to agree on the event schema; a renamed field still breaks every consumer that reads it. Adding a queue between two services doesn't make them resilient — it makes failures show up as queue backlog instead of timeouts, which is sometimes the right trade-off and sometimes just a slower way to fail.

What event-driven is: a few primitives, each with a different default. Three of them dominate AWS:

ShapePrimitiveDefaultReach for it when
Queue (1→1)SQSPull, at-least-once, durable, unorderedBuffer work; smooth bursty load; decouple producer rate from consumer rate
Topic (1→N)SNSPush, at-least-once, fan-out to many subscribersOne event triggers multiple independent consumers
Bus (N→N, with rules)EventBridgePush, content-based routing, schemas, partner sourcesCross-team event distribution with rich filtering
Stream (ordered partitions)Kinesis / MSK (Kafka)Pull, ordered per partition, replayable for daysEvent sourcing, real-time analytics, anything needing strict order
Push vs pull is the cleanest dividing line. SNS and EventBridge push — they call your subscriber's HTTP endpoint or invoke a Lambda directly; you don't control the rate. SQS and Kinesis are pulled — consumers poll at their own rate, which is why they double as rate-smoothing buffers. The pattern that survives slow consumers combines both: producer publishes once to SNS or EventBridge, which pushes to per-consumer SQS queues; each consumer drains its own queue at its own rate. The next section is about why this shape works.

2 · The canonical fan-out shape

Most teams arrive at the same architecture independently: a producer publishes an event once to SNS (or EventBridge); each consumer has its own SQS queue subscribed to the topic; consumers pull from their own queue at their own rate. This is the shape because it composes the three properties you actually want — fan-out, durability, and rate-smoothing — without coupling consumers to each other or to the producer.

The alternative — having the producer publish directly to each consumer's queue — looks simpler until the day you add the fourth consumer and have to change the producer's code and redeploy it. SNS in the middle absorbs that change. EventBridge does the same thing with richer routing (content-based rules) and a slightly higher per-event price.

3 · SQS — the queue

A managed pull-based queue. Producers SendMessage; consumers ReceiveMessage, process, DeleteMessage. Two flavours:

 StandardFIFO
OrderBest-effort (mostly in order, not guaranteed)Strict per-group ordering
DeliveryAt-least-once (occasional duplicates)Exactly-once within a 5-minute deduplication window
ThroughputUnlimited300 msg/s without batching; 3,000/s with batching (per group)
Price$0.40 / million$0.50 / million

Key concepts: visibility timeout (after receive, message is invisible to other consumers for N seconds; if not deleted by then, comes back); long polling (set WaitTimeSeconds=20 on receive to wait for a message rather than poll empty); batching (up to 10 messages per send/receive, 90% cost reduction at high throughput).

The visibility timeout is the single concept most teams get wrong, because it drives the duplicate-processing behaviour that makes idempotency mandatory:

The pattern: rate-smoothing buffer. Lambda hooked to SQS scales its concurrency based on queue depth. A traffic burst goes into SQS; Lambda drains it at a controlled rate. Lets a 1M-msg/min spike land safely against a downstream that only handles 1k/s.

4 · SNS — pub/sub fan-out

SNS is a topic with subscribers. One Publish delivers to all subscriptions. Subscribers can be: SQS queues, Lambda functions, HTTPS endpoints, email, SMS, mobile push, Kinesis Data Firehose.

The canonical pattern is SNS → fan-out to SQS: one event needs to trigger N independent consumers. Each consumer has its own SQS queue subscribed to the SNS topic; each queue retains messages independently; a slow consumer doesn't slow others. Replaces the brittle "publish to multiple queues directly from the producer" anti-pattern.

  • SNS FIFO topics exist; require FIFO SQS subscribers. Order preserved per group.
  • Message filtering — subscriptions can specify a filter policy matching message attributes, so only relevant messages get delivered. Saves consumer CPU and message-delivery cost.
  • Cross-region replication via topic-to-topic subscriptions. For multi-region event distribution.

5 · EventBridge — bus, rule, target

EventBridge is SNS with content-based routing, schema awareness, and partner integrations built in. The shape is bus → rule → target: events land on a bus, rules pattern-match on event content, matched events fan out to one or more targets. One bus can have dozens of rules; each rule can have up to five targets; a single event can match multiple rules.

Three EventBridge features distinguish it from SNS as the routing layer:

  • Pattern matching is rich. Match by source, detail-type, detail field values, prefix, numeric ranges, IP CIDRs, "anything but", existence checks. The patterns are evaluated server-side, so non-matching events don't cost you a target invocation. One rule, up to five targets.
  • Targets are basically anything in AWS — Lambda, SQS, SNS, Step Functions, Kinesis, Firehose, ECS task, EventBridge bus (cross-account / cross-region), API destinations (HTTP webhooks with retry + auth), and 60+ more. A target attached to a rule includes its own retry policy and dead-letter SQS.
  • Schedule rules (formerly CloudWatch Events). Cron expressions or fixed-rate; trigger any target on schedule. EventBridge Scheduler is the newer separate service designed for millions of one-time schedules (the "remind me in 47 minutes" pattern at scale) — flexible time windows, per-schedule retry, dead-letter destinations.
  • Schema registry + discoverer. Discovers event shapes automatically by sampling traffic on the bus; generates typed bindings for Java/Python/TypeScript so consumers know what they're receiving and tools catch breaking changes.
  • Archive + replay. Record events for N days (or indefinitely); replay them through any matching rule's targets. The canonical "we fixed a bug in our analytics consumer, now replay the last 6 hours of order.placed events" workflow.
  • Partner sources. Stripe, Datadog, Auth0, GitHub, Zendesk, and others publish directly to your event bus through a partner event source — no webhook plumbing.
SNS vs EventBridge — when to pick which. Going to a small set of subscribers, throughput-sensitive, latency-sensitive: SNS (single-digit-ms publish latency, ~$0.50/M deliveries). Multi-source, content-based routing, schema-aware, third-party integrations, replay: EventBridge (~50–200 ms publish, ~$1/M events). Many serverless teams default to EventBridge for new domains and reserve SNS for the high-throughput SNS-to-SQS fan-out pattern where you don't need EventBridge's routing layer.

6 · Dead-letter queues

A DLQ is "the queue for messages we couldn't process." Configure on SQS (after N receive attempts, message moves to DLQ), Lambda async destinations (after max retries, payload goes to DLQ), SNS subscriptions, EventBridge rules. Monitor DLQ depth as a top-level alarm — a growing DLQ is the universal "something is broken downstream" signal.

The replay pattern: build a small tool that reads from DLQ, fixes the message, sends to the main queue. Don't auto-replay — get a human eyeball on a sample before mass-replaying, because the same bug that put the messages in the DLQ might still be there.

7 · Real-world case studies

Three public stories show how these primitives compose into the asynchronous backbones of production systems.

Coca-Cola Freestyle — IoT vending at scale. Coca-Cola's Freestyle dispensers (the touchscreen ones with 100+ drink options) send telemetry — fills, faults, sales — back to AWS. Their publicly documented architecture uses SNS + SQS as the durable buffer between thousands of devices and the downstream analytics and operations pipelines. The Freestyle team has talked at re:Invent about the pattern: device publishes to IoT Core → routed to SNS topics by event type → SQS queues fan out to operational subscribers (replenishment alerts) and to Kinesis Firehose for the data warehouse. The takeaway is the durability: SNS+SQS guarantees no event lost between device and warehouse even when downstream consumers are deploying or briefly broken.

BMW — EventBridge as the cross-account event spine. BMW's connected-vehicle platform uses EventBridge to route vehicle events from one ingest account (which terminates the device TLS sessions and writes to a single bus) to dozens of consumer accounts (analytics, predictive maintenance, customer service). The pattern is interesting because it cleanly separates "the team that owns the data plane" from "the teams that consume the data" — each consumer creates an EventBridge rule in their own account, the central bus has no idea who is subscribing, and BMW's security team can audit cross-account event flow from one place. This is the EventBridge sweet spot: arbitrary consumer count, no point-to-point integrations, full IAM-bounded routing.

Capital One — SQS DLQs and the replay pattern at scale. Capital One has published several architecture posts describing how their event-driven pipelines (transaction enrichment, fraud feature extraction) treat DLQs as a first-class operational surface. Their pattern: every Lambda async target and SQS consumer has a DLQ; CloudWatch alarms fire on DLQ depth > 0; a dedicated replay tool reads from the DLQ, applies the deployed bug fix, and routes back to the main queue. The lesson buried in the public material is that not auto-replaying is the discipline — you need a human to look at a sample before mass replay, because the bug that filled the DLQ might still be in the consumer.

8 · Build it yourself — SNS → SQS fan-out

  1. Create a topic and two queues.
    TOPIC=$(aws sns create-topic --name lab-topic --query TopicArn --output text) Q1=$(aws sqs create-queue --queue-name lab-q1 --query QueueUrl --output text) Q2=$(aws sqs create-queue --queue-name lab-q2 --query QueueUrl --output text) Q1_ARN=$(aws sqs get-queue-attributes --queue-url $Q1 --attribute-names QueueArn --query 'Attributes.QueueArn' --output text) Q2_ARN=$(aws sqs get-queue-attributes --queue-url $Q2 --attribute-names QueueArn --query 'Attributes.QueueArn' --output text)
  2. Allow SNS to send to the queues.
    cat > /tmp/qp.json <<EOF { "Policy": "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Principal\":{\"Service\":\"sns.amazonaws.com\"},\"Action\":\"SQS:SendMessage\",\"Resource\":\"$Q1_ARN\",\"Condition\":{\"ArnEquals\":{\"aws:SourceArn\":\"$TOPIC\"}}}]}" } EOF aws sqs set-queue-attributes --queue-url $Q1 --attributes file:///tmp/qp.json # repeat for Q2
  3. Subscribe both queues to the topic.
    aws sns subscribe --topic-arn $TOPIC --protocol sqs --notification-endpoint $Q1_ARN aws sns subscribe --topic-arn $TOPIC --protocol sqs --notification-endpoint $Q2_ARN
  4. Publish and observe both queues.
    aws sns publish --topic-arn $TOPIC --message '{"event":"order.placed","id":42}' # Receive from each queue: aws sqs receive-message --queue-url $Q1 --max-number-of-messages 10 aws sqs receive-message --queue-url $Q2 --max-number-of-messages 10 # Same message delivered to both — fan-out worked.
  5. Add a filter policy so Q2 only gets a subset.
    # Find the subscription ARN for Q2: SUB=$(aws sns list-subscriptions-by-topic --topic-arn $TOPIC --query 'Subscriptions[?Endpoint==`'$Q2_ARN'`].SubscriptionArn' --output text) aws sns set-subscription-attributes --subscription-arn $SUB \ --attribute-name FilterPolicy \ --attribute-value '{"event":["order.placed"]}' # Now Q2 only receives messages whose event attribute is "order.placed". # Note: filter checks message attributes, not body — pass attributes when publishing: aws sns publish --topic-arn $TOPIC --message '...' \ --message-attributes '{"event":{"DataType":"String","StringValue":"order.placed"}}'
  6. Set up a DLQ on Q1.
    DLQ=$(aws sqs create-queue --queue-name lab-dlq --query QueueUrl --output text) DLQ_ARN=$(aws sqs get-queue-attributes --queue-url $DLQ --attribute-names QueueArn --query 'Attributes.QueueArn' --output text) aws sqs set-queue-attributes --queue-url $Q1 \ --attributes "{\"RedrivePolicy\":\"{\\\"deadLetterTargetArn\\\":\\\"$DLQ_ARN\\\",\\\"maxReceiveCount\\\":3}\"}"
  7. Tear down.
    aws sns delete-topic --topic-arn $TOPIC aws sqs delete-queue --queue-url $Q1 aws sqs delete-queue --queue-url $Q2 aws sqs delete-queue --queue-url $DLQ

9 · What breaks

  • "Duplicate messages." SQS Standard is at-least-once. Make consumers idempotent (idempotency key in the message; deduplicate on the consumer side).
  • "Messages stuck invisible." Consumer crashed without deleting. After visibility timeout, message comes back. Tune timeout to "processing budget + slack."
  • DLQ filling silently. Always alarm on DLQ depth. ApproximateNumberOfMessagesVisible > 0 is usually the right alarm.
  • EventBridge rule "fires for everything." Pattern matching is exact — write the pattern carefully. "any event from source X" is {"source":["x"]}; missing the array brackets matches everything.
  • SNS to email — confirmation required. Email subscriptions start unconfirmed; the address must click a confirmation link. Easy to forget when troubleshooting.

10 · Further reading

Found this useful?