Messaging.
Three services do almost the same thing with different shapes: SQS is a queue (one producer, one consumer per message). SNS is a topic (one publish, fan-out to many subscribers). EventBridge is a smart bus (publish events with attributes; routing rules send them to targets, with schedules and schema registry baked in). Plus DLQs as the universal "we couldn't process this" sink.
1 · What event-driven actually means
The mental model that survives every conversation: an event is a fact-shaped record of something that already happened ("order placed", "image uploaded", "user signed up"). A request is an instruction asking for something to happen ("place this order", "process this image", "create this user"). The two shapes call for different infrastructure. Requests want synchronous APIs, immediate responses, and tight coupling between caller and callee. Events want asynchronous delivery, decoupled producers and consumers, and somebody — a queue, a topic, a bus — sitting between them to absorb the impedance mismatch.
What event-driven isn't: a magic decoupling spell. Producers and consumers still have to agree on the event schema; a renamed field still breaks every consumer that reads it. Adding a queue between two services doesn't make them resilient — it makes failures show up as queue backlog instead of timeouts, which is sometimes the right trade-off and sometimes just a slower way to fail.
What event-driven is: a few primitives, each with a different default. Three of them dominate AWS:
| Shape | Primitive | Default | Reach for it when |
|---|---|---|---|
| Queue (1→1) | SQS | Pull, at-least-once, durable, unordered | Buffer work; smooth bursty load; decouple producer rate from consumer rate |
| Topic (1→N) | SNS | Push, at-least-once, fan-out to many subscribers | One event triggers multiple independent consumers |
| Bus (N→N, with rules) | EventBridge | Push, content-based routing, schemas, partner sources | Cross-team event distribution with rich filtering |
| Stream (ordered partitions) | Kinesis / MSK (Kafka) | Pull, ordered per partition, replayable for days | Event sourcing, real-time analytics, anything needing strict order |
2 · The canonical fan-out shape
Most teams arrive at the same architecture independently: a producer publishes an event once to SNS (or EventBridge); each consumer has its own SQS queue subscribed to the topic; consumers pull from their own queue at their own rate. This is the shape because it composes the three properties you actually want — fan-out, durability, and rate-smoothing — without coupling consumers to each other or to the producer.
The alternative — having the producer publish directly to each consumer's queue — looks simpler until the day you add the fourth consumer and have to change the producer's code and redeploy it. SNS in the middle absorbs that change. EventBridge does the same thing with richer routing (content-based rules) and a slightly higher per-event price.
3 · SQS — the queue
A managed pull-based queue. Producers SendMessage; consumers ReceiveMessage, process, DeleteMessage. Two flavours:
| Standard | FIFO | |
|---|---|---|
| Order | Best-effort (mostly in order, not guaranteed) | Strict per-group ordering |
| Delivery | At-least-once (occasional duplicates) | Exactly-once within a 5-minute deduplication window |
| Throughput | Unlimited | 300 msg/s without batching; 3,000/s with batching (per group) |
| Price | $0.40 / million | $0.50 / million |
Key concepts: visibility timeout (after receive, message is invisible to other consumers for N seconds; if not deleted by then, comes back); long polling (set WaitTimeSeconds=20 on receive to wait for a message rather than poll empty); batching (up to 10 messages per send/receive, 90% cost reduction at high throughput).
The visibility timeout is the single concept most teams get wrong, because it drives the duplicate-processing behaviour that makes idempotency mandatory:
4 · SNS — pub/sub fan-out
SNS is a topic with subscribers. One Publish delivers to all subscriptions. Subscribers can be: SQS queues, Lambda functions, HTTPS endpoints, email, SMS, mobile push, Kinesis Data Firehose.
The canonical pattern is SNS → fan-out to SQS: one event needs to trigger N independent consumers. Each consumer has its own SQS queue subscribed to the SNS topic; each queue retains messages independently; a slow consumer doesn't slow others. Replaces the brittle "publish to multiple queues directly from the producer" anti-pattern.
- SNS FIFO topics exist; require FIFO SQS subscribers. Order preserved per group.
- Message filtering — subscriptions can specify a filter policy matching message attributes, so only relevant messages get delivered. Saves consumer CPU and message-delivery cost.
- Cross-region replication via topic-to-topic subscriptions. For multi-region event distribution.
5 · EventBridge — bus, rule, target
EventBridge is SNS with content-based routing, schema awareness, and partner integrations built in. The shape is bus → rule → target: events land on a bus, rules pattern-match on event content, matched events fan out to one or more targets. One bus can have dozens of rules; each rule can have up to five targets; a single event can match multiple rules.
Three EventBridge features distinguish it from SNS as the routing layer:
- Pattern matching is rich. Match by source, detail-type, detail field values, prefix, numeric ranges, IP CIDRs, "anything but", existence checks. The patterns are evaluated server-side, so non-matching events don't cost you a target invocation. One rule, up to five targets.
- Targets are basically anything in AWS — Lambda, SQS, SNS, Step Functions, Kinesis, Firehose, ECS task, EventBridge bus (cross-account / cross-region), API destinations (HTTP webhooks with retry + auth), and 60+ more. A target attached to a rule includes its own retry policy and dead-letter SQS.
- Schedule rules (formerly CloudWatch Events). Cron expressions or fixed-rate; trigger any target on schedule. EventBridge Scheduler is the newer separate service designed for millions of one-time schedules (the "remind me in 47 minutes" pattern at scale) — flexible time windows, per-schedule retry, dead-letter destinations.
- Schema registry + discoverer. Discovers event shapes automatically by sampling traffic on the bus; generates typed bindings for Java/Python/TypeScript so consumers know what they're receiving and tools catch breaking changes.
- Archive + replay. Record events for N days (or indefinitely); replay them through any matching rule's targets. The canonical "we fixed a bug in our analytics consumer, now replay the last 6 hours of order.placed events" workflow.
- Partner sources. Stripe, Datadog, Auth0, GitHub, Zendesk, and others publish directly to your event bus through a partner event source — no webhook plumbing.
6 · Dead-letter queues
A DLQ is "the queue for messages we couldn't process." Configure on SQS (after N receive attempts, message moves to DLQ), Lambda async destinations (after max retries, payload goes to DLQ), SNS subscriptions, EventBridge rules. Monitor DLQ depth as a top-level alarm — a growing DLQ is the universal "something is broken downstream" signal.
The replay pattern: build a small tool that reads from DLQ, fixes the message, sends to the main queue. Don't auto-replay — get a human eyeball on a sample before mass-replaying, because the same bug that put the messages in the DLQ might still be there.
7 · Real-world case studies
Three public stories show how these primitives compose into the asynchronous backbones of production systems.
Coca-Cola Freestyle — IoT vending at scale. Coca-Cola's Freestyle dispensers (the touchscreen ones with 100+ drink options) send telemetry — fills, faults, sales — back to AWS. Their publicly documented architecture uses SNS + SQS as the durable buffer between thousands of devices and the downstream analytics and operations pipelines. The Freestyle team has talked at re:Invent about the pattern: device publishes to IoT Core → routed to SNS topics by event type → SQS queues fan out to operational subscribers (replenishment alerts) and to Kinesis Firehose for the data warehouse. The takeaway is the durability: SNS+SQS guarantees no event lost between device and warehouse even when downstream consumers are deploying or briefly broken.
BMW — EventBridge as the cross-account event spine. BMW's connected-vehicle platform uses EventBridge to route vehicle events from one ingest account (which terminates the device TLS sessions and writes to a single bus) to dozens of consumer accounts (analytics, predictive maintenance, customer service). The pattern is interesting because it cleanly separates "the team that owns the data plane" from "the teams that consume the data" — each consumer creates an EventBridge rule in their own account, the central bus has no idea who is subscribing, and BMW's security team can audit cross-account event flow from one place. This is the EventBridge sweet spot: arbitrary consumer count, no point-to-point integrations, full IAM-bounded routing.
Capital One — SQS DLQs and the replay pattern at scale. Capital One has published several architecture posts describing how their event-driven pipelines (transaction enrichment, fraud feature extraction) treat DLQs as a first-class operational surface. Their pattern: every Lambda async target and SQS consumer has a DLQ; CloudWatch alarms fire on DLQ depth > 0; a dedicated replay tool reads from the DLQ, applies the deployed bug fix, and routes back to the main queue. The lesson buried in the public material is that not auto-replaying is the discipline — you need a human to look at a sample before mass replay, because the bug that filled the DLQ might still be in the consumer.
8 · Build it yourself — SNS → SQS fan-out
- Create a topic and two queues.
TOPIC=$(aws sns create-topic --name lab-topic --query TopicArn --output text) Q1=$(aws sqs create-queue --queue-name lab-q1 --query QueueUrl --output text) Q2=$(aws sqs create-queue --queue-name lab-q2 --query QueueUrl --output text) Q1_ARN=$(aws sqs get-queue-attributes --queue-url $Q1 --attribute-names QueueArn --query 'Attributes.QueueArn' --output text) Q2_ARN=$(aws sqs get-queue-attributes --queue-url $Q2 --attribute-names QueueArn --query 'Attributes.QueueArn' --output text) - Allow SNS to send to the queues.
cat > /tmp/qp.json <<EOF { "Policy": "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Principal\":{\"Service\":\"sns.amazonaws.com\"},\"Action\":\"SQS:SendMessage\",\"Resource\":\"$Q1_ARN\",\"Condition\":{\"ArnEquals\":{\"aws:SourceArn\":\"$TOPIC\"}}}]}" } EOF aws sqs set-queue-attributes --queue-url $Q1 --attributes file:///tmp/qp.json # repeat for Q2 - Subscribe both queues to the topic.
aws sns subscribe --topic-arn $TOPIC --protocol sqs --notification-endpoint $Q1_ARN aws sns subscribe --topic-arn $TOPIC --protocol sqs --notification-endpoint $Q2_ARN - Publish and observe both queues.
aws sns publish --topic-arn $TOPIC --message '{"event":"order.placed","id":42}' # Receive from each queue: aws sqs receive-message --queue-url $Q1 --max-number-of-messages 10 aws sqs receive-message --queue-url $Q2 --max-number-of-messages 10 # Same message delivered to both — fan-out worked. - Add a filter policy so Q2 only gets a subset.
# Find the subscription ARN for Q2: SUB=$(aws sns list-subscriptions-by-topic --topic-arn $TOPIC --query 'Subscriptions[?Endpoint==`'$Q2_ARN'`].SubscriptionArn' --output text) aws sns set-subscription-attributes --subscription-arn $SUB \ --attribute-name FilterPolicy \ --attribute-value '{"event":["order.placed"]}' # Now Q2 only receives messages whose event attribute is "order.placed". # Note: filter checks message attributes, not body — pass attributes when publishing: aws sns publish --topic-arn $TOPIC --message '...' \ --message-attributes '{"event":{"DataType":"String","StringValue":"order.placed"}}' - Set up a DLQ on Q1.
DLQ=$(aws sqs create-queue --queue-name lab-dlq --query QueueUrl --output text) DLQ_ARN=$(aws sqs get-queue-attributes --queue-url $DLQ --attribute-names QueueArn --query 'Attributes.QueueArn' --output text) aws sqs set-queue-attributes --queue-url $Q1 \ --attributes "{\"RedrivePolicy\":\"{\\\"deadLetterTargetArn\\\":\\\"$DLQ_ARN\\\",\\\"maxReceiveCount\\\":3}\"}" - Tear down.
aws sns delete-topic --topic-arn $TOPIC aws sqs delete-queue --queue-url $Q1 aws sqs delete-queue --queue-url $Q2 aws sqs delete-queue --queue-url $DLQ
9 · What breaks
- "Duplicate messages." SQS Standard is at-least-once. Make consumers idempotent (idempotency key in the message; deduplicate on the consumer side).
- "Messages stuck invisible." Consumer crashed without deleting. After visibility timeout, message comes back. Tune timeout to "processing budget + slack."
- DLQ filling silently. Always alarm on DLQ depth.
ApproximateNumberOfMessagesVisible > 0is usually the right alarm. - EventBridge rule "fires for everything." Pattern matching is exact — write the pattern carefully. "any event from source X" is
{"source":["x"]}; missing the array brackets matches everything. - SNS to email — confirmation required. Email subscriptions start unconfirmed; the address must click a confirmation link. Easy to forget when troubleshooting.
10 · Further reading
- How message queues work. The protocol/semantics primer — at-least-once, exactly-once, dead-letter patterns.
- EventBridge user guide.
- Idempotence. The pattern you need on every consumer.