2021 · 7 hours degraded
Postmortem · cloud · cascading failure

AWS, US-EAST-1, December 7 2021.

An automated scaling action on AWS's internal monitoring fleet exceeded the capacity of the internal network. Congestion control kicked in. Services started retrying. The retries amplified the congestion. Within minutes, the control plane for EC2, ELB, IAM, and S3 metadata in US-EAST-1 was unreachable, and stayed that way for roughly seven hours.

Date 7 Dec 2021
Duration ~7 hours (10:30 – 17:30 UTC)
Region us-east-1 (N. Virginia)

TL;DR

AWS runs a separate internal network plane for service-to-service control traffic, alongside the public-facing main network. Around 07:30 PT on December 7 2021, an automated capacity expansion of the internal monitoring fleet caused many devices to connect to the internal network at once. The traffic exceeded the available capacity between the two networks. The network's congestion control responded by slowing connections. AWS services that depend on internal monitoring data retried more aggressively, deepening the congestion. The cascading effect made the control plane unreachable for EC2, ELB, IAM, and S3 metadata in US-EAST-1 for about seven hours, with downstream impact stretching another twelve.

Timeline

07:30 PTAutomated capacity scaling on the internal monitoring fleet begins. Many devices start connecting at once.
07:40 PTInternal network capacity exceeded between main and internal networks. Congestion control engages.
07:50 PTAWS services start retrying internal calls. Retry storm amplifies congestion.
08:00 PTCustomer-facing EC2 API failures begin. Public AWS Service Health Dashboard remains green (its own data plane depended on the affected services).
09:30 PTAWS engineering identifies the congestion source. Begins shutting down portions of internal monitoring traffic.
10:00 PT — 13:00 PTSlow, rate-limited recovery. Bringing all instances back at once would re-trigger the herd.
13:30 PTMost customer-facing services restored. Some downstream effects (Lambda warm pool, ECS task scheduling, EventBridge backlogs) continue for several more hours.
~Midnight PTLong-tail recovery complete.

What went wrong

AWS internally separates the production network (which carries customer traffic) from an internal control-plane network (which carries service-to-service control traffic, monitoring data, scaling events). The two networks are bridged by a finite-bandwidth set of devices. The monitoring fleet's autoscaler triggered a scale-up that brought many new monitoring devices online simultaneously. Each new device opens connections to many backend services on the production network. Multiplied across the fleet, this exceeded the available bridge capacity.

At this point the network's TCP congestion control did what it's supposed to do: it slowed down connections. Backend services that depended on monitoring data started seeing latencies grow. Their client libraries did what they were configured to do: they retried. The retries arrived on top of the existing congested traffic, deepening the saturation. Within minutes the bridge was unusable and the control plane lost the ability to coordinate.

Control plane vs data plane. Most cloud outage analysis focuses on data-plane failures (an EBS volume goes away; a DB primary fails over). Control-plane failures are different: the service is still running, but you can't change anything. You can't launch new instances, terminate old ones, modify ELB targets, or update IAM. For applications that don't need to change configuration, this can be invisible. For everyone running CI/CD, autoscaling, or recovering from any other issue, it's catastrophic.

What was affected

Customer-facing services: EC2 (RunInstances, TerminateInstances), ELB (CreateLoadBalancer, ModifyTargetGroup), Auto Scaling, IAM, Lambda invocation queues, S3 inter-region replication, Route 53 health checks, Connect, Fargate, ECS task scheduling, EventBridge, CloudWatch metrics ingestion, Workspaces, Chime, KMS. Roughly: anything that needed to call the EC2 or IAM control planes anywhere in US-EAST-1.

Customer-visible third-party impact: Slack (which routes through US-EAST-1) stopped delivering messages. Disney+ failed to stream new sessions. Roomba's IoT cloud went offline (vacuum cleaners couldn't connect). Amazon's own retail site degraded, with Prime Now and Whole Foods delivery slots disappearing. Coinbase, Robinhood, Venmo, several major US news sites, and Tinder all reported outages — most rooted in dependencies on AWS-hosted backends.

The recovery

AWS engineering located the congestion source within about 90 minutes — fast for an opaque internal problem. The fix was conceptually simple: shut down portions of internal monitoring traffic to relieve congestion, then bring devices back online in a rate-limited fashion. The "rate-limited" part is the hard one. Bringing everything back at once would re-trigger the same thundering herd that caused the outage. Recovery had to be slow and staged, which is why a 7-hour outage took 3+ hours to recover from once the root cause was understood.

One detail worth noting: the AWS Service Health Dashboard, which would normally show "us-east-1: degraded" in red, stayed green for hours. The dashboard's status updates depended on the affected services. The status page was lying because the system that updates it had no way to talk to the system that shows it. AWS acknowledged this in the post-event summary and committed to running the status page from infrastructure that doesn't share fate with what it reports on.

Lessons

Internal networks deserve the same blast-radius engineering as customer-facing ones. AWS had carefully partitioned customer workloads into AZs and regions to bound failures. The internal monitoring network was the single fate-shared dependency that hadn't been cellularised. Post-incident, AWS introduced finer cellular boundaries inside the internal network plane.

Retries amplify congestion. The textbook fix — exponential backoff with jitter — is in every AWS SDK. Whether internal services consistently applied it is unclear; the post-event summary implies some did not. After this outage, AWS rolled out additional backpressure mechanisms in the internal RPC fabric.

Status pages must not share fate with what they report. The 90 minutes during which the AWS Service Health Dashboard claimed everything was fine cost trust. The status page is now run on infrastructure outside the regions it reports on.

Multi-region for control plane operations. Customers running active-active across regions saw degradation but stayed available. Customers single-homed in US-EAST-1 (the default for too many startups) saw a full outage. The cost of multi-region active-active is significant; this outage made it a clearer line item for risk-averse teams.

Why US-EAST-1?

This is the third US-EAST-1 cascading outage in a five-year stretch (2017 S3 typo; 2020 Kinesis; 2021 monitoring fleet). The region is the oldest, largest, and most complex in AWS's fleet. Many AWS global control planes (IAM, CloudFront, Route 53) are physically rooted in US-EAST-1 — a US-EAST-1 outage cascades to other regions for control-plane operations even when the data plane is fine. AWS has been incrementally migrating global control planes out of US-EAST-1, but it's a multi-year project. Until it's done, US-EAST-1's blast radius is larger than its market share would suggest.

Further reading

More postmortems
Back to the postmortems index
Famous outages — Cloudflare, GitHub, AWS, GitLab — what broke and what was learned.
Found this useful?