AWS, US-EAST-1, December 7 2021.
An automated scaling action on AWS's internal monitoring fleet exceeded the capacity of the internal network. Congestion control kicked in. Services started retrying. The retries amplified the congestion. Within minutes, the control plane for EC2, ELB, IAM, and S3 metadata in US-EAST-1 was unreachable, and stayed that way for roughly seven hours.
TL;DR
AWS runs a separate internal network plane for service-to-service control traffic, alongside the public-facing main network. Around 07:30 PT on December 7 2021, an automated capacity expansion of the internal monitoring fleet caused many devices to connect to the internal network at once. The traffic exceeded the available capacity between the two networks. The network's congestion control responded by slowing connections. AWS services that depend on internal monitoring data retried more aggressively, deepening the congestion. The cascading effect made the control plane unreachable for EC2, ELB, IAM, and S3 metadata in US-EAST-1 for about seven hours, with downstream impact stretching another twelve.
Timeline
| 07:30 PT | Automated capacity scaling on the internal monitoring fleet begins. Many devices start connecting at once. |
| 07:40 PT | Internal network capacity exceeded between main and internal networks. Congestion control engages. |
| 07:50 PT | AWS services start retrying internal calls. Retry storm amplifies congestion. |
| 08:00 PT | Customer-facing EC2 API failures begin. Public AWS Service Health Dashboard remains green (its own data plane depended on the affected services). |
| 09:30 PT | AWS engineering identifies the congestion source. Begins shutting down portions of internal monitoring traffic. |
| 10:00 PT — 13:00 PT | Slow, rate-limited recovery. Bringing all instances back at once would re-trigger the herd. |
| 13:30 PT | Most customer-facing services restored. Some downstream effects (Lambda warm pool, ECS task scheduling, EventBridge backlogs) continue for several more hours. |
| ~Midnight PT | Long-tail recovery complete. |
What went wrong
AWS internally separates the production network (which carries customer traffic) from an internal control-plane network (which carries service-to-service control traffic, monitoring data, scaling events). The two networks are bridged by a finite-bandwidth set of devices. The monitoring fleet's autoscaler triggered a scale-up that brought many new monitoring devices online simultaneously. Each new device opens connections to many backend services on the production network. Multiplied across the fleet, this exceeded the available bridge capacity.
At this point the network's TCP congestion control did what it's supposed to do: it slowed down connections. Backend services that depended on monitoring data started seeing latencies grow. Their client libraries did what they were configured to do: they retried. The retries arrived on top of the existing congested traffic, deepening the saturation. Within minutes the bridge was unusable and the control plane lost the ability to coordinate.
What was affected
Customer-facing services: EC2 (RunInstances, TerminateInstances), ELB (CreateLoadBalancer, ModifyTargetGroup), Auto Scaling, IAM, Lambda invocation queues, S3 inter-region replication, Route 53 health checks, Connect, Fargate, ECS task scheduling, EventBridge, CloudWatch metrics ingestion, Workspaces, Chime, KMS. Roughly: anything that needed to call the EC2 or IAM control planes anywhere in US-EAST-1.
Customer-visible third-party impact: Slack (which routes through US-EAST-1) stopped delivering messages. Disney+ failed to stream new sessions. Roomba's IoT cloud went offline (vacuum cleaners couldn't connect). Amazon's own retail site degraded, with Prime Now and Whole Foods delivery slots disappearing. Coinbase, Robinhood, Venmo, several major US news sites, and Tinder all reported outages — most rooted in dependencies on AWS-hosted backends.
The recovery
AWS engineering located the congestion source within about 90 minutes — fast for an opaque internal problem. The fix was conceptually simple: shut down portions of internal monitoring traffic to relieve congestion, then bring devices back online in a rate-limited fashion. The "rate-limited" part is the hard one. Bringing everything back at once would re-trigger the same thundering herd that caused the outage. Recovery had to be slow and staged, which is why a 7-hour outage took 3+ hours to recover from once the root cause was understood.
One detail worth noting: the AWS Service Health Dashboard, which would normally show "us-east-1: degraded" in red, stayed green for hours. The dashboard's status updates depended on the affected services. The status page was lying because the system that updates it had no way to talk to the system that shows it. AWS acknowledged this in the post-event summary and committed to running the status page from infrastructure that doesn't share fate with what it reports on.
Lessons
Internal networks deserve the same blast-radius engineering as customer-facing ones. AWS had carefully partitioned customer workloads into AZs and regions to bound failures. The internal monitoring network was the single fate-shared dependency that hadn't been cellularised. Post-incident, AWS introduced finer cellular boundaries inside the internal network plane.
Retries amplify congestion. The textbook fix — exponential backoff with jitter — is in every AWS SDK. Whether internal services consistently applied it is unclear; the post-event summary implies some did not. After this outage, AWS rolled out additional backpressure mechanisms in the internal RPC fabric.
Status pages must not share fate with what they report. The 90 minutes during which the AWS Service Health Dashboard claimed everything was fine cost trust. The status page is now run on infrastructure outside the regions it reports on.
Multi-region for control plane operations. Customers running active-active across regions saw degradation but stayed available. Customers single-homed in US-EAST-1 (the default for too many startups) saw a full outage. The cost of multi-region active-active is significant; this outage made it a clearer line item for risk-averse teams.
Why US-EAST-1?
This is the third US-EAST-1 cascading outage in a five-year stretch (2017 S3 typo; 2020 Kinesis; 2021 monitoring fleet). The region is the oldest, largest, and most complex in AWS's fleet. Many AWS global control planes (IAM, CloudFront, Route 53) are physically rooted in US-EAST-1 — a US-EAST-1 outage cascades to other regions for control-plane operations even when the data plane is fine. AWS has been incrementally migrating global control planes out of US-EAST-1, but it's a multi-year project. Until it's done, US-EAST-1's blast radius is larger than its market share would suggest.
Further reading
- AWS Post-Event Summary — December 2021 — the official write-up. AWS post-events are unusually candid by industry standards.
- The Amazon Builders' Library — collection of AWS engineering essays on resilience patterns. The "Avoiding insurmountable queue backlogs" piece is the canonical reference for this incident's root cause class.
- The Tail at Scale — Dean and Barroso 2013. Background on why fan-out amplifies tail latency and why retries can do more harm than good.
- Borg, Omega, and Kubernetes — Google's perspective on cluster-management failures, which share a lot of structure with this incident.
- Back-pressure and retries — the patterns this incident pressure-tests.