01 / 16

Cloud Codex · AWS / 01

Foundations.

Before any service makes sense, the map: one AWS account owns resources; resources live in regions; regions contain Availability Zones; AZs contain physical data centres. A control plane creates and configures the resources; a data plane carries the actual traffic. Almost every operational story on AWS — outage, latency, billing — comes back to one of these axes.

1 · What an AWS account actually is

The mental model that survives every cross-account debugging session: an AWS account is the unit of isolation. It owns a flat namespace of resources, a single bill, and a tree of IAM principals — and almost everything that talks across that boundary needs explicit permission from both sides. The 12-digit account ID is the only thing that uniquely names it; the friendly name in the console is cosmetic.

Each account has exactly one root user. The root user is the email address used to sign up, has irrevocable god-mode over the account (close it, change the payment method, delete IAM Identity Center, transfer ownership), and is the only principal that cannot be denied by SCPs or permission boundaries. Everything else is an IAM principal — IAM users, IAM roles, federated sessions, service-linked roles — and lives in the IAM database scoped to that account. Operationally, you use the root user exactly once after sign-up (to enable MFA, set billing alarms, and create the first admin role), then never log in as it again.

Production AWS estates use many accounts grouped under AWS Organizations, not one. The shape that survives an audit: one management account that owns Organizations and consolidated billing; member accounts arranged into Organizational Units (OUs) like Workloads/Prod, Workloads/NonProd, Security, SharedServices, Sandbox. Each workload-and-environment pair gets its own account — e.g., payments-prod is a different account from payments-staging. That's the "blast radius is one account" property: a runaway IAM role in staging can't reach prod resources because the trust boundary isn't even configured.

Account-as-X	What it gives you	What it doesn't
Isolation boundary	Hard quota / IAM / network separation between workloads	Free networking — peer / Transit Gateway across accounts costs money
Billing unit	One invoice per account; consolidated under the org payer	Per-team showback — you still need tags + Cost Allocation reports
Quota scope	Service quotas (e.g., 20 EIPs/region) apply per account, per region	Sharing quota — every new account starts at default limits
Policy scope	SCPs from the org apply transitively to the account	Hiding from the org — member accounts can't refuse SCPs from the management account
Tenant boundary for SaaS	Pattern: one account per large customer for hard isolation (used by Snowflake-on-AWS, parts of Atlassian)	Cheap scale — provisioning, deleting, and billing accounts has real overhead

Root user MFA isn't optional. A 2024 AWS change requires MFA on every root user in the management account, with the rest of the org following in 2025. Use a hardware key (YubiKey) or a TOTP app that isn't on the same device as the email account. Store the seed/recovery codes somewhere offline. Losing root MFA means a multi-day support ticket and notarised identity verification to regain access.

2 · How AWS is physically organised

Underneath the account abstraction is a deliberately fractal physical layout. Every operational story about AWS — outage, latency, billing — comes back to where in the hierarchy the request lives.

Five layers, top to bottom. A partition is one of three independent AWS universes — commercial aws, China aws-cn (operated by local partners), and US Government aws-us-gov. They have separate IAM, separate billing, separate ARN namespaces. Cross-partition traffic isn't on the AWS backbone — it's the public internet. A region is a geographic location with its own control plane and a published service catalogue (33+ commercial regions as of late 2025). An Availability Zone is one or more physically separated data centres inside a region with independent power, cooling, and networking, linked by sub-millisecond fibre. A cell is an internal-to-AWS partitioning of a service's fleet inside an AZ — most large AWS services (S3, DynamoDB, Lambda, SQS) deploy as many independent cells per region so that a bad deploy or a runaway tenant hits one cell, not the whole region. Engineers don't see cells directly; you see them indirectly as the reason "S3 throttling hit 3% of buckets, not 100%."

This cellular shape is the topic of Amazon's "Avoiding overload" Builders' Library article and Peter Vosshall's re:Invent talk "How AWS Minimizes the Blast Radius of Failures". The pattern that recurs: route traffic to cells with a thin partitioner that itself runs in many cells, deploy to one cell at a time, and design every service so that the worst possible bug takes out one cell instead of the whole region.

AZ names are randomised per account. Your us-east-1a is not the same physical AZ as another account's us-east-1a. AWS shuffles the mapping per account so that traffic balances naturally across the alphabetical AZs. To compare AZs across accounts use the AZ ID (use1-az1), which is consistent. aws ec2 describe-availability-zones --query 'AvailabilityZones[].[ZoneName,ZoneId]' shows the mapping.

3 · The account is the security boundary

An AWS account is a billing unit and a permission boundary in one. The 12-digit account ID identifies it; every resource ARN carries it (arn:aws:s3:::my-bucket doesn't, but arn:aws:iam::123456789012:role/MyRole does). Cross-account access requires a deliberate IAM trust policy on the resource side and an sts:AssumeRole on the caller side — there is no "I'm in the same org, let me in" shortcut.

Production AWS estates use many accounts, not one — typically one per environment per workload (prod, staging, dev × per service), grouped under an AWS Organization for billing and policy. The pattern is "separate accounts so blast radius is bounded; tied together with Organizations so finance can see one bill and security can apply org-wide controls." A single-account deployment is the dev shape, not the prod shape — covered properly in the IAM page.

Service quotas are scoped per account, per region. Default 20 Elastic IPs in us-east-1 doesn't help you in eu-west-2. New AWS accounts start at default quotas in every region; raise them via Service Quotas well before a launch, not the morning of. aws service-quotas list-service-quotas --service-code ec2 shows what you've got.

4 · Regions and Availability Zones

A region is a geographic location (us-east-1 is Northern Virginia, eu-west-2 is London). As of late 2025, there are 33 commercial regions plus the GovCloud / China partitions. Regions are isolated by design — an outage in us-east-1 doesn't take down eu-west-2, and resources don't replicate between regions automatically. The pricing, the service availability set, and the legal jurisdiction all differ by region.

An Availability Zone (AZ) is one or more physically separate data centres inside a region with independent power, cooling, and networking. AZs are connected by a high-bandwidth, low-latency private fibre — the cross-AZ round trip is typically < 1 ms within a region. The point of AZs is fate-sharing: a fire in one AZ shouldn't take down resources in another.

Concept	Granularity	What's isolated	Example failure
Region	Geographic	Service, billing, jurisdiction	Whole region outage (rare, but us-east-1 has done it)
AZ	Data centre	Power, cooling, network	One AZ goes dark; other two carry the load
Rack	Physical row	Top-of-rack switch, PDU	One rack loses power; AWS reschedules instances
Host	Single server	Hardware fault	One EC2 dies; auto-recovery moves it to a new host

AZ names are randomised per account. Your us-east-1a is not the same physical AZ as another account's us-east-1a. To compare AZs across accounts use the AZ ID (use1-az1), which is consistent. aws ec2 describe-availability-zones shows the mapping.

5 · Pick a region — the trade-offs

Three factors matter when picking a region:

Latency to users. Pick a region close to where most traffic lives. A 100 ms round-trip difference is the difference between "snappy" and "noticeably slow."
Data residency. EU users' data typically has to stay in an EU region. Public-sector and healthcare workloads have stricter rules (HIPAA-eligible regions, GovCloud).
Service availability and price. us-east-1 launches new services first and is usually cheapest. ap-south-1, sa-east-1, etc., often run 10–30% more expensive and miss some services for months. AWS publishes per-region service availability.

us-east-1 is the control-plane home region. A handful of AWS services are global but actually run their control plane out of us-east-1 — IAM, Route 53, CloudFront, S3 bucket naming, and Organizations. When us-east-1 has a bad day, these services degrade everywhere. That's why multi-region strategies don't fully save you from us-east-1 outages.

6 · Control plane vs data plane

Every AWS service is built as two systems with very different reliability profiles. The control plane is what handles creating, configuring, and describing resources — running aws ec2 run-instances hits the EC2 control plane. The data plane is what handles the actual traffic to the resource — once an instance is running, SSH, HTTP, and EBS reads/writes hit the data plane.

Data planes are designed to keep working through control plane outages. During the December 2021 us-east-1 event, the IAM control plane was degraded for hours — meaning nobody could create new roles or rotate keys — but already-deployed Lambda functions kept executing because the Lambda data plane kept routing requests. The mental model: "if the worst happens, existing resources keep running, but I can't change them."

Service	Control-plane example	Data-plane example
EC2	`RunInstances`, `TerminateInstances`	Packet to the running instance
S3	`CreateBucket`, `PutBucketPolicy`	`GetObject`, `PutObject`
Lambda	`CreateFunction`, `UpdateFunctionCode`	`Invoke`
DynamoDB	`CreateTable`, `UpdateTable`	`GetItem`, `PutItem`, `Query`
IAM	All of it	Auth checks against existing roles still resolve from a local cache

The two surfaces fail very differently — knowing which one you're hitting tells you which retry strategy to use, what alarm to set, and how scared to be about an outage.

Property	Control plane	Data plane
Rate limits	Strict, low (single- or double-digit per second per account)	High — designed for production traffic (thousands–millions / sec)
Latency	Hundreds of ms to seconds (creating things is slow)	Single-digit ms typical
How it fails	"Cannot create new resource right now"	"Existing requests start erroring or slowing down"
Operator dependency	You can't change anything during the outage	Already-deployed resources keep serving traffic
Cell isolation	Usually one fleet per region (sometimes global)	Many cells per region — partial failure is normal
Where to retry	Long exponential backoff (minutes); don't bombard it	Aggressive retry with jitter (ms–seconds)
SLA	Often not separately published	The published "99.99% available" number

The Builders' Library article "Static stability using Availability Zones" codifies the principle: design every workload so the data plane keeps serving even when its own control plane is unreachable. Auto Scaling Group has already chosen the AZs and instance types; the load balancer already knows about the targets; DNS records are already cached. You can lose RunInstances for hours without users noticing if the running fleet has spare capacity.

7 · ARNs — the universal handle

An Amazon Resource Name (ARN) is the canonical identifier for a resource. The format looks regular until you look closely — different services drop or repurpose fields:

arn:<partition>:<service>:<region>:<account-id>:<resource-type>/<resource-id>
                                                       ^^^ or :resource-id, or just resource-id

The seven services you'll touch most have subtly different ARN shapes. Reading them fluently is how you'll write IAM policies, debug cross-account access, and trace CloudTrail events end-to-end.

Service	ARN shape	What's missing / odd
S3 bucket	`arn:aws:s3:::my-bucket`	No region, no account-id. Buckets are global names.
S3 object	`arn:aws:s3:::my-bucket/some/key.json`	Same — the key is appended after the bucket.
IAM role	`arn:aws:iam::123456789012:role/MyRole`	No region — IAM is global. Account ID present.
Lambda function	`arn:aws:lambda:us-east-1:123456789012:function:my-fn`	Standard shape. Add `:$LATEST` or `:5` for a version.
DynamoDB table	`arn:aws:dynamodb:us-east-1:123456789012:table/Orders`	Standard.
SQS queue	`arn:aws:sqs:us-east-1:123456789012:my-queue`	No resource-type prefix — the queue name is the resource-id directly.
SNS topic	`arn:aws:sns:us-east-1:123456789012:my-topic`	Same shape as SQS — topic name directly.
KMS key	`arn:aws:kms:us-east-1:123456789012:key/abcd1234-…`	The resource-id is a UUID, not a friendly name (use aliases).
API Gateway	`arn:aws:execute-api:us-east-1:123456789012:abc1234/prod/GET/users`	The "service" in the ARN (`execute-api`) doesn't match the service name in the console.
CloudWatch Logs	`arn:aws:logs:us-east-1:123456789012:log-group:/aws/lambda/my-fn:*`	Trailing `:*` is conventional in IAM policy `Resource` fields.

Wildcards in ARNs are policy-evaluator wildcards. arn:aws:s3:::my-bucket/* means "every object in my-bucket"; arn:aws:s3:::my-bucket (no trailing slash) means "the bucket itself" — and bucket-level vs object-level actions need different ARNs in the policy. The number-one cause of "ListBucket works but GetObject doesn't" is mixing these up.

8 · Decoding a service name

AWS service names look chaotic but follow a pattern once you know it:

"Elastic X" — adjustable, pay-per-use. Elastic Compute Cloud (EC2), Elastic Block Store (EBS), Elastic Container Service (ECS), Elastic Load Balancing (ELB), Elastic File System (EFS).
"Amazon X" — usually an AWS-shaped take on a familiar primitive. Amazon S3 is simple storage, Amazon RDS is relational DB service, Amazon SQS is simple queue service, Amazon Aurora is the Postgres/MySQL-compatible managed DB.
"AWS X" — usually a service Amazon built from scratch or that doesn't have an obvious open-source equivalent. AWS Lambda, AWS Step Functions, AWS Config, AWS Organizations, AWS CloudFormation.
The naming inconsistency is real. Lambda is "AWS Lambda" but Aurora is "Amazon Aurora." S3 is "Amazon" but CloudWatch is "Amazon" too despite being built in-house. Read the prefix as a weak hint, nothing more.

The ARN format is more useful than the service name. Once you read ARNs fluently (see section 7), service-name prefixes barely matter — the ARN's service field is the canonical name (s3, lambda, execute-api) and that's what IAM cares about. The marketing-name prefix (Amazon vs AWS vs Elastic) is for the brochure.

9 · Real-world case studies

Three public stories give a sense of how the account / region / cell hierarchy actually shapes systems at scale.

Stripe — many accounts, one observability plane. Stripe's infrastructure is split across hundreds of AWS accounts under one Organization. The "Operating Kubernetes Clusters for over a Decade at Stripe" and "Canonical log lines" posts describe the shape: each compute platform team owns its own AWS accounts; a separate observability account aggregates logs and metrics across all of them via cross-account roles; SCPs at the org level enforce "no public S3, no IAM users in prod, all data must encrypt at rest." The blast-radius argument: a misconfigured deploy in one workload account can never touch the keys, logs, or buckets of another. The price: account creation and inventory itself becomes a platform — automated via AWS Control Tower with Service Catalog landing zones.

Netflix — region as the unit of failure. Netflix runs every service in three regions (typically us-east-1, us-west-2, eu-west-1) with Eureka / Zuul / Atlas configured to fail traffic between them. Their "Active-Active for Multi-Regional Resiliency" post describes the architecture; the Simian Army family of failure-injection tools includes Chaos Kong, which deliberately fails an entire AWS region in production to verify that traffic shifts cleanly. The lesson isn't "you must run active-active in three regions" — most workloads don't earn that complexity — it's that the region is the natural boundary of a correlated AWS failure, so any DR plan that doesn't cross one is mostly theatre.

AWS Builders' Library — cellular architecture. The "Avoiding overload" and "Workload isolation using shuffle sharding" articles describe how AWS itself builds nearly every regional service as N independent cells, with a thin partitioner deciding which cell a tenant's traffic goes to. Route 53 is the canonical example — a hostile customer flooding one cell affects ~1/N of other customers, and shuffle sharding makes the probability that any two customers share all their cells vanishingly small. The same shape shows up in Lambda's worker fleets, DynamoDB partitions, and S3's keymap shards (see the S3 page). When you read "this region had a partial outage affecting ~3% of customers," that's cells working as designed.

The through-line: in 2026 production AWS, the question is rarely "how do I survive a server failure" (AWS handles that) and almost always "how do I cap the blast radius of the next bad deploy / bad query / bad tenant." Accounts, regions, AZs, and cells are the tools.

10 · Build it yourself — first-run AWS CLI

The 10-minute lab that pays off for every subsequent page: get the CLI working, list regions, list AZs, and inspect the account.

Install the CLI.
brew install awscli # macOS # or: see https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html
Create a sandbox IAM user (if you don't have SSO). In the AWS console: IAM → Users → Add user → tick "Provide user access to the AWS Management Console" off, generate an access key labelled cli, attach AdministratorAccess (it's a sandbox account; production never does this).
Configure the CLI.
aws configure # AWS Access Key ID: AKIA... # AWS Secret Access Key: ... # Default region name: us-east-1 # Default output format: json
Verify with STS. This is the first command to run on any unfamiliar AWS environment — it tells you which account, which user, and which region you're on.
aws sts get-caller-identity # { # "UserId": "AIDA...", # "Account": "123456789012", # "Arn": "arn:aws:iam::123456789012:user/cli" # }
List regions.
aws ec2 describe-regions --query 'Regions[].RegionName' --output table
List AZs in your region — including the consistent-across-accounts AZ ID.
aws ec2 describe-availability-zones \ --query 'AvailabilityZones[].[ZoneName,ZoneId,State]' \ --output table
List services available in your region.
aws ssm get-parameters-by-path \ --path /aws/service/global-infrastructure/services \ --query 'Parameters[].Value' --output table | head -40
Set a budget alarm. Do this now so cloud-cost runaway can't get past $25.
aws budgets create-budget --account-id $(aws sts get-caller-identity --query Account --output text) \ --budget '{"BudgetName":"lab","BudgetLimit":{"Amount":"25","Unit":"USD"},"TimeUnit":"MONTHLY","BudgetType":"COST"}' \ --notifications-with-subscribers '[{"Notification":{"NotificationType":"ACTUAL","ComparisonOperator":"GREATER_THAN","Threshold":80},"Subscribers":[{"SubscriptionType":"EMAIL","Address":"you@example.com"}]}]'

Nothing to tear down here. No resources were created. Future labs end with explicit delete commands; this one is just setup. From here, every other page in this codex assumes you can run aws sts get-caller-identity and get a successful response.

11 · What breaks

"Why doesn't my CLI command find the resource?" Wrong region. Either pass --region us-east-1 or set AWS_REGION. The CLI silently uses us-east-1 as default if neither is set, which trips up Europe-based teams constantly.
"My IAM role works for me but not for someone else." Either they're missing sts:AssumeRole permission on the source side, or the target role's trust policy doesn't allow their principal, or both. There is no "shared org, automatic trust" — every cross-account call needs both sides configured.
"us-east-1 is down — should I fail over to another region?" Mostly no, unless you've already designed for it. The existing resources in another region keep working, but anything that depends on us-east-1-hosted global services (IAM, Route 53 DNS changes, S3 bucket creation) is broken everywhere until us-east-1 recovers.
Opt-in regions return "UnauthorizedOperation." Several regions — ap-east-1 (Hong Kong), me-south-1 (Bahrain), eu-south-1 (Milan), af-south-1 (Cape Town), ap-southeast-3 (Jakarta), and others — are opt-in. The account has to explicitly enable each one in Account → AWS Regions; until then every API call fails with a permission-style error that doesn't name the actual problem. New OUs / accounts created after 2022 default to opting in only the original commercial regions.
Service quotas per region, not per account. Default 20 Elastic IPs in us-east-1 doesn't help you in eu-west-2 — every region starts at default. Service Quotas requests can take hours to days for big jumps. File them before launch.
Root user MFA cannot be skipped. Since 2024 AWS requires MFA on the root user of every management account; the rest of the org's accounts followed in 2025. There's no opt-out. If MFA breaks (lost hardware key, dead phone), recovery is a multi-day support process requiring notarised identity verification.
Account closure isn't instant. Closing an account moves it to a 90-day "post-closure" period where resources still exist (and bill, in some edge cases) but you can't sign in. Real deletion happens after 90 days. Plan migrations accordingly — don't delete the source account the day after copying its data.
"My S3 ARN doesn't have an account ID." Correct — S3 bucket ARNs omit region and account because bucket names are globally unique. This trips up IAM policy authors who try to interpolate ${aws:accountId} into a bucket ARN.
Account suspended. If billing fails, AWS suspends the account — resources keep running for a few days, then start getting terminated. Set up multiple billing contacts, a backup payment method, and a Cost Anomaly Detection alert that emails an address that someone actually checks.

12 · Further reading

AWS Global Infrastructure. The interactive map of regions, AZs, edge POPs, and Local Zones.
Builders' Library — Avoiding overload. The canonical write-up on cellular architecture from AWS Principal Engineers.
Builders' Library — Static stability using AZs. Why the data plane keeps working when the control plane doesn't.
Builders' Library — Workload isolation using shuffle sharding. The combinatorial argument behind cellular AWS services.
Netflix — Active-Active multi-regional resiliency. The architecture that survives a region-scale outage.
Stripe — Operating Kubernetes over a decade. Multi-account organisation strategy in production.
AWS Overview whitepaper. 100-page reference; skim the table of contents to know what exists.
AWS Well-Architected Framework. Six "pillars" (operational excellence, security, reliability, performance, cost, sustainability). Cited in every interview question about cloud design.
Cloud Codex (topic shape). The other half of this material — same services, organised by problem.
Identity & IAM concepts. The conceptual companion to the IAM-advanced page that comes next.

IAM advanced →

STS, AssumeRole, federation, IRSA, SCPs vs permissions boundaries — the IAM machinery the simple "user + policy" model leaves out.

Read IAM advanced

Found this useful?