VPC.
Every AWS workload, with vanishingly few exceptions, lives in a VPC. The model: pick a CIDR, carve it into subnets across AZs, decide which subnets get internet, attach the right route tables, and remember that security groups are stateful but NACLs are not. Most teams set this up once, mis-tag a subnet, and stop thinking about it. The cost mistakes hide in egress and NAT.
1 · What a VPC actually is (and isn't)
The mental model that survives every conversation: a VPC is software-defined networking layered on top of EC2's physical substrate. There is no rack switch you can see. Your 10.0.10.5 instance and the one next door at 10.0.10.6 may live on hypervisors in different rows of a different building from each other. The fact that they appear to share an L2 broadcast domain is a polite fiction maintained by a system AWS engineers have variously called the Mapping Service or, more recently, Blackfoot.
Every packet your instance emits is encapsulated by the hypervisor (or, on Nitro, by the Nitro card sitting on the PCIe bus) and tunnelled to its destination's physical host. James Hamilton and Werner Vogels have walked through versions of this in re:Invent keynotes — the substrate translates the virtual destination IP into a physical underlay address by querying the mapping service, rewrites the headers, and forwards. The virtual network you see (subnets, route tables, SGs) is a set of policies the Mapping Service enforces by deciding which encapsulated packets it'll deliver and which it'll drop. There are no virtual cables.
What a VPC isn't: a real network in the way an on-prem network is real. There is no STP, no broadcast (ARP is intercepted and answered by the substrate), no multicast (you have to opt in to Transit Gateway Multicast). MTU is fixed at 9001 inside the VPC and clamps to 1500 over IGW. You can't run a custom switch protocol. You can't span an L2 across regions. Treating a VPC like a colo network — assuming ARP, multicast, or promiscuous mode work — is the most common source of "why doesn't this clustered software work on EC2?" puzzlement.
| VPC is good for | VPC is bad for |
|---|---|
| Isolating workloads per environment / team / customer | L2 protocols (STP, GLBP, VRRP — none of them work) |
| Hub-and-spoke topology via Transit Gateway, with one VPN/DX into the hub | Multicast or broadcast applications (use TGW multicast or rewrite) |
| Exposing a single service across accounts via PrivateLink without full peering | "Just lift-and-shift the colo network as-is" — addresses, MTU, and STP assumptions don't survive |
| Private connections to AWS services via Gateway/Interface endpoints (no NAT, no internet) | Per-instance public IPs at scale (use NAT or a NAT-equivalent — public IPv4 now bills hourly per address) |
| Per-ENI security groups referenced by ID (target groups change membership without rewriting rules) | Cross-VPC mesh at large scale (peering is non-transitive; TGW costs scale linearly) |
2 · How a packet actually moves — the public sketch
AWS doesn't fully publish the substrate, but several re:Invent talks (Hamilton's classic Tuesday Night Live sessions, Becky Weiss on Nitro, the AWS Hyperplane talks) give enough to draw the shape. A single packet from an internet client to an EC2 instance in a private subnet behind an ALB looks like this:
Three useful consequences fall out. First, every "device" in the diagram is software; IGWs, NATs, route tables, and SGs are policy bits the Mapping Service consults. Moving an EIP to a different instance is a 200-ms control-plane update, not a wiring change. Second, throughput is bounded by hypervisor encap — Nitro's offload card pushes this off the host CPU, which is why current generation instances hit 100+ Gbps. Third, SGs evaluate at the hypervisor, not on the wire — so SG changes apply instantly to all flows, even existing ones.
3 · Subnets, gateways, and the route table model
A VPC is a logically isolated chunk of the AWS network identified by an IPv4 CIDR block (and optionally an IPv6 /56 prefix). The VPC is regional — it spans every AZ in its region but doesn't cross regions. Inside the VPC you carve subnets, each one scoped to a single AZ with a chunk of the VPC's CIDR. A subnet has exactly one route table; a route table can be associated to many subnets.
- Pick a CIDR you won't outgrow.
10.0.0.0/16gives 65k addresses — plenty for almost everyone. Avoid overlap with other VPCs you'll eventually peer or with corporate IP ranges.192.168.0.0/16is common in home networks; reserving it in AWS makes future VPN messy. - Three subnet tiers per AZ. The canonical layout is public / private / data per AZ × 2-3 AZs. Public subnets host load balancers and NAT gateways. Private subnets host app servers. Data subnets host RDS / ElastiCache and have no internet path.
- What makes a subnet "public" is the route table — not the name. A subnet is public if its route table has
0.0.0.0/0 → IGW. A "private" subnet has0.0.0.0/0 → NAT(or no default route at all). - AWS reserves 5 IPs per subnet. A
/24looks like 256 hosts but gives you 251. The reserved ones are .0 (network), .1 (VPC router), .2 (DNS — AmazonProvidedDNS lives here, two off the base), .3 (future), and .255 (broadcast).
| Gateway | Direction | Charge |
|---|---|---|
| Internet Gateway (IGW) | Bidirectional public internet. Attach 1 per VPC. | No hourly. Egress at standard data-transfer rates. |
| NAT Gateway | Egress only — private subnet → internet for updates and outbound API calls. | ~$0.045/hour + $0.045/GB. Surprise on the bill. |
| NAT instance | DIY EC2 doing NAT — cheaper for tiny workloads, fragile. | EC2 cost only. |
| Egress-only IGW | IPv6 outbound-only. | Free. Standard data-transfer charges. |
| VPN / Direct Connect | On-prem ↔ VPC. | Hourly + per-GB. DX is the heavy-duty version. |
4 · How NAT actually works (and why it costs money)
A NAT Gateway is a managed, horizontally-scaled implementation of source-NAT. Instances in a private subnet send egress traffic to the NAT's ENI; the NAT rewrites the source IP from 10.0.10.5:54321 to <NAT public IP>:<allocated port>, sends to the internet, and remembers the mapping so it can reverse the rewrite on the response.
Two facts about the pricing decide everything else. First, NAT Gateways bill an hourly idle charge (~$0.045/hr × 720 = $32/mo per AZ, before any traffic). Best practice is one NAT per AZ for HA; that's $96/month for a 3-AZ setup with zero packets through it. Second — and this is the part that surprises everyone — NAT charges a $0.045/GB data-processing fee on top of the standard egress fee. So 1 TB/month of NAT egress costs $45 in data processing plus $90 in regular EC2 internet egress = $135 just for the NAT line item, plus the idle fees.
The trap is per-region AWS traffic going through NAT. aws s3 cp from a private EC2, docker pull from public ECR Public, an SDK call to STS — all of those, by default, traverse the NAT and bill the $0.045/GB processing fee, even though the data never actually leaves AWS's network. The fix is VPC endpoints (next section). A single S3 Gateway Endpoint, configured in five minutes, has measurably cut six-figure annual AWS bills for shops that didn't realise S3 was their largest NAT consumer.
ErrorPortAllocation. Either spread destinations (e.g., S3 VPC endpoint sidesteps this entirely), shard NATs (one per workload), or run multiple NATs and route different prefixes through different ones. CloudWatch ErrorPortAllocation is the metric to alarm on.5 · VPC endpoints and PrivateLink
A VPC endpoint connects your VPC to an AWS service privately — without traversing the public internet or a NAT Gateway. Two flavours:
- Gateway endpoints — S3 and DynamoDB only. Added as a target in a route table (the route table gets a managed prefix list as the destination). Free. Anyone setting up a new VPC should create the S3 gateway endpoint immediately; it cuts S3 egress through NAT to zero.
- Interface endpoints (PrivateLink) — most other AWS services. An ENI in your subnet that proxies to the service via AWS's Hyperplane fleet. ~$0.01/hr per endpoint per AZ + per-GB. Worth it when a service is talked to heavily; expensive if you set up endpoints for everything.
- PrivateLink for third parties. A SaaS vendor can expose their service as an endpoint service. You attach an interface endpoint and reach them privately, no public internet at all. The pattern for HIPAA-conscious vendor connections — and how Snowflake, Datadog, MongoDB Atlas, and Confluent Cloud expose their managed offerings to AWS customers.
6 · Connecting VPCs — peering vs TGW vs PrivateLink
Three patterns dominate, and they're not substitutes for each other — they shine in different regimes.
| VPC Peering | Transit Gateway | PrivateLink | |
|---|---|---|---|
| Topology | 1-to-1, non-transitive (A↔B and A↔C does not imply B↔C) | Hub-and-spoke; every VPC attaches to one TGW | One-way exposure of a single service |
| Scaling | n × (n-1) / 2 peerings — quadratic mesh | n attachments — linear | Per-consumer endpoint, no full network reach |
| Throughput | Limited only by source/dest ENI (line-rate within region) | ~50 Gbps per attachment, ~100 Gbps aggregate | Service-specific; Hyperplane scales horizontally |
| Hourly cost | $0 (peering itself); only data transfer | $0.05/hr per attachment + $0.02/GB through hub | $0.01/hr per endpoint per AZ + per-GB |
| Cross-region | Yes, inter-region peering supported | TGW peering across regions (extra hop) | Cross-region endpoints supported per service |
| Use when | 2–3 VPCs, simple topology, cost-sensitive | 10+ VPCs, multi-account, hybrid (VPN/DX) connectivity, central inspection | Cross-account or cross-org single-service access without giving network reach |
| Watch out for | CIDR overlap; route table sprawl; non-transitive lookups | $0.02/GB hub fee; routes propagated vs static; quotas (5000 attachments) | DNS resolution flow (private DNS or use the regional name) |
A common shape at scale: TGW as the regional hub, every prod/staging/data VPC attached, on-prem reaches the TGW via one DX/VPN connection (instead of N connections to N VPCs), and PrivateLink layered on top for cross-org exposure of specific services. For 10 VPCs that's roughly $360/month in attachment fees before traffic — but pure peering at that scale ceases to be maintainable: 10 VPCs need 45 peerings, each with its own route-table entries.
10.0.0.0/16 meets your existing 10.0.0.0/16. Plan CIDRs centrally before any growth event; the alternative is re-IP'ing a fleet, which no team has ever enjoyed.7 · Security groups vs NACLs — evaluation order
Both are firewall-style rule sets, but they sit at different layers and evaluate differently. SGs are stateful per-ENI rules; NACLs are stateless per-subnet rules. The packet path:
| Security group | Network ACL | |
|---|---|---|
| Scope | Attached to ENIs (instances, LBs, RDS, etc.) | Attached to subnets |
| State | Stateful (return traffic auto-allowed) | Stateless (must allow both directions) |
| Rules | Allow only | Allow and deny |
| Reference targets | Other security groups (by ID) — the killer feature | CIDRs only |
| Order | All rules evaluated; any allow wins | Numbered; lowest-number wins; first match decides |
| When to reach | Default. Use them. | Rarely. Blanket-deny of an IP range or a "no egress at all" subnet. |
The pattern for production: tier security groups by role. web-sg allows 443 from anywhere. app-sg allows 8080 from web-sg (referencing the SG ID, not a CIDR). db-sg allows 5432 from app-sg. This is more maintainable than IP-based rules because as instances come and go, the membership of each SG changes automatically.
8 · Real-world case studies
Three public stories give a sense of how VPC design plays out at scale.
Stripe — network isolation between merchants. Stripe processes payments for millions of businesses, and its compliance story (PCI DSS Level 1, plus per-region data-residency rules) leans heavily on VPC-style isolation. The infrastructure-engineering posts on stripe.com/blog/engineering describe how they decompose workloads into many small services each running in its own network boundary, with PrivateLink and explicit allowlists between them — no flat network where one compromised service can talk to all the others. The pattern they describe, "isolation by default, connectivity by exception," is the inversion of how most teams start (open everything, restrict later) and is the strongest argument for taking SG groups-by-role seriously from day one.
Pinterest — Transit Gateway and multi-region. Pinterest has written about their infrastructure architecture several times, including the migration from a flat-peering model to a TGW hub-and-spoke topology as their VPC count climbed past a few dozen. The key lesson is the breakpoint: at small scale, peering is cheaper and simpler; somewhere around 10–15 VPCs the quadratic-mesh maintenance cost (and route-table size, and cross-account complexity) crosses the linear TGW cost. The transition is mostly mechanical — attach VPCs to TGW one at a time, update route tables to point default-route at TGW, decommission peerings.
AWS PrivateLink at financial institutions. Goldman Sachs, Capital One, and several other banks have documented their PrivateLink-heavy architectures in AWS case studies (the PrivateLink customer page aggregates many). The shape is identical across them: customer-facing VPCs in one account, internal services exposed via PrivateLink in other accounts, no NAT or IGW in the internal accounts at all. The combination of "no public network reach" plus "explicit service-by-service exposure" satisfies most financial-services audit regimes far more cleanly than IP-allowlist firewalls — there's literally no path for traffic to escape, because no route exists.
The through-line: VPC design at scale is less about how to connect things and more about how to not connect them. Most production incidents stem from connectivity that no one explicitly chose; isolation by default is the property worth optimising for.
9 · Build it yourself — VPC with public + private subnets
- Create the VPC.
VPC_ID=$(aws ec2 create-vpc --cidr-block 10.0.0.0/16 \ --tag-specifications 'ResourceType=vpc,Tags=[{Key=Name,Value=lab}]' \ --query Vpc.VpcId --output text) aws ec2 modify-vpc-attribute --vpc-id $VPC_ID --enable-dns-hostnames - Two subnets, one public + one private, in
us-east-1a.PUB_SUBNET=$(aws ec2 create-subnet --vpc-id $VPC_ID --cidr-block 10.0.0.0/24 \ --availability-zone us-east-1a \ --tag-specifications 'ResourceType=subnet,Tags=[{Key=Name,Value=public-a}]' \ --query Subnet.SubnetId --output text) PRI_SUBNET=$(aws ec2 create-subnet --vpc-id $VPC_ID --cidr-block 10.0.10.0/24 \ --availability-zone us-east-1a \ --tag-specifications 'ResourceType=subnet,Tags=[{Key=Name,Value=private-a}]' \ --query Subnet.SubnetId --output text) - Attach an Internet Gateway, make the public subnet routable.
IGW=$(aws ec2 create-internet-gateway --query InternetGateway.InternetGatewayId --output text) aws ec2 attach-internet-gateway --vpc-id $VPC_ID --internet-gateway-id $IGW PUB_RT=$(aws ec2 create-route-table --vpc-id $VPC_ID --query RouteTable.RouteTableId --output text) aws ec2 create-route --route-table-id $PUB_RT --destination-cidr-block 0.0.0.0/0 --gateway-id $IGW aws ec2 associate-route-table --route-table-id $PUB_RT --subnet-id $PUB_SUBNET - Add a free S3 gateway endpoint (the reflex move).
aws ec2 create-vpc-endpoint --vpc-id $VPC_ID \ --service-name com.amazonaws.us-east-1.s3 \ --route-table-ids $PUB_RT - Create security groups — web tier, app tier.
WEB_SG=$(aws ec2 create-security-group --group-name web --description "web tier" --vpc-id $VPC_ID --query GroupId --output text) aws ec2 authorize-security-group-ingress --group-id $WEB_SG --protocol tcp --port 443 --cidr 0.0.0.0/0 APP_SG=$(aws ec2 create-security-group --group-name app --description "app tier" --vpc-id $VPC_ID --query GroupId --output text) aws ec2 authorize-security-group-ingress --group-id $APP_SG --protocol tcp --port 8080 --source-group $WEB_SG - Inspect the route table from the private subnet's view (no default route — instance has no internet path).
aws ec2 describe-route-tables --filters Name=association.subnet-id,Values=$PRI_SUBNET - Tear it down.
aws ec2 delete-vpc-endpoints --vpc-endpoint-ids $(aws ec2 describe-vpc-endpoints --filters Name=vpc-id,Values=$VPC_ID --query 'VpcEndpoints[].VpcEndpointId' --output text) aws ec2 delete-security-group --group-id $APP_SG aws ec2 delete-security-group --group-id $WEB_SG aws ec2 disassociate-route-table --association-id $(aws ec2 describe-route-tables --route-table-ids $PUB_RT --query 'RouteTables[0].Associations[0].RouteTableAssociationId' --output text) aws ec2 delete-route-table --route-table-id $PUB_RT aws ec2 detach-internet-gateway --vpc-id $VPC_ID --internet-gateway-id $IGW aws ec2 delete-internet-gateway --internet-gateway-id $IGW aws ec2 delete-subnet --subnet-id $PUB_SUBNET aws ec2 delete-subnet --subnet-id $PRI_SUBNET aws ec2 delete-vpc --vpc-id $VPC_ID
aws ec2 allocate-address then aws ec2 create-nat-gateway, and don't forget to delete both at the end.10 · What breaks
- "My EC2 in a private subnet can't reach the internet." Either the route table doesn't point at NAT, the NAT is in a different AZ (NATs are zonal), the SG denies egress (the default-allow is usually fine), or no NAT exists at all.
- Overlapping CIDRs after peering / TGW. You can create the peering / attachment, but you can't add a route for the overlapping range — packets to that destination fall back to the local route inside the same VPC. Surfaces only when an acquisition / merger / corporate-network onboarding happens. Pick CIDRs in a central registry before any growth event.
- "VPC peering allows me to reach A but not C." Peering is non-transitive. A↔B and B↔C don't make A↔C work. Use TGW for any topology beyond two VPCs.
- NAT Gateway bill exploded. Almost always one of: (a) S3 traffic going through NAT instead of via a gateway endpoint, (b) container image pulls from public registries instead of ECR with an endpoint, (c) per-AZ NAT (correct for HA) costing more than you noticed, (d) cross-AZ traffic to a NAT in a different AZ doubling the bill.
- The 55,000-connection NAT ceiling. High-RPS workloads hammering a single destination (one external API IP, one S3 endpoint IP) trip
ErrorPortAllocationin CloudWatch. Spread destinations, shard NATs, or move the traffic onto a VPC endpoint where this limit doesn't apply. - TGW route propagation surprises. Routes can be either propagated (auto-injected from attached VPC CIDRs) or static. Mixing them per attachment leads to "this route is here, why isn't traffic flowing?" — check the TGW route table's association vs propagation separately. They're orthogonal concepts and the UI doesn't make that obvious.
- NACL surprises. Because NACLs are stateless, response packets need an explicit egress rule. The classic case: someone tightens the NACL to deny ephemeral ports (32768–60999), and now every outbound connection times out on the reply. Either use SGs as your primary tool and keep NACLs default-allow, or allow ephemeral ports explicitly.
- CIDR overlap with corporate network. Surfaces only when you set up VPN/DX. Pick CIDRs in advance;
10.0.0.0/16if greenfield,10.16.0.0/16+ onwards if your corp is already using10.0/10.1. - ENI exhaustion. An instance type has a max number of network interfaces. Lambda-in-VPC and EKS-with-CNI-prefix-delegation both blow through ENI limits at scale. Either choose a bigger instance or enable prefix delegation.
- Public IPv4 now bills hourly. Since Feb 2024 every public IPv4 address (EIP or auto-assigned) bills $0.005/hr — including IPs attached to running instances. A fleet of 100 instances with public IPs is ~$360/month just for the addresses. Audit and remove EIPs you don't need.
11 · Further reading
- VPC user guide. Definitive, but read selectively.
- AWS networking blog. Best source for "how does this actually behave under load" answers.
- Becky Weiss — Nitro / Hyperplane re:Invent sessions. The closest AWS has come to publicly documenting the underlay and mapping service.
- Stripe engineering blog. Infrastructure posts that describe isolation-by-default networking at scale.
- PrivateLink customer case studies. Financial services and healthcare deployments; the public reference architectures.
- Cloud networking (concepts). The conceptual companion — VPC alongside GCP and Azure equivalents.
- The Networking Stack. If the IP / subnet / route table model isn't second nature, start here.