10 min read · Guide · Cloud · Networking
How it works · Cloud networking

A private network, in the cloud.

CIDR allocation. Route tables. Gateways. Stateful firewalls vs stateless ACLs. Peering, Transit Gateway, PrivateLink. The full mental model — without the AWS marketing.

Parts01 – 10 InteractiveLayer picker PrereqCIDR · L3 routing

What is a VPC?

Five layers, all glued by routing.

A VPC (Virtual Private Cloud) is a logically isolated network in a public cloud. AWS VPC launched in 2009; GCP VPC and Azure VNet followed. The five layers — VPC CIDR, subnets, route tables, gateways, security groups/ACLs — are all glued together by routing tables. CIDR choices are essentially permanent; resize them carefully.

A VPC is a software-defined network you carve out of a cloud provider's substrate. From the outside it behaves like a private datacenter; from the inside it is a series of overlay constructs — VPC, subnets, route tables, gateways, firewalls — that together produce the illusion of a network. Pick a layer to inspect.

Layer 01

Subnets carve a VPC by AZ.

A VPC is a private IPv4 (and optionally IPv6) range — say 10.0.0.0/16, sixty-five thousand addresses. You divide it into subnets, each pinned to one Availability Zone, each with a smaller CIDR (10.0.1.0/24 = 256 addresses). Public subnets have a route to the Internet Gateway; private subnets do not.


Plan your CIDR carefully, because you cannot resize it

The primary CIDR is permanent, so size it once and well.

The single most common VPC mistake is sizing the primary CIDR too small or overlapping with on-prem. AWS allows 5 secondary CIDRs but they cannot overlap. Pick a /16 you have not used elsewhere; carve /20 per AZ; carve /24s inside that for app, db, and shared subnets.

Reserve five addresses in every subnet — AWS uses the network address, the VPC router (.1), DNS (.2), a future-use slot (.3), and the broadcast address. So a /24 has 251 usable IPs, not 256.

# A typical 3-AZ VPC layout
VPC 10.0.0.0/16 65,536 addresses
AZ-a public 10.0.0.0/24 251 + IGW route
AZ-a private 10.0.16.0/20 4,091 + NAT route
AZ-b public 10.0.1.0/24 251 + IGW route
AZ-b private 10.0.32.0/20 4,091 + NAT route
AZ-c public 10.0.2.0/24 251 + IGW route
AZ-c private 10.0.48.0/20 4,091 + NAT route
DB AZ-a/b/c 10.0.64.0/22 ... no internet route at all

Route tables, where the most-specific match always wins

Most-specific match, always wins.

Every packet leaving an ENI consults the route table attached to its subnet. Routes are evaluated longest-prefix-first: a /32 host route beats a /24 subnet route beats a 0.0.0.0/0 default. The local route covering the entire VPC CIDR is implicit and cannot be removed.

"Public" vs "private" is purely a route table distinction. Same VLAN substrate, same hardware. A subnet whose default route points at an Internet Gateway is public; one pointing at a NAT GW or no default at all is private. There is no firewall difference at the subnet level — that is what NACLs and SGs are for.

Public subnet route table

Two routes.

10.0.0.0/16 → local handles intra-VPC. 0.0.0.0/0 → igw-xyz handles everything else. Instances need a public IP (or Elastic IP) for return traffic.

Private subnet route table

Two routes, different target.

10.0.0.0/16 → local same. 0.0.0.0/0 → nat-xyz sends egress to the NAT in a public subnet, which SNATs to its own EIP.


Internet Gateway vs NAT Gateway: only one actually does NAT

One does NAT, the other does not.

An Internet Gateway is logical — it is not a NAT. It performs 1:1 translation between an instance's public IP (or EIP) and its private IP. If the instance has no public IP, the IGW does nothing and the packet is dropped.

A NAT Gateway is a managed many-to-one source NAT. It SNATs all egress from many private instances to its own EIP, tracks the connection, and reverses the translation on return. Inbound-initiated connections cannot reach behind it (much like a one-way reverse proxy). AWS bills NAT GW per-hour AND per-GB — it is one of the easiest unintended cost centers in a VPC.

igw

Internet Gateway.

No NAT. Bidirectional. Free. Requires public IPs on instances. One per VPC.

nat-gw

NAT Gateway.

SNAT only. Outbound-initiated. Per-AZ for HA. Charged per-hour and per-GB.

vpce

VPC endpoint.

PrivateLink. Reach AWS services without IGW or NAT. Cheaper at scale; private DNS optional.


Security groups are stateful, allow-only firewalls on each ENI

Stateful, allow-list, attached to ENIs.

A security group is a stateful firewall pinned to one or more elastic network interfaces. You can specify only allow rules — there is no deny — and the source can be a CIDR or another security group. The latter matters: "allow port 5432 from sg-app" lets you wire database access without managing IP lists. The same model shows up in Kubernetes NetworkPolicy.

Stateful means: if you allow inbound, the return path is automatic; if you allow outbound, the inbound reply is automatic. You almost never need to think about return traffic with SGs. Default outbound is "all allowed" — most teams leave it alone, but tightening it is a real defense-in-depth move.

# Web -> App -> DB three-tier, by SG reference, not CIDR
sg-web in: 80, 443 from 0.0.0.0/0
sg-web out: 8080 to sg-app

sg-app in: 8080 from sg-web
sg-app out: 5432 to sg-db

sg-db in: 5432 from sg-app
sg-db out: (empty - no egress allowed)

Network ACLs are stateless, ordered filters at the subnet edge

Stateless, ordered, subnet-wide.

A Network ACL is a stateless filter applied at the subnet boundary. Rules are numbered; lowest-number-first wins; both allow and deny are valid; default is deny. Stateless means: you must explicitly allow return traffic in the opposite direction, including ephemeral source ports (1024–65535).

Most teams leave NACLs at the default-allow-all and rely entirely on security groups. Reach for NACLs when you need to block something at a coarser granularity than per-instance — say, blackholing a CIDR across an entire subnet without touching every SG.


VPC peering vs Transit Gateway: many links or one hub

N×N peerings, or one hub.

VPC peering is a direct, non-transitive link between two VPCs. CIDRs must not overlap. Routing each side requires updating both route tables. With four VPCs you need six peerings; with ten you need 45. It does not scale — it is meant for occasional pairings.

Transit Gateway replaces N×N peering with a single hub. Each VPC attaches once. Routing tables on the TGW decide which spokes can talk to which. It also terminates Direct Connect (which rides BGP), Site-to-Site VPN, and SD-WAN appliances. For multi-account, multi-region, multi-cloud topologies it is essentially mandatory.

PrivateLink is different in purpose: it exposes a single service from a producer VPC to many consumer VPCs without ever sharing CIDRs or routing. The consumers see an interface endpoint with a private IP in their own subnet; the producer sees a network load balancer. The two VPCs cannot otherwise talk.


VPC observability: you cannot debug what you cannot see

You cannot debug what you cannot see.

Enable VPC Flow Logs from day one. They record every accepted or rejected packet flow at the ENI, subnet, or VPC level, with five-tuple, action, and byte counts. Send to S3 (cheap, archival) or CloudWatch (queryable). The first time something mysteriously cannot reach something else, flow logs will tell you whether it is a route problem, an SG problem, or something else.

VPC DNS is enabled by default at VPC_CIDR.0.2 (Route 53 Resolver). It resolves public AWS records and any private hosted zones associated with the VPC. Outbound resolver endpoints let you forward queries to on-prem; inbound endpoints let on-prem resolve VPC zones. This is the backbone of hybrid DNS.


Works from the bastion, times out from Lambda

The four-step walk that finds it every time.

A real shape of incident. You SSH to the bastion and curl a third-party API: 200 OK, instant. The same call from a Lambda attached to the VPC hangs for thirty seconds and times out. Not refused, not a 4xx — a hang. A hang means packets are leaving and nothing is coming back, or they are never leaving at all. That distinction is the whole investigation.

First suspect: the Lambda's security group. Its egress rule is the default allow-all, so outbound 443 is fine. And SGs are stateful — if the request gets out, the reply is implicitly allowed back in. No return rule to forget. Two minutes, and SGs are cleared.

Second: the subnet's NACL. This is where stateless matters. Even with outbound 443 allowed, the API's reply arrives addressed to whatever ephemeral source port the client picked, so the NACL needs an explicit inbound allow on ports 1024–65535. Teams that hand-tighten NACLs and forget that return rule produce exactly this symptom. Here, though, the subnet runs the default NACL — allow-all both directions. Cleared.

Third: the route table the Lambda's subnet actually uses — and there it is. The Lambda's ENIs landed in a subnet someone created for internal tooling, and its route table has only the local route. No 0.0.0.0/0 at all. The bastion never had this problem because it sits in the public subnet, the one part of the VPC that is deliberately configured differently, with a default route to the IGW. So "works from the bastion" proved nothing about the Lambda's subnet.

The fix is not pointing the subnet at the IGW, either. Lambda ENIs never get public IPs, and an IGW does 1:1 translation only for instances that have one — for a private ENI it drops the packet. The subnet needs 0.0.0.0/0 → nat-gw, and the punchline is that this VPC never had a NAT gateway built. The outage was an architecture gap, not a typo.

# The order to check, every time
1 Route table   which subnet holds the ENI; where does 0.0.0.0/0 point?
2 Security grp  outbound rule for the port (stateful: return is implicit)
3 NACL          outbound port AND inbound ephemeral 1024-65535 (stateless)
4 Flow logs     REJECT row = SG/NACL; no row at all = routing

What AWS, Azure, and GCP each call their private network

The same idea, three slightly different shapes.

AWS VPC
One VPC per region. Subnets are AZ-scoped (one subnet, one AZ). Default tenancy. Mature: launched 2009, the foundational primitive of every AWS architecture.
GCP VPC
VPCs are global by default — one VPC can span all regions, with subnets scoped per region. Subnets can be expanded after creation (AWS cannot resize). Cleaner model for global-first architectures.
Azure VNet
VNet is regional. Different terminology: NSG (Network Security Group) plays the role of a security group; UDR (User Defined Route) is the route table. Subnets are not AZ-scoped (zone-redundant by default).
Cloudflare Magic WAN
Software-defined WAN that overlays Cloudflare's network on top of cloud VPCs and on-prem sites — abstracts away the cross-cloud peering complexity entirely. Newer player; gaining traction in multi-cloud shops.
Tailscale
Mesh VPN that creates a flat overlay across VPCs, on-prem, laptops. Skips the entire VPC-peering vs Transit Gateway question for internal-team connectivity.

Cost reality. AWS bills per-GB by distance: cross-region traffic runs roughly double cross-AZ, and internet egress close to ten times cross-AZ. At terabyte-per-day egress the network line item dominates many bills. Gateway VPC endpoints to S3 and DynamoDB avoid the egress charge for those services entirely — a cheap optimisation many teams miss.



A closing note

A VPC is mostly route tables. Once you internalise that "public" is a routing decision, "stateful" is an SG decision, and "egress only" is a NAT GW decision, the rest follows. Most VPC pain is caused by overlapping CIDRs nobody planned for; the second most by NAT GW bills nobody noticed.

Found this useful?