Sub-page 09 · for infra + CNI authors

Kubernetes internals · Networking

Four assumptions,
one flat fabric.

Kubernetes does not implement a network. It assumes one (four flat assumptions about reachability and identity) and delegates every byte of packet plumbing to a plugin. The shape of every cluster's data plane is the consequence of which plugin you picked and how it satisfies those four assumptions.

This page is a tour of the assumptions, the CNI contract that lets you swap implementations, and the four kube-proxy modes that turn ClusterIP virtual addresses into real packet flows. Roughly 4,300 words. Pair it with the architecture sub-page for where these processes live, and the how-it-works guide for the narrative version.

The four assumptions of the Kubernetes networking model.

Kubernetes networking is famously underspecified. The official documentation is roughly five pages, half of which is a list of plugins. The reason is that Kubernetes does not own the network; it owns four assumptions about the network, and any plumbing that satisfies those assumptions is a valid Kubernetes network. Every constraint, every Service primitive, every NetworkPolicy is a direct consequence of these four. Internalise them and the rest of this page is bookkeeping.

The first assumption is that every pod has its own IP address. Not a port on the node, not a NAT'd address shared with siblings, but a routable IP that the pod can bind to and announce as its own. This is what lets Kubernetes treat a pod as a first-class network endpoint. A web server in a pod calls listen(0.0.0.0:8080) and the application does not need to know it is in a container; from inside the pod, the network looks identical to a dedicated VM. The pod's IP comes from a per-node CIDR (typically a /24 carved out of a larger cluster CIDR) and is allocated by the IPAM plugin on the node where the pod was scheduled.

The second assumption is that every pod can reach every other pod without NAT. Source and destination addresses are preserved end-to-end. A pod on node A sending to a pod on node B sees its own IP as the source, and the receiver sees that same IP. There is no MASQUERADE, no SNAT, no port-mapping translation between pods. This is what lets you treat the cluster's pod network as a flat L3 fabric: every pod IP is routable from every other pod IP, and the only thing needed is for the underlying network (or the CNI plugin's overlay) to know how to forward packets between node CIDRs. In a typical cluster, the 10.244.0.0/16 pod CIDR is split into per-node /24s, and the CNI plugin either pushes routes into the node's table (Calico's BGP) or encapsulates packets in VXLAN (Flannel's default) so the underlay does not need to know about pod IPs.

The third assumption is that agents on a node (the kubelet, system daemons, kube-proxy) can reach every pod, including pods on other nodes. This is what makes liveness probes, readiness probes, exec sessions, and log streaming work. Kubelet on node A periodically GET /healthz against pod IPs on node B. The agent does not have to be on the pod network; it has to have a route to it. In practice this means the node's network namespace can reach the pod CIDR even though node IPs and pod IPs come from different ranges.

The fourth assumption is the subtle one: a pod sees its own IP as the same IP that other pods see for it. Concretely, socket.gethostbyname(socket.gethostname()) inside the pod returns the same address that another pod would get if it resolved the pod's DNS name. This sounds trivial but it is the assumption that breaks first when someone tries to put SNAT in the pod-to-pod path: the pod thinks it is one IP, every other pod thinks it is a different IP, and any application that announces itself in a registry (Eureka, Consul, the JVM's RMI export) immediately misbehaves. Preserving identity-of-self is what lets clustered software written for VMs run unchanged on Kubernetes.

These assumptions are about pod-to-pod, not pod-to-service. Service traffic is explicitly allowed to be NAT'd, and almost always is, because that is how kube-proxy implements virtual ClusterIPs. The assumptions describe the substrate; Services sit on top.

The CNI spec: ADD, DEL, CHECK, and a JSON contract.

If the four assumptions are the network's promise, the Container Network Interface (CNI) is the API by which Kubernetes asks something else to deliver on it. CNI is unusually small for a plug-in surface this consequential: a binary in /opt/cni/bin/, a JSON config in /etc/cni/net.d/, and three verbs. ADD is called when a pod sandbox is created and needs an interface and an IP. DEL is called when the sandbox is torn down. CHECK is an idempotency probe (does the pod still have the configuration we previously assigned?), used by kubelet's reconciliation to detect drift.

The protocol is even simpler than it sounds: the runtime executes the plugin binary with the verb as CNI_COMMAND and the config plus per-call arguments (CNI_CONTAINERID, CNI_NETNS, CNI_IFNAME, CNI_ARGS) on the environment. The plugin reads its config JSON from stdin, does whatever Linux network programming it needs to do (create a veth, push an IP, install routes, attach an eBPF program), and prints a result JSON to stdout describing what it did. Exit code zero means success, non-zero plus a JSON error body means failure. That is the entire spec. It fits on a postcard, which is why every serious cloud networking project (Calico, Cilium, Weave, Flannel, AWS VPC CNI, GKE Dataplane V2) implements it.

CNI plugins compose. Most real installations run a chain: a base plugin that creates the interface and assigns the IP, plus auxiliary plugins that layer policy and shaping on top. The canonical example is bridge + portmap + bandwidth + tuning. The bridge plugin creates a Linux bridge on the node and a veth pair from the pod's namespace into it. The portmap plugin programs iptables hostport rules so that hostPort: 8080 on a pod actually maps to a node port. The bandwidth plugin installs tc qdiscs to enforce kubernetes.io/ingress-bandwidth annotations. The tuning plugin sets per-interface sysctls (e.g. net.ipv4.conf.all.arp_ignore). Each plugin runs in turn, fed the output of the previous as prevResult in stdin. Failures unwind the chain by issuing DEL in reverse order.

The runtime that actually invokes CNI is, in modern installs, not kubelet directly but the container runtime (containerd or CRI-O) using the github.com/containernetworking/cni Go library. The kubelet calls RunPodSandbox over the CRI gRPC socket; the runtime opens a network namespace for the sandbox and execs the CNI chain. This indirection matters because it is why crashing kubelet does not strand pod networking: the plugin invocation is synchronous, and once the IP is assigned it persists in the runtime's sandbox state.

# What containerd writes to the bridge plugin's stdin during ADD
# env:  CNI_COMMAND=ADD  CNI_CONTAINERID=8e4b...  CNI_NETNS=/var/run/netns/cni-...
#       CNI_IFNAME=eth0  CNI_PATH=/opt/cni/bin  CNI_ARGS=K8S_POD_NAMESPACE=prod;K8S_POD_NAME=web-7d8

{
  "cniVersion": "1.0.0",
  "name":       "k8s-pod-network",
  "type":       "bridge",
  "bridge":     "cni0",
  "isGateway":  true,
  "ipMasq":     false,
  "ipam": {
    "type":    "host-local",
    "subnet":  "10.244.1.0/24",
    "gateway": "10.244.1.1",
    "dataDir": "/var/lib/cni/networks"
  },
  "dns": { "nameservers": ["10.96.0.10"] }
}

Operational note: the most common CNI failure mode in production is a stale lock-file in the IPAM dataDir after a node-level reboot or kubelet restart. Symptom: new pods stuck in ContainerCreating with failed to allocate for range 0: no IP addresses available even though only a fraction of the /24 is in use. Fix: clear /var/lib/cni/networks/k8s-pod-network/ entries that no longer correspond to running sandboxes; better fix: switch to a cluster-scope IPAM.

IPAM: host-local vs cluster-scope address allocation.

IPAM (IP Address Management) is the part of CNI that decides which address gets assigned to which pod, and it is one of the most under-appreciated joints in the cluster. The CNI spec defines IPAM as its own plugin sub-type; a CNI plugin like bridge or macvlan delegates the actual allocation to whichever IPAM plugin is named in its config. There are two families: host-local and cluster-scope, and the choice has large consequences for cluster scale and behaviour during node churn.

Host-local IPAM is the default for most installs. The cluster CIDR (say 10.244.0.0/16) is partitioned into per-node /24s by an external controller (the node-ipam-controller in kube-controller-manager, which writes the assigned CIDR onto each Node's spec.podCIDRs). On each node, the host-local plugin reads its slice and tracks allocations in a flat directory of lock-files, one per IP, in /var/lib/cni/networks/<net>/. To assign an IP, it scans for the first free address. To free an IP, it deletes the file. Simple, stateless across reboots only because the runtime tells it which IPs to release on DEL, and utterly broken if the runtime forgets to call DEL, which happens on hard kubelet kills, on node reboots between sandbox creation and CNI ADD completion, and in race conditions during deletion. The classic symptom is the no IP addresses available error on a /24 that still has 80% of its space free in reality.

Cluster-scope IPAM solves this by maintaining the allocation state in the api-server rather than on each node. Calico's IPAM uses CRDs called IPAMBlock and IPAMHandle; Cilium's uses CiliumNode.spec.ipam. The controller assigns blocks of addresses to nodes on demand rather than statically, can reclaim a block when its node disappears, and can detect leaks by reconciling allocations against actually-running pods. The cost is that pod creation now requires a synchronous api-server call before the IP is known. In practice this adds tens of milliseconds to pod startup and is invisible against the CRI sandbox-creation latency, but it is one more dependency on a healthy api-server during scale-out.

A third class (routable cluster IPAM) is what the AWS VPC CNI does. Each pod gets a real VPC IP, allocated from the same address space as the EC2 instances, by attaching multiple ENIs to each instance and assigning their secondary IPs to pods. This makes the four assumptions trivially satisfiable (the underlay is the pod network) at the cost of being bound to a single cloud's instance-IP-density limits. Azure CNI works the same way; GKE's dataplane v2 (Cilium-based) does similar with alias IP ranges.

The choice between host-local and cluster-scope is, at scale, a choice about failure modes. Host-local fails open and silent: leaks accumulate, and one day you cannot create pods on a node. Cluster-scope fails closed and noisy: the api-server is slow, all pod creations wait, but the system is consistent. Most production clusters past a few hundred nodes converge on cluster-scope precisely because the silent leak in host-local is impossible to alert on until it bites.

# host-local IPAM lock files — one per allocated address
$ ls /var/lib/cni/networks/k8s-pod-network/
10.244.1.10  10.244.1.11  10.244.1.7  last_reserved_ip.0  lock

$ cat /var/lib/cni/networks/k8s-pod-network/10.244.1.7
8e4b7f3a2c1d...   # the container ID that holds this lease
eth0
k8s-pod-network

# cluster-scope (Calico) — allocations tracked as CRD
$ kubectl get ipamblocks -A
NAME                  AGE
10-244-1-0-24         3d
10-244-2-0-24         3d

$ kubectl get ipamblock 10-244-1-0-24 -o jsonpath='{.spec.allocations}'
[null,7,10,11,null,null,...,null]   # dense bitmap of /24 by handle index

If you are building a controller that touches IP allocation (a custom CNI, a webhook that inspects pod IPs), be aware that the IP is not in the Pod object at admission time. It is set by kubelet's SyncPod after the runtime returns from RunPodSandbox, written into status.podIPs via the kubelet's status update path. Webhooks that try to mutate based on pod IP will see an empty field. The right pattern is to watch Pod status updates, not Pod creates.

kube-proxy modes: iptables, IPVS, nftables, eBPF.

kube-proxy's job is one sentence: turn a Service ClusterIP (a virtual address with no interface bound to it anywhere) into packets that arrive at one of the Service's backing pods. The implementation has gone through four generations, each addressing the scale ceiling of the one before. Modern clusters can pick any of the four; the choice is now a tuning decision rather than a default.

The iptables mode is the default in most installs. kube-proxy watches Services and EndpointSlices; for every Service it writes a chain KUBE-SVC-XXX with one rule per backend pod, each statistically matched (--mode random --probability 1/N) so that incoming packets are spread evenly. Each backend rule jumps to a per-endpoint KUBE-SEP-YYY chain that does the actual DNAT to the pod IP and an optional MASQUERADE on the return path. The model is correct and durable, but it is linear: every packet to a Service traverses the full chain looking for a match, and a Service with 1,000 endpoints means evaluating up to 1,000 rules per packet. On clusters with tens of thousands of services, kube-proxy's ruleset alone can exceed 100,000 lines, which makes its periodic full resync take seconds and the kernel's per-packet evaluation cost meaningful.

IPVS mode replaces the iptables chains with the kernel's IPVS subsystem, a dedicated in-kernel L4 load balancer with hashtable-based lookup. kube-proxy populates IPVS virtual services and real servers via netlink instead of writing iptables rules, ClusterIPs are bound to a dummy interface (kube-ipvs0) so the kernel routes them through IPVS, and the lookup is O(1) regardless of endpoint count. IPVS supports more scheduling algorithms than the random-statistics of iptables (round-robin, least-connections, source-hash) and scales to hundreds of thousands of services. The drawback is a larger operational surface: IPVS interacts with conntrack subtly, and a misconfigured net.ipv4.vs.conntrack=1 can cause subtle reset storms.

nftables mode, GA in Kubernetes 1.33, is the spiritual successor to iptables. It uses the same kube-proxy-as-controller model but writes nftables rules instead of iptables rules. The advantage over iptables is nftables' native support for sets and verdict-maps, which let kube-proxy express “DNAT to one of these N IPs” as a single map lookup rather than a linear chain of probabilistic jumps. Programming the dataplane is also faster, because nftables supports atomic ruleset replacement; iptables-restore, by contrast, has to replace large tables wholesale.

eBPF (most prominently as Cilium's kube-proxy replacement) moves the entire model out of netfilter. Cilium attaches a BPF program to the cgroup connect() hook so that when a process inside a pod calls connect(ClusterIP, port), the BPF program rewrites the destination address to a pod IP before the packet ever enters the network stack. There is no DNAT, no conntrack entry, no rule scan. For packets that originate outside the cluster, Cilium attaches a tc-bpf program at the host's external interface that does the same lookup at ingress. The BPF maps are populated from the same Service+EndpointSlice watches that kube-proxy uses; the difference is purely in where the rewrite happens.

Mode	Rule storage	Per-packet lookup	Throughput	Notes
iptables	O(services × endpoints)	O(services + endpoints) linear	~30k pps before contention	default in most installs; KUBE-SERVICES → KUBE-SVC-XXX → KUBE-SEP-YYY chain
IPVS	O(services × endpoints) hashed	O(1) kernel hashtable	~200k pps per node	in-kernel L4 LB; dummy iface kube-ipvs0 holds ClusterIPs
nftables	O(services × endpoints) sets	O(log n) named sets + maps	~100k pps per node	beta in 1.31, GA target 1.33; replaces iptables in modern kernels
eBPF (Cilium)	O(services × endpoints) bpf maps	O(1) socket-LB at connect()	~500k pps per node	kube-proxy replacement; cgroup/sock_addr hook rewrites before packet exists

# iptables mode — KUBE-SERVICES dump (abridged)
$ iptables -t nat -L KUBE-SERVICES -n --line-numbers
Chain KUBE-SERVICES (2 references)
1   KUBE-SVC-NPX46M4PTMTKRN6Y  tcp  --  0.0.0.0/0  10.96.0.1     /* default/kubernetes */
2   KUBE-SVC-TCOU7JCQXEZGVUNU  udp  --  0.0.0.0/0  10.96.0.10    /* kube-system/kube-dns:dns */
3   KUBE-SVC-J2DWGRZTH4C2LPA4  tcp  --  0.0.0.0/0  10.96.42.7    /* prod/web */
...

$ iptables -t nat -L KUBE-SVC-J2DWGRZTH4C2LPA4 -n
1  KUBE-SEP-AAA  --  prob 0.33333  # jump to endpoint 1 with 1/3 probability
2  KUBE-SEP-BBB  --  prob 0.50000  # of remaining, jump to endpoint 2 with 1/2
3  KUBE-SEP-CCC  --                # otherwise endpoint 3

# IPVS mode — same Service, viewed via ipvsadm
$ ipvsadm -L -n
TCP  10.96.42.7:80 rr
  -> 10.244.1.7:8080            Masq    1      0          0
  -> 10.244.2.4:8080            Masq    1      0          0
  -> 10.244.3.9:8080            Masq    1      0          0

A common gotcha: switching kube-proxy mode is not zero-downtime. iptables-mode rules and IPVS-mode rules cannot coexist coherently for the same Service, because both write conntrack state. The supported procedure is: drain a node, change its kube-proxy config, restart kube-proxy, undrain. Cluster-wide it takes a rolling node restart.

ServiceCIDR and the ClusterIP allocator.

Service ClusterIPs come from a single contiguous range called the Service CIDR, by default 10.96.0.0/12 on kubeadm installs, but configurable on cluster bootstrap. Every Service that gets a ClusterIP gets one address from this range, allocated by the api-server itself the moment a Service is created. The allocation has to be globally unique across the cluster (two Services pointing at the same ClusterIP would be ambiguous to kube-proxy) and the allocator runs inside the api-server's storage path so it can use etcd's transactional semantics to enforce uniqueness.

Pre-1.27, the allocator was a single bitmap object stored at /registry/services/specs/ in etcd, mutated under a transaction every time a Service was created or deleted. This worked but had a 64K cap (the bitmap was a single etcd key under the value-size limit), and it was a hot key during cluster scale-out. Every Service create contended on it. KEP-1880, GA in 1.31, replaced this with the ServiceCIDR and IPAddress API objects: the cluster can now have multiple ServiceCIDRs (you can add a new range without recreating the cluster), and each allocated ClusterIP becomes its own IPAddress object in the api-server, garbage-collected when the Service goes away. Allocation is just an api-server CREATE transaction with a deterministic name (10-96-42-7 for the IPAddress object representing 10.96.42.7); collisions return 409 AlreadyExists and the allocator picks another.

The ClusterIP itself is never bound to any interface anywhere in the cluster, and this is the most common conceptual stumble. If you arping a ClusterIP from inside a pod, no host will answer; if you packet-capture on every interface in the cluster you will not see it. The ClusterIP exists only as a destination in kube-proxy's rules. Packets sent to it are intercepted by netfilter (in iptables/nftables modes), or by IPVS (with the dummy kube-ipvs0 interface advertising the IP locally just so the routing table sends it to the right hook), or by an eBPF program at connect() time (in Cilium mode). The packet that leaves a node has the pod IP as its destination, never the ClusterIP.

A subtle implication: changing the Service CIDR after cluster install is brutal. Existing Services keep their old ClusterIPs (they are immutable on the Service object), kube-proxy is configured at startup with the old range, and certificates baked into the api-server include the apiserver Service IP (almost always the first IP in the range, 10.96.0.1) as a SAN. Migrating to a different CIDR requires a cluster recreate or a meticulous rolling cert rotation plus Service-by-Service recreation. Plan the range generously the first time. With KEP-1880 you can add a new range without breaking anything, which is the new escape hatch.

# The default ServiceCIDR object as of 1.31+
$ kubectl get servicecidr
NAME     CIDRS          AGE
kubernetes  10.96.0.0/12   12d

# Each allocated ClusterIP is its own object
$ kubectl get ipaddresses | head
NAME         PARENTREF
10.96.0.1    services/default/kubernetes
10.96.0.10   services/kube-system/kube-dns
10.96.42.7   services/prod/web

# Adding a second range is now a single CREATE
$ kubectl create -f - <<EOF
apiVersion: networking.k8s.io/v1beta1
kind: ServiceCIDR
metadata:
  name: extension
spec:
  cidrs: ["10.112.0.0/12"]
EOF
servicecidr.networking.k8s.io/extension created

Headless Services (Services with clusterIP: None) skip the allocator entirely. They have no ClusterIP, no kube-proxy rules; they exist purely as DNS records. api-server publishes one A/AAAA record per backing pod IP on the Service's DNS name, and clients do their own load balancing by picking one. This is the right default for stateful workloads where clients need stable, addressable identities (the pattern you get with StatefulSets and per-pod DNS names like web-0.web.prod.svc.cluster.local).

If you write a controller that allocates IPs from outside the ServiceCIDR (a load-balancer provider, a custom routing system) make sure your range does not overlap. Overlap is silent until a packet from a Service's ClusterIP collides with a packet from your range, at which point conntrack hashes them together and you get reset storms that are extremely hard to diagnose.

The pod-to-service path: DNS, ClusterIP, DNAT.

Watching a packet travel from one pod to another via a Service is the single most informative exercise in Kubernetes networking. It pulls together every piece on this page: the four assumptions, CNI's interface assignment, the ClusterIP allocator, kube-proxy's rules, and DNS. The path has six steps, and each one can fail independently in a way the others will not catch.

Step one: DNS lookup. The pod's /etc/resolv.conf is generated by kubelet at sandbox creation time and points at the cluster DNS Service, usually 10.96.0.10, the second-allocated ClusterIP, owned by the kube-dns Service in kube-system. The pod's libc resolver sends a UDP/53 query to that address. The query is intercepted by kube-proxy on the local node, DNAT'd to a CoreDNS pod, answered with the Service's ClusterIP, and returned. CoreDNS is itself just pods, scheduled across the cluster like any other workload. The DNS Service's ClusterIP is an entry in kube-proxy's tables like any other Service.

Step two: NodeLocal DNS. On clusters with the NodeLocal DNSCache addon deployed, the pod's resolver actually targets a different address (typically 169.254.20.10 on a link-local IP) bound to a per-node DaemonSet pod in the host network namespace. NodeLocal DNS handles the query locally, only forwarding cache misses to the upstream CoreDNS. The benefit is escaping a notoriously lossy path (conntrack-based UDP DNAT to a remote pod) and turning DNS resolution into a same-node operation. On large clusters this can drop tail-latency p99 by 100ms or more and eliminate most of the 5-second DNS pauses caused by conntrack races on UDP.

Step three: ClusterIP intercepted. With a resolved ClusterIP in hand, the pod dials the address. The packet exits the pod's veth into the node's root network namespace. There, in iptables mode, the PREROUTING chain jumps to KUBE-SERVICES; the rule for the destination ClusterIP matches; the packet is sent to KUBE-SVC-XXX; one of its endpoint chains is picked by random probability; the packet's destination is rewritten to a backing pod IP via DNAT; and conntrack remembers the rewrite so reply packets are SNAT'd back. In Cilium mode the rewrite happens inside the BPF connect() hook before the packet is ever constructed; the kernel sees only the pod-IP destination.

Step four: route to the pod. Now the packet has a pod IP destination (10.244.2.4, say). The node's routing table, populated either by the CNI plugin's daemon (Calico's BIRD, Cilium's agent), by a tunnel mesh (Flannel's VXLAN), or by the underlay (AWS VPC CNI's ENI routes), knows that this pod IP lives on node B and sends the packet there. If it is encapsulated, this is where the encap header goes on.

Step five: receive and demux. Node B receives the packet. Either via veth (direct routing) or after decapsulation (overlay), the packet enters the destination pod's network namespace through its veth peer. The pod's listening socket accepts the connection. The destination application sees the source IP as the originating pod's IP; the destination as its own pod IP. The Service ClusterIP is gone — it never appeared on the wire.

Step six: the reply. The reply goes the other way, with conntrack on node A rewriting the source from the backing pod IP back to the ClusterIP so the originating application sees the address it dialled. The conntrack entry has a TTL; if the connection idles too long (default 4 minutes for UDP, much longer for established TCP) the entry expires and the next packet would have to re-resolve through kube-proxy's rules.

Diagnostic discipline — when a Service is misbehaving, walk these six steps in order. Resolve the DNS name (does it return an IP at all?), check the IP is in the ServiceCIDR, dump kube-proxy's rules for it (iptables -t nat -L | grep ClusterIP or ipvsadm -L), confirm EndpointSlices are populated (kubectl get endpointslices -l kubernetes.io/service-name=<svc>), and finally tcpdump on the source node. Most outages stop at step three or four; the rest point at the CNI.

Cross-link: the load balancing guide goes deeper on probability vs hash-based selection, and the consistent hashing simulator visualises why externalTrafficPolicy: Local + sessionAffinity: ClientIP sometimes does not balance the way you expect.

NetworkPolicy backends: Cilium, Calico, Antrea.

NetworkPolicy is the Kubernetes type that says “pods matching label X may only receive traffic from pods matching label Y on port Z.” The type is a CRD baked into the api-server, with a precise schema; what makes NetworkPolicy slippery is that the api-server does not enforce it. Enforcement is the responsibility of whatever CNI you have installed, and not all CNIs implement it. Flannel, famously, does not — pods on a Flannel cluster have unrestricted connectivity regardless of how many NetworkPolicies you write. Cilium, Calico, Antrea, and kube-router do, with three different mechanisms and surprisingly different operational profiles.

Cilium uses identity-based enforcement. Every pod is assigned a numeric security identity at sandbox creation time, derived from its label set: two pods with identical labels share an identity, two pods with different labels do not. NetworkPolicies are compiled into BPF programs that match on identity rather than IP. The enforcement happens in tc-bpf at the pod's veth and at the host's external interface. Because identity is an integer, the datapath cost is constant regardless of the pod count, and policies can be expressed at L7 (HTTP method, gRPC service) by injecting an Envoy proxy as a transparent sidecar. The identity-vs-IP shift is visible in cilium endpoint list: each endpoint has a stable identity number that survives pod restarts (as long as labels do not change).

Calico uses ipset-based enforcement. For every NetworkPolicy selector, Calico materialises an ipset containing the pod IPs that match. Policies are compiled to iptables (or nftables, or eBPF in newer versions) rules that consult those ipsets. Because pod IPs are ephemeral, the ipsets churn — a pod restart with a new IP forces a recompile of every ipset the pod was in — but iptables-with-ipsets is cheap to evaluate (O(1) hash lookup per set), and Calico's BIRD-based BGP daemon means the policy datapath is independent of any overlay encapsulation. Calico is the default in many on-prem clusters specifically for its policy maturity.

Antrea uses Open vSwitch flow tables. Each policy compiles to a set of OVS flow rules that match on conjunctive selectors — conj_id matches the cross-product of source-set and destination-set, scaling sub-linearly in the number of policies. Because OVS supports flow caching in the megaflow path, the per-packet cost in steady state is a single hashtable lookup. Antrea also supports L7 policies via OVS extensions and FQDN-based egress policies — the latter is difficult, because FQDN means snooping DNS responses, which Antrea does inside the OVS controller.

All three implement the standard networking.k8s.io/v1 NetworkPolicy, which is namespaced and supports ingress/egress with podSelector, namespaceSelector, and ipBlock. Each also defines its own CRD — CiliumNetworkPolicy, GlobalNetworkPolicy (Calico), ClusterNetworkPolicy (Antrea) — for cluster-wide rules and L7 features that the upstream type does not cover. The newer AdminNetworkPolicy (KEP-2091) is meant to standardise the cluster-wide layer; it is alpha as of 1.30.

# A standard NetworkPolicy: payments pods only receive from web-tier on :8443
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: payments-ingress
  namespace: prod
spec:
  podSelector:
    matchLabels:
      app: payments
  policyTypes: ["Ingress"]
  ingress:
  - from:
    - podSelector:
        matchLabels:
          tier: web
    ports:
    - protocol: TCP
      port: 8443

# What this looks like in Cilium — identities are numeric, derived from labels
$ cilium endpoint list
ENDPOINT  IDENTITY  LABELS                                       STATUS
3941      54213     k8s:app=payments,k8s:io.cilium.k8s.policy... ready
8127      54219     k8s:tier=web,k8s:io.cilium.k8s.policy...     ready
9412      54213     k8s:app=payments,k8s:io.cilium.k8s.policy... ready

# Policies compile to maps keyed by (source-identity, dest-identity, port)
$ cilium bpf policy get cilium_call_policy_3941
DIRECTION  IDENTITY  PORT/PROTO   AUTH    BYTES   PACKETS
Ingress    54219     8443/TCP     none    1.2MB   17421
Ingress    *         *            DENY    0       0

CNI	Engine	Identity model	L7
Cilium	eBPF tc + bpf_sk_lookup	security identity (numeric, label-derived)	yes (Envoy proxy injection)
Calico	iptables / nftables / eBPF	IP-set per selector	limited (Application Layer Policy)
Antrea	OVS flow tables	AppliedToGroup conjunctive match	yes (FQDN, HTTP via OVS extensions)
kube-router	iptables + ipset	ipset per selector	no

Subtle policy gotcha — selectors with empty ingress: [] mean “deny all ingress”, but selectors with missing ingress mean “allow all ingress”. The two YAML shapes look almost identical in a diff and behave oppositely. Pin a default-deny policy at namespace level and you can stop worrying about which is which.

Keep going.

Cluster architecture

Eight processes, one storage primitive. Where every networking process actually lives.

Pod scheduling, end to end

Where CNI ADD slots into the kubelet SyncLoop and the CRI sandbox dance.

Read →

How K8s networking works

The narrative companion to this page — same story, fewer source pointers.

Read ↑

Back to the internals index

All twelve sub-pages — and the system on one canvas.

Index

Found this useful?

Four assumptions,one flat fabric.

The four assumptions of the Kubernetes networking model.

The CNI spec: ADD, DEL, CHECK, and a JSON contract.

IPAM: host-local vs cluster-scope address allocation.

kube-proxy modes: iptables, IPVS, nftables, eBPF.

ServiceCIDR and the ClusterIP allocator.

The pod-to-service path: DNS, ClusterIP, DNAT.

NetworkPolicy backends: Cilium, Calico, Antrea.

Further reading: CNI spec, eBPF, Cilium, Gateway API.

Keep going.

Four assumptions,
one flat fabric.