What is a CNI plugin?

CNI (Container Network Interface) is the spec for plugins that wire up pod networking. When a pod starts, the kubelet calls the CNI plugin to give the pod an IP and a route. Calico, Cilium, Flannel, and the AWS VPC CNI all implement the spec but make different choices about how pod-to-pod traffic crosses node boundaries (overlay tunnels, BGP routing, or native cloud routing).

12 min read · Guide · Kubernetes

How it works · Kubernetes

How Kubernetes networking connects pods and services

A pod gets an IP. A Service gives it a stable name. A CNI moves the packets. Everything else is variations on those three.

Parts01 – 11 InteractiveLayer picker PrereqIP / iptables

How a packet travels from outside the cluster to a container

Four layers of translation between the internet and your pod.

Kubernetes networking is built on five concepts layered on each other: the pod network (every pod gets a routable IP), Services (stable virtual IPs in front of pod sets), CNI plugins (Calico, Cilium, Flannel — they route pod traffic), CoreDNS (service-name resolution), and Ingress (external HTTP entry).

K8s networking can be thought of as five distinct concerns layered on top of one another. Each is a piece of YAML and a kernel feature. Pick one below to see what it owns.

Pod network

Every pod gets a unique cluster-internal IP. Containers in a pod share the network namespace — they reach each other via localhost. The IP is allocated by the CNI plugin from the cluster's pod CIDR.

The pod network: every pod gets its own real IP

Pods talk to each other directly, with no port mapping in between.

Unlike Docker's default bridge network, where every container sees its peers through a NAT, every Kubernetes pod gets a routable IP that other pods can reach directly. Pod-to-pod traffic does not traverse NAT. Same node or different node, same protocol — open a TCP connection to the IP and packets arrive.

The four-network model. Kubernetes formalises the rules: (1) every pod can reach every other pod without NAT; (2) every node can reach every pod without NAT; (3) the IP a pod sees itself as is the same IP others see it as; (4) Services are an additional virtual layer on top. Any CNI plugin that breaks rule (1) breaks half the ecosystem — service meshes, network policies, and headless Services all assume direct pod IPs.

Pod CIDR sizing matters. A typical default is 10.244.0.0/16 — 65,536 IPs total, sliced into /24 per node (256 IPs each). At 256 nodes the cluster is full. EKS reserves IPs from the VPC CIDR directly; on an m5.large a node gets ~30 IPs (limited by ENI count × IPs-per-ENI), on an m5.4xlarge ~234. Bigger pods per node means a larger node type, regardless of how much CPU and RAM you actually need. EKS prefix-delegation (introduced 2021) loosens this by letting one ENI carry a /28 prefix instead of single IPs.

The pause container. Inside a pod, every container shares the same network namespace — they all reach each other via localhost and bind to the same set of ports (collisions cause pod-start failure). The shared namespace is owned by a tiny pause container that the kubelet creates first; the application containers join its namespace. If the pause container dies the whole pod's networking is gone, which is why it does almost nothing — it just sleeps and reaps zombie processes.

A pod-to-pod packet across nodes. The application sees its own IP and the peer's IP unchanged on both sides — that is the four-network invariant. The CNI plugin's job is to make the IP routing work.

A Service gives moving pods one stable name

Pods come and go; the Service IP and DNS name stay put.

Pod IPs change — every restart, every reschedule, every rolling deploy. A Service is the indirection. It is a stable virtual IP (the ClusterIP) plus a label selector. When traffic hits the ClusterIP, kube-proxy redirects it to one of the matching pods, doing in-kernel load balancing. The pattern is also a built-in form of service discovery.

Four Service types. ClusterIP — the default, internal virtual IP, only reachable from inside the cluster. NodePort — opens a port (range 30000–32767) on every node, forwarded to the Service. LoadBalancer — provisions a cloud LB (AWS NLB, GCP Forwarding Rule, Azure LB) that fronts the NodePort. ExternalName — a CNAME-only Service that aliases an external DNS name (no proxying happens).

Endpoints vs EndpointSlices. Behind every Service sits a list of pod IPs. Until 1.21 this lived in a single Endpoints object — one giant resource that the API server gossiped to every node every time any backing pod changed. At a few thousand replicas the constant churn measurably hit etcd. EndpointSlice shards the list into ~100-pod slices, so a single pod restart only touches one slice. Mandatory for Services with more than 1,000 endpoints.

Three kube-proxy modes, real numbers. iptables mode (default) walks a chain of rules per Service — rule lookup is O(n) and adding a rule rewrites the whole chain; at 5,000 Services adding one new Service can take 250 ms, blocking the kube-proxy update loop. IPVS mode uses kernel hash tables — lookup is O(1) and an add is ~50 µs regardless of Service count, scaling cleanly to 10,000+ Services. eBPF mode (Cilium without kube-proxy) bypasses Netfilter entirely — the BPF program at the socket layer does the DNAT before the packet ever hits the iptables chain. eBPF is fastest in microbenchmarks (~30% lower CPU than IPVS) but harder to debug because tcpdump no longer sees the original destination.

Approximate add-Service latency at 5,000 Services. iptables rule lookup is linear in the number of Services; IPVS and eBPF use hash tables and stay flat. Numbers from Cilium's published benchmarks (2022).

CNI: three ways to move packets between pods

Overlay, routing, or eBPF, depending on the plugin you pick.

How does a packet from a pod on Node A get to a pod on Node B? Three families:

overlay

Wrap pod packets inside another packet

Encapsulate pod-to-pod packets in VXLAN/Geneve and send them between nodes over the underlying network. Works on any underlay; pays a small MTU and CPU cost. Flannel, Calico in VXLAN mode.

routing

Announce pod routes to the network

Each node tells the network (via BGP) which pod IPs live there. Routers handle the rest natively. No encapsulation overhead. Calico in BGP mode, Cilium with native routing.

native

Use the cloud provider’s network

AWS VPC CNI, GKE, AKS — pod IPs are allocated from the cloud VPC and use the cloud's existing routing. Fastest and simplest, locked to one cloud.

One packet, traced: pod A to pod B through a Service

veth, bridge, DNAT, wire — and the conntrack entry that has to hold.

The setup. Pod A (10.244.1.5, node 1) calls my-svc, which CoreDNS resolves to the ClusterIP 10.96.0.20. The only Ready backend is pod B (10.244.2.7, node 2). Pod A's eth0 is one half of a veth pair; the other half sits in the node's root namespace, plugged into the cni0 bridge. The packet leaves the pod addressed to 10.96.0.20:80 — an IP that no interface anywhere owns.

DNAT on the client's node. ClusterIPs are a fiction maintained by kube-proxy. As the packet crosses node 1's netfilter PREROUTING hook, the KUBE-SERVICES chain matches the destination, jumps to the per-Service KUBE-SVC chain, picks a backend with a random statistic rule, and the KUBE-SEP rule DNATs the destination to 10.244.2.7:8080. conntrack records the translation for this flow. IPVS mode reaches the same result with one hash lookup instead of a chain walk. Either way, the rewrite happens on the client's node — node 2 never sees the ClusterIP.

Crossing the wire. With a real pod IP as destination, ordinary routing takes over. Flannel encapsulates the packet in VXLAN: outer header node 1 → node 2, UDP 8472, the pod packet riding inside. Calico in BGP mode skips the wrapping — node 2 already announced that 10.244.2.0/24 lives there. Node 2 hands the packet to its bridge, through pod B's veth pair, into B's namespace. Pod B sees source 10.244.1.5, untouched.

The return path runs on state. Pod B replies to 10.244.1.5 directly — it knows nothing about the Service. When the reply reaches node 1, conntrack matches the recorded flow and reverses the translation, rewriting the source from 10.244.2.7:8080 back to 10.96.0.20:80, so pod A's TCP stack sees a reply from the address it dialled. If that conntrack entry is gone — table full (nf_conntrack_max), entry evicted, kube-proxy restarted mid-flow — the un-NAT never happens, pod A receives a packet from an IP it never contacted, and its kernel answers with a RST.

Why the proverb says "it's always DNS or conntrack". Everything in this trace except two pieces is configured once and static: the veth pairs, the bridge, the routes. The static parts fail at pod creation, loudly. The two pieces with per-request state are name resolution (DNS via CoreDNS, plus the ndots:5 search-path expansion that turns one lookup into five) and conntrack (a finite table with a known insert race that occasionally drops one of two parallel UDP DNS packets — the classic five-second DNS timeout, which NodeLocal DNSCache exists to absorb). A cluster that has run clean for months and now hangs intermittently for exactly five seconds is almost never the CNI. It is one of the two stateful things.

CoreDNS resolves service names inside the cluster

It turns a service name into the right cluster IP for every pod.

Pods don't usually call Services by IP — they call them by name. The cluster's DNS server (CoreDNS) answers my-svc.my-ns.svc.cluster.local with the Service's ClusterIP. The kubelet writes /etc/resolv.conf in every pod to point at CoreDNS by default.

Headless Services (clusterIP: None) have a different behaviour: DNS returns A records for every backing pod IP, not a single VIP. Stateful sets and primary-replica databases use this so clients can address individual pods.

Ingress and LoadBalancer: getting outside traffic in

Two ways to expose a service to clients beyond the cluster.

Three ways for the outside world to reach pods. NodePort opens a port on every node and forwards to a Service — fine for small clusters, awkward at scale. LoadBalancer provisions a cloud load balancer (ELB, GLB) per Service — clean, pricey, one cloud LB per service. Ingress is the modern answer: one cloud LB fronts every HTTP service in the cluster, an Ingress controller (NGINX, Traefik, Envoy, AWS ALB) — effectively a reverse proxy — routes by Host and path.

The newer Gateway API generalises Ingress for non-HTTP protocols and multi-tenancy. Same idea, more expressive.

NetworkPolicy controls which pods can talk

Declarative rules, often default-deny, written in YAML.

By default, Kubernetes is a flat network — every pod can reach every other pod. NetworkPolicy resources let you restrict that. A typical policy says "pods labelled tier=db only accept traffic from pods labelled tier=app in the same namespace." Policies stack additively; without any policy, all traffic is allowed.

NetworkPolicy is enforced by the CNI. The API resource is part of core Kubernetes, but the enforcement isn't — if you run a CNI that doesn't support NetworkPolicy (like the original Flannel), the policies are silently ignored. Calico, Cilium, Weave, AWS VPC CNI, and Antrea all enforce; check before relying on it. The CNI translates the YAML into iptables rules (Calico, classic) or eBPF programs (Cilium).

L7 policies — Cilium only. Vanilla NetworkPolicy is layer 3/4: this pod, this port, this protocol. Cilium extends it to layer 7: this pod can only call GET /api/v1/users on the user-service, not DELETE /api/v1/users. The L7 layer requires the eBPF data plane to parse HTTP/gRPC inline. CRDs: CiliumNetworkPolicy and CiliumClusterwideNetworkPolicy. Calico has its own L7 extension via the GlobalNetworkPolicy CRD.

The default-deny pattern. Production clusters typically run a default-deny policy per namespace (no traffic in or out unless explicitly allowed) plus per-app policies that whitelist specific peers. A useful rule of thumb: any namespace with PII or production secrets must have NetworkPolicy enforced; the audit team will eventually ask, and "we trust the cluster boundary" stops being a sufficient answer.

eBPF and Cilium replace iptables in the data plane

Routing packets in the kernel instead of long iptables chains.

eBPF (extended Berkeley Packet Filter) lets you run small programs inside the Linux kernel at hook points — socket creation, packet egress, syscall entry — without writing a kernel module. Cilium (Isovalent, 2017) is the Kubernetes CNI built on eBPF; it programs the kernel directly to do everything kube-proxy does, plus L7 policies, observability, and service mesh, with no per-pod sidecar.

Where eBPF wins in production. At Datadog scale (2024 KubeCon talk: tens of thousands of nodes, hundreds of thousands of Services), iptables-based kube-proxy was the bottleneck — the rule-update loop hit 100% CPU and lagged behind the API server. Switching to Cilium's eBPF kube-proxy replacement dropped the CPU floor by 40% per node and eliminated the propagation lag. Bank of America, Adobe, and Sky each have public talks on similar moves.

Hubble — observability without sidecars. Cilium ships Hubble, a query interface to all the network flows the eBPF data plane sees. hubble observe --pod tier=db streams every packet a pod sends or receives, with namespace, label, L7 verb, and policy verdict. The same flows feed Grafana dashboards. The traditional alternative — istio-proxy sidecar everywhere — doubles your container count and adds ~100ms p99 latency per hop; Hubble adds nothing to the data plane it's already running.

Cilium service mesh. Since Cilium 1.12 (2022), the same eBPF programs that route Service traffic also do the work an Istio sidecar used to do — mTLS, retries, traffic splitting — without a sidecar. The mesh ships HTTP, gRPC, and Kafka L7 awareness at the kernel layer.

A runbook for debugging Kubernetes networking

What to type when packets don't flow.

The five most useful commands when something doesn't work:

1. Confirm the Service has endpoints. kubectl get endpointslices -l kubernetes.io/service-name=my-svc. If the slice is empty, the label selector matches no Ready pods — usually a readiness-probe misconfiguration. If the slice has IPs, the Service definition is fine; the problem is downstream.

2. Hit the Service from inside the cluster. kubectl run debug --rm -it --image=nicolaka/netshoot -- bash drops you into a pod with curl, dig, tcpdump, ss, ip, mtr. curl my-svc.my-ns.svc.cluster.local tests DNS + Service routing in one shot.

3. Inspect the iptables/IPVS rules. iptables-save | grep KUBE-SVC on a node shows the kube-proxy chain. ipvsadm -Ln for IPVS mode. If the rules don't list your pod IP, kube-proxy hasn't reconciled yet — check kube-proxy logs.

4. Inspect the pod network namespace. Find the container's PID with crictl inspect, then nsenter -t <pid> -n ip route shows the pod's view of routing. Useful when a pod can talk to some peers but not others — usually a misconfigured CNI subnet.

5. With Cilium, use Hubble. hubble observe --from-pod my-app --type drop shows every dropped packet with the policy that dropped it. With Calico, the equivalent is calicoctl get felixconfiguration default -o yaml plus the per-policy log.

For the long version, the canonical reference is Learnk8s's deployment troubleshooting flowchart and the official debugging guide.

Kubernetes networking at scale: three case studies

What runs at the limits.

Spotify (~2,500 nodes, GCP). Runs GKE with the native VPC-native CNI — pod IPs come directly from the VPC, no overlay encapsulation. Network policies enforced by Calico in policy-only mode. CoreDNS at the cluster level plus NodeLocal DNSCache on every node to absorb the DNS query rate (~50k QPS during business hours). Single biggest reliability win: switching from kube-dns to CoreDNS in 2018 cut DNS-related on-call pages by ~80%.

Datadog (~50,000 nodes globally). Migrated from kube-proxy iptables to Cilium's eBPF kube-proxy replacement in 2022. Reasons: kube-proxy was burning a noticeable fraction of node CPU at ~10,000 Services per cluster, and Service rule propagation lag was visible in p99 deploy latency. Hubble observability replaced a custom flow logger. Public details: Datadog engineering blog (2024).

Adobe (~1,000 nodes per cluster, multi-cloud). Standardised on Calico for cross-cloud consistency — the same NetworkPolicy semantics in EKS, AKS, and on-prem clusters. Adobe runs ~25,000 NetworkPolicy resources cluster-wide for compliance with SOC 2 controls. Their public talks (KubeCon EU 2023) emphasise the operational benefit of one CNI vendor across clouds, not raw performance.

A closing note

Five layers, one pod. The CNI gives every pod an IP that other pods can reach directly. Services give those moving IPs a stable name. CoreDNS resolves the name. Ingress lets the outside world in. Network policy says no when the rules are wrong.