Sub-page 11 · for storage operators + controller authors
Kubernetes internals · Storage

Six controllers,
one durable byte.

Kubernetes does not store anything on its own. It coordinates a small fleet of out-of-tree controllers — a provisioner, an attacher, a resizer, a snapshotter, the binder, and the attach/detach loop in the controller-manager — that together turn the abstract promise of a PersistentVolumeClaim into a real block device, formatted, bind-mounted, and visible inside a pod.

This page is a tour of the storage subsystem itself: the controllers, the CRDs, the gRPC contract between Kubernetes and the storage vendor, and the bits of the design you only notice when something goes wrong. Roughly 4,300 words. The mount-side view from inside the kubelet is in the kubelet sub-page; this page is about everything that happens before the mount.


Why storage is hard — the in-tree to CSI migration story.

Storage is the single subsystem in Kubernetes that has been rewritten end-to-end while in production. Compute scheduling has evolved; networking has been re-plumbed; the API surface has gone through API-version dances. But only storage went through a multi-year, deliberate replacement of every plugin in the codebase. Understanding why that happened is the precondition to understanding what the storage subsystem looks like today.

Until roughly Kubernetes 1.21, every cloud provider's volume type was a Go package compiled into the kubelet itself. AWS EBS lived in pkg/volume/aws_ebs/; GCE PD in pkg/volume/gce_pd/; Azure Disk, Azure File, Cinder, vSphere, Ceph RBD, GlusterFS, Portworx, ScaleIO, StorageOS, Quobyte, Flocker, all of it lived inside the Kubernetes binary. To ship a new storage backend you had to convince the SIG, get code reviewed by maintainers who had no domain knowledge of your hardware, wait for a Kubernetes release cycle, and pray the test infrastructure had access to your gear. To fix a bug in your driver you had to ship a patched kubelet to every node. The cadence was glacial; the surface was enormous; the test matrix was unmaintainable.

The first attempt to fix this was Flexvolume in 2015 — drop a binary in /usr/libexec/kubernetes/kubelet-plugins/volume/exec/, kubelet execs it with JSON on stdin, the binary returns. It worked for trivial drivers but had no model for cluster-wide operations: nothing in Flexvolume could provision a new volume in EBS or attach it from outside the node. Drivers worked around this by talking to cloud APIs directly from a privileged DaemonSet, which became its own security and lifecycle nightmare. Flexvolume is still in the tree but it is officially deprecated; do not start there.

The replacement is the Container Storage Interface, drafted in 2017 across SIG Storage and the CNCF Storage TAG, with parallel implementations for Kubernetes, Mesos, and Cloud Foundry. CSI is a specification — a gRPC service definition in container-storage-interface/spec — that describes everything Kubernetes needs to ask a storage system: create me a volume, delete it, attach it to this node, mount it at this path, snapshot it, expand it. KEP-178 stabilised CSI in Kubernetes 1.13, and KEP-1490 (CSI Migration) finished proxying the legacy in-tree plugins through CSI by 1.26. As of 1.30, every in-tree volume plugin is either deleted or a thin shim around a CSI driver, and almost every storage vendor ships a CSI driver as a standalone Helm chart that you install separately from Kubernetes itself.

The result is a Kubernetes binary with no storage code in it. The kubelet knows how to call CSI but knows nothing about EBS. The controller-manager knows how to drive VolumeAttachment objects but does not know what an attach is. The ecosystem has decoupled. New storage backends can ship as out-of-tree drivers, on their own release cadence, with their own bug fixes, signed by their own vendor, and Kubernetes itself does not have to change. This is what is meant by "storage is the most successful out-of-tree story in the Kubernetes project". It is also what makes the storage subsystem look so busy from outside: you are looking at six controllers across three or four pods, when in 2018 it was one Go function in the kubelet.

STORAGE PLUGIN MODELS · 2014 → 20262014201720212026In-tree pluginspkg/volume/aws_ebs, gce_pd, azure_disk, ceph_rbd, ...Flexvolumeexec'd binary, deprecatedCSI · out-of-tree gRPCcontroller plugin + node plugin + sidecarsKEP-1490 CSI migration · 1.21 beta · 1.26 GA

One historical artefact you still see — pods that mount a volume of type: awsElasticBlockStore in the spec. The kubelet quietly redirects that to the EBS CSI driver via the migration shim. The on-disk path under /var/lib/kubelet will still say kubernetes.io~csi, not kubernetes.io~aws-ebs. If you have automation that grovels mount paths, this trips it.

CSI architecture — controller plugin and node plugin.

A CSI driver is, by spec, two gRPC services that may be implemented in the same binary or in two different binaries depending on what the driver supports. The Controller service exposes operations that are global to the cluster — CreateVolume, DeleteVolume, ControllerPublishVolume, ControllerUnpublishVolume, CreateSnapshot, ControllerExpandVolume. The Node service exposes operations that have to be performed on the machine where the volume will be mounted — NodeStageVolume, NodePublishVolume, NodeUnpublishVolume, NodeUnstageVolume, NodeExpandVolume, NodeGetVolumeStats. There is also a tiny Identity service that returns the driver's name, version, and capability set; both controller and node implementations expose it.

The deployment shape is conventional but not enforced by the spec. A driver typically ships its controller plugin as a Deployment with one or two replicas, leader-elected, in kube-system or a vendor namespace. The Pod runs the driver's controller binary plus a stack of the four external sidecars (next section). The node plugin is a DaemonSet — one Pod per node, running the driver's node binary plus the node-driver-registrar sidecar, which advertises the driver's Unix socket to the kubelet. That socket lives at /var/lib/kubelet/plugins/<driver-name>/csi.sock, and the registration protocol writes a tiny handshake socket at /var/lib/kubelet/plugins_registry/<driver-name>-reg.sock that the kubelet polls.

The split exists because the two halves need radically different privilege. The controller plugin needs cloud credentials to call EC2 or GCE APIs, so it usually runs with an IRSA / Workload Identity binding to a cloud IAM role. It does not touch the host filesystem. The node plugin needs CAP_SYS_ADMIN to bind-mount and run mkfs, plus hostPath mounts to /var/lib/kubelet and /dev, plus the kubelet's plugin registration directory. It does not need cloud credentials. Splitting them means you do not have a privileged DaemonSet on every node holding a cloud admin key, and you do not have an unprivileged controller trying to bind-mount filesystems. Each side gets exactly the privilege it needs.

Drivers also declare a capability set via the Identity service. The most common capabilities are CONTROLLER_SERVICE (the driver implements the Controller RPC), VOLUME_ACCESSIBILITY_CONSTRAINTS (volumes are pinned to topology domains; see Part 05), PUBLISH_UNPUBLISH_VOLUME (the driver does an explicit attach/detach step, as opposed to NFS-style network filesystems that do not), EXPAND_VOLUME (online resize), CLONE_VOLUME, and CREATE_DELETE_SNAPSHOT. Kubernetes' sidecars query the capability set at startup and only run the loops the driver supports. If the driver returns no CREATE_DELETE_SNAPSHOT, the external-snapshotter sidecar still runs but rejects every VolumeSnapshot for that StorageClass with a clear error.

api-serverPV / PVC / VolumeAttachmentCONTROLLER PLUGIN POD · Deployment, leader-electeddriver-controllervendor binaryexternal-provisionerexternal-attacherexternal-resizerexternal-snapshottercloud creds via IRSA / Workload Identityno host mounts; talks to cloud APINODE PLUGIN DAEMONSET · one Pod per Nodedriver-nodeprivileged, CAP_SYS_ADMINnode-driver-registraradvertises socket to kubelet/var/lib/kubelet/plugins/csi-driver/csi.sockhostPath /var/lib/kubelet, /dev, /sysruns mkfs, bind-mounts; no cloud credswatches PV/PVC/VAkubelet → unix sockkubelettalks only to node plugin

A subtle property of this design: Kubernetes never speaks CSI to the controller plugin directly. The api-server has no CSI client. The controller plugin sits there as a private gRPC service inside the controller plugin Pod, and the only thing that calls it is a sidecar container in the same Pod, over localhost:9090 or a Unix socket on a shared emptyDir. The sidecars are the bridge between the Kubernetes API and the CSI gRPC. We get to them next.

Driver capability sets are queryable at runtime via the CSIDriver object — kubectl get csidriver lists every driver registered in the cluster, with attach-required, fs-group-policy, and volume-lifecycle-modes already resolved. If a workload's pod is stuck waiting on attach, this is your first read.

The four sidecars — provisioner, attacher, resizer, snapshotter.

The Kubernetes side of CSI is implemented as four out-of-tree controllers, each a separate container image maintained in github.com/kubernetes-csi and shipped alongside every CSI driver. They are deliberately small and single-purpose. Each watches one or two Kubernetes resources, translates events into CSI gRPC calls against the driver, and writes the result back to the api-server. They are not part of the Kubernetes binary; they are not part of any vendor's binary; they are a shared library of ready-made controllers that every CSI driver instantiates.

The reason this is a sidecar pattern and not a single big controller is leadership and lifecycle. Each sidecar leader-elects on its own Lease. Each can be upgraded independently — the snapshotter went GA two releases after the attacher, for example, and clusters that did not need snapshots ran an older version for a long time. The driver vendor decides which sidecars to ship, at what version, with what flags. The Kubernetes project decides what the CRDs and the gRPC contract look like. The interface is stable; the implementation is not.

Sidecar Watches Calls into driver Result Repo
external-provisioner PersistentVolumeClaim CreateVolume / DeleteVolume Creates a PV that satisfies the PVC kubernetes-csi/external-provisioner
external-attacher VolumeAttachment ControllerPublishVolume / ControllerUnpublishVolume Marks VolumeAttachment.status.attached kubernetes-csi/external-attacher
external-resizer PersistentVolumeClaim (size deltas) ControllerExpandVolume / NodeExpandVolume Grows the underlying volume + filesystem kubernetes-csi/external-resizer
external-snapshotter VolumeSnapshot, VolumeSnapshotContent CreateSnapshot / DeleteSnapshot Materialises VolumeSnapshotContent backed by a real snapshot kubernetes-csi/external-snapshotter

Take external-provisioner first. It watches every PVC in the cluster. When a PVC appears with a storageClassName whose provisioner field matches the driver's name, the sidecar calls CreateVolume on the controller plugin with the PVC's parameters (size, accessModes, the StorageClass parameters). The driver does whatever it does — calls EC2 to create an EBS volume, allocates an LVM logical volume, talks to a Ceph cluster — and returns a volume handle. The sidecar then constructs a PV object and POSTs it to the api-server, with spec.claimRef already pointing at the originating PVC, so the binder controller (Part 04) wires them together immediately.

external-attacher watches a different CRD entirely: VolumeAttachment. The attach/detach controller in the kube-controller-manager creates one of these whenever a Pod is scheduled to a node and needs a volume that is not yet attached there. The attacher sees the new VolumeAttachment, calls the driver's ControllerPublishVolume RPC with the volume handle and the target node ID, waits for the cloud-side attach to finish (this can take 20 seconds for EBS, longer for some on-prem systems), and updates VolumeAttachment.status.attached to true. The kubelet's volumeManager is watching this status and now proceeds with NodeStageVolume.

external-resizer watches PVC size deltas. When you kubectl edit pvc and bump spec.resources.requests.storage, the api-server records the request. The resizer sees it, calls ControllerExpandVolume on the driver to grow the underlying block device, then sets a flag on the PVC that signals the kubelet to call NodeExpandVolume on the node plugin to grow the filesystem on next mount or, for online resize, immediately. The resize is intentionally two-phase because most filesystems need to be running on the device to grow it.

external-snapshotter is the most baroque. It watches three CRDs — VolumeSnapshot (the user-facing intent), VolumeSnapshotContent (the cluster-scoped binding to a real snapshot, analogous to a PV), and VolumeSnapshotClass (analogous to StorageClass) — and translates them into CreateSnapshot / DeleteSnapshot RPCs against the driver. Snapshots are a deliberately late addition to CSI; they did not GA until 1.20.

# What a CSI driver pod looks like, abridged. The five containers are sidecars + driver.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: csi-foo-controller
  namespace: kube-system
spec:
  replicas: 2
  template:
    spec:
      serviceAccountName: csi-foo-controller-sa
      containers:
        - name: csi-provisioner
          image: registry.k8s.io/sig-storage/csi-provisioner:v4.0.0
          args: ["--csi-address=$(ADDRESS)", "--leader-election"]
        - name: csi-attacher
          image: registry.k8s.io/sig-storage/csi-attacher:v4.5.0
        - name: csi-resizer
          image: registry.k8s.io/sig-storage/csi-resizer:v1.10.0
        - name: csi-snapshotter
          image: registry.k8s.io/sig-storage/csi-snapshotter:v7.0.0
        - name: foo-driver
          image: vendor.io/foo-csi:v2.3.1
          args: ["--endpoint=$(ADDRESS)"]
      volumes:
        - name: socket-dir
          emptyDir: {}     # shared Unix socket for sidecar→driver gRPC

A common operational mistake — running the snapshotter sidecar at a version newer than the snapshotter CRDs installed in the cluster. The CRDs are deliberately not bundled with the sidecar image; you install them separately. If the cluster is on the v1beta1 CRDs and you ship the v8 sidecar, every snapshot reconcile fails with no kind "VolumeSnapshotContent" is registered. Lock the CRD version to your sidecar release in CI.

PV / PVC binding — how a Pending PVC finds a PV.

The persistent-volume-binder controller, which runs as one of the loops inside kube-controller-manager, is the matchmaker between PersistentVolumeClaims and PersistentVolumes. Its job is one sentence long: for every PVC in Pending, find a PV that satisfies it, set PV.spec.claimRef to the PVC and PVC.spec.volumeName to the PV, and move both to Bound. The catch is the word "satisfies" — there are six dimensions of compatibility, and the binder runs them as a strict filter, not a fuzzy match.

A PV is eligible for a PVC if and only if it satisfies all of: same StorageClass (or both empty); a superset of the PVC's accessModes (a PVC asking for ReadWriteOnce can bind to a PV offering ReadWriteMany, but not vice versa); at least the PVC's requested capacity; matching selector labels (PVCs can use spec.selector for label matching, which is rare in cloud-native installs); compatible volume mode (Filesystem vs Block); and not already bound or reclaimed. If multiple PVs match, the binder picks the smallest one that satisfies the request — a deliberate choice to avoid wasting capacity. If none match, the PVC stays Pending.

There are two ways a Pending PVC becomes Bound. The first is static binding: an operator pre-creates a PV (perhaps wrapping a hand-built EBS volume or an existing NFS share), and the binder finds it on its next sweep. The second is dynamic provisioning: the PVC has a storageClassName that points at a StorageClass with a CSI provisioner, the external-provisioner sidecar sees the unbound PVC, calls CreateVolume on the driver, and POSTs a freshly minted PV with claimRef already set. The binder sees the PV, matches it to the PVC, and moves both to Bound. In modern installs almost all PVCs are dynamically provisioned; static PVs survive in environments with hand-managed legacy storage.

Once bound, the relationship is sticky. The claimRef on the PV is a permanent reservation; even if the PVC is deleted, no other PVC can claim that PV until an operator clears the claimRef manually. This is what the Released phase means — the PV's claim is gone, but the PV is not yet recyclable. The reclaim policy on the PV decides what happens next: Delete calls the provisioner's DeleteVolume and removes the PV; Retain keeps the PV around in Released for an operator to handle. The default for dynamically provisioned volumes is Delete, which is sometimes a sharp edge — accidental kubectl delete pvc on a production database in a Delete-policy StorageClass deletes the data.

# StorageClass — a template the provisioner uses for new PVCs of this class.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3-ssd
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
  encrypted: "true"
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
---
# PVC + the PV the provisioner generates for it (after binding).
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: web-data
  namespace: prod
spec:
  storageClassName: gp3-ssd
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 20Gi
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pvc-7b4c1d2e
spec:
  storageClassName: gp3-ssd
  capacity: { storage: 20Gi }
  accessModes: [ReadWriteOnce]
  csi:
    driver: ebs.csi.aws.com
    volumeHandle: vol-0a1b2c3d4e5f
    fsType: ext4
  claimRef:                              # back-pointer to the PVC
    namespace: prod
    name: web-data
  persistentVolumeReclaimPolicy: Delete
user / kubectlapi-serverexternal-provisionerdriver / cloudPOST PVC (Pending)watch ADDED PVCCreateVolume (gRPC)volume_id, capacityPOST PV (claimRef→PVC)binderPVC + PV → Bound

The PV/PVC dance has a hidden invariant: the PV holds the back-reference, not the PVC. If you ever need to recover from a botched delete (PVC gone, PV stuck in Released), the fix is to edit the PV and clear spec.claimRef; the PV returns to Available and the next compatible PVC will pick it up. The reverse does not work — patching a PVC's volumeName alone leaves the PV thinking it is bound to nothing.

Volume binding modes — Immediate vs WaitForFirstConsumer.

A StorageClass has a field that quietly controls one of the most important scheduling properties in the cluster: volumeBindingMode. It takes two values — Immediate and WaitForFirstConsumer — and the difference is "when does the external-provisioner call CreateVolume". With Immediate (the historical default), the provisioner provisions as soon as the PVC appears, before any pod has been scheduled. With WaitForFirstConsumer, the provisioner waits until a pod that uses the PVC has been scheduled to a specific node, and then provisions a volume in that node's topology.

The reason this matters is topology. Cloud block storage is almost always pinned to a single availability zone — an EBS volume in us-east-1a cannot be attached to an instance in us-east-1b. With Immediate binding, the provisioner picks an AZ for the volume before the scheduler picks an AZ for the pod, and the two can disagree. The pod then gets stuck Pending forever with the cryptic message node(s) had volume node affinity conflict. With WaitForFirstConsumer, the order is reversed: the scheduler picks a node first (using whatever constraints the pod has), and the provisioner uses the node's topology labels to provision a volume in the right AZ. The two never disagree, because the pod's choice is now an input to the volume's choice.

The mechanism by which this works is a CSI capability called VOLUME_ACCESSIBILITY_CONSTRAINTS. Drivers that support it return a VolumeNodeAffinity on every PV they create — a label selector that says "this PV is only attachable on nodes whose topology.ebs.csi.aws.com/zone matches us-east-1a". The scheduler reads this affinity in its VolumeBinding plugin and folds it into pod placement. With WaitForFirstConsumer, the plugin runs in preemption mode: it asks the provisioner for a list of viable topology domains for the volume, intersects with the candidate node set, and only after the scheduler picks a node does the provisioner actually create the volume in the chosen AZ.

The KEP-490 Topology-Aware Volume Provisioning enhancement made this the default for every CSI StorageClass shipped with cloud-provider charts. EBS, GCE PD, Azure Disk, and Cinder all ship with WaitForFirstConsumer in their default StorageClass. If you are still seeing volumeBindingMode: Immediate in your cluster, it is almost certainly a leftover from a 1.12-era install or a hand-rolled StorageClass; change it. The cost of WaitForFirstConsumer is a slightly longer pod-startup delay (provisioning happens during scheduling rather than before it), and the benefit is that pods never get stuck on topology mismatches.

WAITFORFIRSTCONSUMER · scheduler decides AZ · provisioner provisions in that AZPVC createdphase: Pendingprovisioner waitsPod scheduledVolumeBinding pluginpicks node N3 (zone us-east-1b)PVC annotated with selected nodevolume.kubernetes.io/selected-node=N3provisioner unblocksCreateVolume(zone=us-east-1b)PV created with NodeAffinitytopology.ebs.csi…/zone=us-east-1bPVC + PV boundattach/detach controller firesPod runs on N3volume mountedno zone mismatch— with Immediate, steps 3–4 happen first; if scheduler later disagrees, pod stays Pending forever —

There is one corner case worth knowing. WaitForFirstConsumer requires the provisioner sidecar to be running at PVC creation time, because the sidecar is what reads the volume.kubernetes.io/selected-node annotation and triggers CreateVolume. If your CSI driver pod is down for an upgrade, every new PVC stays Pending until it comes back. Pods using existing volumes continue to work — the kubelet's mount path goes through NodeStageVolume on the node plugin, not through the controller plugin — but new provisions are blocked. This is one reason most operators run two replicas of the controller plugin Pod even though only one is leader at a time.

If you have a stretched cluster across multiple zones and need cross-zone replication, WaitForFirstConsumer is not enough; you need a driver that supports VOLUME_ACCESSIBILITY_CONSTRAINTS with topology-spread or one of the replicated-volume drivers (Portworx, Rook-Ceph, Longhorn, OpenEBS). The scheduler can place pods across AZs, but a single EBS volume cannot follow.

Snapshots and clones — VolumeSnapshot and dataSource.

The volume-snapshot subsystem is structurally a parallel of the PV/PVC subsystem, with the same naming pattern: a namespaced user-facing CRD pointing at a cluster-scoped backing CRD pointing at a real underlying object, classed by a class CRD. VolumeSnapshot is the user intent ("take a snapshot of this PVC"); VolumeSnapshotContent is the cluster-scoped binding to a real cloud snapshot ID; VolumeSnapshotClass picks the driver and parameters. The CRDs live in the snapshot.storage.k8s.io API group and are deliberately not bundled with the Kubernetes binary; you install them as part of the snapshotter sidecar release.

The reconciliation loop is run by the external-snapshotter sidecar, the same way the provisioner runs the PVC loop. When a VolumeSnapshot appears, the sidecar resolves the VolumeSnapshotClass to find the driver, calls CreateSnapshot on the controller plugin with the source PV's volume handle, gets back a snapshot ID, and creates a VolumeSnapshotContent with a snapshotHandle pointing at it. From the user's perspective, the VolumeSnapshot transitions to readyToUse: true and reports restoreSize — the volume size needed to restore from this snapshot.

Restoring is done through the PVC's spec.dataSource field. Set dataSource to a VolumeSnapshot reference and the external-provisioner, instead of calling CreateVolume with no source, calls it with a volume_content_source.snapshot field set to the snapshot ID. The driver creates a new volume populated from the snapshot, returns its handle, and the new PVC binds to a PV containing the snapshot's data. The new volume is independent — modifications do not propagate back to the snapshot or to the source PVC.

Cloning is the same machinery with a different source. A dataSource pointing at another PVC instead of a VolumeSnapshot causes the provisioner to call CreateVolume with volume_content_source.volume. The driver duplicates the source volume — usually as a server-side copy on the storage backend, fast and cheap — and returns a new volume handle. Clones are useful for forking development databases or prepping multi-tenant test fixtures: spin up ten copies of a 100GB seed dataset in seconds, where a naive cp would take an hour.

# Take a snapshot of the production PVC.
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: web-data-snap-2026-05-03
  namespace: prod
spec:
  volumeSnapshotClassName: ebs-snap
  source:
    persistentVolumeClaimName: web-data
---
# Restore it into a new PVC in the staging namespace.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: web-data-restored
  namespace: staging
spec:
  storageClassName: gp3-ssd
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 20Gi
  dataSource:
    apiGroup: snapshot.storage.k8s.io
    kind: VolumeSnapshot
    name: web-data-snap-2026-05-03

Two operational subtleties bear flagging. First, snapshots are not backups in the disaster-recovery sense. Most cloud snapshots live in the same region as the source volume (EBS snapshots are regional, but not multi-region by default), and many drivers store snapshot metadata in the same Kubernetes cluster that holds the data. If you lose the cluster and the region, you lose both. A real backup strategy uses something like Velero, which exports snapshots to object storage in another region. Second, snapshot deletion semantics follow a DeletionPolicy on the VolumeSnapshotContent — Delete calls DeleteSnapshot on the driver, Retain leaves the underlying snapshot alone. Default is Delete, with the same accidental-deletion sharp edge as PV reclaim policy.

CSI snapshots are crash-consistent, not application-consistent. The driver freezes the block device and copies it; it does not flush your database's WAL or fsync your file handles. For application-consistent backups (Postgres, Mongo, MySQL), use the database's own backup tool, or use a pre-snapshot hook (spec.preSnapshotHook in the operator, or Velero's hook framework) to FREEZE the database first.

StatefulSet storage — sticky volumes and PVC retention.

A StatefulSet is, structurally, a Deployment with three extra rules: pods get stable, ordinal names (web-0, web-1, web-2); pods are created and deleted one at a time, in order; and each pod gets its own PVC, generated from a volumeClaimTemplates stanza, that follows the pod across reschedules and restarts. The third rule is the storage one, and it is the entire reason StatefulSet exists. Without it, every pod restart would lose its data.

The mechanism is mechanical. When the StatefulSet controller decides to create web-0, it first reconciles a PVC named data-web-0 (the template's name, suffixed with the pod's ordinal), generated from volumeClaimTemplates[0]. The PVC goes through the normal binder + provisioner flow and ends up Bound to a PV. The StatefulSet controller then creates the pod with a volumes entry that mounts the PVC by name. If the pod is later deleted (rolling update, eviction, node failure), the PVC is not deleted — it survives. When the StatefulSet controller recreates web-0 on a different node, it finds the existing PVC, reuses it, and the new pod inherits all of web-0's previous data.

This is what is meant by "StatefulSet pods stick to their volume". The pod is ephemeral, but the PVC survives at least as long as the StatefulSet itself. If you scale up from 3 to 5 replicas you get two new PVCs (data-web-3, data-web-4); if you scale down from 5 to 3, the PVCs for ordinals 3 and 4 stay around — by default — so that scaling back up reattaches the previous data. This is a deliberate safety property. A misconfigured HPA scaling a Postgres StatefulSet down to 1 replica should not destroy the data on replicas 2 and beyond.

The "by default" qualifier is doing real work. As of 1.27, KEP-1847 added spec.persistentVolumeClaimRetentionPolicy on the StatefulSet, with two sub-fields: whenDeleted (what happens when the StatefulSet is deleted) and whenScaled (what happens on scale-down). Each takes Retain (default, keep the PVCs) or Delete (garbage-collect them with owner references back to the StatefulSet or the pod). With whenScaled: Delete, scaling down deletes both the pod and its PVC, which deletes the underlying PV (assuming the StorageClass has reclaim Delete), which deletes the cloud volume. This is occasionally what you want for stateless caches; it is rarely what you want for databases.

# StatefulSet with volumeClaimTemplates — every pod gets its own sticky PVC.
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  serviceName: web
  replicas: 3
  selector:
    matchLabels: { app: web }
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Retain        # keep PVCs even if the StS is deleted
    whenScaled: Retain         # keep PVCs on scale-down (default)
  template:
    metadata:
      labels: { app: web }
    spec:
      containers:
        - name: app
          image: web:1.0
          volumeMounts:
            - name: data
              mountPath: /var/lib/web
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        storageClassName: gp3-ssd
        accessModes: [ReadWriteOnce]
        resources:
          requests:
            storage: 20Gi
# After apply, the cluster contains:
#   pvc/data-web-0  pv/pvc-aaa…   pod/web-0 (mounts data-web-0)
#   pvc/data-web-1  pv/pvc-bbb…   pod/web-1 (mounts data-web-1)
#   pvc/data-web-2  pv/pvc-ccc…   pod/web-2 (mounts data-web-2)

There is a subtle interaction with WaitForFirstConsumer worth holding. A StatefulSet with WaitForFirstConsumer-bound volumes will, on first creation of web-0, schedule the pod and provision the volume in whatever AZ the scheduler picked. On every subsequent rescheduling of web-0 — node failure, voluntary eviction, drain — the existing PVC has a node affinity locking it to that AZ. The scheduler will only place web-0 in that AZ from now on. If the AZ is unhealthy and you have no nodes there, web-0 stays Pending until a node comes back. The volume is sticky, and "sticky" sometimes means "stuck".

The StatefulSet controller is in some ways the most conservative controller in the cluster: it deliberately serialises operations (default podManagementPolicy: OrderedReady) so that web-1 never starts until web-0 is Ready. For storage this is the right default — bringing up a Postgres replica before its primary is bad — but for stateless-ish workloads where you just want stable names, set podManagementPolicy: Parallel.

Anti-patterns and further reading.

Three anti-patterns recur often enough in production storage incidents that they are worth naming. First, treating PVs as pets. A PV is identified by a name the provisioner generates (pvc-7b4c1d2e…); the moment your runbook references that name, you have coupled an operational procedure to an opaque cloud-side identifier. The right reference is the PVC name in the namespace; the PV is a leaf in the binding chain and should not be hand-managed. The exception is recovering from a botched delete (clear claimRef, see Part 04), which is, by design, a manual procedure.

Second, using ReadWriteMany when you mean ReadWriteOnce. RWX volumes are NFS-shaped — eventually consistent, locking-aware, network-mediated — and almost no cloud-native database supports them safely. Postgres, MySQL, etcd, Mongo all expect a ReadWriteOnce block device. RWX exists for shared media (uploads, build artifacts, shared config) and even there it is usually the wrong primitive; an object store is almost always better. A surprising fraction of "Kubernetes is slow" incidents trace to NFS-backed RWX volumes saturating a single backing server.

Third, not pinning CSI driver versions across cluster upgrades. The CSI driver, the four sidecars, the snapshotter CRDs, and the kubelet's CSI client are four independent release vehicles, and combinations that worked in 1.27 may not work in 1.30. The snapshotter sidecar v8 requires the v1 CRDs, which require Kubernetes 1.20+, which requires the kubelet CSI client to negotiate the CSI 1.5 RPC version. Pin every component, test the matrix in staging, and read the driver's release notes when you bump anything. The most common symptom of a mismatch is silent: dynamic provisioning fails, but pods using existing PVCs keep working, so you discover the breakage hours later when something tries to scale up.

A fourth, less common but worth a mention: treating storage capacity as homogeneous. CSI 1.4 added CSIStorageCapacity objects, which let drivers report per-topology capacity to the scheduler. Without them, the scheduler has no idea how much space is left in a given AZ, and over-provisioning a single AZ is invisible until pods start failing to bind. The fix is to enable --enable-capacity on the external-provisioner sidecar and let the scheduler's VolumeBinding plugin filter on capacity.

And the rest of the Semicolony ladder: the kubelet sub-page covers the mount-side mechanics — NodeStageVolume, NodePublishVolume, the staging directory, the unmount on pod terminate; the architecture sub-page traces where the storage controllers fit inside the eight-process control plane; the controller pattern sub-page is the deep version of how the sidecars actually implement their reconcile loops. For the visceral side, the storage engine simulator lets you watch the binder, the provisioner, and the attacher run on a synthetic cluster.

One closing observation. Kubernetes' storage subsystem is what an out-of-tree story looks like when it succeeds. Six controllers, four CRD groups, two sidecar sockets, one well-defined gRPC contract — and the cluster does not know or care which storage vendor's driver is running. The kubelet binary has no EBS code. The controller-manager has no Ceph code. The scheduler has no AZ-affinity code; it has a generic VolumeBinding plugin that reads PV node affinities written by drivers it has never heard of. This is the architectural pattern that the rest of the Kubernetes ecosystem keeps trying to replicate, with mixed success — see networking (CNI is partly out-of-tree), authentication (webhooks are partly out-of-tree), devices (device plugins, half-baked). Storage is the gold standard, because the team accepted four years of migration pain to get there.

Next in the internals series

Keep going.

Found this useful?