What does maxSurge mean in Kubernetes?

maxSurge is the maximum number of pods that can be created above the desired replica count during a rolling update. With replicas=10 and maxSurge=25%, the rollout can spin up to 2 extra pods (12 total) before terminating any old ones — useful when total capacity matters more than predictability.

Why is my Kubernetes rollout stuck?

Three common causes: (1) new pods aren't reaching Ready (readiness probe failing — check the probe definition and recent logs); (2) a PodDisruptionBudget prevents enough old pods from terminating; (3) cluster autoscaler can't add nodes (resource quota, taints, or capacity). kubectl rollout status and kubectl describe deployment show the controller's view; kubectl get events shows the underlying complaint.

Kubernetes Rolling Update Simulator: step by step.

A rolling update swaps a Deployment's pods a few at a time so the app never goes fully down. Two knobs set the pace: maxSurge adds pods above the desired replica count, maxUnavailable lets some drop below it. Watch the orchestrator add and remove pods until the new generation has replaced the old.

Gen 1

Old

Ready

6/6

Replicas

maxSurge

maxUnavailable

Pods

p1 g1 READY

p2 g1 READY

p3 g1 READY

p4 g1 READY

p5 g1 READY

p6 g1 READY

What you're looking at

The pod tiles are your fleet. Each shows its id, its generation (g1 is the old version, g2 the new), and its state: INIT for a pod still starting, READY once its probe would pass, TERM as it drains. The three controls up top set the desired replica count, maxSurge (how many extra pods may run), and maxUnavailable (how far ready can dip). Trigger rollout bumps the generation, and each Step advances the controller's reconcile loop by one move while the Ready counter tracks available capacity.

Set surge 1, unavailable 0 and step through: the controller always brings a new pod to READY before draining an old one, so the ready count never falls below your replicas. Then try surge 0, unavailable 1 and watch it drain first, dipping capacity instead of adding a pod. The case to try on purpose is surge 0 and unavailable 0. What should surprise you is that the rollout simply stops, because the controller is allowed neither to add a pod nor to remove one. That deadlock is the real-world ProgressDeadlineExceeded, reproduced in two clicks.

What is a Kubernetes rolling update?

Why you can't just restart everything at once.

A Kubernetes rolling update replaces pods one batch at a time — never killing all instances simultaneously, always keeping enough capacity to serve traffic. The Deployment controller drives the rollout via two knobs: maxSurge (how many extra pods to spin up) and maxUnavailable (how many existing pods can disappear). The defaults (25% each) work for most workloads; tune them when traffic shape demands.

Imagine you run a small web service. You have ten copies of it — called pods in Kubernetes — behind a load balancer, each handling its share of incoming traffic. You have a new version of the code ready to ship. The dumb option is to stop all ten pods, swap the image, and start them again. For about thirty seconds your site is down: every request that arrives while the pods are restarting fails with a 503. Your users see error pages; your monitoring fires alarms; your on-call gets paged. This is exactly what early container deployments looked like, and it is the problem rolling updates exist to solve.

The smarter option is to replace the pods one at a time. Stand up an eleventh pod with the new image; wait for it to start, finish its slow JVM warm-up, pass its health check, and start serving traffic; then stop one of the old pods. Repeat: bring up another new one, then drop another old one, until the old version has been fully replaced. The fleet never drops below ten healthy pods, and every request gets answered the whole time. This is a rolling update, and it is the default deployment strategy in Kubernetes.

Two numbers govern the choreography. maxSurge says how many extra pods are allowed during the rollout — the “eleventh pod” in the example above. maxUnavailable says how many of the ten can be missing or not yet ready at any moment. With surge=1, unavailable=0 you keep full capacity but pay briefly for an extra pod; with surge=0, unavailable=1 you never pay extra capacity but accept a brief 9-out-of-10 dip. Both extremes are correct in different situations; the simulator above lets you watch them play out step by step.

Behind the curtain there is a controller — a piece of code inside the cluster that watches the desired state in etcd and continuously reconciles it against the live pods. The Deployment object you create with kubectl apply is just a record. The controller is the part that makes the record true, by spawning new pods, terminating old ones, and waiting for readiness probes between each step. The next sections trace what that controller actually does, what failure modes the rollout can get stuck in, and the progressive-delivery tooling that takes the same ideas further with traffic-shifting and metric-driven rollback.

How a Deployment rollout works — controller loop, not one-shot

A controller loop, not a one-shot.

A Kubernetes Deployment is not a process; it is a record of desired state in etcd, reconciled by a controller. The Deployment object owns one or more ReplicaSet objects; each ReplicaSet owns a homogeneous set of Pod objects. When you change the Pod template — bumping the image tag, modifying an environment variable, adjusting a resource request — the Deployment controller computes a hash of the new template, compares it to the current ReplicaSet's hash, and on mismatch creates a fresh ReplicaSet with the new template plus a fresh hash. From that moment on, the controller increases replica count on the new ReplicaSet and decreases it on the old, observing surge and unavailability constraints, until the old ReplicaSet has zero replicas.

The mechanism has been stable since the GA of the apps/v1 Deployment API in Kubernetes 1.9 (December 2017). Before that, in 1.6 through 1.8, the API was apps/v1beta2 and extensions/v1beta1; the rolling-update behaviour was identical but field names differed. The choice to keep the old ReplicaSet around — rather than deleting it once it scaled to zero — is what makes kubectl rollout undo trivial: roll the old ReplicaSet back up, scale the new one down, change a couple of object generations, done. The default revision history retained by a Deployment is ten ReplicaSets (spec.revisionHistoryLimit); production clusters often raise this to twenty or thirty so that a multi-week incident can still be rolled back to a known-good image.

The four most-used commands form the operational vocabulary. kubectl rollout status deployment/app blocks until the rollout reaches its target generation, returning a non-zero exit code on timeout (default 10 minutes via progressDeadlineSeconds — a Deployment that fails to make progress within that window is marked Progressing=False with reason ProgressDeadlineExceeded). kubectl rollout history lists the ReplicaSets the Deployment has produced. kubectl rollout undo rewinds to the previous revision (or a named one). kubectl rollout pause and kubectl rollout resume halt and restart progress — the canonical mechanism for poor-man's-canary, where you pause after the first new pod is up, eyeball metrics, and resume.

The Deployment's reconciliation runs every time an event hits the controller's watch — pod ready, pod terminated, pod evicted, ReplicaSet scaled. Brendan Burns and Joe Beda's Kubernetes Up & Running (O'Reilly, third edition 2022) and Brendan Burns, Brian Grant, David Oppenheimer, Eric Brewer, John Wilkes' Borg, Omega, and Kubernetes retrospective (CACM, May 2016) describe this controller pattern in depth: declarative desired state, level-triggered reconciliation, monitor-the-cluster-not-the-event-stream. The pattern is what makes Deployments resilient to apiserver restarts, network blips, and out-of-order events — properties earlier orchestrators (Marathon, Mesos's chronos) often lacked.

maxSurge and maxUnavailable — two numbers, one trade

Two numbers, one trade.

The RollingUpdate strategy exposes two knobs. maxSurge is the upper bound on how many pods may exist above spec.replicas during a rollout. maxUnavailable is the upper bound on how many ready pods may be below spec.replicas. Both default to 25%, accept absolute counts or percentages, and round — surge rounds up, unavailable rounds down — in the way that least surprises the operator.

For a Deployment with replicas: 10, the default 25%/25% means up to 13 pods exist during the rollout (10 + ceil(2.5)) and at least 7 ready pods are guaranteed (10 - floor(2.5)). For surge: 0, unavailable: 1 the controller never exceeds 10 pods but accepts a brief 9-ready dip; this is the “no extra capacity” setting common on cost-sensitive clusters with strict resource quotas. For surge: 1, unavailable: 0 the controller never drops below 10 ready, paying for an extra pod throughout; this is the “no capacity dip” setting common on user-facing workloads. For surge: 100%, unavailable: 0 the controller doubles the fleet briefly and switches over once new pods are all ready; this is the fastest safe rollout and the most expensive.

surge: 0, unavailable: 0 is a configuration error: the rollout cannot make any progress because adding a new pod requires surge and removing an old pod requires unavailable. The controller will retry forever without progress; the symptom shows up as ProgressDeadlineExceeded after ten minutes. surge: 0, unavailable: 100% behaves like strategy: Recreate — tear all old pods down, then bring all new ones up, accepting downtime in exchange for the guarantee that two versions never run simultaneously.

The interaction with readinessProbe is the load-bearing detail. A pod is counted “available” only after its readiness probe has succeeded for at least minReadySeconds (default 0, often raised to 10–30 for warm-up-sensitive services). Until then, the controller waits before terminating the next old pod. A buggy or absent readiness probe turns RollingUpdate into a downtime event because the controller cannot tell whether the new pod is actually serving traffic. The Kubernetes documentation's Configure Liveness, Readiness and Startup Probes page enumerates three probe forms: httpGet (returns 200–399), tcpSocket (TCP connect succeeds), and exec (command exits 0). The startup probe added in Kubernetes 1.16 (September 2019) decouples slow first-time initialisation from the steady-state readiness check, which matters for JVM applications, large ML model loads, and anything with multi-minute warm-up.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2          # never exceed 12 pods
      maxUnavailable: 1    # never drop below 9 ready
  minReadySeconds: 15      # buffer after readiness before counting as available
  progressDeadlineSeconds: 600
  template:
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: web
          image: example/web:1.4.2
          readinessProbe:
            httpGet: { path: /healthz, port: 8080 }
            periodSeconds: 5

Pod termination — SIGTERM, drain, grace period, SIGKILL

SIGTERM, drain, SIGKILL.

When a rollout decides to terminate an old pod, the controller does not call kill -9 immediately. The shutdown sequence is deliberately long-winded to allow in-flight requests to drain. First, the pod is removed from the Service's endpoint list (kube-proxy reconfigures iptables/IPVS within 5–15 seconds, depending on the kube-proxy mode and informer sync). Second, the kubelet runs the pod's preStop hook if one exists — an HTTP callback or shell command that lets the application drain gracefully. Third, the kubelet sends SIGTERM to PID 1 in each container. Fourth, after terminationGracePeriodSeconds (default 30s), the kubelet sends SIGKILL to anything still running.

The race condition that bites every team at least once is between the SIGTERM and the kube-proxy reconfiguration. If your application stops accepting new connections immediately on SIGTERM but the load balancer is still routing requests to it, those requests fail. The standard mitigation is a preStop hook that sleeps for 10–15 seconds before SIGTERM, giving kube-proxy time to drain endpoints. preStop: { exec: { command: ["sleep", "15"] } } is the canonical incantation. The Lyft engineering blog post Operating a Large, Distributed System in a Reliable Way: Practices I Learned (2020) by Matt Klein documents this pattern; the Datawire / Ambassador docs spell it out for envoy-proxy sidecars; the Istio sidecar injection sets a 5-second preStop sleep by default.

For long-running connections — gRPC streams, WebSockets, server-sent events — the 30-second default grace period is often too short. Raise it to 60–180 seconds for stateful workloads. For very-long-lived connections (file uploads, background jobs), consider draining via an explicit application protocol (refuse new work, wait for in-flight to complete) rather than relying solely on grace period.

The Pod Disruption Budget (PDB) is the operator's mechanism to constrain involuntary disruption: node drains, cluster autoscaler scale-down, eviction. minAvailable: 80% on a Deployment with 10 replicas guarantees that at least 8 pods remain ready even when a node is being drained. PDBs do not affect rolling updates — the Deployment controller is itself responsible for maintaining availability during a rollout — but they do bound how aggressively a cluster operator can drain nodes underneath you. Without a PDB, kubectl drain can take all your pods down at once if you happen to have all replicas on one node. With a PDB, drain is forced to migrate them gradually.

preStop sleep, the most copy-pasted snippet in Kubernetes

When the kubelet sends SIGTERM, the pod is already off the Service endpoints — but kube-proxy hasn't necessarily reconfigured iptables across every node yet. A 10–15s preStop: exec sleep hook gives the data plane time to drain. Without it, your in-flight requests get RST mid-stream during every rolling update.

StatefulSets and DaemonSets — other workload kinds, other rules

Other workload kinds, other rules.

Deployments are the right primitive for stateless services, but Kubernetes has two other workload kinds whose rollout semantics differ in important ways. A StatefulSet manages pods with stable identities (db-0, db-1, db-2) and stable persistent volume claims. Its rollout strategies are RollingUpdate and OnDelete. RollingUpdate on a StatefulSet replaces pods in reverse ordinal order — db-N first, working back to db-0 — one at a time, waiting for each to become ready before moving to the next. There is no surge: at most one pod is being updated at any moment.

The reverse-ordinal order matters because of the convention that db-0 is often the leader (in single-master databases) or the bootstrap node (in clusters where db-1 joins by talking to db-0). Updating leader-first risks all replicas joining a freshly upgraded leader that crashes; updating leader-last lets you observe the new image on followers before risking the leader. The partition field in the StatefulSet update strategy lets you cap how far the rollout goes — partition: 2 means only pods with ordinal >= 2 are updated, leaving 0 and 1 on the old image. This is the pattern for canary on stateful workloads: bump partition gradually as confidence grows.

The OnDelete strategy disables automatic rollout entirely; replacements happen only when an operator deletes a pod manually. This is useful for human-driven sequenced upgrades where automation cannot be trusted — database cluster upgrades, custom controllers, anything where the order of replacement matters and the controller's heuristics are insufficient.

A DaemonSet runs one pod per node (or per labelled node) and is used for log collectors (Fluent Bit, Vector), metrics agents (Datadog, Prometheus node-exporter), CSI drivers, CNI plugins, and security scanners. Its rolling update walks nodes in unspecified order, deleting and recreating pods on each. The controller honours maxUnavailable — how many nodes' agents may be unavailable simultaneously — and respects PodDisruptionBudgets just like Deployments. OnDelete is also available for DaemonSets, with the same human-driven semantics.

Jobs and CronJobs do not have rolling updates — they run to completion. The replacement story for a Job is to delete and re-create. The replacement story for a CronJob is simply to update the JobTemplate; the next scheduled run picks up the new image.

When a rollout refuses to finish — readiness, PDBs, stuck pods

When the rollout refuses to finish.

The most common failure is image pull failure. A typo in the image tag, a missing imagePullSecret for a private registry, a registry outage — the new pod stays in ImagePullBackOff indefinitely. kubectl describe pod shows the reason; kubectl get events --sort-by='.lastTimestamp' shows the timeline. The Deployment controller never advances past the first failed pod, leaving the old ReplicaSet running. kubectl rollout undo gets you back to known-good.

The second is readiness-probe regression. A new image's probe endpoint changed from /healthz to /health but the Deployment manifest was not updated; pods start, never become ready, and the rollout stalls. progressDeadlineSeconds eventually marks the Deployment Progressing=False and Argo CD or your CI surfaces the failure.

The third is the stuck-on-creating failure where ReplicaSet creation succeeds but pods never come up. Causes include resource quota exhaustion (the namespace has hit its CPU or memory quota), node taints with no matching tolerations on the new pod template, PodSecurityPolicy or PodSecurityAdmission rejecting the pod, or admission webhooks timing out. Each shows a distinctive failure mode in kubectl get events — FailedCreate from the ReplicaSet controller, FailedScheduling from the scheduler, or admission errors from the apiserver.

The fourth is etcd watch resync storms. When the apiserver restarts or the etcd cluster reaches a memory threshold, every controller in the cluster receives a re-list event and re-evaluates its state. Large clusters with thousands of pods can spend minutes catching up, during which rollouts pause. The 2018 GitHub Engineering postmortem “October 21 post-incident analysis” documents a 24-hour recovery in part driven by control-plane state convergence; subsequent improvements in Kubernetes 1.16+ (watch bookmarks, server-side apply, cache-aware list pagination) have reduced this exposure. Operators of large multi-tenant clusters tune --default-watch-cache-size and split etcd events across multiple instances to manage the resync cost.

The fifth is the two-pod scheduling deadlock that occurs when surge: 0 meets unavailable: 0: the controller cannot proceed without violating one of the constraints. The fix is to adjust at least one of the knobs upward, or trigger a Recreate strategy briefly to break the impasse.

The sixth, less common but spectacular, is configmap or secret drift. The new pod template references a configmap that was deleted, renamed, or contains an incompatible value; pods crash-loop on startup, the rollout stalls, and rollback restores the prior pod template but not the prior configmap. Always version configmaps alongside Deployments — the standard pattern is suffixing configmap names with a content hash so that a configmap change forces a Deployment template change, which forces a rollout. Tools like Kustomize's configMapGenerator and Helm's templating do this automatically; hand-written manifests miss it routinely.

Argo Rollouts and Flagger — progressive delivery

Argo Rollouts, Flagger, analysis templates.

A native Deployment is dumb — it advances as long as readiness probes pass, but it has no concept of error rate, p99 latency, or business metric. Argo Rollouts (Intuit, GA 2020) and Flagger (Weaveworks / Flux, GA 2018) extend the Deployment model with progressive delivery: shift a percentage of traffic to the new version, query a metrics backend for SLI compliance, advance or rollback based on the analysis, repeat.

Argo Rollouts replaces the Deployment kind with its own Rollout CRD, which supports both canary and blue-green strategies plus an AnalysisTemplate that defines what metric to query and the pass/fail thresholds. The metric providers include Prometheus, Datadog, New Relic, Wavefront, CloudWatch, Graphite, Apache SkyWalking, and arbitrary HTTP/Kayenta backends. A typical canary spec runs five steps — 5% / 25% / 50% / 75% / 100% — with a 5-minute analysis at each step querying success_rate > 99.5% and p99_latency < 250ms. If either fails at any step, the rollout aborts and reverses.

Flagger takes a different shape: it sits on top of an unchanged Deployment plus a service mesh (Istio, Linkerd, AWS App Mesh, Open Service Mesh, Contour). Flagger creates a primary Deployment alongside the canary Deployment and shifts traffic via mesh weights or Gateway API HTTPRoute weights. The metrics integration is similar: Prometheus by default, with adapters for Datadog, New Relic, Dynatrace, Stackdriver. Flagger's webhooks let teams plug in load tests, smoke tests, and approval gates between steps.

The 2018 paper Continuous Delivery 2.0 by Qian Liu et al, plus Mike Roberts' Continuous Delivery chapter in the SRE Workbook (O'Reilly 2018), shaped the analytic vocabulary now standard across both projects: baseline, canary, SLI (service-level indicator), SLO-based gating, error budget burn-rate alerting. The Cloud Native Computing Foundation's Progressive Delivery whitepaper (2020) catalogues the patterns; Spinnaker's Kayenta service (Netflix, 2017), used internally at Netflix and externally via Spinnaker, was an early implementation that influenced both Argo and Flagger.

The decision among the three is largely a question of mesh affinity. If you already run Istio or Linkerd, Flagger fits naturally. If you don't run a mesh, Argo Rollouts is the lower-friction choice because it can split traffic at the Service level via SMI, NGINX Ingress, or AWS ALB target-group weights. Both tools are mature; the GitOps integration with Argo CD or Flux closes the loop from commit to canary to promote without human intervention.

Production teams that ship many times per day pair progressive delivery with a manual kill switch — a button or one-line command that aborts the rollout in under five seconds. The Argo Rollouts CLI supports kubectl argo rollouts abort; Flagger supports a kubectl flagger rollback equivalent via annotations. The on-call's first reflex during an incident is to look at the rollout state and abort if a recent deploy is in flight; making that abort fast is the difference between a five-minute partial-traffic incident and a thirty-minute full-traffic incident.

Strategy	Both versions live?	Capacity overhead	Rollback speed
RollingUpdate	yes — overlapping	maxSurge worth of pods	moderate — re-roll old image
Recreate	no — kill all, then start	none	slow — full restart
Blue/Green	yes — both fleets warm	2× replicas during cutover	instant — flip Service selector
Canary (Argo, Flagger)	yes — 1–10% canary	small — canary subset only	fast — abort on metric breach
Feature flag	single image, both paths	none	instant — flip flag

When RollingUpdate is the wrong shape — blue/green and canary

When RollingUpdate is the wrong shape.

Rolling updates are the right tool for stateless backwards-compatible services. They are the wrong tool when the new version cannot coexist with the old — database migrations that change column semantics, breaking API protocol changes, singleton background jobs, anything where two simultaneous versions produce data corruption.

For incompatible deploys, the safe pattern is blue/green. Run two complete fleets, blue (current) and green (new), behind a Service or load balancer. Once green is fully up and verified, flip the Service selector to point at green; once blue is no longer receiving traffic and a grace period has passed, scale it down. Rollback is a single selector flip back to blue. The cost is doubled infrastructure during the cutover window. Argo Rollouts implements blue/green natively via the blueGreen strategy in the Rollout CRD; the older alternative is a pair of Deployments and a Service whose selector your CI tooling rewrites.

For partial deploys where the new version is enabled per-user, per-account, or per-region, the pattern is feature flags. The pod runs both code paths; a runtime flag service (LaunchDarkly, Optimizely, Unleash, Flagsmith, ConfigCat, GrowthBook, or an open-source Open Feature implementation) decides which path the request takes. Feature flags decouple deploy from release: the binary ships at one tempo, the user-facing change rolls out at another. The 2017 Martin Fowler article Feature Toggles and the 2020 book Feature Management with LaunchDarkly by Heidi Waterhouse and Adam Zimman are the standard references.

The trio — rolling updates, blue/green, feature flags — covers nearly every production deployment shape. Rolling for routine backwards-compatible changes; blue/green for incompatible cuts; feature flags for fine-grained or staged user-facing changes. Mature shops combine all three: ship a new image with a rolling update, gate the new behaviour behind a feature flag, blue/green when the database schema changes underneath. Each tool addresses a different risk; conflating them produces architectures that get the worst of every world. Burns and Beda's Kubernetes Up & Running, the SRE workbook's deployment chapter, and the Argo Rollouts and Flagger user guides remain the canonical references for sorting through the choice on any given service.