How autoscaling works across pods, nodes, and clusters.
Pods, nodes, clusters. Each scales on its own loop, against its own signal, with its own lag. Pretending it's one knob is the most expensive mistake.
What is autoscaling?
Match capacity to demand, automatically.
Autoscaling is the practice of adjusting capacity to match demand without manual intervention. In Kubernetes there are three layers: HPA (Horizontal Pod Autoscaler — adds/removes pods), VPA (Vertical Pod Autoscaler — resizes pod resources), and the Cluster Autoscaler (adds/removes nodes). Each operates on different signals and at different speeds.
Provision for peak and you pay for idle eight hours a day. Provision for average and you crash on the spike. Autoscaling is the third option: a control loop that watches a signal, compares to a target, and adjusts capacity. The cost is operational complexity — the loop has to be tuned, observed, and trusted.
Three loops to know. Pod-level (HPA): horizontal pod autoscaler — adds replicas of a Kubernetes pod. Workload-level (VPA): vertical pod autoscaler — resize CPU/memory requests. Cluster-level: provision new nodes when pods cannot schedule. Each runs independently; each has its own lag.
The three layers: pod, node, and cluster
Pick a layer. Each behaves differently.
The three loops solve different problems. Most production systems run all three.
Horizontal — more replicas of the same thing.
When CPU rises, add another pod or VM running the same image. Identical replicas; load balancer fans out. Linear scaling for stateless workloads. Slow path: image pull + warm-up. Fast path: pre-baked images, lazy init.
Why CPU is rarely the right autoscaling signal
It spikes after the queue has already backed up.
HPA defaults to CPU because it is universally available and cheap to compute. But CPU is a poor proxy for "is the service overloaded". A web server may pin CPU and still hold tail latency steady; a worker may sit at 30% CPU and be queueing minutes of work.
The right metric is the one that actually correlates with degradation. Queue depth for workers. p99 latency for user-facing services. Active connections for chat/streaming. Use a custom metrics adapter (Prometheus + KEDA, Datadog, etc.) to expose them to HPA.
# HPA targeting custom queue-depth metric
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
minReplicas: 2
maxReplicas: 50
metrics:
- type: External
external:
metric:
name: rabbitmq_queue_depth
selector: { matchLabels: { queue: orders } }
target: { type: AverageValue, averageValue: "30" }
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # avoid yo-yo
policies: [{ type: Percent, value: 25, periodSeconds: 60 }]Why scaling up takes a long minute
Metrics, pod startup, and node provisioning all add delay.
HPA reads metrics every 15 seconds and acts on a window. New pods take 10–60 seconds to be ready (container image pull, app warm-up, JIT, connection-pool init). Cluster autoscaler is much slower — 60–120 seconds to bring a new node online. End-to-end: a sudden 4× traffic spike will hurt for 90+ seconds before steady state returns. A load balancer in front of the pods determines how that ramp distributes.
Three levers shrink it: headroom (run more replicas than peak demand at all times), predictive scaling (schedule changes ahead of known patterns), and burstable capacity (Karpenter, AWS Fargate Spot). Don't expect autoscaling alone to absorb instant 5× spikes; pair with pre-warmed capacity.
A lunchtime spike, minute by minute
Where each second of lag actually goes.
Walk the clock on a real spike. At 11:58 office workers open the app and traffic starts climbing; by 12:04 it has doubled. The pods feel it right away — CPU rises, request queues build — but the autoscaler doesn't. Metrics-server scrapes kubelets on its own interval, and HPA reads the aggregated result on a 15-second sync loop. The number HPA acts on at 12:04 describes the cluster as it looked up to half a minute earlier.
At 12:05 the arithmetic finally crosses the threshold and HPA asks for six more replicas. Scale-up has no stabilization window by default, so the request goes out at once — but the scheduler can only place two of the six. The other four sit Pending: every node is full. HPA adds pods, not machines; for machines it can only wait.
The cluster autoscaler notices the Pending pods on its next scan (every ten seconds), picks a node group, and asks the cloud for a VM. The instance boots, joins the cluster, pulls images — 60 to 120 seconds, more if the image is heavy. The four pods schedule at 12:07, warm up, and pass readiness around 12:08. Lunch traffic peaked at 12:06. The capacity you paid for arrives in time to serve the decline.
Every knob you can turn edits this story in a different place, and none of them moves capacity earlier:
- Stabilization window
- HPA's stabilizationWindowSeconds. On scale-up it defaults to zero — raise it and the 12:05 reaction waits out the window first, trading speed for smoothness. On scale-down it defaults to 300 seconds, which is why the extra replicas hold through the noisy 12:30 lull instead of collapsing on the first quiet scrape.
- Cooldown
- The pause after one scaling action before the next (ASG cooldowns; HPA's periodSeconds policies). At 12:06, when six replicas turn out not to be enough, cooldown decides how long the second correction has to wait behind the first.
- Scale-down delay
- Cluster autoscaler's scale-down-delay-after-add and scale-down-unneeded-time (about ten minutes by default). They decide how long the new node survives after traffic falls. Too short and the node is gone before the 13:00 stragglers; generous and it is still warm for them — at the price of idle metal in between.
The lag itself — scrape interval plus sync loop plus boot time — is fixed; the knobs only shape behavior around it. And lunch happens at the same hour every day. A scheduled scale-up at 11:45 beats every reactive tuning above, which is the honest argument for predictive scaling on any spike you can see coming.
Horizontal pod autoscaling (HPA) and the scaling yo-yo
Add a pod, CPU drops, remove a pod, repeat.
The classic autoscaling pathology: scale up because CPU is high, the new pod absorbs load, CPU drops below threshold, scale down. Repeat. Pods churn; cold starts hurt; nothing steady. A warm cache on the new pod is what closes the gap.
Fixes: a wider band between scale-up and scale-down thresholds, a stabilization window (HPA stabilizationWindowSeconds), or a different metric that doesn't dip immediately on adding capacity (queue depth holds for a beat; CPU drops instantly).
Vertical pod autoscaling (VPA) and the restart it costs
Resizing memory requires restart.
VPA recommends new CPU and memory requests for your pods based on actual usage. Recommendation mode is safe; auto mode rolls pods to apply the change. That means restarts — VPA in auto on a stateful workload during peak is a recipe for sadness.
Most teams run VPA in recommendation mode and apply the suggestions in the next deploy. The wins are real (right-sizing avoids both OOM and waste), but the surprise restarts of auto mode usually outweigh them.
Cluster autoscaling: adding nodes just in time
Just-in-time node provisioning.
Cluster Autoscaler scales pre-defined node groups — fixed instance type, fixed shape. Karpenter (AWS, then ported elsewhere) flips the model: it watches pending pods, picks the best-fit instance type from a wide catalog, and provisions it directly. No node groups.
The result is faster (skip ASG step) and tighter-fitting (right shape per workload). Most modern AWS K8s shops have moved to Karpenter for this reason — the savings on right-sizing alone often pay for the migration in a quarter.
Better signals: RPS, queue depth, and p99 latency
What to scale on when CPU is the wrong answer.
CPU lags real load. Better signals, with the systems that expose them:
- Requests per second (RPS)
- The most accurate signal for stateless web services. Scale when RPS-per-pod exceeds the capacity you measured under load testing. Exposed by ingress controllers (NGINX, Envoy) and by the Kubernetes external metrics API. KEDA's prometheus scaler is the easiest way to wire this up.
- Queue depth
- For workers consuming a queue (Kafka, SQS, RabbitMQ): the right signal is the lag (Kafka), the visible message count (SQS), or the queue depth (RabbitMQ). KEDA ships scalers for all three out of the box; scaling on queue depth gives you near-perfect tracking of producer rate.
- p99 latency
- When the application's perceived performance matters more than its CPU usage. Scale when p99 > target_latency. Trickier than RPS — needs a steady-state RPS for the latency to mean anything; it's typically combined with RPS as a secondary signal.
- Concurrent connections
- For long-lived WebSocket / SSE servers, since RPS is meaningless. Memory per connection × headroom = concurrent connection target. Scale at 70% of saturation to leave room for new arrivals.
The KEDA model. KEDA (Kubernetes Event-Driven Autoscaler) is the de-facto standard for non-CPU autoscaling on Kubernetes. It runs as a separate operator that watches an external metric (Kafka lag, SQS depth, Prometheus query, AWS CloudWatch alarm) and feeds it to the standard Kubernetes HPA via the external metrics API. More than sixty prebuilt scalers ship out of the box — the scalers catalog has the full list.
How big sites autoscale in production
Three case studies.
Pinterest — KEDA + Karpenter. Pinterest scales ~5,000 microservices, mostly via KEDA on RPS metrics from Envoy, with Karpenter (rather than the cluster autoscaler) provisioning nodes. Their public engineering writeups report ~30% cost savings from switching to Karpenter for spiky workloads, and ~40% faster scale-up than the legacy cluster autoscaler.
Stripe — predictive scaling. Stripe's API tier autoscales on a combination of inbound RPS and a 7-day rolling forecast. The forecast component handles predictable spikes (Black Friday, payday Mondays in major timezones) by pre-warming capacity. Stripe described the architecture in its engineering writeups on API traffic forecasting; the predictor uses gradient-boosted decision trees on hour-of-week and recent-trend features.
AWS Lambda — invocation-driven, no autoscaler at all. Lambda doesn't use HPA-style autoscaling — every concurrent invocation gets a fresh microVM. Scale is implicit: 1,000 simultaneous requests means 1,000 cold starts (mitigated by provisioned concurrency for latency-sensitive paths). The architecture is the polar opposite of HPA but solves the same problem.
Autoscaling is a control system, and like every control system, it loves to oscillate. Spend the time to pick a metric that correlates with what hurts users, set wide bands, and reserve headroom for spikes. Cheap to operate; expensive to ignore.
Further reading on Kubernetes autoscaling
Primary sources, in order.
- Kubernetes docsHorizontalPodAutoscalerThe canonical reference. Behavior, metrics, stabilization windows.
- karpenter.shKarpenter docsModern node provisioning. Worth the read even if you stay on Cluster Autoscaler.
- Semicolony guideLoad balancingThe other half of "more capacity": once you have it, distribute the work.
- Semicolony guideKubernetes pod creationWhat actually happens when HPA decides to add a pod.