Pod Eviction Visualizer: when something has to go.

Pod eviction is how Kubernetes reclaims a node under pressure or makes room for higher-priority work: the kubelet (or scheduler) picks a victim pod and terminates it. The rules of how it picks decide which workloads survive.

Memory
70%
Running
6

Pods on this node · sorted by priority
pri=1000 critical-payment running
pri=500 web-frontend-a running
pri=500 web-frontend-b running
pri=100 batch-job-x running
pri=100 batch-job-y running
pri=0 best-effort-z running
Recent
— quiet —

What you're looking at

One node, six pods, sorted by priority from critical-payment at pri=1000 down to best-effort-z at pri=0. The memory counter top right is the node's pressure gauge; the threshold sits at 90%. The two buttons are the two ways a pod dies: "+ Memory pressure" nudges the gauge up 8 points at a time, and "+ Schedule pri=800 pod" simulates an urgent workload arriving on a full node. The log underneath records each kill and why.

Click "+ Memory pressure" a few times. Nothing happens at 78%, nothing at 86%, then the gauge crosses 90 and best-effort-z is gone, reason NodePressure. Keep going and the kubelet works its way up the priority ladder, batch jobs next, web frontends after that. The part that should surprise you is what eviction never touches: critical-payment survives every round, and the pri=800 scheduling button refuses to preempt it too. Priority is not a tiebreaker here, it is the whole decision.


What is Kubernetes pod eviction?

A node fills up. Someone has to die.

Kubernetes pod eviction is how the cluster decides which pods to terminate when a node runs out of resources or an operator drains it. Eviction interacts with QoS classes (Guaranteed, Burstable, BestEffort), PodDisruptionBudgets, priority and preemption, and the kernel's OOMKiller. Misconfigured eviction is one of the top causes of unscheduled production incidents.

Imagine you run a Kubernetes cluster — hundreds of containers spread across a few dozen worker nodes. Each node has a fixed budget of memory: say 32 GB. The scheduler tries hard to place pods so the requested memory fits, but pods don't always behave: a request says “I need 1 GB”, and the actual workload reaches 1.5 GB once it warms up. Multiply that by a few co-located pods on the same node and the node's total usage starts climbing past safe limits.

Once free memory drops below a critical line, something has to give. The naive answer — “just refuse new allocations” — doesn't work because Linux can't tell a process “you can't have more memory” without crashing it. The kernel's only tool when memory truly runs out is the OOM killer: pick a process and kill it. If Kubernetes does nothing, the kernel will pick at random and the wrong pod might die — say, the one running the cluster's payment ingestion while a low-priority batch job sails on.

This is the problem kubelet eviction solves. The kubelet, the Kubernetes agent on each node, watches a handful of pressure signals — memory.available, nodefs.available, imagefs.available, pid.available — every ten seconds. When any signal crosses a threshold, the kubelet picks one or more pods to evict and gracefully terminates them, freeing resources before the kernel has to step in. Eviction is structured: pods carry a quality-of-service (QoS) class derived from their resource requests and limits, and the kubelet always evicts BestEffort pods (no requests, no limits) before Burstable pods, and Burstable before Guaranteed. Within Burstable, pods that exceed their request the most go first.

The simulator above shows that decision tree in motion. Pods on a node, sorted by QoS and priority. A memory-pressure event fires; the kubelet picks a pod; the pod gets a SIGTERM, a 30-second grace period, then SIGKILL. The pod's controller (Deployment, StatefulSet) notices the loss and schedules a replacement on another node. From the application's perspective, an evicted pod looks identical to a node failure — and that's deliberate. Kubernetes' eviction is the controlled, predictable cousin of the kernel's blunt OOM kill. The kubelet is one piece of a larger machine; Kubernetes internals covers how it fits alongside the API server, scheduler, and controllers.

Why this matters in numbers: a typical 32 GB node holds 30–50 small pods. A single misbehaving pod that allocates an extra 8 GB triggers eviction; the kubelet has to pick one or two pods to kick off. If you set requests and limits properly, that pod will be a low-importance batch job. If you don't, it'll be your payment ingestion. The single most impactful piece of Kubernetes hygiene is “set memory requests on every container”, because the request value is what determines who survives the squeeze.

KUBELET EVICTION DECISION · NODE MEMORY UNDER PRESSUREmemory.available< 100Miany BestEffort?YES → KILL FIRSTany Burstable?SORT BY USE / REQUESTlast resort: GuaranteedSIGTERM victim30s graceuse_over_reqdesc → kill top

QoS classes — Guaranteed, Burstable, BestEffort — one eviction queue

Three classes, one queue.

Kubernetes assigns every pod a Quality-of-Service class derived from the relationship between its requests and limits across all containers. The class determines the pod's position in the kubelet's eviction queue when the node runs out of memory, disk, or PIDs. The kubelet kills the bottom of the queue first; QoS is the most influential single signal in that ordering.

Class Requirements Eviction order
GuaranteedEvery container has equal CPU + memory request & limit.Last. Highest priority for resources.
BurstableAt least one container has a request but request ≠ limit.Middle. Ordered by “memory used over request.”
BestEffortNo requests, no limits.First. Always evicted before Burstable.

Internally, QoS class translates to the Linux kernel's oom_score_adj value applied to each container's processes by the kubelet. Guaranteed pods get oom_score_adj near −998 (essentially exempt from the OOM killer). Burstable pods get a value computed from 1000 − (1000 × container_memory_request / node_capacity), giving small-request pods higher scores. BestEffort pods get oom_score_adj of 1000 — the kernel's signal for “kill me first.” When the kernel OOM killer fires, it always picks the highest-scored process; QoS class effectively determines the kill-order at kernel level too.

The Burstable middle is where most production pods sit, and where eviction order matters most. The kubelet sorts Burstable pods by their memory usage over request ratio: a pod requesting 1 GB and using 1.5 GB ranks ahead of a pod requesting 100 MB and using 110 MB, despite the latter using less absolute memory. The reasoning: a pod within its request is “promised” that memory; a pod over its request has voluntarily expanded into shared headroom and is fair game when that headroom evaporates.

The actionable lesson: set requests for everything. A BestEffort pod is the cluster's sacrificial lamb — killed first the moment any neighbour misbehaves. Even a small request (50 MB memory, 50m CPU) lifts a pod out of BestEffort into Burstable, dramatically improving its survival odds. The Burns & Beda book Kubernetes Up & Running (O'Reilly, 4th ed. 2022) makes this point bluntly: “every container should have a memory request, full stop.” Production clusters that follow this rule see two-orders-of-magnitude fewer surprise evictions.

A subtle point about CPU versus memory: CPU is compressible, memory is not. A pod exceeding its CPU request is throttled (its cgroup CPU quota caps it) but not killed. A pod exceeding its memory limit is killed, because the kernel cannot “throttle” memory back to fit. This is why memory limits are dangerous and CPU limits are merely annoying: a too-low memory limit means OOMKilled in production; a too-low CPU limit means slow but alive. The recurring advice is to set memory limits to a generous multiple of expected use and CPU limits sparingly or not at all (since CPU requests already determine the share under contention).


Node pressure — five pressures, two thresholds (soft and hard)

Five pressures, two thresholds.

The kubelet samples node-level signals every 10 seconds (the --housekeeping-interval default) and applies eviction when any of them crosses a threshold. The eviction manager loop in pkg/kubelet/eviction runs at the same cadence; in practice the lag from threshold-crossed to pod-evicted is 10–30 seconds. Soft thresholds give the pod a grace period (configurable per signal); hard thresholds skip the grace period and SIGKILL the pod immediately.

  • memory.available
    Free memory plus buffers/cache. Default hard threshold: 100Mi. Below this, the kubelet evicts at most one pod per loop iteration.
  • nodefs.available / nodefs.inodesFree
    Disk space and inode count on the root filesystem holding the kubelet state. Default 10% / 5% remaining. Container logs accumulating in /var/log/pods are usually the culprit.
  • imagefs.available / imagefs.inodesFree
    Disk for container images and layers (when separated from nodefs). Same defaults; image GC of unused images runs first, then pod eviction.
  • pid.available
    Process-ID exhaustion in the node's PID namespace. Less common, but a fork bomb in any pod can take the node down without this signal.
OOMKilled vs Evicted — different events

A pod over its own memory limit is killed by the kernel OOM killer with status OOMKilled — the pod restarts in place per its restart policy. A pod evicted because the node ran out of memory is moved to a different node with status Evicted and a reason like MemoryPressure. Same outcome (the pod stops); different runbook. The metric to watch is kube_pod_status_reason in kube-state-metrics.

Memory pressure has a hidden second source: kernel cgroup memory accounting. cgroup v1 counts page-cache pages used by a container against its memory limit; cgroup v2 (default on most modern distributions since 2022) is more nuanced but can still surprise applications that depend on the OS page cache. A Postgres pod with limits.memory: 2Gi and an actively-read 4 GB working set will OOMKill repeatedly even though the resident set fits, because the kernel has counted cached file pages against the limit. The fix is either raising the limit or telling Postgres to use direct I/O (effective_cache_size tuning).

One particularly devious failure mode: missing pages versus pages reclaimable. A node can show 8 GB “available” in /proc/meminfo while the kubelet's eviction signal reads 100 MB available, because the kubelet's calculation excludes inactive file pages but includes some kernel-reserved pages. The kubelet's signal source is the memory.workingSet metric from the cAdvisor stats endpoint, which approximates “memory the kernel cannot quickly reclaim.” Operators monitoring node-level dashboards in Grafana sometimes report “eviction without pressure” incidents that turn out to be a misreading of which metric the kubelet uses.

KUBELET EVICTION SIGNALS · MEASURED EVERY 10S, COMPARED TO THRESHOLDSmemory.available≤ 100MiHARD: 100Minodefs.available≤ 10%HARD: 10%imagefs.available≤ 15%HARD: 15%pid.available10%RED ZONE = HARD THRESHOLD CROSSED → EVICT IMMEDIATELY (NO GRACE)

PodDisruptionBudget and priority — two ways to protect a pod

Two ways to protect a pod.

QoS class is computed from requests and limits. PriorityClass is an explicit override: assign a numeric priority to a pod, and the scheduler will preempt lower-priority pods to make room when no node has free capacity. PriorityClass was introduced in Kubernetes 1.8 (September 2017) and reached GA in 1.14 (March 2019). System pods (kube-proxy, CNI agents, CSI drivers) ship with priorities in the billions; your application defaults to zero unless you set otherwise.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical-realtime
value: 100000
globalDefault: false
description: For payment ingestion. Always preempt low-priority workloads.

Preemption works in the kube-scheduler, not in the kubelet. When a high-priority pod cannot be scheduled, the scheduler runs the preemption algorithm: find a node where evicting some lower-priority pods would free enough resources, and emit DELETE calls for those victims. Victims are graceful by default — they receive SIGTERM, the grace period (default 30 s) elapses, then SIGKILL. The high-priority pod is then re-evaluated on the next scheduler cycle. The whole cycle takes 30–60 seconds in the typical case.

PodDisruptionBudget goes the other way: it tells the eviction API how many replicas of a workload must remain ready when voluntary disruptions happen — node drains, autoscaler scale-downs, cluster upgrades. PDBs are honoured by the eviction API (/eviction subresource) and by kubectl drain; they are not honoured by the kubelet's node-pressure eviction. A node failing under memory pressure will kill whatever the eviction manager picks, PDB or no PDB.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-pdb
spec:
  minAvailable: 2     # never drop below 2 ready replicas
  selector:
    matchLabels: { app: web }

A useful default for production: PriorityClass for tiered importance, PDB with minAvailable: N−1 for stateful workloads (so at most one replica is voluntarily disrupted at a time), and topology-spread constraints ensuring replicas land in different availability zones. The combination protects against rolling upgrades, autoscaler oscillations, and most node-level failures — though, again, not against a runaway memory leak that triggers node-pressure eviction.

One trap worth flagging: a PDB with minAvailable equal to the replica count is unsatisfiable — no voluntary disruption can ever proceed. kubectl drain hangs forever; cluster upgrades stall. The error mode is silent unless you read the eviction API's response. The right value is minAvailable equal to the replica count minus one, or equivalently maxUnavailable: 1. Production teams have learned this rule the hard way enough times that automation tools now refuse to apply unsatisfiable PDBs.

PRIORITY PREEMPTION · SCHEDULER DECISION FLOW (kube-scheduler)new pod, pri = 100kno node fits?find victim nodeSIGTERM low-pri victims · 30s gracePDB blocks API eviction;does NOT block preemptionvictims = lowest-pri podsthat free enough for new podre-schedule on next cycle

OOMKill — when the kernel picks instead

When the kernel picks instead.

Below the Kubernetes layer sits the Linux Out-Of-Memory killer, a kernel mechanism that fires when a memory allocation cannot be satisfied even after page reclaim. The OOM killer scans every process in the OOM-eligible cgroup, computes an oom_score for each (proportional to RSS plus swap, scaled by oom_score_adj), and SIGKILLs the highest-scored. The decision happens in milliseconds; there is no SIGTERM, no grace period, no opportunity for the application to flush.

The kubelet writes oom_score_adj for every container at start-up, with values determined by the QoS class as described earlier. Guaranteed containers get a near-immune score (−998) so the kernel almost never kills them; BestEffort containers get +1000 so they are always the first victim. Burstable containers get a sliding score from 2 to 999, lower for pods with larger memory requests. The end result: the kernel's choice and the kubelet's choice usually agree, but the kernel can fire faster (microseconds) and works even if the kubelet has crashed.

There are two cgroup-level OOM events: the per-container OOM (a container exceeded its own memory limit) and the per-cgroup-hierarchy OOM (the kubelet's parent cgroup or the system root ran out). The first kills only that container; the second can kill any container in the hierarchy. Kubernetes uses a per-pod cgroup hierarchy specifically so that container-level OOMs don't accidentally kill sibling pods. The detail is invisible most of the time but matters during incidents: if you see OOMKilled on a pod whose own container was nowhere near its limit, the cause is almost always a cgroup-hierarchy OOM upstream.

Tools like Brendan Gregg's oomkill.bt (eBPF, available in his bcc-tools collection) can capture every OOM event with stack traces. The output is essential when debugging surprise pod restarts: the difference between “your container leaked” and “a noisy neighbour took your pod down” is one cgroup-hierarchy line in the OOM event log.

A modern complication: cgroup v2's memory.high mechanism enables a soft throttle, where a process exceeding the high watermark is slowed by reclaim pressure rather than killed. Kubernetes does not yet make full use of this — the kubelet still maps requests/limits to cgroup memory.max for hard kills — but proposals exist to expose memory.high to enable graceful degradation. When implemented, the trade-off will be: pods that bump up against memory.high will run slowly but stay alive, at the cost of degrading neighbour latency through paging.

QoS classDefinitionoom_score_adjEviction order
BestEffortno requests, no limits+1000first
Burstablerequests < limits OR limits absent+2 to +999 (sliding)middle, by usage over request
Guaranteedrequests == limits for all containers−998last (almost never)
System podspriorityClass system-cluster-critical−999 (kernel reserved)never (best-effort)

API-driven eviction — drain, taints, the eviction API

Eviction from the API side.

Beyond kubelet-driven node-pressure eviction, the control plane has its own eviction paths. The node lifecycle controller in kube-controller-manager watches node status and applies NoExecute taints when a node becomes unhealthy. The standard ones: node.kubernetes.io/not-ready, node.kubernetes.io/unreachable, and per-condition taints for memory pressure, disk pressure, and PID pressure.

A pod tolerates a NoExecute taint via its tolerations field, with an optional tolerationSeconds. If the pod doesn't tolerate, it is evicted immediately. If it tolerates with seconds, it has that long to finish before being evicted. Default tolerations for not-ready and unreachable are 300 seconds — the source of the famous “5-minute pod survival window” after a node disappears. You can override per-pod, but most workloads should not.

The Cluster Autoscaler evicts pods to scale down nodes; Karpenter (introduced by AWS in 2021, now CNCF) does similar work with a more aggressive algorithm that consolidates underutilised nodes. Both honour PDBs but ignore tolerations — a workload that pins itself to a specific node via toleration cannot prevent autoscaler-initiated drain. The descheduler (kubernetes-sigs/descheduler) is a separate controller that proactively evicts pods to rebalance the cluster: nodes that are over-utilised, pods violating affinity rules, pods running on tainted nodes. It runs as a CronJob and is opt-in.

Each eviction path uses a different Kubernetes API verb. Kubelet eviction is a direct DELETE on the pod with grace period zero. API-initiated eviction (drain, autoscaler) goes through the /eviction subresource, which checks PDBs and emits a graceful delete. Preemption-initiated eviction is a graceful delete with the deleting controller noted in the event. Reading kubectl get events after an incident reveals which path fired by the actor name on the event.

Karpenter, AWS's open-source autoscaler that has gradually replaced Cluster Autoscaler in many production clusters, takes a more aggressive line. Rather than scaling node groups in response to unschedulable pods, it consolidates: every few minutes it computes “could this workload run on fewer nodes?” and proactively drains the loser. The default consolidation interval is 30 seconds; the default consolidation type evicts pods one node at a time honouring PDBs. The effect on a stable cluster is steady churn: pods get rescheduled every few hours as Karpenter shuffles them onto larger or cheaper instances. Workloads that can't tolerate frequent restarts (long-running batch jobs, simulators with warmup state) need the karpenter.sh/do-not-disrupt annotation as a kill-switch.


Eviction tuning — the flags that move the dial

The flags that move the dial.

The kubelet exposes several --eviction-* flags that operators rarely review until they have an incident. The defaults are conservative for a typical workload but wrong for several common shapes; production clusters should review them.

  • --eviction-hard
    Default: memory.available<100Mi,nodefs.available<10%,imagefs.available<15%,nodefs.inodesFree<5%. Tighten when running large nodes (the absolute 100Mi is meaningless on a 256 GB node).
  • --eviction-soft
    Off by default. Enables soft thresholds with grace periods. Useful when you want the kubelet to start evicting before hitting hard limits, giving SIGTERM a chance to save state.
  • --eviction-minimum-reclaim
    How much resource to free per eviction round. Default zero (evict one pod at a time). Setting memory.available=500Mi tells the kubelet to keep evicting until 500 Mi has been freed — useful on memory-spiky clusters.
  • --eviction-pressure-transition-period
    How long memory pressure must persist before the node reports MemoryPressure=true. Default 5 minutes. Lowering it to 30 s makes the scheduler stop placing new pods on a stressed node sooner.

One particularly worth-knowing setting is --system-reserved and --kube-reserved: explicit reservations for kernel and kubelet memory. Without them, system processes (sshd, journald, cri-o) compete with pods for the same allocatable pool, and a runaway journald can take the kubelet down. Reserving 1 GB / 1 core for system + 0.5 GB / 0.5 core for kube on a 16 GB node means pods see allocatable: 14.5GiB, which is what every requests calculation should use.

Default tunings on managed services are a moving target. EKS, GKE, and AKS each ship slightly different kubelet configs out of the box. AWS reserves roughly 11% of memory plus 255 MiB on EKS nodes by default; GKE reserves 25% of the first 4 GiB plus 20% of the next 4 GiB and so on (a sliding scale); AKS exposes most flags but defaults aggressively to leave 750 MiB free. Cross-cloud teams should read the per-vendor docs before assuming any setting; the gap between “allocatable I see in kubectl describe node” and “total node memory” is non-trivial and worth understanding when sizing requests.

# /var/lib/kubelet/config.yaml — production-grade tuning
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
evictionHard:
  memory.available:  "500Mi"      # tightened from default 100Mi
  nodefs.available:  "10%"
  imagefs.available: "15%"
  nodefs.inodesFree: "5%"
evictionSoft:
  memory.available:  "1Gi"        # gives SIGTERM a chance to flush state
evictionSoftGracePeriod:
  memory.available:  "1m30s"
evictionMinimumReclaim:
  memory.available:  "500Mi"      # free this much per eviction round
systemReserved:    &#123; cpu: "1",  memory: "1Gi" &#125;
kubeReserved:      &#123; cpu: "500m", memory: "500Mi" &#125;
evictionPressureTransitionPeriod: 30s
100Mi is meaningless on a 256GB node

The default memory.available<100Mi threshold was chosen for the 1 GB nodes of 2015. On a modern bare-metal 256 GB node, hitting 100Mi means you are already in swap-thrash territory; production clusters should set this to 0.5–2 GiB depending on node size and the kernel's memory reclaim aggressiveness.


Pod eviction in a real incident — what survives

What survives a real incident.

A few patterns recur across production postmortems. The first is memory limits without requests: a developer sets limits.memory: 2Gi but forgets requests.memory. The pod becomes Burstable with a tiny effective request, gets scheduled densely, and is OOMKilled or evicted whenever a noisy neighbour shows up. The fix is requests ≈ expected steady-state, limits ≈ 1.5× that.

The second is missing PodDisruptionBudgets on stateful workloads. A node drain during a cluster upgrade can evict every replica of a stateful set in seconds; the application loses quorum; recovery takes hours. A simple minAvailable: N−1 PDB on every stateful workload prevents this. The cost is slightly slower drains.

The third is aggressive log volumes filling nodefs. Production pods that log verbosely to stdout fill /var/log/pods at multi-MB-per-second rates. Without log rotation or a shipper that drains in real time, a 50 GB ephemeral disk fills in 24 hours and triggers nodefs eviction. The fix is centralised log shipping (Fluent Bit, Vector, Promtail) plus aggressive logrotate and the --container-log-max-size kubelet flag.

The fourth is preemption cascades. A high-priority workload joins, preempts low-priority pods on Node A; those low-priority pods are immediately rescheduled on Node B; their requests cause Node B to push out other low-priority pods; the cascade continues until the cluster stabilises. In a poorly-sized cluster, the cascade can take minutes and leave dozens of pods in CrashLoopBackOff. The fix is over-provisioning: keep 10–20% headroom by default, plus balloon pods at the lowest priority that can be preempted instantly to absorb scheduling pressure. Pinterest's engineering blog has a good write-up of this pattern under the name “low-priority placeholders.”

A fifth, easy-to-overlook pattern: liveness probes that lie. A liveness probe configured too aggressively (1-second timeout on a JVM that takes 90 seconds to warm up) restarts the pod every minute, looking like an eviction storm even though it is purely intra-pod. Production runbooks should treat “CrashLoopBackOff with no eviction events” as a probe-tuning issue, not a resource issue, until proven otherwise. The Kubernetes docs explicitly warn against liveness probes in general (“use them sparingly and only for true deadlocks”); readiness probes are usually the right tool, since a failing readiness probe just removes the pod from service rotation rather than killing it.


Found this useful?