K8s resources size.
Set CPU and memory requests and limits, see the resulting QoS class, and read the cluster-wide totals at your replica count. Sizing is always an estimate; this is the place to start, not finish.
| Resource | Request total | Limit total |
|---|---|---|
| CPU | 0.30 cores | 1.50 cores |
| Memory | 0.38 GiB | 0.75 GiB |
apiVersion: apps/v1
kind: Deployment
metadata:
name: app
spec:
replicas: 3
template:
spec:
containers:
- name: app
image: ghcr.io/example/app:latest
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 256MiTwo numbers, two jobs.
A container's request is what the scheduler reserves on its behalf — the amount of CPU and memory subtracted from a node's allocatable capacity when the pod is placed there. The scheduler will refuse to place a pod on a node that doesn't have enough free request-slots; this is the only mechanism Kubernetes uses to decide where pods go. Requests are guarantees: once your pod is placed, the kubelet ensures it can use up to its request even when the node is under pressure.
A container's limit is the cap on how much it can use, regardless of node availability. If the container tries to allocate beyond its memory limit, the Linux OOM killer fires and the container is terminated with exit code 137. If the container tries to use more CPU than its limit, the Completely Fair Scheduler (CFS) throttles it — the process keeps running but its threads are paused for slices of every 100ms scheduling window. Memory overruns are a hard failure; CPU overruns are a soft failure that hurts latency.
The relationship between request and limit determines the QoS class. If a pod has all its containers with requests equal to limits on both CPU and memory, Kubernetes labels it Guaranteed. If at least one container has a request below its limit, the pod is Burstable. If no container sets requests or limits, the pod is BestEffort. The QoS class is the kubelet's tie-breaker when the node runs out of memory and has to evict pods to recover: BestEffort first, Burstable next, Guaranteed last (and only after spilling everyone else).
Compressible vs incompressible.
CPU is a compressible resource — when contended, the kernel scheduler shares it among requesting processes, and each just runs slightly slower. Memory is incompressible — once allocated, it can't be partially shared, and contention means somebody has to die. This asymmetry shapes everything about how to size requests and limits.
For CPU, the consensus among production Kubernetes operators (after roughly 2019) has shifted away from setting CPU limits at all. The reason: CFS bandwidth throttling can pause threads even when the node has spare capacity, because the calculation is per-quota-period, not per-second. A burst of work that would happily run flat-out on a node with idle cores gets throttled mid-burst because the quota for the current 100ms window has been exhausted. The result is increased tail latency at no obvious benefit. Set CPU requests to ensure scheduling fairness; let the kernel sort out short-burst utilisation.
For memory, limits are the safety use — they prevent a runaway process from consuming all the node's memory and triggering host-level OOM, which evicts other pods unrelated to the cause. The tradeoff is choosing limits that are high enough to absorb normal allocation patterns and low enough to detect actual leaks. Setting memory limit equal to memory request (Guaranteed QoS) gives the strongest protection against eviction but the least flexibility under load.
Where the numbers come from.
The first source for sizing is measurement. Run the workload under representative load, capture P50, P95, P99 of CPU and memory utilisation, set requests to roughly the P95 of CPU and the P99 of memory. Memory should be sized higher because it's incompressible — running close to the cliff is much riskier than running close to a CPU ceiling.
Vertical Pod Autoscaler (VPA) automates this for steady-state workloads. VPA observes actual usage over time, computes recommendations, and (in "Auto" mode) applies them — bouncing the pod with new requests. Most teams run VPA in "recommendation only" mode and apply the suggestions during normal deployments. The trap: VPA's recommendations are based on historical traffic; if your traffic pattern changes (new feature, new customer onboarding), the recommendations lag behind reality.
Horizontal Pod Autoscaler (HPA) is the complement — instead of resizing each pod, scale the number of pods up and down based on observed metrics. The classic HPA target is "CPU at 70% of request" — when actual CPU exceeds 70% of the request across all pods, add more replicas. This works only if requests are sized correctly; HPA on top of mis-sized requests behaves badly. A common pattern is VPA for memory (because memory needs are more pod-intrinsic) and HPA for CPU (because CPU scales horizontally).
Multi-tenant clusters need ResourceQuota and LimitRange. ResourceQuota caps the total request/limit a namespace can consume; LimitRange enforces minimum/maximum/default per-container values. Both are essentialfor preventing one team from hoarding cluster capacity. Best practice: set per-namespace ResourceQuota up to ~80% of cluster capacity (leaving headroom for system pods), and per-namespace LimitRange that forces every workload to set requests (denying BestEffort).