Sub-page 07 · for practitioners + controller authors

Kubernetes internals · api-server

Eleven stages,
one request.

Every kubectl command, every controller reconcile, every kubelet status update is the same shape underneath: an HTTPS request to :6443 that walks eleven discrete stages inside the api-server before its bytes ever touch etcd. Authentication, authorisation, priority and fairness, two flavours of admission, schema, conversion, storage, watch fan-out, audit. The order is fixed. None of it can be skipped.

Roughly 4,300 words. Pair it with the architecture sub-page for the process boundaries, and the etcd sub-page for what happens after stage 10.

The eleven-stage pipeline.

The api-server is, to a first approximation, a tower of HTTP middleware. A request arrives at the TLS terminator, walks down a stack of named handlers, hits a storage call at the bottom, walks back up, and the response is written. Every Kubernetes request. A GET /api/v1/pods, a POST /apis/apps/v1/deployments, a PATCH /api/v1/nodes/n3/status. Passes through the same eleven stages, in the same order. The handlers are wired together in staging/src/k8s.io/apiserver/pkg/server/config.go, and you can read them top-to-bottom; the binary is doing exactly what the source says it is.

The point of laying this out as a numbered list is that it gives you a place to put every symptom you have ever debugged. A 401 is stage 02 failing. A 403 is stage 06 failing. An incomprehensible "admission webhook denied" is stage 07 or stage 09. A 429 with no obvious cause is stage 05 — the API Priority and Fairness layer rejected you for being too noisy. A 409 Conflict on an Update is stage 10 noticing that resourceVersion moved under you. A 410 Gone on a watch is the cacher in stage 11 telling you the watch cache rotated past your cursor. None of these errors are mysterious; they are each exactly one stage of the pipeline misbehaving.

The other reason the order matters is that several stages are intentionally placed where they cannot be bypassed. Authentication is before authorisation, because you cannot make a policy decision about an unknown user. Mutating admission is before validating admission, because validation is the contract: whatever a mutating webhook produces must still satisfy the schema and the validating chain. Schema validation is between the two admission stages, not after, because a mutating webhook that emits a malformed object should be caught immediately. APF is before authorisation because rejecting an over-quota client is cheaper than running RBAC for them. Audit brackets the whole pipeline because you want a record even of the requests that were rejected at stage 02.

The handlers are also deliberately uniform in shape. Each one takes a http.Handler, wraps it, and returns a new http.Handler. The whole pipeline is a function-composition; you can reorder it in source, you can disable individual stages with a --feature-gates flip or an --enable-admission-plugins argument, and the rest of the system keeps working as long as the contract — what each stage promises to the next. Is preserved. This composability is what makes aggregated apiservers (covered in Part 08) cheap: a third-party server gets to reuse the same middleware library (k8s.io/apiserver) and inherit the same eleven stages for free.

The table version, with the source-tree owner of each stage, is below. Bookmark this; it is the layout the rest of this page expands.

#	Stage	Owner	What it does
01	TLS / HTTP/2	go-restful + net/http	mTLS handshake terminates here; SNI demux for aggregated apiservers
02	Authentication	authentication.Request	cert → token → OIDC → webhook, in order; first hit wins
03	Audit (start)	audit.Backend	RequestReceived stage written to audit log
04	Impersonation	filters.WithImpersonation	Impersonate-User header swaps the user, re-authorised
05	Priority + Fairness	flowcontrol.Interface	FlowSchema match → PriorityLevel queue → admit or 429
06	Authorisation	authorizer.Authorizer	Node → RBAC → ABAC → Webhook, allow-on-first-yes
07	Mutating admission	MutatingAdmissionWebhook	webhook + policy plugins reorder, defaulting, sidecar injection
08	Schema validation	OpenAPIv3 validator	structural schema, default-pruning, x-kubernetes-* tags
09	Validating admission	ValidatingAdmissionWebhook	last refusal point; CEL ValidatingAdmissionPolicy lives here too
10	Storage / etcd	storage.Interface	protobuf encode → optimistic Txn → etcd PUT; resourceVersion bumps
11	Watch fan-out	cacher.Cacher	event broadcast to every subscribed watch; audit "ResponseComplete"

A useful debugging habit. When an api-server request misbehaves, name the stage first. "This is a stage 06 problem" is a precise statement that survives team handoffs; "my RBAC is broken" is not. The audit log includes per-stage timestamps in 1.27+; you can literally see which stage took how many milliseconds.

Authentication. Four sources, first match wins.

Authentication, stage 02, is the question "who is this caller?", and only that question. Authorisation is a separate stage. The api-server runs a chain of authenticators, configured at start-up via flags, and asks each one in turn whether it can identify the caller. The first authenticator that returns a positive result wins; the request continues with that identity attached. If every authenticator returns "I do not recognise this", the api-server returns 401 Unauthorized and the request never reaches stage 03.

There are exactly four authenticator types in mainline Kubernetes, and every cluster you will ever see uses some subset. Client certificates are the strongest: the caller presents an X.509 cert during the TLS handshake, the api-server verifies it against the cluster CA configured by --client-ca-file, and the identity is taken from the certificate's CN (username) and O (groups). This is how the kubelet authenticates to the api-server in most installs, how kubeadm-bootstrapped operators authenticate, and how kubectl authenticates when your kubeconfig has a client-certificate-data stanza. There is no callable network path for client-cert auth. It is purely TLS.

Bearer tokens are the most common in practice. A caller sends Authorization: Bearer ey...; the api-server validates the token. There are three flavours, served by separate authenticator implementations: static tokens loaded from a file (deprecated, never use), bootstrap tokens (only for kubeadm node-join), and ServiceAccount tokens. ServiceAccount tokens are the workhorse. Every Pod gets one mounted at /var/run/secrets/kubernetes.io/serviceaccount/token, and modern installs (1.20+) issue them as JWTs signed by the api-server's TokenRequest signing key, with a short audience-scoped lifetime and a binding to the Pod that received them. The legacy "ServiceAccount Secret" tokens. Long-lived, never-rotated. Are gone in 1.24+; if you still see them in your cluster, audit and migrate.

OIDC is how humans usually authenticate in real organisations. The api-server is configured with an OIDC issuer URL (--oidc-issuer-url=https://login.microsoftonline.com/...) and a client ID. The caller obtains a JWT from the IdP. Okta, Azure AD, Google, Dex, whatever, and presents it as a bearer token; the api-server validates the signature against the issuer's JWKS, checks iss, aud, exp, and extracts the username and groups from configured claims. The IdP never sees the api-server; the api-server never sees the IdP's secret. This is the only sane way to give human engineers cluster access at scale.

Webhook authentication is the escape hatch. Configure the api-server with a kubeconfig pointing at a custom HTTPS service (--authentication-token-webhook-config-file), and the api-server will POST a TokenReview object to it for every otherwise-unrecognised bearer token, accepting whatever username and groups the webhook returns. This is how managed providers plumb their cloud IAM into the api-server (EKS does this for aws-iam-authenticator; GKE for its IAM proxy). Latency matters here. A slow webhook adds latency to every request. And the implementation is expected to cache aggressively.

Production rule. Disable anonymous auth (--anonymous-auth=false) unless you have a specific reason to allow it. Anonymous-on lets unauthenticated callers reach stage 06, where they get the system:unauthenticated group; if RBAC is misconfigured to grant that group anything beyond /healthz, it is a public-internet RCE.

Authorisation. RBAC, ABAC, Webhook, Node.

Authorisation, stage 06, runs after APF and answers a single question per request: given this verb on this resource by this user, is it allowed? Like authentication, it is a chain, but the semantics are different. Authentication is "first match wins". Authorisation is "any one yes wins, all-no denies". The api-server runs every configured authoriser against the request; the first one that returns DecisionAllow short-circuits the rest, and the request continues. If every authoriser returns DecisionDeny or DecisionNoOpinion, the request gets a 403 Forbidden.

The four authorisers are, in canonical recommended order: Node, RBAC, ABAC, Webhook. Node and RBAC are always on. ABAC is legacy. Webhook is the extension point. The order is set by --authorization-mode=Node,RBAC,Webhook. Order matters because early-yes wins; you put the cheapest, most likely-to-allow authorisers first so you do not pay for the slow ones.

Node authoriser is a special-purpose authoriser scoped to one identity: the kubelet's. It checks that the caller's username is system:node:<nodeName> and is in the system:nodes group, then checks that the requested resource is one the kubelet is allowed to touch, and crucially, that the resource is about that kubelet's own node. A kubelet on node-3 can read the Secrets mounted into Pods scheduled on node-3, can update its own Node status, can patch Pod status for its own Pods, but cannot read a Secret on node-7 or update a Pod scheduled elsewhere. Without the Node authoriser, a single compromised kubelet becomes a cluster-wide credential-exfiltration vector. With it, the blast radius is one node.

RBAC is the workhorse. Three primitives. Role / ClusterRole, RoleBinding / ClusterRoleBinding. Encode an additive policy: a Role lists [apiGroups, resources, verbs] tuples that a binding subject is allowed; the subject is a user, group, or ServiceAccount; the binding ties one or more subjects to one Role. There is no DENY; absence of an ALLOW is a deny. Roles are namespaced; ClusterRoles are cluster-scoped. The binding determines the scope: a RoleBinding can bind a ClusterRole into a single namespace, which is how you get reusable role definitions ("read all configmaps") without re-declaring the rules in every namespace.

ABAC is the predecessor to RBAC, configured via a policy file with per-line JSON rules. It still works, but it has been deprecated in spirit since 1.6 and every new install should use RBAC. The only reason to know it exists is that some very old clusters still have it enabled; if you see --authorization-policy-file=... in a kube-apiserver flag set, you have an ABAC config to migrate.

Webhook authorisation is the extension point: configure the api-server with --authorization-webhook-config-file and it will POST a SubjectAccessReview object to your service for every request that no earlier authoriser allowed, accepting whatever your webhook returns. Cloud-IAM integrations live here; OPA/Gatekeeper used to live here; many bespoke "this team can deploy to this namespace" implementations live here. The latency budget is small — timeoutSeconds: 5 default, cache-authorized-ttl caches positive decisions, and a slow webhook will cause cluster-wide tail-latency.

# A canonical RBAC pair: a ClusterRole that lists configmaps, and a RoleBinding
# that grants it to a ServiceAccount in the prod namespace only.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: configmap-reader
rules:
- apiGroups: [""]
  resources: ["configmaps"]
  verbs:     ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: configmap-reader-prod
  namespace: prod
subjects:
- kind: ServiceAccount
  name: app-sa
  namespace: prod
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind:     ClusterRole
  name:     configmap-reader

Three things to internalise about RBAC. First, every wildcard is a footgun; verbs: ["*"] includes impersonate, which is the privilege-escalation primitive — a subject with that verb can become any other identity in the cluster. Second, system:masters is hard-coded as cluster-admin and is not subject to RBAC; certificates whose O is system:masters bypass every rule. Treat that group like a root password. Third, RBAC has no negative rules; if you want "everyone except Bob can read Secrets", you do that by structuring your bindings, not by writing a deny. There is no deny.

Debugging tip. kubectl auth can-i <verb> <resource> --as=<user> is the official way to ask the chain "would you have allowed this?". It runs the entire stage 06 chain server-side and returns yes/no, exactly as the api-server would. Combine with --as-group to test bindings against groups.

Admission. Mutating, schema, validating, in that order.

Admission is three stages — 07, 08, 09, and they are the extension point that can rewrite or veto any write. The first runs mutating admission: a chain of webhooks and built-in plugins that are allowed to modify the incoming object. The second is schema validation: the OpenAPI v3 schema the resource type was defined with, used to reject malformed objects and to apply defaults. The third runs validating admission: webhooks and CEL policies that get the final say but cannot mutate. After stage 09, the object is canonical and goes to etcd.

The split between mutating and validating is deliberate and worth dwelling on. Mutating webhooks can change the object — inject sidecars, set defaults, rewrite image references, add labels, and they run first because their changes have to be visible to validation. Validating webhooks cannot mutate; they only return allow or deny. The reason is determinism: if validation could also mutate, two webhooks each making "small fixes" could ping-pong indefinitely, and you would have no way to reason about the final shape. The api-server resolves this by enforcing the order — every change happens in stage 07, every check happens in stage 09, and treating any mutation in stage 09 as a server-side error.

Built-in admission plugins live alongside webhooks in the same chain. The defaults are a long list: NamespaceLifecycle (rejects requests against deleting namespaces), LimitRanger (defaults pod resource requests), ServiceAccount (auto-mounts the SA token), DefaultStorageClass (sets the default StorageClass on PVCs), ResourceQuota (enforces quotas), PodSecurity (runs the Pod Security Standards). Each is a Go struct with the same Admit / Validate interface as a webhook, just compiled in. Knowing which plugins are on (with kube-apiserver --enable-admission-plugins=...) is part of any cluster forensic exercise.

Webhooks are the user-extensible flavour. You register a MutatingWebhookConfiguration or ValidatingWebhookConfiguration object that points the api-server at an HTTPS service inside the cluster (clientConfig.service) or at an external URL (clientConfig.url). For every matching request, the api-server POSTs an AdmissionReview to the webhook. The response is either an unmodified allowed: true, an allowed: false with a status message, or — for mutating webhooks only. A JSONPatch describing the changes to apply. The api-server applies the patch, then continues down the chain.

Schema validation, stage 08, is what catches a malformed object before it ever sees etcd. Every built-in resource has a hand-written validator under pkg/apis/<group>/validation/; every CRD has an OpenAPI v3 schema attached to its spec.versions[].schema.openAPIV3Schema stanza. Both are run here. The validator enforces structural constraints (required fields, enums, regex patterns, integer bounds), applies defaults (a CRD field with default: 3 in its schema gets that value if the caller omitted it), and prunes any field the schema does not declare. A behaviour controlled by x-kubernetes-preserve-unknown-fields, off by default for structural schemas.

# A ValidatingAdmissionWebhook that rejects Deployments with no resource limits.
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: enforce-resource-limits.toolkit.dev
webhooks:
- name: limits.toolkit.dev
  clientConfig:
    service:
      namespace: policy-system
      name:      limit-enforcer
      path:      "/validate"
      port:      443
    caBundle:    <PEM-encoded CA>
  rules:
  - apiGroups:   ["apps"]
    apiVersions: ["v1"]
    operations:  ["CREATE", "UPDATE"]
    resources:   ["deployments"]
    scope:       "Namespaced"
  failurePolicy:  Fail          # reject on webhook unreachable
  sideEffects:    None
  admissionReviewVersions: ["v1"]
  timeoutSeconds: 5

A non-obvious detail: the order in which webhooks run inside stage 07 (and stage 09) is determined by the alphabetical sort of the name field on the webhook config. If two mutating webhooks both want to add the same annotation, the lexically-smaller one runs first, and the second one sees the first's output. This is deterministic but easy to get wrong; many production teams adopt a numeric prefix convention (10-istio-injector, 20-resource-defaulter) so order is explicit on the page.

The other webhook lever that matters in production is failurePolicy. Fail means a webhook that does not respond rejects the request; Ignore means it silently passes through. Fail is correct for security webhooks (your policy MUST run); Ignore is correct for convenience webhooks (sidecar injection that you would rather skip than break the cluster). Set the wrong one and you either get a brittle cluster (Fail on a flaky webhook = no Pods schedule during the outage) or a security hole (Ignore on a policy webhook = attackers can DoS their way around it). The matchPolicy and objectSelector fields scope the webhook narrowly, which is the standard mitigation.

CEL policies (KEP-3488, GA in 1.30) are the new shape — a ValidatingAdmissionPolicy is a CRD-like object with inline CEL expressions that run inside the api-server, with no out-of-process webhook call. They are faster, cannot be a SPOF, and are increasingly the way new policy is written. They occupy stage 09 alongside webhooks and follow the same allow/deny semantics.

Conversion webhooks. Multi-version CRDs, the right way.

Built-in Kubernetes resources have multiple API versions: apps/v1beta1, apps/v1beta2, apps/v1 were three simultaneously-served versions of the same Deployment for years. The api-server transparently converts between them. A client requesting v1beta1 sees a v1beta1-shaped response even though etcd stores only one canonical version. For built-ins, the conversion functions are hand-written Go code under pkg/apis/<group>/<version>/zz_generated_conversion.go, generated from struct tags by conversion-gen.

CRDs are different. A CRD author cannot ship Go code into the api-server; their resource might evolve from v1alpha1 to v1beta1 to v1 over its lifetime, and they need a way to convert between them at runtime. Kubernetes provides three strategies, configured on the CRD's spec.conversion stanza: None (no conversion — every version is structurally identical), Webhook (call out to a service to do it), and the deprecated NoneConverter alias. In modern operators, almost every multi-version CRD uses Webhook.

A conversion webhook is structurally identical to an admission webhook: an HTTPS service that receives a ConversionReview and returns the converted objects. The api-server invokes it at two points. First, on every read of a CRD object whose stored version differs from the requested version: a client doing GET /apis/example.com/v1/foos/x when the object is stored as v1alpha1 triggers a webhook call to convert v1alpha1 → v1. Second, on every write: the api-server converts the incoming object to the storage version (declared with storage: true on one of the CRD versions) before persisting. A multi-version CRD with active clients on three versions therefore depends on the conversion webhook for every single read and write of that resource.

This makes the webhook a critical-path SPOF in a way admission webhooks are not. An admission webhook with failurePolicy: Ignore can be down without breaking the cluster; a conversion webhook has no such option. If it is down, every read of every existing object of that resource type returns 500 Internal Server Error. Operators that ship multi-version CRDs run the conversion webhook with three replicas, on the same controller-manager deployment, with PodDisruptionBudget minAvailable: 2, and a preStop hook that drains gracefully.

The other subtlety is that the storage version, once set, is hard to change. The conventional path is "store as v1alpha1, declare v1beta1 also-served, when v1beta1 is stable flip storage: true to v1beta1 and run a one-time job that re-PUTs every existing object so etcd gets re-encoded under the new version". Kubernetes provides the storage-version-migrator as a built-in controller for this; you do not have to write the migration loop yourself, but you do have to remember to run it. Forgetting leaves you with v1alpha1-encoded objects in etcd that the conversion webhook has to keep handling forever.

# A multi-version CRD with conversion webhook. v1 is the storage version;
# v1alpha1 is served for backwards compatibility and converted on the fly.
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: widgets.example.com
spec:
  group: example.com
  scope: Namespaced
  names:
    kind: Widget
    plural: widgets
  conversion:
    strategy: Webhook
    webhook:
      conversionReviewVersions: ["v1"]
      clientConfig:
        service:
          namespace: widget-operator
          name:      conversion-webhook
          path:      "/convert"
        caBundle:    <PEM>
  versions:
  - name: v1alpha1
    served: true
    storage: false
    schema: { openAPIV3Schema: { type: object } }
  - name: v1
    served: true
    storage: true
    schema: { openAPIV3Schema: { type: object } }

A conversion webhook is the only place you should write per-field migration code. It runs inside the api-server's admission pipeline before stage 08, so its output is what schema-validates, which means a buggy webhook produces a stage-08 rejection. Surfaced to the user as a confusing "validation failed" rather than "your conversion crashed". Operators that ship CRDs invariably learn this the hard way once and then add comprehensive conversion-webhook tests to their CI.

Production rule. Never ship a CRD upgrade that simultaneously bumps the storage version and changes the schema. Do them in two releases: first add the new version as served-only, roll the conversion webhook, run the storage migrator, then in a follow-up release flip storage: true. The CRD versioning guide is the canonical reference.

Storage — etcd, optimistic concurrency, resourceVersion.

Stage 10 is where bytes finally land in etcd. The api-server's storage layer (staging/src/k8s.io/apiserver/pkg/storage/etcd3) is a thin abstraction over etcd's gRPC client: it serialises the post-admission, post-conversion object to protobuf, computes its etcd key (/registry/<resource>/<namespace>/<name>), and issues an etcd transaction that performs the write conditionally on the current revision. The transaction is what makes optimistic concurrency work. For a Create, the condition is "no key at this path"; for an Update, "the existing key has mod_revision = N", where N is the resourceVersion the client sent. etcd accepts or rejects atomically. The api-server translates rejections to 409 Conflict for Update and 409 AlreadyExists for Create.

Every successful etcd transaction increments etcd's global revision counter, and that counter. Surfaced as metadata.resourceVersion on every object is the single thread that makes Kubernetes' watch model coherent. resourceVersion is cluster-wide, not per-resource. A Pod's resourceVersion of 487293 means "this is what the cluster looked like at global revision 487293"; if a ConfigMap also has resourceVersion 487293, the two states are consistent. They were both true at the same instant. This is what lets a controller open watches on multiple resources and reason about their joint state.

The encoding format on the wire to etcd is protobuf, not JSON. The api-server converts JSON ↔ protobuf using generated code under zz_generated.pb.go in each apis package; the kubelet, the scheduler, every other component talks JSON-or-protobuf to the api-server, and the api-server normalises to protobuf for storage. CRDs are stored as JSON inside a runtime.Unknown protobuf wrapper because there is no generated PB code for user-defined types. This makes CRDs measurably more expensive to read and write than built-ins. A fact that matters at the scale of "1,000 CRD objects watched by 50 controllers".

etcd has a default per-object size limit of 1.5 MiB (the --max-request-bytes flag, default 1.5 MiB minus protocol overhead) and a database size limit of 2 GiB (--quota-backend-bytes). The first is what makes "we put a 4 MiB binary into a ConfigMap" fail with a confusing request entity too large. The second is what makes a cluster with 80,000 stale Events stop accepting writes. Every write is rejected with "etcdserver: mvcc: database space exceeded" until you compact + defrag.

Kubernetes ships an Event TTL controller (default --event-ttl=1h on api-server) and a periodic etcd compactor (--etcd-compaction-interval=5m) so this rarely happens in well-managed clusters. When it does, the symptoms are cluster-wide write failures with healthy node-state, and the remediation is invariably: etcdctl compact <rev> followed by etcdctl defrag on every member. The etcd sub-page goes much deeper on this.

# An update with optimistic concurrency: the api-server uses the client's
# resourceVersion as the etcd Txn's expected mod_revision.
$ kubectl get pod web-7d8 -n prod -o yaml | grep resourceVersion
  resourceVersion: "487292"

$ kubectl edit pod web-7d8 -n prod
# meanwhile, somewhere else, a controller updates the same Pod...
$ kubectl edit pod web-7d8 -n prod
error: Operation cannot be fulfilled on pods "web-7d8":
       the object has been modified; please apply your changes
       to the latest version and try again
# That is a 409 Conflict, surfaced from the etcd Txn at stage 10.
# Every well-written controller retries on this error.

One subtle behaviour worth knowing: the api-server's storage layer is generic over resource type, but per-resource it can be configured to encrypt at rest (--encryption-provider-config) for sensitive types like Secrets. The encryption is symmetric (AES-CBC or AES-GCM with a KMS-derived key) and happens on the api-server side; etcd sees only the ciphertext. This is what lets you say "my Secrets are encrypted at rest" even when etcd's own disk is not. --encryption-provider-config-automatic-reload=true (1.27+) lets you rotate keys without an api-server restart.

The deepest mistake people make at stage 10 is treating UPDATE as if it commutes. It does not. UPDATE is a complete-object replace; if you read at resourceVersion 100, modify field X, and PUT, you are also writing back fields Y and Z as you saw them at version 100, even if a controller updated Y at version 105 in between. Use PATCH (strategic-merge or JSON Patch) when you want to modify a single field; use server-side apply for multi-author convergence.

List + watch and the Priority and Fairness layer.

List and watch are the two read primitives every Kubernetes client uses to maintain a consistent in-memory view of cluster state. List returns a snapshot. Every object of a type, paginated, at a single etcd revision. Watch returns a stream of changes — ADDED, MODIFIED, DELETED, BOOKMARK. Starting from a specified resourceVersion. The canonical client pattern, baked into client-go's Reflector, is: open a List at revision N, then a Watch from revision N+1, and never re-List unless the watch breaks. As long as the watch holds, the client's cache is exactly consistent with the api-server's view, with bounded latency.

The api-server serves watches from an in-memory watch cache (staging/src/k8s.io/apiserver/pkg/storage/cacher), not by re-querying etcd for every subscriber. On startup, the cacher does one big etcd Range to populate itself; after that, it follows etcd's own watch stream and broadcasts events to subscribers. The cache stores a sliding window of the last N events per resource typically a few minutes. So that a watcher reconnecting at a recent resourceVersion can resume from cache without etcd ever seeing the request. The cache is what makes 5,000 watchers cheap; etcd never has 5,000 watch streams open against it.

The "expensive list" problem is the canonical scaling failure of this model. A naive controller that List-watches all Pods cluster-wide every reconcile (instead of using its informer cache) can multiply by a thousand operators and saturate the api-server's CPU on JSON serialisation alone. A particularly bad pattern is List with a label selector that the apiserver cannot index. Until KEP-3157 the api-server walked every object, deserialised, ran the selector, and only then filtered. Modern installs solve this with the consistent-list-from-cache path (KEP-3157, GA in 1.31), where Lists are served directly from the watch cache without going to etcd at all.

API Priority and Fairness (APF), stage 05, is the rate-limit / back-pressure mechanism that keeps the api-server from being DoS'd by its own clients. It replaces the older max-in-flight global limits. APF has two CRD-like primitives: FlowSchema classifies an incoming request into one of N priority levels by matching on user, group, namespace, resource, etc.; PriorityLevelConfiguration allocates a fraction of the api-server's total concurrency budget to a level, and within that level uses a fair-queuing algorithm to interleave flows. A flow is identified by user + namespace; one noisy ServiceAccount in one namespace cannot starve the others.

Default PriorityLevel	Concurrency share	Typical flows
system	30%	kubelet, controller-manager, scheduler — anything bearing system: SA
leader-election	10%	Lease renewals; never starved
node-high	40%	kubelet status updates from healthy nodes
workload-high	40%	in-cluster controllers and operators
workload-low	100%	humans on kubectl, batch jobs, anything unscheduled
global-default	20%	fallback when no FlowSchema matches
catch-all	5%	last-resort; dropping here means a misconfigured client
exempt	∞	system:masters, healthz; never queued

# A FlowSchema + PriorityLevelConfiguration pair for an internal controller
# that should never starve, but should also never run unbounded.
apiVersion: flowcontrol.apiserver.k8s.io/v1
kind: PriorityLevelConfiguration
metadata: { name: my-operator-high }
spec:
  type: Limited
  limited:
    nominalConcurrencyShares: 30
    limitResponse:
      type: Queue
      queuing:
        queues: 64
        handSize: 6
        queueLengthLimit: 50
---
apiVersion: flowcontrol.apiserver.k8s.io/v1
kind: FlowSchema
metadata: { name: my-operator-flows }
spec:
  matchingPrecedence: 800
  priorityLevelConfiguration: { name: my-operator-high }
  distinguisherMethod: { type: ByUser }
  rules:
  - subjects:
    - kind: ServiceAccount
      serviceAccount: { name: my-operator, namespace: my-operator-system }
    resourceRules:
    - verbs:        ["*"]
      apiGroups:    ["*"]
      resources:    ["*"]

When APF queues a request, the client sees latency added but no error. When APF rejects, the client gets 429 Too Many Requests with Retry-After. Modern client-go honours this header. The rate-limiter on the controller side backs off proportionally. If you write a custom client without using client-go, you have to implement this yourself; many "my operator DoSes the cluster" outages trace back to a client that does not.

# What a watch trace actually looks like — kubectl --watch is a thin wrapper
# around the same long-lived HTTP/2 stream every controller uses.
$ kubectl get pods -n prod --watch -v=8 2>&1 | head -20
GET https://api:6443/api/v1/namespaces/prod/pods?watch=true&resourceVersion=487292
HTTP/2.0 200 OK
{"type":"ADDED","object":{"kind":"Pod","metadata":{"name":"web-7d8","resourceVersion":"487293"...}}}
{"type":"MODIFIED","object":{"kind":"Pod","metadata":{"name":"web-7d8","resourceVersion":"487294"...}}}
# 90 seconds idle; api-server sends a bookmark so the cursor stays fresh
{"type":"BOOKMARK","object":{"kind":"Pod","metadata":{"resourceVersion":"487310"}}}
# MODIFIED for the next change, and so on, until the client disconnects.
# If we get a 410, the client redoes the LIST and re-establishes the watch.

Tuning rule. If you see apiserver_request_terminations_total incrementing in Prometheus, APF is shedding load. Look at apiserver_flowcontrol_rejected_requests_total by priority-level to see who is being shed; the labels include the FlowSchema name and the reason (queue-full, concurrency-limit, timeout). Most production teams add their own FlowSchema for their own controllers; not doing so means your operator is competing in the workload-low bucket with kubectl.

Aggregated apiservers. Extension by composition.

CRDs are how 95% of Kubernetes extensions add their own resource types. They are simple, declarative, and require no extra binary in the cluster. But they have limits: a CRD cannot serve subresources beyond /status and /scale; it cannot do streaming responses; it cannot proxy outbound to a backing store other than etcd; it cannot serve a non-trivial connect verb (like /exec). When you need any of these, you reach for an aggregated apiserver.

An aggregated apiserver is a separate HTTPS service inside the cluster that serves a portion of the Kubernetes API surface. The kube-apiserver proxies requests for that portion to it. The mechanism is the APIService object, which says "for resources under API group metrics.k8s.io/v1beta1, do not serve them locally; proxy to the Service kube-system/metrics-server:443". The kube-apiserver acts as a TLS-terminating, authenticating, authorising reverse proxy: clients still authenticate to the kube-apiserver as usual, the kube-apiserver runs stages 02–06 against the request as usual, and only then forwards the request. Together with the established identity in HTTP headers — to the aggregated server.

Three real aggregated apiservers ship with mainline Kubernetes: metrics-server (the metrics.k8s.io group, which is what kubectl top talks to); kube-aggregator itself (the framework, which lives inside the kube-apiserver binary); and historically service-catalog. Many third-party operators ship aggregated servers for the same reason — Knative serving's serving.knative.dev group, the legacy apiregistration-based extension model. The KEP-2876 ecosystem expects most new APIs to be CRDs, but the aggregation path remains for high-throughput or stream-shaped APIs that CRDs cannot model.

The kubernetes-sigs/apiserver-builder and the k8s.io/apiserver library are how you write one. The same eleven-stage pipeline runs in your aggregated apiserver, but with you owning stages 07–11. You register your own admission plugins, your own schema, your own storage backend (which is usually etcd, sometimes a different store entirely; metrics-server uses an in-memory ring because metrics are throw-away). The advantage is power; the cost is that you now run a stateful HTTPS service in the cluster, with all the operational baggage that implies.

# Registering an aggregated apiserver. The kube-apiserver will proxy every
# GET/PUT/POST/DELETE under /apis/metrics.k8s.io/v1beta1 to the named Service.
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  name: v1beta1.metrics.k8s.io
spec:
  group: metrics.k8s.io
  version: v1beta1
  groupPriorityMinimum: 100
  versionPriority: 100
  service:
    namespace: kube-system
    name:      metrics-server
    port:      443
  caBundle:    <PEM-encoded CA the kube-apiserver trusts>
  insecureSkipTLSVerify: false

The kube-apiserver authenticates to the aggregated server using a separate client certificate signed by the --proxy-client-ca-file CA, with the username and group information from the original request passed through as X-Remote-User / X-Remote-Group / X-Remote-Extra-* HTTP headers. The aggregated server validates the proxy-client cert against --requestheader-client-ca-file and trusts the headers. This is the only Kubernetes-native cross-server identity-passing mechanism, and getting the CA configuration wrong is the most common reason an aggregated server returns 401 to every request — even though the original kubectl request was perfectly authenticated.

Operationally, the aggregated server is a SPOF for its API group in the same way a conversion webhook is. If metrics-server is down, kubectl top returns 503. The kube-apiserver cannot materialise PodMetrics from anywhere else. Aggregated APIs therefore require careful HA: at least two replicas, a PodDisruptionBudget, a leader election if the work is stateful, and ideally horizontal scaling on request rate. The apiserver-builder and the sample-apiserver are good starting points; both come with sane defaults baked in.

The reason aggregation exists at all is the same architectural discipline the architecture sub-page covers: one HTTPS endpoint, one authentication surface, one audit log. A user with a kubectl in their hand should never have to know whether they are talking to the core api-server or to an aggregated extension; the kube-apiserver makes the whole surface look uniform. This is why your operator's CRDs and your provider's aggregated APIs all show up in the same kubectl api-resources output, with the same RBAC, the same audit log, and the same APF queues. Extensibility, in Kubernetes, is composition behind a single front door.

Authoritative docs

Source-tree pointers

KEPs that shaped this

And the rest of the Semicolony ladder. The architecture sub-page lays out the eight processes the api-server lives among; the apply lifecycle sub-page traces a single kubectl apply through every one of these eleven stages with timings; the etcd sub-page picks up at stage 10 and goes all the way down to disk fsyncs and Raft heartbeats; the controller pattern sub-page shows what the watcher on the other end of stage 11 actually does with the stream.

One last thing. The api-server is the densest, most-trafficked, most-extensible component in Kubernetes, and yet you can hold its entire request shape in your head. Eleven stages. Two chain primitives (authn-first-match, authz-allow-on-yes). Three admission stages with strict ordering. One storage call with optimistic concurrency. One watch fan-out. One priority-and-fairness queue. One aggregation surface. That is the system. Everything you will ever debug in a control-plane incident is one of those moving parts. Read source when the prose is ambiguous, write the stage number on every postmortem, and treat the api-server like the careful piece of plumbing it is.

Next in the internals series

Keep going.

etcd, deeply

Raft, MVCC, the database file, fsync latency, defrag, and how the api-server's storage layer drives it.

The lifecycle of kubectl apply

Twelve hops from keystroke to running pod, named, timed, explained — every one of them touches stages 02–11.

The controller pattern

Informers, listers, work queues, reconciliation. What the watcher on the other side of stage 11 is actually doing.

Read ↑

Back to the internals index

All twelve sub-pages — four live, eight planned, and the system on one canvas.

Index

Found this useful?

Eleven stages,one request.

The eleven-stage pipeline.

Authentication. Four sources, first match wins.

Authorisation. RBAC, ABAC, Webhook, Node.

Admission. Mutating, schema, validating, in that order.

Conversion webhooks. Multi-version CRDs, the right way.

Storage — etcd, optimistic concurrency, resourceVersion.

List + watch and the Priority and Fairness layer.

Aggregated apiservers. Extension by composition.

Keep going.

Eleven stages,
one request.