Sub-page 04 · for controller authors + operators

Kubernetes internals · Controllers

Observe, diff, act.
Forever.

Every Kubernetes controller. Every built-in one in the controller-manager, every operator, every CRD reconciler, every line of controller-runtime code you have ever written or read. Is the same shape. A watch on the api-server. A cache. A queue. A reconcile function. Some clever rate limiting. A lease for high availability. That is it. Once you see the shape, you can read the source for any controller in the ecosystem in an afternoon.

This page is a 4,000-word walk through the pattern, the libraries that implement it, and the conventions you have to obey for your controller to behave well in a real cluster. Pair it with the architecture sub-page for the system around it, and the apply lifecycle sub-page for the request-trace view of how a Pod gets reconciled.

The control loop in thirty lines.

A Kubernetes controller is a program that does three things in a tight loop forever: it observes the current state of the cluster, it diffs that state against the desired state, and it acts to close the gap. Observe, diff, act. The reason this pattern is everywhere in Kubernetes is that it is the only known shape that survives partial failures gracefully. If your controller crashes mid-act, the next iteration of the loop just observes again, computes a fresh diff, and acts on whatever is still wrong. There is no transaction, no rollback, no two-phase commit, no compensation logic. There is only eventual convergence, driven by repetition.

The Deployment controller is a working example you already know. Its desired state is the Deployment object's spec.replicas and spec.template. Its observed state is the set of ReplicaSets it owns and, transitively, the Pods those ReplicaSets own. Its diff is "is there a ReplicaSet whose pod template hash matches my current spec, and does that ReplicaSet have the right number of pods?" Its act is to create a new ReplicaSet, scale it up, scale the old one down each as a single api-server UPDATE call. Then it returns. The loop runs again on the next watch event or the next periodic resync, observes the now-changed state, diffs again, and either acts or returns silently because the cluster converged.

Every other built-in controller has the same shape. The ReplicaSet controller diffs spec.replicas against the count of running pods labelled with its selector, and creates or deletes pods to close the gap. The Job controller diffs spec.completions against the number of pods that have reached Succeeded. The EndpointSlice controller diffs the set of pod IPs that match a Service's selector against the EndpointSlices currently published. None of them care how the cluster got into its current state; they only care about reading it freshly and writing the right next step. This is the property that makes Kubernetes safe to crash, safe to restart, safe to upgrade, safe to operate at all.

The pseudo-code for any controller is almost embarrassingly small. The library code that implements informers, queues, and listers is ten thousand lines of Go, but the user-facing shape is what you see below: read from a queue, fetch the current state from a local cache, do work, optionally re-enqueue, repeat.

// The control loop, in shape — controller-runtime style.
// In real code this is hidden behind manager.Start; the body is yours to write.

func (r *MyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // 1. OBSERVE — fetch the current desired state from the local cache.
    var obj appsv1.Deployment
    if err := r.Get(ctx, req.NamespacedName, &obj); err != nil {
        if apierrors.IsNotFound(err) { return ctrl.Result{}, nil }
        return ctrl.Result{}, err
    }

    // 2. DIFF — compute what should change.
    desired := buildReplicaSet(&obj)
    var current appsv1.ReplicaSet
    err := r.Get(ctx, client.ObjectKeyFromObject(desired), &current)

    // 3. ACT — create / update / delete to close the gap.
    switch {
    case apierrors.IsNotFound(err):
        return ctrl.Result{}, r.Create(ctx, desired)
    case err != nil:
        return ctrl.Result{}, err
    case !equality.Semantic.DeepEqual(desired.Spec, current.Spec):
        current.Spec = desired.Spec
        return ctrl.Result{}, r.Update(ctx, &current)
    }

    // 4. RETURN — the framework will call us again on the next watch event,
    // or after RequeueAfter if we ask for one.
    return ctrl.Result{Requeue: false}, nil
}

The diagram is the canonical reconcile loop the way client-go assembles it. The Reflector opens a List+Watch against the api-server, decodes JSON events into typed objects, and pushes deltas into a thread-safe FIFO. The Indexer pops deltas, applies them to its in-memory map, and notifies any registered event handlers. Those handlers compute a key (usually namespace/name) and push it onto a work queue. A pool of workers pop keys, look the object up via the Lister (the read-only view of the Indexer), call your Reconcile, and either drop the key on success, re-enqueue with rate limiting on transient error, or hard-fail on a permanent error. Every part of that pipeline is reusable. You write only the Reconcile body.

The most important property of this loop is that it is level-triggered, not edge-triggered. The work queue stores the key of a changed object, not the event that changed it. By the time your reconciler runs, the cache may have absorbed three more updates. You read the latest state and act on that. Edge-triggered controllers (the ones that try to react to a specific event) are universally a bug. They miss events when watches drop, and they double-act when watches replay.

Informers, listers, and the shared informer factory.

An informer is, roughly, "a watch on the api-server plus a thread-safe in-memory cache of every object in the watched resource". The first time it starts, it issues a List against the api-server (paginated, with a resourceVersion of zero, served from the api-server's watch cache when possible), populates its cache, and then opens a Watch from the latest revision. From that point on, every change is delivered as an ADDED, MODIFIED, or DELETED event, the cache is updated locally, and any registered event handlers fire. Reads against the cache. Via a lister, which is a typed view of the same data — are O(1) for Get and O(n) for List, and they never round trip to the api-server.

The pattern matters because it solves a problem that would otherwise be ruinous. A naive controller that called client.Get(...) on every reconcile would generate one round-trip to the api-server per object per loop iteration. With thirty controllers in the controller-manager and tens of thousands of objects in a large cluster, that is millions of api-server reads per minute, every one of which becomes an etcd read. Informers turn the entire fleet of controllers into watchers. One watch per resource, shared, the cache populated once. The api-server load is bounded by the watch event rate, which is bounded by the actual write rate to etcd, which is bounded by the reality of how often things actually change. Without informers, Kubernetes would not scale past a few hundred nodes.

The shared informer factory is the further refinement: every controller that cares about Pods uses the same Pod informer, which means the Pod list is in memory exactly once. The factory hands out subscriptions. Each consumer registers AddEventHandler callbacks, and the underlying List+Watch is multiplexed. In controller-runtime the factory is hidden inside the Manager; in raw client-go you create a SharedInformerFactory explicitly and call factory.Apps().V1().Deployments().Informer() for each resource you watch. Either way the result is the same: one watch per resource, one cache per resource, N consumers per cache.

// Setting up a shared informer factory by hand.
// In controller-runtime this is automatic; this is what is happening underneath.

import (
    "k8s.io/client-go/informers"
    "k8s.io/client-go/tools/cache"
    "k8s.io/client-go/util/workqueue"
)

func newController(ctx context.Context, cs kubernetes.Interface) error {
    factory := informers.NewSharedInformerFactory(cs, 30*time.Minute) // resync every 30m

    deploys := factory.Apps().V1().Deployments()
    pods    := factory.Core().V1().Pods()

    queue := workqueue.NewNamedRateLimitingQueue(
        workqueue.DefaultControllerRateLimiter(), "deployment-controller",
    )

    deploys.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
        AddFunc:    func(o interface{})       { enqueue(queue, o) },
        UpdateFunc: func(_, n interface{})    { enqueue(queue, n) },
        DeleteFunc: func(o interface{})       { enqueue(queue, o) },
    })

    // catch silent watch failures, otherwise the cache stops updating without warning.
    deploys.Informer().SetWatchErrorHandler(func(_ *cache.Reflector, err error) {
        klog.Errorf("deployment watch failed: %v", err)
    })

    factory.Start(ctx.Done())
    if !cache.WaitForCacheSync(ctx.Done(), deploys.Informer().HasSynced, pods.Informer().HasSynced) {
        return fmt.Errorf("failed waiting for caches to sync")
    }

    return runWorkers(ctx, queue, deploys.Lister(), pods.Lister())
}

Two operational details in that snippet are easy to miss but matter in production. First, the 30*time.Minute resync is a periodic re-fire of the event handlers for every object in the cache, even if nothing has changed. It is the safety net against missed events: if your controller has a logic bug that drops a key from the queue incorrectly, the resync will eventually re-enqueue it. Set this to zero only if you are confident your enqueue logic is bulletproof; thirty minutes is the convention.

Second, SetWatchErrorHandler is the only signal you get when the watch breaks irrecoverably. For instance, your ServiceAccount lost a Role and the watch returns 403, or the api-server is rotating certificates and the connection fails to re-establish. Without it, the informer prints a warning to klog and silently stops updating; your controller then operates against a stale cache, sometimes for hours, and you only notice when reconciles start producing wrong answers. Always set it. Always wire it to a metric or to your error reporter.

The lister is the read side. deploys.Lister().Deployments(ns).Get(name) returns the typed Deployment from the cache, with the same memory layout the api-server would have returned. It returns a pointer to the cache copy, which is critical: if you mutate it, you have just corrupted every other controller's view of the same object. Always DeepCopy() before mutating, then Update() the copy back through the client. The controller-runtime client wraps this for you; raw client-go does not.

Production gotcha. The lister returns objects with their cached resourceVersion. If you read, mutate, and Update on a hot object, you will frequently get 409 Conflict because another writer has bumped the resourceVersion since your cache last saw it. The fix is not to disable the cache; the fix is to retry the reconcile, which will re-read the now-fresher cache value. This is why every reconciler must be idempotent.

Work queues, rate limiting, and exponential backoff.

The work queue is the hinge between the informer (which produces events) and the reconciler (which consumes them). It is a strictly FIFO queue with three core properties layered on: deduplication, so that ten rapid updates to the same Pod produce only one queued item; per-item rate limiting, so that a Pod that fails to reconcile does not pin a worker forever; and graceful shutdown, so that a SIGTERM drains in-flight work before exiting. The implementation lives in k8s.io/client-go/util/workqueue and is roughly four hundred lines of Go that every Kubernetes controller in the world depends on.

Deduplication is the first non-obvious feature. When you push key prod/web-7d8 onto the queue, the queue first checks whether that key is already in the queue or being processed. If the former, the push is a no-op. If the latter, the queue marks the key as "dirty" and only re-queues it once the in-flight worker calls Done(key). This means a busy-looping informer that fires fifty events per second on the same Pod produces at most one reconcile in flight plus one queued. Without this, a pathological update storm could multiply into N parallel reconciles fighting each other for the same lock.

Rate limiting is the second feature. The RateLimitingInterface wraps the queue with a per-item exponential-backoff calculator and a global token-bucket. When your reconciler returns an error, you call queue.AddRateLimited(key) instead of plain queue.Add(key); the queue computes the next retry delay from the per-item failure count, schedules the re-add, and increments the counter. On eventual success, you call queue.Forget(key) to reset the counter. The global token bucket then caps the aggregate retry rate at ten per second (with a burst of one hundred), which protects the api-server from a controller bug that is hot-looping on every key in the cache.

Default	Value	Notes
BaseDelay	5ms	first retry interval after failure
MaxDelay	1000s	cap on the exponential backoff
Bucket QPS	10/s	token-bucket steady-state
Bucket burst	100	short-burst allowance
Forget on success	true	reset the per-item failure counter

The defaults shipped by DefaultControllerRateLimiter() are the result of a decade of production tuning. The 5ms base delay is short enough that a transient blip. A webhook timing out, a brief api-server hiccup — gets retried fast enough not to stall a rollout. The 1000s cap is long enough that a permanently broken resource (a CRD whose webhook has been deleted, say) does not become a hot loop that saturates a worker. Between them, the geometric backoff is 5ms, 10ms, 20ms, 40ms, …, doubling on each failure, capped at 1000s. After about eighteen failures of the same key, you are retrying it once every sixteen minutes; this is the right behaviour.

// Building a rate-limited workqueue with the canonical defaults.

import "k8s.io/client-go/util/workqueue"

queue := workqueue.NewNamedRateLimitingQueue(
    workqueue.DefaultControllerRateLimiter(),  // 5ms→1000s exp + 10qps/100 burst
    "my-controller",                        // shows up in /metrics as work_queue_*
)

// Enqueue / dequeue pattern in a worker goroutine.
for processNextItem(ctx, queue, reconcile) {}

func processNextItem(ctx context.Context, q workqueue.RateLimitingInterface, fn ReconcileFunc) bool {
    key, shutdown := q.Get()
    if shutdown { return false }
    defer q.Done(key)                          // must always call, even on panic

    if err := fn(ctx, key.(string)); err != nil {
        if q.NumRequeues(key) < 15 {
            q.AddRateLimited(key)              // transient — back off and retry
        } else {
            klog.Errorf("giving up on %s: %v", key, err)
            q.Forget(key)                      // permanent — stop retrying
        }
        return true
    }

    q.Forget(key)                              // success — reset the failure counter
    return true
}

A few things in that loop are easy to get wrong. First, defer q.Done(key) is not optional; if you forget it, that key is "in flight" forever and will never be re-queued, even if it changes again. Wrap your reconciler so a panic still calls Done. Second, Forget is what resets the per-item failure counter. If your reconcile succeeds after several retries and you do not call Forget, the counter will keep growing and the next transient failure on that key will start at a much longer backoff than it should. Always Forget on success.

Production tuning. The workqueue exposes Prometheus metrics named workqueue_depth, workqueue_adds_total, and workqueue_unfinished_work_seconds. The third is the one that catches everything: it is the wall-clock age of the oldest in-flight item, and if it climbs above your reconcile latency p99, your controller has a stuck reconciler. Alert on it.

The reconcile signature — Result, errors, idempotency.

The reconcile signature in controller-runtime is exactly two returns wrapped in one struct: ctrl.Result and error. Despite the surface simplicity, four distinct behaviours are encoded in those two values, and the framework's behaviour for each is the difference between a controller that converges quickly and one that flap-loops or stalls. Most controller bugs in the wild are misuse of this signature.

The four cases are: success. Return ctrl.Result{}, nil; the framework drops the key from the queue and waits for the next watch event. Soft retry. Return ctrl.Result{Requeue: true}, nil; the framework immediately re-enqueues the key with rate limiting; this is the right return when something is in flight (a Pod is still pulling its image) and you want to be called again soon. Scheduled retry — return ctrl.Result{RequeueAfter: 30*time.Second}, nil; the framework re-enqueues exactly thirty seconds later, bypassing the rate limiter; this is for periodic checks (a certificate expiring in some hours, a backup job that runs once a day). Error retry. Return ctrl.Result{}, err; the framework re-enqueues with rate limiting, increments the failure counter, logs the error, and emits an Event; this is for transient failures, not for permanent bugs.

The most common antipattern is treating any error as ctrl.Result{}, err and letting the framework re-enqueue. If the underlying object has been deleted while you were reconciling, you get 404 Not Found on the next read; that should not be re-queued as an error, it should be returned as a success because the desired state is "the object is gone, the cluster has converged". Always test for apierrors.IsNotFound at the top of your reconciler and return nil. The same applies to admission rejections that you cannot fix from your controller. Log, emit an Event, return nil, and let a human handle it.

Idempotency is the deeper requirement underneath the signature. Your Reconcile must produce the same result if it runs once or runs five times in a row against the same input state. This is not optional; the framework will call you again on every watch event, on every periodic resync, and on every retry, and the cluster's correctness depends on the operations you perform being safe to repeat. The standard trick is to make every write a Patch or an Apply with a stable name and selector, so that the api-server itself dedupes "create if not exists" and "update only the fields I own". Server-side apply (since 1.22) is the modern, ergonomic version of this; older code uses strategic merge patches.

// All four reconcile signatures, in one place.

func (r *MyReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    var obj appsv1.Deployment
    if err := r.Get(ctx, req.NamespacedName, &obj); err != nil {
        if apierrors.IsNotFound(err) {
            // gone — that is fine; the cluster has converged.
            return ctrl.Result{}, nil
        }
        return ctrl.Result{}, err   // transient — let the rate limiter retry
    }

    if obj.Status.ReadyReplicas < *obj.Spec.Replicas {
        // in flight — soft retry, rate-limited.
        return ctrl.Result{Requeue: true}, nil
    }

    if cert.NotAfter.Sub(time.Now()) < 24*time.Hour {
        // scheduled work — bypass the rate limiter.
        return ctrl.Result{RequeueAfter: 30 * time.Minute}, nil
    }

    // nothing to do — converged.
    return ctrl.Result{}, nil
}

A subtle interaction: RequeueAfter and the rate limiter do not compose. If you return RequeueAfter: 30s the framework schedules an exact 30s timer; if you also return an error in the same call, the rate limiter wins and ignores the RequeueAfter. The intent is "errors imply backoff" but the surprise for new authors is that you cannot say "this is broken, but try again in exactly five minutes". You either have an error (rate-limited retry) or a scheduled re-check (RequeueAfter), not both.

A common bug. Returning ctrl.Result{}, err on every transient failure causes the rate limiter's per-key counter to climb fast, and a momentarily broken webhook can produce hour-long backoffs that long outlive the actual outage. Use RequeueAfter for known-transient conditions, return err only for unexpected failures, and clear the counter (via Forget) when conditions normalise.

Finalizers — the deletion handshake.

A finalizer is a string in metadata.finalizers that tells the api-server "do not actually delete this object until this string is removed". When a user runs kubectl delete on an object that has finalizers, the api-server sets metadata.deletionTimestamp to now, but the object remains. Watchers see a MODIFIED event (not DELETED). Whoever owns the finalizer is expected to do whatever cleanup they need. Release a cloud resource, drain a queue, revoke a credential, and then issue a Patch removing their finalizer string. Once finalizers is empty, the api-server actually deletes the object.

The pattern exists because deletion in Kubernetes is otherwise ungated: the moment you delete a CRD instance, your controller stops seeing it, and any external resources it was managing become orphans. A controller that provisions an AWS RDS database from a custom Database resource needs a guarantee that the api-server will not vanish the resource until the actual RDS instance has been terminated; otherwise a stray AWS bill is the result. Adding a finalizer when the object is created, and removing it only after the external work is confirmed gone — is the guarantee.

Finalizers are also the source of half the "stuck deletion" stories you have heard. A finalizer that is never removed. Because the controller crashed permanently, or was uninstalled, or the cluster lost network access to the cloud API. Leaves the object in a perpetual Terminating state. kubectl delete --force --grace-period=0 does not remove finalizers; only a controller (or a brutal kubectl patch -p '{"metadata":{"finalizers":null}}' against the finalizers field) can. The most common operator bug is shipping a finalizer without shipping a controller-uninstall hook that clears it; users who uninstall the operator find themselves with undeletable objects until they manually patch them.

// The finalizer add/remove pattern, controller-runtime style.

const myFinalizer = "databases.example.com/finalizer"

func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    var db v1.Database
    if err := r.Get(ctx, req.NamespacedName, &db); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    if db.DeletionTimestamp.IsZero() {
        // not being deleted — make sure our finalizer is present.
        if !controllerutil.ContainsFinalizer(&db, myFinalizer) {
            controllerutil.AddFinalizer(&db, myFinalizer)
            return ctrl.Result{}, r.Update(ctx, &db)
        }
        return r.reconcileNormal(ctx, &db)
    }

    // being deleted — run cleanup, then remove our finalizer.
    if controllerutil.ContainsFinalizer(&db, myFinalizer) {
        if err := r.deprovisionRDS(ctx, &db); err != nil {
            return ctrl.Result{}, err   // retry — RDS deletion may take minutes
        }
        controllerutil.RemoveFinalizer(&db, myFinalizer)
        if err := r.Update(ctx, &db); err != nil {
            return ctrl.Result{}, err
        }
    }
    return ctrl.Result{}, nil   // all our finalizers are off; api-server will GC
}

The order in that snippet is exact: cleanup first, finalizer removal second. If you reverse them. Remove the finalizer, then run cleanup. The api-server will delete the object the moment the finalizer is gone, your controller will lose the spec it needs to do the cleanup, and the cleanup will silently fail. The crash-safety property is also exact: if the controller dies between cleanup and finalizer removal, the next iteration will see the finalizer still present, run cleanup again (must be idempotent — that is why deprovisionRDS has to handle "already gone"), and remove the finalizer.

Finalizer naming follows a convention: domain/name, where the domain is your controller's group. Multiple controllers can each register their own finalizer on the same object, and they all have to be removed before deletion proceeds. This is how the GarbageCollector and the per-controller cleanup interact safely. They do not coordinate; they each just remove their own string.

Recovery procedure for stuck Terminating objects. kubectl patch $obj -p '{"metadata":{"finalizers":null}}' --type=merge clears all finalizers. Use only after confirming no controller will ever process the object again, otherwise you orphan whatever the finalizer was protecting.

Owner references and the garbage collector.

A Kubernetes object can declare, in its metadata.ownerReferences array, one or more "owners". An owner reference points at another object in the same namespace (cluster-scoped owners are allowed for cluster-scoped owned objects) by UID, kind, and name. The owner is, by convention, the thing whose deletion should trigger the deletion of the owned object. When you delete a Deployment, the ReplicaSet it owns is deleted; when you delete the ReplicaSet, the Pods it owns are deleted; when you delete a Pod, the Pod-owned PVCs (if you set the policy that way) are deleted. The chain is followed by the GarbageCollector controller, which is one of the thirty controllers in the controller-manager.

The GarbageCollector watches every Kind it knows about and maintains an in-memory dependency graph: for every owner-owned edge, it stores both halves. When an owner is deleted, it walks the graph and queues each dependent for deletion. The default policy is Background cascade: the owner is deleted immediately, and the dependents are deleted asynchronously. There is also Foreground, where the owner is held in Terminating until all dependents are gone (the GarbageCollector adds its own foregroundDeletion finalizer to enforce this), and Orphan, where the owner is deleted and the dependents have their owner reference cleared but stay alive.

The owner reference also has a controller: true flag, which is the difference between "this object is owned by …" and "this object is the responsibility of …". A Pod's controller: true owner is its ReplicaSet; the ReplicaSet's controller is its Deployment. Only one owner per object can have controller: true; this is what the ReplicaSet adoption logic checks when it claims orphan pods that match its selector. If a Pod has no controller, any matching ReplicaSet may adopt it; if it already has one, only that ReplicaSet is allowed to manage it.

Owner references are also how a controller "claims" the objects it manages. When the ReplicaSet controller creates a Pod, it stamps an ownerReference on the Pod with controller: true. When the controller is later listing Pods to count its replicas, it filters by ownerReference UID, not by selector. Selectors are the bootstrap mechanism, ownerReferences are the authoritative claim. This is also why kubectl delete rs --cascade=orphan exists: it strips the ownerReference, and the previously-owned pods become unmanaged but stay alive.

A subtle GC bug to know about — if you create a child without setting an ownerReference (because your controller forgot, or because you scripted it with kubectl), the child is permanently orphan. The garbage collector cannot delete it when the conceptual parent goes away. Always use controllerutil.SetControllerReference(parent, child) in controller-runtime before creating the child.

The status subresource. Separating spec from status.

Every Kubernetes object splits its body into two top-level fields: spec, which is what the user wants, and status, which is what the controller observes. The split is not just convention — the api-server enforces it. Built-in resources (Deployment, StatefulSet, Pod) and CRDs that opt in via the /status subresource get two separate write endpoints: PUT /apis/.../deployments/foo updates the spec but the status field is silently dropped, and PUT /apis/.../deployments/foo/status updates only the status. Each has its own RBAC verb. Each has its own resourceVersion-tracked optimistic-concurrency lane.

The reason this matters in practice is that controllers should never compete with users on writes. A user editing the spec. Kubectl edit, an Argo sync, a Helm upgrade should not race with the controller writing the status. Without the subresource split, both go through the same UPDATE; either the user's edit overwrites the controller's status (and now the cached status is wrong) or the controller's status update fails with 409 Conflict because the user beat them to it (and the controller has to retry, adding latency). With the split, a user's UPDATE leaves status untouched, and the controller's UPDATE leaves spec untouched. They literally cannot collide.

For CRDs, opting in is one line in the schema:

# In your CRD definition — turn on the /status subresource.
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: databases.example.com
spec:
  group: example.com
  names: { kind: Database, plural: databases }
  scope: Namespaced
  versions:
    - name: v1
      served: true
      storage: true
      subresources:
        status: {}            # enables PUT /databases/foo/status
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:   { type: object, properties: { ... } }
            status: { type: object, properties: { ... } }

With this in place, your controller writes status via r.Status().Update(ctx, &obj) in controller-runtime — note the .Status() sub-client. And the change goes through the dedicated endpoint. RBAC for users grants verbs: [get, list, watch, update, patch] on databases; RBAC for the controller additionally grants the same on databases/status. Two distinct ClusterRoles, two distinct concerns.

The convention for the status structure is also worth knowing. The mature pattern is to expose status.observedGeneration (the spec.generation the controller last reconciled), status.conditions (an array of typed conditions like Ready, Progressing, Degraded, each with status, reason, message, and last transition time), and a small set of typed counters specific to the resource. The condition pattern is uniform across every Kubernetes API and lets generic UIs (Argo, Lens, kubectl describe) display state without knowing the resource's semantics.

A status update should always be done with a Patch or a server-side Apply, not a full Update. The reason is the same as elsewhere: optimistic concurrency. If two reconcile attempts race (one in flight, one re-enqueued from a watch event), the second's PUT .../status will fail with 409 if it has a stale resourceVersion. Patch. Especially a JSON merge patch with a typed condition update — side-steps the conflict by saying "set this condition to this value, regardless of what else is in there".

Idempotency note. Never write your status more often than the conditions actually change. A reconciler that calls Status().Update on every loop iteration produces a write storm against etcd, even when the status is unchanged. Compare the proposed status to the current status with equality.Semantic.DeepEqual and skip the write if they match. This single check is one of the highest-use controller optimisations.

Leader election — Lease objects and RunOrDie.

A controller that runs as a single replica is a single point of failure: if the pod dies, no reconciliation happens until the pod comes back. Running multiple replicas naively is worse. Three replicas all watching the same Deployments, all running their reconciler in parallel, all racing each other on Patch calls. The Kubernetes-native solution is leader election: run N replicas, but only one of them is the active reconciler at any moment, with the rest in hot standby ready to take over within seconds if the leader dies. The mechanism is a Lease object in the coordination.k8s.io/v1 API group, claimed via optimistic-concurrency Update, refreshed every few seconds, and inherited by a challenger if the holder stops refreshing.

The library that implements this is k8s.io/client-go/tools/leaderelection, exposed in controller-runtime as manager.Options.LeaderElection = true. Under the hood it does what the architecture sub-page describes: every replica races to UPDATE a well-known Lease object with its own holderIdentity; etcd's transaction picks one winner; losers see 409 and back off; the winner updates renewTime roughly every RenewDeadline / 2 to retain the lease. If the holder dies, after leaseDuration elapses without a renewal, the others race again. Failover is in the fifteen-to-forty-second range with default settings.

The diagram is the canonical timeline. Three replicas start. They each issue an UPDATE against the same Lease object with a stale resourceVersion guard. etcd's transaction lets exactly one through; the other two see 409 Conflict and enter standby. Replica-A is now the leader; it renews the lease every five seconds (RetryPeriod = 2s, so renews are spaced a bit further). The two standbys keep watching the Lease. They do not actively poll, they have a watch on the Lease object, and notice nothing changes for roughly fifteen seconds (LeaseDuration). When replica-A dies, its renewals stop; after fifteen seconds without a renewal, replicas B and C race the same UPDATE; B happens to win; B is now the leader. Time-to-failover is the LeaseDuration, plus one or two RetryPeriods of jitter.

// The leader-election RunOrDie call — client-go style.
// In controller-runtime: manager.Options{LeaderElection: true, LeaderElectionID: "..."}.

import "k8s.io/client-go/tools/leaderelection"

rl, err := resourcelock.New(
    resourcelock.LeasesResourceLock,
    "kube-system",                    // Lease namespace
    "my-controller",                   // Lease name
    cs.CoreV1(), cs.CoordinationV1(),
    resourcelock.ResourceLockConfig{
        Identity: hostname + "_" + uuid.New().String(),
    },
)

leaderelection.RunOrDie(ctx, leaderelection.LeaderElectionConfig{
    Lock:          rl,
    LeaseDuration: 15 * time.Second,    // followers wait this long before challenging
    RenewDeadline: 10 * time.Second,    // leader must renew within this
    RetryPeriod:    2 * time.Second,    // frequency of acquire/renew attempts
    Callbacks: leaderelection.LeaderCallbacks{
        OnStartedLeading: func(ctx context.Context) {
            // I am the leader. Run the manager.
            mgr.Start(ctx)
        },
        OnStoppedLeading: func() {
            // Lost the lease. Exit; the deployment will restart us.
            klog.Fatalf("lost leader lease")
        },
        OnNewLeader: func(id string) {
            klog.Infof("new leader: %s", id)
        },
    },
    ReleaseOnCancel: true,              // best-effort release on shutdown — speeds up failover
})

A few details to internalise. The Identity must be unique per process. Typically hostname + "_" + uuid, so a restarted pod with the same hostname does not accidentally inherit a stale lease. The constraint RenewDeadline < LeaseDuration - RetryPeriod must hold; otherwise the leader can lose its lease while still believing it holds it (a "split brain" with the new leader). The default (15, 10, 2) satisfies it; if you tune for faster failover, recompute the inequality. ReleaseOnCancel makes graceful shutdown release the lease immediately, cutting the failover from 15s to whatever the next replica needs to start up.

Operational note. Every leader-elected controller in the cluster shows up as a Lease. kubectl get leases -A lists them; the HOLDER column tells you which replica is currently leading. If two replicas of your operator both claim leadership, you have a configuration bug: most likely you are running the same operator twice with the same LockName but different namespaces, or two different operators happen to share a LockName.

controller-runtime + kubebuilder — a controller in ten minutes.

controller-runtime is the library; kubebuilder is the scaffolding tool that generates a project skeleton on top of it. Together they are the standard way to write a Kubernetes controller in 2026; raw client-go is still used inside the kubernetes/kubernetes tree but for everything outside it (every operator, every CRD reconciler, every internal platform tool) you reach for these. kubebuilder generates a Go module, a Dockerfile, a Makefile, the CRD manifests, the RBAC, the controller skeleton, and a webhook skeleton if you ask for one. From kubebuilder init to a controller running against a cluster is roughly ten minutes of typing.

The project flow is exact. kubebuilder init --domain example.com --repo example.com/myop bootstraps a Go module with all of the wiring. kubebuilder create api --group apps --version v1 --kind Database adds a CRD type (api/v1/database_types.go) and a controller (internal/controller/database_controller.go). The CRD types include both a DatabaseSpec (the user-edited half) and a DatabaseStatus (the controller-edited half), and the kubebuilder annotations on those struct fields drive the generation of the CRD's OpenAPI schema, the validating webhooks, and the RBAC ClusterRole. You edit the types and the Reconcile body; the rest is generated by make manifests and make generate.

// internal/controller/database_controller.go — the kubebuilder skeleton, fleshed out.

package controller

import (
    "context"
    appsv1 "k8s.io/api/apps/v1"
    corev1 "k8s.io/api/core/v1"
    "k8s.io/apimachinery/pkg/api/errors"
    ctrl "sigs.k8s.io/controller-runtime"
    "sigs.k8s.io/controller-runtime/pkg/client"
    "sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
    "sigs.k8s.io/controller-runtime/pkg/log"
    examplev1 "example.com/myop/api/v1"
)

// DatabaseReconciler reconciles a Database object.
type DatabaseReconciler struct {
    client.Client
    Scheme *runtime.Scheme
}

// kubebuilder annotations — these drive RBAC and CRD generation. Do not delete them.
// +kubebuilder:rbac:groups=example.com,resources=databases,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=example.com,resources=databases/status,verbs=get;update;patch
// +kubebuilder:rbac:groups=example.com,resources=databases/finalizers,verbs=update
// +kubebuilder:rbac:groups=apps,resources=deployments,verbs=get;list;watch;create;update;patch
// +kubebuilder:rbac:groups="",resources=secrets,verbs=get;list;watch;create;update;patch

func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    logger := log.FromContext(ctx)

    // 1. Fetch the Database from the cache.
    var db examplev1.Database
    if err := r.Get(ctx, req.NamespacedName, &db); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // 2. Handle deletion via finalizer.
    const fin = "databases.example.com/finalizer"
    if !db.DeletionTimestamp.IsZero() {
        if controllerutil.ContainsFinalizer(&db, fin) {
            if err := r.deprovision(ctx, &db); err != nil {
                return ctrl.Result{}, err
            }
            controllerutil.RemoveFinalizer(&db, fin)
            return ctrl.Result{}, r.Update(ctx, &db)
        }
        return ctrl.Result{}, nil
    }
    if !controllerutil.ContainsFinalizer(&db, fin) {
        controllerutil.AddFinalizer(&db, fin)
        return ctrl.Result{}, r.Update(ctx, &db)
    }

    // 3. Reconcile children — Deployment + Secret. SetControllerReference
    // stamps an ownerRef so GC will clean them up when the Database is deleted.
    desiredDep := r.buildDeployment(&db)
    if err := controllerutil.SetControllerReference(&db, desiredDep, r.Scheme); err != nil {
        return ctrl.Result{}, err
    }
    var current appsv1.Deployment
    err := r.Get(ctx, client.ObjectKeyFromObject(desiredDep), &current)
    switch {
    case errors.IsNotFound(err):
        if err := r.Create(ctx, desiredDep); err != nil {
            return ctrl.Result{}, err
        }
    case err != nil:
        return ctrl.Result{}, err
    default:
        current.Spec = desiredDep.Spec
        if err := r.Update(ctx, &current); err != nil {
            return ctrl.Result{}, err
        }
    }

    // 4. Update status — only if it changed (idempotency).
    desiredStatus := computeStatus(&db, &current)
    if !equality.Semantic.DeepEqual(db.Status, desiredStatus) {
        db.Status = desiredStatus
        if err := r.Status().Update(ctx, &db); err != nil {
            return ctrl.Result{}, err
        }
    }

    logger.Info("reconciled", "database", db.Name, "ready", current.Status.ReadyReplicas)
    return ctrl.Result{RequeueAfter: 5 * time.Minute}, nil   // periodic re-check
}

// SetupWithManager wires this reconciler into the controller-runtime manager.
// .Owns() registers a watch on Deployments owned by a Database, so any change
// to a child triggers a reconcile of its parent. This is the canonical way
// to react to child status changes without polling.
func (r *DatabaseReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&examplev1.Database{}).
        Owns(&appsv1.Deployment{}).
        Owns(&corev1.Secret{}).
        Complete(r)
}

The Reconcile body above is the full surface of a real-world operator: get, finalizer, child reconcile with owner reference, status update, scheduled re-check. It is roughly forty lines of meaningful code; the boilerplate (struct definition, kubebuilder annotations, SetupWithManager) is another twenty. Everything else — informers, listers, work queues, rate limiting, leader election, the manager that assembles them. Is hidden behind the ctrl.NewControllerManagedBy(mgr) builder. You configure the manager once at main.go with manager.Options{LeaderElection: true} and you are done.

The kubebuilder RBAC annotations are the other piece worth understanding. Each // +kubebuilder:rbac comment is parsed by the controller-gen tool and emitted as a Role or ClusterRole in config/rbac/role.yaml. The five lines in the example translate to: full CRUD on Database, status-only update on Database/status, finalizer-only update on Database/finalizers, full CRUD on Deployment, full CRUD on Secret. make manifests regenerates them. Forgetting an annotation is the most common cause of "Forbidden" errors when your operator runs in cluster, because the dev environment usually has cluster-admin and hides the gap.

The fastest dev loop. make install applies the CRDs to the current kube-context, make run runs your controller locally against that cluster (no in-cluster deployment needed), kubectl apply -f config/samples/example_v1_database.yaml creates a sample CR. You can iterate on the Reconcile body, kill the local process, re-run, and the controller picks up where it left off. From scaffold to first working reconcile is under ten minutes.

Keep going.

Architecture

Eight processes, one storage primitive. The control plane and the data plane, port by port.

The lifecycle of kubectl apply

Twelve hops from the keystroke to the running pod, named, timed, explained.

Pod scheduling, end to end

From pending Pod to running container, through the scheduler framework and the kubelet SyncLoop.

Read ▶

Rollout simulator

Watch a Deployment controller reconcile in real time. maxSurge / maxUnavailable, dialled live.

Open

Found this useful?

Observe, diff, act.Forever.

The control loop in thirty lines.

Informers, listers, and the shared informer factory.

Work queues, rate limiting, and exponential backoff.

The reconcile signature — Result, errors, idempotency.

Finalizers — the deletion handshake.

Owner references and the garbage collector.

The status subresource. Separating spec from status.

Leader election — Lease objects and RunOrDie.

controller-runtime + kubebuilder — a controller in ten minutes.

Further reading — KEPs, sample-controller, the kubebuilder book.

Keep going.

Observe, diff, act.
Forever.