Sub-page 13 · for controller + operator authors

Kubernetes internals · Informers

One watch per resource,
one cache per process.

Every Kubernetes controller you have ever read about sits on top of a single, unglamorous data structure: a thread-safe in-memory copy of every object it cares about, kept hot by a long-lived HTTP watch against the api-server. That data structure is the informer. It is also the entire reason the cluster scales beyond a few hundred objects.

This page is a vertical slice through client-go's cache package. Reflector, DeltaFIFO, Indexer, Lister, workqueue, resync, BOOKMARK, the watch wire format, custom indexers, and the memory cost of running them at scale. Roughly 4,400 words. Pair it with the controllers sub-page for the reconcile-loop view, and the api-server sub-page for the watch cache on the server side.

Why a local cache exists at all.

Imagine, briefly, a Kubernetes cluster without informers. Every controller. The Deployment controller, the ReplicaSet controller, the Endpoint controller, the GarbageCollector, your custom CRD operator, all forty-odd of them. Is, fundamentally, a loop that wakes up every few seconds, asks the api-server for the current state of some set of objects, decides what to change, and writes it back. The naive implementation is to do a fresh List on every iteration: GET /api/v1/pods, get back fifty thousand pods, walk them, react. With one controller, this is fine. With forty controllers across a cluster of fifty thousand pods, you have just asked the api-server to serialise two million pod objects per second, every second, forever, before anything else happens.

The api-server is the only process that talks to etcd, and etcd is the only process that persists state. Every byte read by every controller in the cluster has to come out of one api-server's HTTPS socket. With list-only controllers, the api-server's CPU profile becomes one giant call to runtime.JSONEncoder; etcd's network becomes a flat line at line-rate; and the entire cluster collapses into a thrash where nothing else can get a request in edgewise. This is not theoretical. It is exactly what early Kubernetes prototypes looked like, and it is why client-go/tools/cache exists.

The fix has the obvious shape: list once, then subscribe to changes. The api-server's ?watch=true query param turns a List endpoint into a long-lived HTTP/2 chunked-response stream of typed events — ADDED, MODIFIED, DELETED. One per state change, indexed by a global revision called resourceVersion. A controller does one big List at startup to get a baseline (fifty thousand pods, once), then holds the watch open and consumes the trickle of changes (a few hundred per second on a busy cluster). The api-server's CPU drops by three orders of magnitude. etcd's network drops with it. The math works.

But a raw watch is not enough. Watches break — TCP resets, api-server rolling restarts, idle-timeout expirations, the connection's resourceVersion falling out of the watch cache window. Watches deliver events in order, but they do not tell you "the current state of pod web-7d8"; they tell you "a sequence of changes". And every controller that cares about Pods would need its own copy of the bookkeeping to recover from a broken watch and reassemble current state from the event stream. Repeating this code forty times in forty controllers is a recipe for forty subtly-different bugs.

The informer is the package that solves all of this exactly once. It runs the list-then-watch dance, maintains an in-memory snapshot of the resource set, recovers from broken watches by relisting, deduplicates and orders deltas, and exposes two interfaces to the controller: a Lister for synchronous Get-and-List queries against the cached snapshot, and a stream of event handlers for reacting to changes. The Lister calls never hit the network. The event handler calls fire from a single goroutine per resource type, in order, with at-least-once semantics. This is the read path for every modern Kubernetes controller, full stop.

The trade is straightforward. You spend memory. Typically tens to hundreds of megabytes per controller process, depending on how many resource types you watch. To save effectively all of the api-server's CPU and bandwidth. On a cluster with five thousand nodes and a quarter million Pods, this trade is the difference between a working cluster and one that catches fire every time you restart a controller. The informer is not a nice-to-have; it is the architectural mechanism by which Kubernetes's api-server-as-bottleneck model survives contact with reality.

The number to internalise. Every controller-go process saves the api-server roughly its working-set worth of egress per second by running an informer instead of polling. Multiplied across ~30 controllers in kube-controller-manager alone, that is the difference between a working cluster and an unbootable one.

Anatomy. ListerWatcher, Reflector, DeltaFIFO, Indexer, Lister, queue.

A SharedInformer is six co-operating components in a precise pipeline. Going left to right on the data-flow diagram below: a ListerWatcher is the small interface (List(opts) (Object, error) + Watch(opts) (watch.Interface, error)) that knows how to call the api-server for one resource type. The Reflector is a goroutine that drives the ListerWatcher: it issues the initial List, walks every object into the FIFO as a Sync delta, then opens a Watch and walks every event into the FIFO as Added / Updated / Deleted deltas. If the watch breaks, the Reflector relists. It runs forever.

The DeltaFIFO is, despite the name, not a plain FIFO. It is a per-key queue of deltas with deduplication. For any given object key (namespace/name), the queue holds an ordered list of pending deltas; if a new delta arrives for a key already in the queue, the FIFO can compress consecutive identical deltas (an Updated immediately followed by another Updated collapses the older one). The FIFO is the synchronisation point between the Reflector goroutine (producer) and the Indexer's process-loop goroutine (consumer). It is thread-safe and has explicit semantics for resyncs.

The Indexer is the actual cache: a thread-safe map from key to object, augmented by one or more secondary indexes (more on those in Part 07). When the FIFO emits a delta, the Indexer applies it to its store: Added inserts, Updated replaces, Deleted removes. After each delta is applied, the informer's internal processLoop distributes a corresponding event to every registered handler. The Lister is a small façade over the Indexer that exposes typed Get(name) / List(selector) methods to consumers — the controller never touches the Indexer directly; it goes through the Lister.

The workqueue is the controller's, not the informer's, but it sits at the end of the pipeline because event handlers almost always do exactly one thing: extract the object's cache key with cache.MetaNamespaceKeyFunc and push it onto the workqueue. The workqueue is rate-limited, deduplicating, and goroutine-safe; it is the thing that drives the controller's reconciler. Decoupling event distribution from reconciliation here is critical: a Reconcile that takes 200ms must not block the next watch event, and a flapping resource must not produce a flapping reconcile.

The whole pipeline is single-threaded per resource type on the consumer side. There is one Reflector goroutine, one processLoop goroutine, and one (or N, if you parallelise the workqueue) reconciler goroutine. The producer side (the api-server) is, of course, multi-threaded, but the FIFO serialises events into a strict per-key order before they reach your code. This is what gives you the property "for any given pod, my reconciler will see its lifecycle events in causal order, even if my reconciler is slow".

A small but important detail: the informer is constructed with a resync interval, typically zero (disabled) or in the range of ten minutes to an hour. When non-zero, the informer periodically re-emits an Update event for every object currently in the Indexer. Without consulting the api-server. The point is not to refresh state (the watch already does that); the point is to give the controller a chance to re-converge on objects it might have failed to fully reconcile last time, defending against bugs in the reconciler. In modern controller-runtime code, resync is usually disabled and replaced with explicit RequeueAfter from inside Reconcile, which is more precise.

// The ListerWatcher interface — five lines, the entire api-server contract.
// staging/src/k8s.io/client-go/tools/cache/listwatch.go

type ListerWatcher interface {
    List(opts metav1.ListOptions) (runtime.Object, error)        // initial snapshot
    Watch(opts metav1.ListOptions) (watch.Interface, error)      // stream of events
}

// The default factory builds one for any typed client:
lw := cache.NewListWatchFromClient(
    clientset.CoreV1().RESTClient(),
    "pods", namespace, fields.Everything())

Implementation pointer. Every interesting thing in this pipeline lives in client-go/tools/cache. It is roughly 8,000 lines of Go and worth reading end-to-end at least once. Start at shared_informer.go, then reflector.go, then delta_fifo.go, then thread_safe_store.go.

The watch protocol. ADDED, MODIFIED, DELETED, BOOKMARK, ERROR.

The watch is the lifeline of every informer, and the wire format is unusually simple. A client issues an HTTP/2 GET to a List endpoint with ?watch=true&resourceVersion=N. The api-server responds 200 OK with Transfer-Encoding: chunked and writes a stream of newline-delimited JSON objects of type {"type":"...","object":{...}}. There are five values for type: ADDED, MODIFIED, DELETED, BOOKMARK, and ERROR. The first three carry a full object; BOOKMARK carries an empty object whose only meaningful field is resourceVersion; ERROR carries a Status object describing why the stream is being terminated.

The semantics are precise. ADDED is emitted exactly once for each object that newly enters the watched scope. Either created during the watch, or pre-existing at the moment of the initial List. MODIFIED is emitted on every successful update (anything that bumps the object's resourceVersion). DELETED is emitted exactly once when the object is removed, and the carried object is the last-observed state, not a tombstone. BOOKMARK is informational: it carries no payload but advances the consumer's resourceVersion cursor so a reconnect can resume from a recent point even if no real events have occurred. ERROR is terminal. The server is closing the stream, and the client must reconnect (with a new List+Watch if the error is "too old resourceVersion").

resourceVersion deserves its own paragraph. It is a global, strictly-increasing integer (encoded as a string for JSON-number-precision reasons) that tags every successful etcd write across the entire cluster. Every object carries one. Every watch event carries one. A watch is logically "give me everything that happened after resourceVersion N", and the api-server guarantees that the events you receive are in resourceVersion order. The api-server does not guarantee that you see every resourceVersion. Only that you see the events for resources you are watching, with their resourceVersions in order. Gaps are normal and meaningful: revision 487293 might be on a Pod you do not watch, and revision 487294 on the Deployment you do.

There is one important corner case: "too old resourceVersion", surfaced as 410 Gone on a watch attempt or as an ERROR event mid-stream. The api-server keeps a bounded watch cache. Typically the last 1,000 to 5,000 events per resource type, and if you try to resume from a resourceVersion older than the cache's oldest entry, it cannot tell you what happened in between, so it gives up. Your only recourse is to start over: do a fresh List (which gives you a baseline at the current revision), then open a new Watch from there. This is exactly what the Reflector's ListAndWatch loop does on every error.

One subtlety that bites every controller author at least once: the object delivered in a MODIFIED event is the new state, not a diff against the old state. If you want to know what changed, you have to compare the new object to whatever you have in your cache (or in memory in your handler). The informer machinery does this for you — the UpdateFunc handler signature is func(oldObj, newObj interface{}), with oldObj populated from the Indexer's previous value. Every other language binding for the Kubernetes API gets this wrong at least once; client-go is the reference.

# A watch on the wire — chunked HTTP/2 from /api/v1/pods?watch=true&resourceVersion=487291

{"type":"ADDED","object":{"kind":"Pod","apiVersion":"v1","metadata":{"name":"web-7d8","namespace":"prod","resourceVersion":"487292","uid":"a3..."},"spec":{...},"status":{...}}}
{"type":"MODIFIED","object":{"kind":"Pod","metadata":{"name":"web-7d8","resourceVersion":"487310"},"status":{"phase":"Running"}}}
# 90s idle — server emits a BOOKMARK so the consumer's cursor advances
{"type":"BOOKMARK","object":{"kind":"Pod","metadata":{"resourceVersion":"487420"}}}
{"type":"DELETED","object":{"kind":"Pod","metadata":{"name":"web-7d8","resourceVersion":"487501"}}}
# Cache rotated past us — server terminates the stream
{"type":"ERROR","object":{"kind":"Status","status":"Failure","message":"too old resource version: 487501 (490012)","reason":"Expired","code":410}}

The Reflector reads this stream a line at a time, decodes each event, and pushes it into the DeltaFIFO. The single piece of state it tracks across reconnects is the lastSyncResourceVersion — the resourceVersion of the last event it successfully delivered. On a clean reconnect, it sends that value as ?resourceVersion=N; the api-server resumes from there. On a 410, it discards the value, issues a fresh List (which returns a new resourceVersion as the baseline), and watches forward from that. The DeltaFIFO is told about the relist via a Replace operation, which atomically swaps in the new full set and emits Sync events for every object. Preserving the controller's invariant that it has seen something for every known key.

Wire-level debugging trick. kubectl get pods -w -o json | jq -c prints the same stream client-go consumes. If your controller is misbehaving in production, this is sometimes the fastest way to confirm whether the events are actually reaching the client or being dropped somewhere in between.

list-then-watch. The canonical pattern, the relist storm, the BOOKMARK rescue.

The pattern at the heart of every informer is a four-step loop: (1) List the resource at some revision N to get a complete baseline; (2) Watch from N+1 onward, consuming events in order; (3) if the Watch errors, decide whether to resume or relist; (4) if relisting, go back to step (1). This dance is the only correct way to combine a snapshot read with an event stream and end up with a coherent, eventually- consistent local cache. It is what Kubernetes's informers do, and it is what every other system that has tried to do the same thing. Etcd's own watcher API, Consul's blocking queries, Zookeeper's ephemeral nodes. Converges on by version 2.

The trap, and it is a famous one, is the relist storm. Imagine ten thousand kubelets, each running its own informer on Node + Pod, all maintaining a long-lived watch against the api-server. The api-server is rolled. A control-plane upgrade, a deploy, a crash. Every watch breaks at exactly the same instant. Every kubelet's Reflector sees the disconnect, and on its next iteration of ListAndWatch it issues a fresh List against the new api-server. The api-server is now serving ten thousand simultaneous List(50,000 pods) requests, which is exactly the load profile that informers were designed to avoid. The api-server's CPU spikes; etcd's Range scan saturates; the cluster is on its knees for two to ten minutes.

There are three mitigations, all of them now standard. First, the watch cache on the api-server: a per-resource ring buffer of recent events, so that most reconnects can resume from a remembered resourceVersion without going to etcd at all. Second, the BOOKMARK event: introduced in KEP-956, it is a no-op event the server emits periodically (default every 60 seconds) carrying a current resourceVersion. The point is to keep the consumer's cursor close to the live edge of the watch cache: even if the consumer has been idle for an hour because nothing in its scope changed, its lastSyncResourceVersion is no more than 60 seconds stale, so on reconnect the api-server can almost certainly resume from cache instead of forcing a relist.

Third, jittered backoff: the Reflector's reconnect logic does not retry immediately on failure. It uses a default of 1-second backoff with jitter, climbing on repeated failures. Combined with BOOKMARK, this means that even if all ten thousand kubelets disconnect simultaneously, they will reconnect over a 30-second window with mostly-warm cursors, and the api-server will service almost all of them from its watch cache. The relist storm is, in 2026, mostly a historical concern, but if you write a custom controller and use a custom Reflector (or, worse, your own watch loop), you have to put these mitigations back yourself, and people regularly forget. The default SharedInformerFactory handles this for you. Custom code does not.

The new frontier is WatchList, introduced in KEP-3157 and graduated to GA in 1.32. It folds steps (1) and (2) into a single watch stream: instead of issuing a List request and then opening a Watch, the client opens a Watch with sendInitialEvents=true, and the api-server streams every existing object as a sequence of ADDED events, terminated by a synthetic BOOKMARK whose annotations["k8s.io/initial-events-end"] = "true". This avoids the "huge JSON List then start watching" memory spike that plagued informer startup on large clusters; the cache is built incrementally. Modern client-go enables this by default when the api-server advertises support.

// Reflector pseudocode — list, watch, relist on error.
// staging/src/k8s.io/client-go/tools/cache/reflector.go (abridged)

func (r *Reflector) ListAndWatch(stopCh <-chan struct{}) error {
    // step 1: snapshot at the current revision
    list, err := r.listerWatcher.List(metav1.ListOptions{ResourceVersion: "0"})
    if err != nil { return err }
    rv := list.(metav1.ListMetaAccessor).GetListMeta().GetResourceVersion()
    r.syncWith(list.Items, rv)              // emit Sync deltas for the snapshot
    r.setLastSyncResourceVersion(rv)

    // step 2: watch from the snapshot's RV onwards, forever
    for {
        w, err := r.listerWatcher.Watch(metav1.ListOptions{
            ResourceVersion:     rv,
            AllowWatchBookmarks: true,    // please send BOOKMARKs
            TimeoutSeconds:      ptr.To(int64(5 * 60)),
        })
        if err != nil { return err }

        for ev := range w.ResultChan() {
            switch ev.Type {
            case watch.Added, watch.Modified, watch.Deleted:
                r.store.Add(ev.Object)       // push delta into FIFO
                rv = ev.Object.(metav1.Object).GetResourceVersion()
            case watch.Bookmark:
                rv = ev.Object.(metav1.Object).GetResourceVersion()
            case watch.Error:
                return apierrors.FromObject(ev.Object)  // bubble up; outer loop relists
            }
            r.setLastSyncResourceVersion(rv)
        }
        // channel closed — server timed us out, reopen with current rv
    }
}

Operational rule — never write your own watch loop in a production controller. The four edge cases that ListAndWatch handles (410 Gone, idle timeout, BOOKMARK, error events) are exactly the ones every hand-rolled watcher gets wrong, usually in a way that does not surface until the cluster is under load. Use cache.NewSharedInformer or controller-runtime's manager.

DeltaFIFO and Indexer. Dedup, ordering, thread safety.

The DeltaFIFO is the most underrated component in the entire pipeline. Most readers bounce off the name. "FIFO" suggests something dumb, but the real data structure is map[key][]Delta with insertion-order tracking on the outer map. For each known key, the FIFO holds an ordered list of pending deltas (Added, Updated, Deleted, Sync, Replaced). When a new delta arrives for a key, the FIFO appends it to that key's slot, deduplicating consecutive identical deltas. When the consumer pops, it gets the key plus the entire delta list since the last pop, which is what allows the Indexer to apply state changes correctly even if the consumer was slow.

The deduplication semantics matter. If a Pod is Updated five times in 100ms while the consumer is busy, the FIFO does not necessarily store five separate Updated deltas. It stores the most recent Updated (since each Updated supersedes the previous), which is what the Indexer needs to converge. Importantly, the FIFO does not collapse Added / Updated / Deleted with each other. Those are causally distinct. A sequence of Added, Updated, Updated, Deleted will collapse to Added, Updated, Deleted (with the most-recent-Updated payload), preserving the lifecycle but not every intermediate state. This is what gives controllers their characteristic property: "if I am slow, I will see the most recent state, but I will not see every transient intermediate state".

Thread-safety in client-go's cache is achieved with a single sync.Mutex per data structure (FIFO and ThreadSafeStore), held for the duration of any operation. This is unfashionable. Modern Go idioms favour channel-based concurrency, but for the workload (small, frequent reads; less-frequent writes; no need for fine-grained concurrency) it is exactly right. Lock contention is measurable on very busy informers (think: an EndpointSlice informer in a 50,000-Service cluster), but in practice the cost is dominated by JSON deserialisation upstream, not by lock acquisition.

The Indexer is an extension of the FIFO's consumer side. Where the FIFO produces (key, []Delta) tuples, the Indexer applies them to a ThreadSafeStore: a primary map[key]Object plus zero or more secondary indexes. The primary store gives you O(1) Get-by-key. Secondary indexes give you O(1) lookup by other keys. Most commonly by namespace, but optionally by label selector, owner-ref, or any custom key function you supply. The signature is exactly type IndexFunc func(obj interface{}) ([]string, error): given an object, return the set of index keys it should be filed under. The Indexer maintains map[indexName]map[indexKey]sets.String internally, and updates them all atomically with the primary store on every delta application.

One critical guarantee the Indexer provides: the cache is consistent with respect to a single resourceVersion. After processing the deltas for revision N, every Get and List call sees exactly the state at N — not a mixture of N-1 and N. This is achieved by holding the store's write lock for the duration of a delta-application batch. It is the property that makes Listers correct: when your Reconcile pulls deploymentLister.Get("foo") and replicaSetLister.List(selector), you might not have the latest-latest revision, but whatever revision you have is consistent across both calls.

// Pseudocode: DeltaFIFO.Add — insert with per-key dedup.
// staging/src/k8s.io/client-go/tools/cache/delta_fifo.go (simplified)

func (f *DeltaFIFO) queueActionLocked(actionType DeltaType, obj interface{}) error {
    id, err := f.KeyOf(obj)
    if err != nil { return err }

    newDeltas := append(f.items[id], Delta{actionType, obj})
    newDeltas = dedupDeltas(newDeltas)         // collapse last 2 if identical type

    if _, ok := f.items[id]; !ok {
        f.queue = append(f.queue, id)    // new key, push to tail
    }
    f.items[id] = newDeltas
    f.cond.Broadcast()                          // wake consumer
    return nil
}

// Indexer.Add — apply a delta to the ThreadSafeStore + secondary indexes.
func (s *threadSafeMap) Add(key string, obj interface{}) {
    s.lock.Lock(); defer s.lock.Unlock()
    oldObject := s.items[key]
    s.items[key] = obj
    s.updateIndices(oldObject, obj, key)        // recomputes every IndexFunc
}

There is one operational gotcha worth flagging. The cache holds full deserialised objects in memory. A Pod is, on average, 4-12 KB of Go struct (depending on labels, annotations, volumes, environment variables, status conditions). At 50,000 pods that is roughly 250-600 MB of pod objects per controller process that runs a Pod informer, and most controllers do, because Pods are how you reconcile almost anything. This is the dominant memory cost of a controller binary. Trimming it requires either field selectors (only watch Pods on a given Node), label selectors (only watch Pods with a given label), or a transform function passed to SharedInformerFactory that strips fields you do not care about before the object enters the cache. We will come back to the budgeting math in Part 08.

Subtle bug worth knowing. Never mutate an object you got from a Lister. The Lister returns a pointer into the cache; mutating it corrupts the cache for every other consumer in the process. Always obj.DeepCopy() before changing anything. The lint golangci-lint rule cachemutation from controller-runtime will catch this for you.

SharedInformerFactory. One factory, many informers, one watch per type.

The naive way to use informers gives you one informer per controller. The Deployment controller has its own Deployment informer; the ReplicaSet controller has its own; the HPA controller has yet another for the Deployments it scales. In a binary that runs all thirty controllers (the controller-manager), you would have three Deployment informers, three watches against the api-server for Deployments, three full caches of every Deployment in the cluster — wasted memory, wasted bandwidth, wasted server CPU, and an explicit violation of the one watch per resource type design principle.

The SharedInformerFactory fixes this. It is a memoised constructor: the first call to factory.Apps().V1().Deployments().Informer() builds a real informer; every subsequent call in the same factory returns the same instance. Multiple controllers in the same binary share the same Reflector, the same DeltaFIFO, the same Indexer, and the same Lister. Each controller registers its own ResourceEventHandler, and they all see the same events fan-out from the same source. One watch, one cache, N consumers.

This is why every well-written Go controller binary follows the same skeleton: in main, construct one SharedInformerFactory per *kubernetes.Clientset (with a single resync interval); pass it to every controller's constructor; have each controller call .Informer() on the resources it cares about and register handlers; finally call factory.Start(stopCh) exactly once at the bottom of main to fire up the goroutines, then factory.WaitForCacheSync(stopCh) to block until every informer has finished its initial List. After that, your controllers run.

The factory takes a global resync interval at construction time, which becomes the default for every informer it builds. You can override per-informer with factory.InformerFor(...).AddEventHandlerWithResyncPeriod(...), but in practice nobody does — the resync is an old defensive mechanism, and the modern approach is to disable it (interval = 0) and rely on workqueue.AddAfter for explicit periodic re-reconciliation. The factory also accepts a list of SharedInformerOptions. Most usefully WithNamespace (scope to one namespace, smaller cache), WithTweakListOptions (add a label / field selector to every List+Watch), and WithTransform (strip fields before caching).

A separate concern is custom resources. The typed SharedInformerFactory in client-go only knows about built-in types. For CRDs, you either use the generated factory from the project's typed client (externalversions.SharedInformerFactory generated by code-generator), or you use the dynamic factory. dynamicinformer.NewDynamicSharedInformerFactory which works over unstructured.Unstructured and does not need codegen. controller-runtime hides this behind a single cache.Cache abstraction and a manager.Manager that owns it.

// Idiomatic factory setup with WatchErrorHandler and a custom resync period.
// import "k8s.io/client-go/informers"

cs, _ := kubernetes.NewForConfig(cfg)

// One factory, 30-minute resync, scoped to one namespace.
factory := informers.NewSharedInformerFactoryWithOptions(
    cs,
    30*time.Minute,
    informers.WithNamespace("prod"),
    informers.WithTweakListOptions(func(o *metav1.ListOptions) {
        o.LabelSelector = "app.kubernetes.io/managed-by=mycontroller"
    }),
)

// Pull out typed informers — same instance returned to every caller.
podInf := factory.Core().V1().Pods().Informer()
deployInf := factory.Apps().V1().Deployments().Informer()

// Set a custom WatchErrorHandler — fires on every watch failure.
_ = podInf.SetWatchErrorHandler(func(_ *cache.Reflector, err error) {
    metrics.WatchErrors.WithLabelValues("pods").Inc()
    klog.Warningf("pod watch error: %v", err)
})

// Register handlers — multiple controllers can register on the same informer.
podInf.AddEventHandler(cache.ResourceEventHandlerFuncs{
    AddFunc:    func(o interface{}) { queue.Add(keyOf(o)) },
    UpdateFunc: func(_, n interface{}) { queue.Add(keyOf(n)) },
    DeleteFunc: func(o interface{}) { queue.Add(keyOf(o)) },
})

// Fire goroutines, wait for the first List to complete, then run.
factory.Start(stopCh)
if !cache.WaitForCacheSync(stopCh, podInf.HasSynced, deployInf.HasSynced) {
    klog.Fatal("informer caches did not sync")
}
controller.Run(workers, stopCh)

The WatchErrorHandler is the single most important hook for operability. Watch errors are silent in default client-go logs above --v=2, so a controller whose watches are constantly relisting against a saturated api-server can look healthy from the outside while burning the cluster down. Wire a Prometheus counter into SetWatchErrorHandler for every informer in your binary; alert on any non-zero rate. The Kubernetes controller-manager itself does this. Its apiserver_request_total{verb="LIST",resource="pods"} spike during a relist storm is the canonical signature.

WaitForCacheSync is not optional. If your controller starts reconciling before its caches have synced, every Lister.Get returns "not found", every Lister.List returns empty, and you spend the first thirty seconds of your process incorrectly deleting things that exist. Always block on it.

Custom indexers. By label, by owner-ref, by arbitrary field.

The Indexer's secondary indexes are the most underused feature of client-go. Out of the box, every informer has exactly one secondary index: cache.NamespaceIndex, which maps a namespace name to the set of object keys in that namespace. That is what lets lister.Pods("prod").List(selector) run in O(pods-in-prod) instead of O(pods-in-cluster). For most controllers, this one index is enough. For some, it is not, and the canonical example is owner-ref-based reconciliation.

Imagine a Deployment controller that needs, on every reconcile of a Deployment, to find all the ReplicaSets owned by that Deployment. The naive code is to List(everything) and filter by ownerReferences[0].UID == deployment.UID. On a cluster with ten thousand ReplicaSets, this is a 10,000-element walk per Deployment reconcile. With ten Deployments reconciling per second, you have just spent 100,000 ReplicaSet comparisons per second on a problem that should be O(replicasets-of-this-deployment). The fix is a custom indexer keyed by owner-ref UID: register an IndexFunc that returns []string{ ownerRef.UID } for every ReplicaSet, and the Indexer will maintain the inverse map for you. The Deployment controller's reconcile then becomes indexer.ByIndex("ownerUID", deployment.UID), O(1)-amortised.

The mechanism is general. An IndexFunc takes an object and returns a list of strings (zero, one, or many) under which the object should be filed. An Indexers map names them. You pass them at informer construction time. After that, the Indexer transparently maintains map[indexName]map[indexKey]sets.String as deltas flow through. Queries are indexer.ByIndex(name, key) for object lookup or indexer.IndexKeys(name, key) for just keys. controller-runtime wraps this in manager.GetFieldIndexer().IndexField(...) with a more declarative shape.

The other classical example is per-label indexing. If your controller frequently lists Pods with app=frontend, you can register an IndexFunc that returns []string{ pod.Labels["app"] }, and lookups by app become O(1) per matching pod. The trade-off is memory: each index doubles roughly the per-key bookkeeping (a pointer in a set per object per index), so adding ten custom indexes to a Pod informer can add 100-200 MB of overhead in a 50,000-pod cluster. Most controllers need at most one or two custom indexes; resist the temptation to index every field.

A subtle correctness point: index keys must be derivable from the object alone, with no external state. If your IndexFunc reads from a database, queries an environment variable, or depends on the current time, the index will desync from the cache and lookups will silently miss. This is a hard rule. Any IndexFunc that is not a pure function over obj is a bug.

// A custom IndexFunc with NamespaceKeyFunc and a per-label index.

// 1. by-app — group pods by the app label.
const ByAppIndex = "by-app"

func byAppIndexFunc(obj interface{}) ([]string, error) {
    pod, ok := obj.(*corev1.Pod)
    if !ok { return nil, fmt.Errorf("not a Pod") }
    app, ok := pod.Labels["app"]
    if !ok { return []string{}, nil }   // pure: depends only on obj
    return []string{app}, nil
}

// 2. by-owner — group children by the controller-OwnerReference UID.
const ByOwnerIndex = "by-owner-uid"

func byOwnerIndexFunc(obj interface{}) ([]string, error) {
    m, err := meta.Accessor(obj)
    if err != nil { return nil, err }
    for _, ref := range m.GetOwnerReferences() {
        if ref.Controller != nil && *ref.Controller {
            return []string{string(ref.UID)}, nil
        }
    }
    return []string{}, nil
}

// Wire them in at construction time, alongside the default namespace index.
indexers := cache.Indexers{
    cache.NamespaceIndex: cache.MetaNamespaceIndexFunc,
    ByAppIndex:           byAppIndexFunc,
    ByOwnerIndex:         byOwnerIndexFunc,
}
podInf.AddIndexers(indexers)

// Query — O(1) by index, returns the pre-filtered slice.
items, err := podInf.GetIndexer().ByIndex(ByAppIndex, "web")

Reconcile-by-owner is the indexer pattern that powers every built-in workload controller. ReplicaSet finds its Pods this way; Deployment finds its ReplicaSets; StatefulSet and Job and CronJob the same. controller-runtime's SetControllerReference + EnqueueRequestForOwner is the declarative form; understanding the underlying indexer is what lets you debug it when it goes wrong.

Operating cost — what 200 informers actually weigh, and how to budget.

Every informer has three costs: memory for the cached objects plus bookkeeping; CPU for JSON deserialisation on the producer goroutine and for handler dispatch on the consumer goroutine; and network bandwidth for the watch stream itself, plus a one-time spike for the initial List. The relative magnitudes depend wildly on which resource you are watching, but there are reliable rules-of-thumb. A Pod informer in a 10,000-pod cluster runs ~80-150 MB of resident memory and ~0.1% CPU at steady state. A Node informer in a 1,000-node cluster runs ~20 MB and negligible CPU. An Event informer is the big surprise: short-lived but high-throughput, it can blow past 1 GB if you do not set a TTL or a field-selector limit.

The kube-controller-manager binary is the canonical case study. It runs roughly thirty controllers, each of which uses one to four informers. Many of these informers overlap (a dozen controllers all want a Pod informer), and the SharedInformerFactory dedupes them — a large cluster's controller-manager has roughly twenty distinct informers, not seventy. Steady-state memory on a 5,000-node, 100,000-pod cluster: 1.5 to 4 GB, dominated by the Pod, EndpointSlice, and Lease informers in roughly that order. CPU under nominal load: 0.5 to 1.5 cores. Watch fan-out from the api-server's perspective: about twenty long-lived HTTP/2 streams, each doing a few hundred KB/s of event traffic.

The pathological case is a controller that runs 200 informers in one binary — typically because it watches every CRD on the cluster (a generic webhook, a policy engine, a cross-cutting reconciler). Even if each informer is small, the aggregate is brutal: the Reflector's startup List for each one races against the others, the watch fan-out fills the api-server's per-client connection budget, and the memory footprint can reach 10-20 GB before steady state. The mitigations are the standard ones: scope every informer with field/label selectors, use WithTransform to drop unused fields before caching, and stagger informer startup if your controller runtime supports it. controller-runtime's cache.Options.ByObject per-type config exists for exactly this reason.

The other operational concern is watch fan-out on the api-server. Each long-lived watch costs roughly 50-100 KB of api-server memory (the watcher struct, the HTTP/2 stream state, the per-client filter chain), plus a slot in the api-server's WatchCache's broadcaster. A cluster with 10,000 nodes and 50 controllers running 20 informers each is on the order of 60,000 active watches, ~6 GB of api-server memory just for watch state, plus the CPU to fan-out every event to every watcher. This is why api-server pod sizing on large clusters starts at 8-16 GB; it is almost entirely driven by watch fan-out, not by the request rate.

A useful rule for budgeting: the api-server's apiserver_registered_watchers metric should stay flat once your controllers are warm. Sudden growth means a controller is leaking watches (creating new informers without stopping old ones. A common bug in dynamic-client code). The apiserver_storage_objects{resource="..."} tells you the on-disk count per resource type, which is the upper bound on what any informer of that type will hold in memory. Together they are the two metrics that drive controller-binary memory sizing.

With those numbers in mind, an order-of-magnitude calculation for your own controller: (average object size) × (number of objects in scope) × (1 + number of custom indexes) × (1.3 overhead factor) is a decent starting estimate for cache memory. Add 10-50 MB per goroutine you spawn and 50-100 MB for the Go runtime itself. If the answer is more than a couple of gigabytes, narrow your scope or strip fields with a transform. If it is more than ten gigabytes, you are probably watching the wrong thing, and what you actually want is a metrics-server-style aggregator with its own filtered cache.

Authoritative docs

Source-tree pointers

KEPs that shaped this

The workqueue deserves a final mention because it is the place rate-limiting lives. The default DefaultControllerRateLimiter is a MaxOf combiner of two limiters: an item-level exponential backoff (5 ms doubling up to 1,000 s, per key, reset on success), and a token-bucket overall limit (10 events per second, burst 100, across all keys). If your reconcile returns an error five times in a row, the offending key gets a 5 ms, 10 ms, 20 ms, 40 ms, 80 ms backoff before each retry; if your reconcile floods the queue, the bucket throttles you to 10/s globally. The library exposes this as workqueue.NewRateLimitingQueue(workqueue.DefaultControllerRateLimiter()); do not invent your own without an extremely clear reason.

// A workqueue with the default rate-limiter — 5ms exponential up to 1000s.
// import "k8s.io/client-go/util/workqueue"

queue := workqueue.NewRateLimitingQueue(workqueue.DefaultControllerRateLimiter())

// Producer side — register handlers that push keys.
informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
    AddFunc: func(obj interface{}) {
        key, _ := cache.MetaNamespaceKeyFunc(obj)
        queue.Add(key)
    },
})

// Consumer goroutine — runWorker.
for {
    key, quit := queue.Get()
    if quit { return }

    err := reconcile(key.(string))
    if err == nil {
        queue.Forget(key)         // reset the per-key backoff
    } else {
        queue.AddRateLimited(key) // re-enqueue with exponential delay
    }
    queue.Done(key)
}

And the rest of the Semicolony ladder: the architecture sub-page is the bird's-eye view of where informers fit; the api-server sub-page covers the watch cache from the server side; the etcd sub-page explains the resourceVersion that flows through every event; the controllers sub-page assembles informer + workqueue + reconcile into a complete loop. For the lived experience, the rollout simulator shows informers absorbing a thundering herd of Pod events, and the Pod creation, end to end guide tells the same story without the source pointers.

The closing note is the same one as the architecture page. The informer pattern works because it refuses to grow new edges. One watch per resource type, one cache per process, one reconcile per key, one rate limiter per workqueue. Every time you find yourself wanting to bypass the cache for "just this one read", or open "just one more watch" for a custom query, you are about to lose the property that makes Kubernetes scale. Stay inside the lines. The pattern is well-trodden because it is correct.

Next in the internals series

Keep going.

The controller pattern

Informers + workqueue + reconcile, assembled into a complete loop with finalizers and leader election.

The api-server, deeply

The watch cache from the server side: per-resource broadcasters, the watch budget, KEP-3157 streaming list.

etcd, the only thing that persists

MVCC, Raft, the resourceVersion that flows through every watch event you will ever see.

Read ↑

Back to the internals index

All twelve sub-pages, and the system on one canvas.

Index

Found this useful?

One watch per resource,one cache per process.

Why a local cache exists at all.

Anatomy. ListerWatcher, Reflector, DeltaFIFO, Indexer, Lister, queue.

The watch protocol. ADDED, MODIFIED, DELETED, BOOKMARK, ERROR.

list-then-watch. The canonical pattern, the relist storm, the BOOKMARK rescue.

DeltaFIFO and Indexer. Dedup, ordering, thread safety.

SharedInformerFactory. One factory, many informers, one watch per type.

Custom indexers. By label, by owner-ref, by arbitrary field.

Operating cost — what 200 informers actually weigh, and how to budget.

Keep going.

One watch per resource,
one cache per process.