Sub-page 08 · for operator + CRD authors

Kubernetes internals · CRDs & operators

Extend the API.
Don't bypass it.

A CustomResourceDefinition is the cleanest extension point Kubernetes ships. You hand the api-server an OpenAPI schema, it gives you back a typed REST surface — listed, watched, validated, RBAC'd, audit-logged, and stored in etcd alongside Pods and Deployments. An operator is then just a controller that reconciles the resulting custom resources. The two together let you encode operational knowledge — how to upgrade Postgres, how to fail over Kafka, how to rotate certificates — as executable, declarative state.

This page is the design manual: the schema, the subresources, the conversion-webhook contract, the OperatorHub maturity model, and the scaffold you actually run on day one. Pair it with the controller pattern sub-page for the reconcile-loop mechanics and the api-server sub-page for what happens when your CR hits the request pipeline. Roughly 4,200 words.

Why CRDs exist at all.

Kubernetes ships with about forty built-in resource types — Pod, Deployment, Service, Ingress, Job, ConfigMap, and so on. They cover a generic compute substrate well, and they cover almost nothing else. The first time you try to model a Postgres cluster, a Kafka topic, a TLS certificate, an S3 bucket, or a tenant environment, you discover that none of the built-ins fit and you have to invent something. Before CRDs, the inventions were ugly: annotations on ConfigMaps, magic labels on Pods, side-channel databases that the controller read instead of etcd, REST services on top of Kubernetes that re-implemented authn, authz, audit, and watch from scratch. None of it composed. Each one was a tiny island with its own conventions.

A CustomResourceDefinition is the api-server saying: hand me a name, a schema, and a scope, and I will give you a first-class API kind. Once registered, your PostgresCluster or Certificate object behaves exactly like a Pod. It accepts kubectl get, kubectl apply, kubectl edit, kubectl describe. It is namespaced or cluster-scoped at your choice. RBAC works on it. Quotas can target it. Server-side apply tracks field ownership on it. Audit logs record every write. Admission webhooks see every mutation. Watch streams flow through it. You inherit the entire platform, including the parts you did not know you needed, on day one.

There are two alternatives, and both exist for narrow reasons. An aggregated apiserver is a separate process that registers itself with the main api-server via an APIService object; the main api-server proxies requests for that group/version to the aggregated server. You write all the storage, validation, and watch plumbing yourself. This is how metrics.k8s.io works, and it is the right answer when you need imperative endpoints, high-cardinality non-etcd backing storage, or resource semantics that do not fit a declarative spec/status. Part 08 covers the decision in detail. The second alternative is no API at all — a sidecar, a daemon, a Helm chart. Pick that when there is no state to reconcile, just configuration to template.

The cultural argument for CRDs is older than the technical one. CoreOS coined “the operator pattern” in 2016: the observation that the most reliable way to encode operational knowledge — how a human SRE would deploy, upgrade, back up, fail over a piece of stateful software — is as a controller, watching a custom resource that describes the desired state. The CR is the SRE's runbook, made declarative. The controller is the SRE's muscle memory, made executable. Every long-running platform team eventually reaches for this shape because it is the only one where the runbook stays in sync with reality. The architecture forces convergence.

The mental model worth carrying through the rest of the page: a CRD is a schema, a custom resource is an instance, and an operator is a controller that watches one or more CR kinds and reconciles them by creating or updating other Kubernetes resources (Pods, Services, PersistentVolumeClaims) and out-of-cluster resources (cloud APIs, load balancers, DNS entries). The CRD has nothing to do with the controller. They ship in the same operator bundle by convention, but the api-server happily serves a CRD that no controller is watching — you just get an inert object.

Mental model — a CRD is a Kubernetes API kind that you wrote. It is stored in etcd next to Pods. You did not write any storage code, any REST handler, any watch protocol. You wrote a YAML schema and a controller. Everything else is the api-server doing its job.

Designing a CRD: spec, status, names, scope.

A well-designed CRD has the same shape as every built-in: a top-level spec that the user owns and a top-level status that the controller owns. The split is not cosmetic. It is the contract that lets a user re-apply their YAML without stomping on whatever the controller has reported, and that lets the controller update its observations without inviting a flapping fight with whatever pipeline is applying the spec. The status subresource (Part 04) makes the split mechanically enforceable; the schema below expresses the intent.

Names matter more than they look. A CRD has a group (usually a domain you control, like db.example.com), a kind (PascalCase, singular, like PostgresCluster), a listKind (the same with List suffix), a singular and plural short name for the URL, and zero or more shortNames for kubectl convenience. Get any of these wrong and you cannot rename them later — the URL path is part of the public contract, watched by every client. Pick the group from a domain you actually own. Pick the kind so the type reads naturally as English when combined with a verb: “create a PostgresCluster” reads, “create a PgClstr” does not.

Scope is a one-time choice between Namespaced and Cluster. Default to Namespaced. Most things that look cluster-scoped (TLS issuers, storage classes, cluster policies) belong cluster-scoped because they describe shared infrastructure, but most workload-shaped CRDs (database clusters, deployments, tenant configurations) belong namespaced because they inherit the existing tenant boundary. Cluster scope means RBAC is global — anyone with list on the kind sees every instance — and that has a habit of biting multi-tenant clusters later. Cert-manager learned this the hard way and introduced both Issuer (namespaced) and ClusterIssuer (cluster-scoped) precisely so users could pick the right boundary.

The spec schema should describe the user's intent in the smallest number of fields that fully captures it. Fight the urge to expose every knob your controller could in principle support. Each new field is a versioning obligation forever — once shipped you cannot remove it without writing a conversion webhook (Part 05). Prefer composition: a top-level backup sub-object with its own schema is easier to evolve than seven flat backupX fields. Mark sensible defaults explicitly with default: in the schema so users do not have to write them. Use enums freely. Use x-kubernetes-preserve-unknown-fields sparingly and only at clearly-marked extension points.

The additionalPrinterColumns field controls what kubectl get shows. Always populate it. A CR that prints as just a name and an age is unhelpful; users will write describe wrappers or jsonpath queries that should not have been necessary. Show the most operationally relevant fields: replica count, current phase, the URL or endpoint the user will actually paste into another tool. Pull them from .status so they reflect reality, not requested intent.

# crd.yaml — a real-world CRD with the fields that matter
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: postgresclusters.db.example.com
spec:
  group: db.example.com
  scope: Namespaced
  names:
    kind: PostgresCluster
    listKind: PostgresClusterList
    singular: postgrescluster
    plural: postgresclusters
    shortNames: [pgc, pg]
    categories: [databases, all]
  versions:
    - name: v1
      served: true
      storage: true
      # --- subresources: see Part 04 ---
      subresources:
        status: {}
        scale:
          specReplicasPath: .spec.replicas
          statusReplicasPath: .status.readyReplicas
          labelSelectorPath: .status.selector
      # --- printer columns: shown by kubectl get ---
      additionalPrinterColumns:
        - name: Replicas
          type: integer
          jsonPath: .status.readyReplicas
        - name: Phase
          type: string
          jsonPath: .status.phase
        - name: Version
          type: string
          jsonPath: .spec.version
        - name: Age
          type: date
          jsonPath: .metadata.creationTimestamp
      # --- OpenAPI v3 schema ---
      schema:
        openAPIV3Schema:
          type: object
          required: [spec]
          properties:
            spec:
              type: object
              required: [version, replicas]
              properties:
                version:
                  type: string
                  enum: ['15', '16', '17']
                replicas:
                  type: integer
                  minimum: 1
                  maximum: 9
                  default: 3
                storage:
                  type: object
                  required: [size]
                  properties:
                    size:
                      type: string
                      pattern: '^[0-9]+(Mi|Gi|Ti)$'
                    storageClassName:
                      type: string
                backup:
                  type: object
                  properties:
                    schedule: { type: string }
                    retention: { type: integer, default: 7 }
            status:
              type: object
              properties:
                phase:
                  type: string
                  enum: [Pending, Provisioning, Ready, Failed]
                readyReplicas: { type: integer }
                selector: { type: string }
                conditions:
                  type: array
                  items:
                    type: object
                    properties:
                      type: { type: string }
                      status: { type: string }
                      reason: { type: string }
                      message: { type: string }
                      lastTransitionTime: { type: string, format: date-time }

Two stylistic conventions are worth copying from the built-ins. First, mirror the conditions array shape — an array of objects with type, status, reason, message, and lastTransitionTime — because every Kubernetes-aware tool (kubectl wait, dashboards, monitoring) knows how to read it. Second, name your phases the same way the built-ins do (Pending, Running or Ready, Succeeded, Failed). Operators that invent their own vocabulary make every dashboard custom.

Validation: required fields, CEL rules, defaulting.

Validation runs at the api-server, not in your controller, and that is the whole point. A bad CR should be rejected at kubectl apply time with a 400-class error pointing at the offending field, not surface fifteen seconds later as a cryptic event on a partly-created object. The OpenAPI v3 schema already gives you a lot: required arrays, enum values, minimum/maximum for numbers, pattern for strings, length and item counts for arrays. Use them aggressively. Every constraint expressed in the schema is one fewer bug class your reconcile loop has to defend against.

Defaulting in the schema lives next to validation. A field with default: 3 on it gets that value injected during the api-server's mutation phase if the user did not supply one. The defaulted object is what hits etcd — not the original user payload. This means defaults are visible to every subsequent reader (kubectl get -o yaml shows the defaulted value) and to your controller, which does not have to re-implement the logic. The corollary: changing a default later is a breaking change for anyone whose existing YAML relied on the old value, because their stored object already has the old default written in.

The big modern unlock is CEL — the Common Expression Language — exposed via x-kubernetes-validations. CEL went GA for CRDs in 1.29 and replaces 90% of the use cases that previously required a validating admission webhook. CEL gives you full predicate logic at the api-server: cross-field validation, transition-rule validation (the new value compared to the old), and structured error messages. Where a webhook adds latency, an extra deployable, certificate management, and a cluster-wide failure mode if it goes down, CEL adds a string in your CRD YAML that the api-server compiles once and runs in-process. The win is enormous.

A CEL rule is a boolean expression. It must evaluate to true for the object to be accepted; if it returns false, the api-server emits the rule's message as the validation error. The expression sees self (the field the rule is attached to), and on update, oldSelf (the previous value of that field). You can attach rules at any level of the schema — top-level, on a sub-object, on a leaf field. Common idioms: enforce that .spec.replicas is odd for quorum-based systems; enforce that the storage class cannot change after creation (immutability); enforce that the backup schedule and the retention policy are both set or both unset (mutual presence).

CEL has limits. Rules cannot do I/O, cannot call out to other resources, and cannot fan out unbounded work — the api-server caps complexity by AST size and runtime cost. Anything that needs to look up another object (does this PVC's storage class exist? is this user permitted to reference this Secret?) still needs a validating admission webhook or a ValidatingAdmissionPolicy that fetches Lists. But for the “does this CR object internally make sense” class of rule, CEL is the answer. Reach for a webhook only when CEL cannot express the rule.

# CEL rules attached at the spec level
schema:
  openAPIV3Schema:
    type: object
    properties:
      spec:
        type: object
        x-kubernetes-validations:
          - rule: 'self.replicas % 2 == 1'
            message: 'replicas must be odd for quorum'
          - rule: 'has(self.backup) == has(self.backup.schedule)'
            message: 'backup.schedule is required when backup is set'
          - rule: 'self.version in ["15","16","17"]'
            message: 'unsupported postgres version'
        properties:
          version:
            type: string
            x-kubernetes-validations:
              - rule: 'self == oldSelf || int(self) >= int(oldSelf)'
                message: 'version downgrades are not allowed'
          storage:
            type: object
            x-kubernetes-validations:
              - rule: '!has(oldSelf.storageClassName) || self.storageClassName == oldSelf.storageClassName'
                message: 'storageClassName is immutable after creation'
            properties:
              size: { type: string }
              storageClassName: { type: string }

CEL has a hard cost ceiling — the per-rule budget is 10^7 evaluation steps and rules are rejected at install time if their estimated cost exceeds the limit. Long string operations and recursive CEL programs hit it; simple field comparisons do not. kubectl --warnings-as-errors apply surfaces the cost-limit warning as a hard failure during CI.

Subresources: /status and /scale.

A subresource in Kubernetes is an alternate URL on the same object that exposes a narrower verb set. Pods have /exec, /log, and /portforward. Deployments have /scale. Almost everything has /status. CRDs can opt into the last two with a four-line declaration in the schema, and doing so changes how your controller and your users interact with the resource in important ways.

The /status subresource splits the object into two server-enforced halves. Updates to the main object URL (PUT /apis/db.example.com/v1/.../prod-orders) ignore any changes to .status, even if you sent them. Updates to /status (PUT .../prod-orders/status) ignore any changes to .spec, even if you sent them. Without the subresource, a single PUT covers both fields and you have to trust everyone with write access not to clobber each other's edits. With it, a misbehaving CI pipeline that re-applies the spec cannot accidentally erase the status that your controller just wrote — and your controller cannot accidentally rewrite the spec when it meant to update an observed value.

Practically, every controller-runtime Reconcile function ends with a call to r.Status().Update(ctx, obj) rather than r.Update(ctx, obj). The first targets the /status URL, which is RBAC'd separately (update on postgresclusters/status) so users can be granted edit on the kind without being able to invent statuses. Almost every “why is the controller fighting itself” bug ten years into Kubernetes traces back to an operator that forgot to enable the subresource and ended up in a write race with whatever was applying the spec.

The /scale subresource is the magic that makes kubectl scale postgrescluster prod-orders --replicas=5 work, that lets the HorizontalPodAutoscaler target your CR, and that lets kubectl autoscale hand off to it. You declare three JSONPaths in the CRD: specReplicasPath, statusReplicasPath, and (optionally) labelSelectorPath. The api-server then exposes a virtual Scale object on /scale that reads and writes those paths. Your CR does not have to call its replica field replicas; it can be .spec.size or .spec.cluster.replicas — the JSONPath wires it up.

The label-selector path is the bit that makes the HPA work. The HPA needs to know which Pods belong to your CR so it can read their CPU usage; the standard mechanism is a label selector that the controller sets on .status.selector as a serialised labels.Selector string (e.g. app=postgres,cluster=prod-orders). With the path declared, metrics.k8s.io reads your selector, lists matching Pods, and feeds aggregated CPU back to the HPA, which writes the desired replica count to /scale, which the api-server projects onto your spec. That is how a third-party autoscaler controls a custom workload without ever knowing what a PostgresCluster is.

# ascii: how /status changes the api shape

  PUT  /apis/db.example.com/v1/namespaces/orders/postgresclusters/prod-orders
       └── only spec, metadata.labels, metadata.annotations are accepted
           status is silently dropped
           required RBAC: update on postgresclusters

  PUT  /apis/db.example.com/v1/namespaces/orders/postgresclusters/prod-orders/status
       └── only status is accepted
           spec changes are silently dropped
           required RBAC: update on postgresclusters/status   # separate verb!

  PUT  /apis/db.example.com/v1/namespaces/orders/postgresclusters/prod-orders/scale
       └── virtual Scale object: { spec: { replicas }, status: { replicas, selector } }
           required RBAC: update on postgresclusters/scale
           used by: kubectl scale, kubectl autoscale, HPA

# controller-runtime, in Reconcile:
if err := r.Update(ctx, &cluster); err != nil { return err }                // spec changes
if err := r.Status().Update(ctx, &cluster); err != nil { return err }       // status changes — different URL

If you are seeing “Operation cannot be fulfilled... the object has been modified” conflicts on every reconcile, check whether you are calling Update with both spec and status changes set on the same object. Even with the subresource, sending a stale resourceVersion will conflict — read the object fresh, mutate only the half you intend, and then call the matching method.

Multi-version CRDs and the conversion webhook.

Sooner or later your CRD outgrows its first schema and you need a v2. The Kubernetes answer is identical to how the built-ins evolve: declare multiple versions in the same CRD, mark exactly one as the storage version, and either rely on None conversion (the default; only works if the schemas are identical except for additive fields) or run a conversion webhook that translates between versions on the fly. The api-server stores objects in the storage version and converts on every read or write whose served version differs from storage.

The mechanics are subtle. When a user runs kubectl get postgresclusters.v1.db.example.com, the api-server fetches the etcd-stored object (in v2, say), invokes your conversion webhook to turn it into v1, and returns v1 to the user. When a controller writes a v1 object, the api-server invokes the webhook in the other direction to turn it into v2, validates against the v2 schema, and persists v2 to etcd. Same etcd row, different JSON shape on the wire. Old clients and new controllers can coexist for as long as you keep both versions served.

The conversion webhook is an HTTPS endpoint you implement and register in the CRD. It receives a ConversionReview request — a list of objects in one version — and must return the same list in the requested target version. The webhook is called by the api-server, on the request hot path, and so it must be fast, idempotent, and stateless. A conversion that takes 200ms adds 200ms to every list, watch-event delivery, and apply for that CRD. Cluster operators notice. The standard deployment is two replicas behind a Service, with a serving certificate provisioned by cert-manager and the public CA bundle fed back into the CRD's caBundle field via a controller (cert-manager's CAInjector automates this).

Two patterns make conversion sustainable. First, design your storage version to be a strict superset of every served version, so that conversion in one direction is just field renaming and conversion in the other direction is sometimes lossy but predictable. Second, run a one-time storage migration after you switch the storage version: list every existing object and re-PUT it, which forces the api-server to re-encode it into the new storage version. The storage-version-migrator controller automates this. Without it, etcd keeps the old version forever and you cannot remove it.

The other big lever is x-kubernetes-deprecated-versions on the CRD: a flag that tells the api-server to emit a warning header on every request to a deprecated version, which kubectl prints to stderr. Combined with the storage-version migrator, this is the standard playbook for retiring v1: mark v1 deprecated for a full release cycle, watch warnings drop to zero on dashboards, then mark v1 served: false. The shape of the deprecation cycle is identical to how the built-ins retire APIs (e.g. the extensions/v1beta1 Deployment that died in 1.16).

# conversion strategy on the CRD
spec:
  conversion:
    strategy: Webhook
    webhook:
      conversionReviewVersions: [v1]
      clientConfig:
        service:
          namespace: postgres-operator-system
          name: postgres-operator-webhook
          path: /convert
          port: 443
        caBundle: # injected by cert-manager's CAInjector
  versions:
    - name: v1                # served, NOT storage — wire-compatible with old clients
      served: true
      storage: false
      schema: { ... v1 schema, kept identical to the original release ... }
    - name: v2                # the new storage version
      served: true
      storage: true
      schema: { ... v2 schema, superset of v1 plus new fields ... }

A conversion webhook outage is a cluster-wide outage for that CRD. Every list, every watch event delivery, every apply fails until the webhook returns. Set failurePolicy: Fail only after you have run two replicas, anti-affinity-scheduled, with a PodDisruptionBudget. Most operators ship with failurePolicy: Ignore for the first release and tighten later — accepting that during a webhook outage, list of v1 returns the storage-version JSON un-converted, which is preferable to total failure.

The operator maturity model — Levels I through V.

CoreOS originally articulated the operator maturity model in 2017, and OperatorHub adopted it as a five-level rubric for ranking operators in the catalogue. The progression is from “a fancier installer” to “a piece of software that operates itself.” It is useful as a planning ladder: each level is a discrete engineering project, the increments are roughly equal in cost, and skipping a level almost always backfires because the operator below depends on the one above being solid.

Level I (Basic install) is the minimum viable operator: install the operand from a CR, configure it, expose a Service, mount Secrets and ConfigMaps. A good Helm chart with a controller wrapper. The CR is essentially a typed input form. If your CR were missing, you could replace the operator with a templating tool. Most operators ship at this level on day one and stay here for a long time. There is nothing wrong with stopping here if the operand really does not need lifecycle help — stateless services, ad-hoc dev tools.

Level II (Smooth upgrades) means the operator handles patch and minor version updates of the operand without human intervention. Bumping .spec.version from 15.4 to 15.5 rolls the underlying StatefulSet, observes pod readiness and replication lag, and either succeeds or surfaces a clear status. This is harder than it looks because most stateful systems have ordering constraints — upgrade replicas before the primary, run pre-flight checks, gate the cutover on successful failover. Level II is where the operator starts encoding real domain knowledge.

Level III (Full lifecycle) covers backups, restores, fail-overs, storage migrations, certificate rotations, scaling up and down. The CR grows new sub-objects: .spec.backup, .spec.replication.standby, .spec.tls. The operator may emit Job objects for one-shot operations and watch them for completion. It may create extra CRs of its own kinds (e.g. PostgresBackup, PostgresRestore) so users can express “restore this cluster from that backup” declaratively. This is where commercial database operators and the CNCF graduated operators (Strimzi, Cassandra) live.

Level IV (Deep insights) is observability: the operator emits operand-specific metrics in Prometheus format, exposes ServiceMonitor resources, ships Grafana dashboards, defines AlertManager rules, surfaces operand events. The principle is that anyone with kubectl and Prometheus should be able to diagnose a problem with the operand without reading operand-specific documentation. Level IV distinguishes operators that platform teams enjoy running from operators that platform teams begrudge running.

Level V (Auto-pilot) is the operator making decisions on its own: horizontal autoscaling based on operand metrics, vertical autoscaling based on resource pressure, anomaly detection that triggers preventive failover, automatic index tuning, automatic vacuum scheduling. The CR becomes a high-level intent (.spec.workload: oltp) and the operator chooses replicas, sizes, and configuration based on observed traffic. Almost no operator reaches Level V honestly. The operators that claim it usually have a few narrowly-scoped autopilot features bolted onto a Level III base.

A planning observation. Most teams underestimate I and IV and overestimate III and V. A clean Level I with a well-designed CRD schema (the bulk of this page) is more durable than a janky Level III with a CRD that has to be redesigned later. And Level IV is the level your platform team will actually thank you for. Pick your battles accordingly: ship a clean I, then jump straight to IV before you attempt III. Your future self at 3am will agree.

The Operator Capability Levels rubric also maps onto OperatorHub categories — “Database”, “Big Data”, “Logging & Tracing”, “Networking”, “Security”, “Streaming & Messaging”, “Storage”, “Cloud Provider”, “Modernisation & Migration”, “Application Runtime”, “Monitoring”, “OpenShift Optional”, “Developer Tools”, “AI/Machine Learning”. Picking your category early shapes which existing CRD conventions you should mirror.

kubebuilder and Operator SDK: a working scaffold in ten minutes.

There are two scaffolding tools the ecosystem has converged on, and they are siblings more than competitors: kubebuilder, maintained by the Kubernetes SIG API Machinery, and Operator SDK, maintained by the Red Hat operator-framework team. Both wrap controller-runtime, both generate identical project layouts, and both produce a working reconciler that you can make run against any cluster. Operator SDK adds extra generators for OLM bundles (the OperatorHub packaging format), for Ansible-based operators, and for Helm-based operators; if you do not need those, kubebuilder is the lower-friction choice.

The kubebuilder workflow is three commands. kubebuilder init creates the Go module, the Makefile, the Dockerfile, the manager binary entry-point. kubebuilder create api scaffolds a CRD type and a controller stub for a given group/version/kind. make manifests && make install generates the CRD YAML from your Go struct's annotations and applies it to the cluster. From a fresh terminal to a CRD in etcd takes roughly five minutes; the next five are spent in internal/controller/postgrescluster_controller.go filling in the Reconcile body.

The annotations on your Go types are how kubebuilder generates the OpenAPI v3 schema. You write Go struct fields with json: tags and // +kubebuilder: comment markers; the controller-gen tool reads them and emits the CRD YAML at build time. This is the contract: your Go types are the source of truth, the YAML is generated. Editing the YAML by hand is an anti-pattern because the next make manifests will overwrite it. The markers cover all the schema knobs from Part 02 (default, enum, min/max, pattern), the subresource declarations from Part 04, the printer columns, and — since 1.29 — the CEL rules from Part 03.

The Reconcile function gets a ctrl.Request identifying the CR by namespace and name, a typed client, and a logger. It returns a ctrl.Result that lets you request a requeue (immediately, after a delay, or never) and an error. The reconcile loop must be idempotent — it can be called repeatedly with the same input — and edge-driven — it should not assume any particular previous state, because the work queue may have coalesced multiple events into one call. Structure: fetch the CR, fetch the children, diff intent against reality, apply changes, update status, return.

Owner references and finalizers (the next paragraph) get set by helper calls in controller-runtime. controllerutil.SetControllerReference marks a child resource as owned by the CR; deletion of the CR triggers cascading deletion of the child via Kubernetes' built-in garbage collector — no controller code needed for the happy path. Finalizers handle the unhappy path: external state that has to be cleaned up before the CR is allowed to disappear from etcd. The pattern is documented in detail in the controllers sub-page; the kubebuilder scaffold below shows where the calls go.

// api/v1/postgrescluster_types.go — generated by kubebuilder, then edited
package v1

import (
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

// PostgresClusterSpec defines the desired state.
type PostgresClusterSpec struct {
    // +kubebuilder:validation:Enum=15;16;17
    // +kubebuilder:validation:Required
    Version string `json:"version"`

    // +kubebuilder:validation:Minimum=1
    // +kubebuilder:validation:Maximum=9
    // +kubebuilder:default=3
    Replicas int32 `json:"replicas"`

    // +kubebuilder:validation:Required
    Storage StorageSpec `json:"storage"`

    Backup *BackupSpec `json:"backup,omitempty"`
}

type StorageSpec struct {
    // +kubebuilder:validation:Pattern=`^[0-9]+(Mi|Gi|Ti)$`
    Size             string `json:"size"`
    StorageClassName string `json:"storageClassName,omitempty"`
}

type BackupSpec struct {
    Schedule  string `json:"schedule,omitempty"`
    // +kubebuilder:default=7
    Retention int    `json:"retention,omitempty"`
}

// PostgresClusterStatus is the observed state.
type PostgresClusterStatus struct {
    Phase         string             `json:"phase,omitempty"`
    ReadyReplicas int32              `json:"readyReplicas,omitempty"`
    Selector      string             `json:"selector,omitempty"`
    Conditions    []metav1.Condition `json:"conditions,omitempty"`
}

// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
// +kubebuilder:subresource:scale:specpath=.spec.replicas,statuspath=.status.readyReplicas,selectorpath=.status.selector
// +kubebuilder:printcolumn:name="Replicas",type=integer,JSONPath=`.status.readyReplicas`
// +kubebuilder:printcolumn:name="Phase",type=string,JSONPath=`.status.phase`
// +kubebuilder:printcolumn:name="Age",type=date,JSONPath=`.metadata.creationTimestamp`
// +kubebuilder:resource:shortName=pgc;pg,categories=databases
type PostgresCluster struct {
    metav1.TypeMeta   `json:",inline"`
    metav1.ObjectMeta `json:"metadata,omitempty"`
    Spec   PostgresClusterSpec   `json:"spec,omitempty"`
    Status PostgresClusterStatus `json:"status,omitempty"`
}

// internal/controller/postgrescluster_controller.go — the reconcile body
package controller

import (
    "context"

    apierrors "k8s.io/apimachinery/pkg/api/errors"
    ctrl "sigs.k8s.io/controller-runtime"
    "sigs.k8s.io/controller-runtime/pkg/client"
    "sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"

    dbv1 "example.com/postgres-operator/api/v1"
)

const finalizerName = "db.example.com/finalizer"

type PostgresClusterReconciler struct {
    client.Client
    Scheme *runtime.Scheme
}

// +kubebuilder:rbac:groups=db.example.com,resources=postgresclusters,verbs=get;list;watch;update;patch
// +kubebuilder:rbac:groups=db.example.com,resources=postgresclusters/status,verbs=update;patch
// +kubebuilder:rbac:groups=db.example.com,resources=postgresclusters/finalizers,verbs=update
// +kubebuilder:rbac:groups=apps,resources=statefulsets,verbs=get;list;watch;create;update;patch;delete
func (r *PostgresClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    var pgc dbv1.PostgresCluster
    if err := r.Get(ctx, req.NamespacedName, &pgc); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // finalizer add/remove pattern — full discussion in the controllers sub-page
    if pgc.DeletionTimestamp.IsZero() {
        if !controllerutil.ContainsFinalizer(&pgc, finalizerName) {
            controllerutil.AddFinalizer(&pgc, finalizerName)
            return ctrl.Result{}, r.Update(ctx, &pgc)
        }
    } else {
        if controllerutil.ContainsFinalizer(&pgc, finalizerName) {
            if err := r.deleteExternalResources(ctx, &pgc); err != nil {
                return ctrl.Result{}, err
            }
            controllerutil.RemoveFinalizer(&pgc, finalizerName)
            return ctrl.Result{}, r.Update(ctx, &pgc)
        }
        return ctrl.Result{}, nil
    }

    // reconcile children — StatefulSet, Service, Secret, PDB
    sts := buildStatefulSet(&pgc)
    if err := controllerutil.SetControllerReference(&pgc, sts, r.Scheme); err != nil {
        return ctrl.Result{}, err
    }
    if err := r.applyChild(ctx, sts); err != nil {
        return ctrl.Result{}, err
    }

    // observe the world, write status
    pgc.Status.Phase = derivePhase(sts)
    pgc.Status.ReadyReplicas = sts.Status.ReadyReplicas
    pgc.Status.Selector = "app=postgres,cluster=" + pgc.Name
    return ctrl.Result{}, r.Status().Update(ctx, &pgc)
}

The RBAC markers above are not decorative — controller-gen reads them and emits the ClusterRole that the operator binds to its ServiceAccount. Forget the /status verb and your reconciler will succeed at every step until it tries to write status, then 403. Forget the /finalizers verb and you cannot add a finalizer at all. Run make manifests after every signature change.

CRD vs aggregated apiserver, anti-patterns, further reading.

Almost every API extension you will write should be a CRD. The aggregated apiserver is the right answer in three narrow situations and the wrong answer in every other one. First, you have a non-etcd backing store — you need to expose live metrics (metrics.k8s.io aggregates from kubelet /stats/summary), or to expose a database query result, or to expose a paginated stream from object storage. Storing those in etcd is infeasible (size, churn) or wrong (the data is not the source of truth). Second, you need imperative endpoints — verbs that are not CRUD on a resource, like /exec, /portforward, /proxy. Third, you need fully custom storage semantics: server-driven pagination over a non-list shape, watch resumability with non-etcd revision tokens, server-defined OPTIONS responses.

The aggregated apiserver costs you everything the CRD path gives you for free. You write the storage layer (or wrap an etcd client yourself). You implement watch. You wire up authentication delegation. You vend serving certificates, watch your CA bundle, deal with APIService availability conditions. The sample-apiserver repo and apiserver-builder exist precisely to reduce the cost, but it is still ten times the engineering of a CRD-plus-controller. Reach for it only when CRD cannot express the API.

Anti-patterns are the other half of the design lesson. The biggest one, by a wide margin, is using a CRD as a configuration database — an opaque blob inside .spec.config: string that no one validates, no one diffs, no one upgrades. CRDs reward a typed schema and punish a string blob. If you find yourself writing x-kubernetes-preserve-unknown-fields: true at the top of a CR, stop and ask whether the right answer is a different shape. Pair it with the closely-related anti-pattern of CRDs that store secrets in plaintext (use a Secret reference instead) or that store large blobs (etcd is not a blob store; the size limit is roughly 1.5MB per object).

The second anti-pattern is the operator that creates other CRs of its own kinds solely to orchestrate work — PostgresUpgradeJob, PostgresBackupJob, ten kinds for ten verbs. CRs are durable state, not commands. If you need a one-shot operation, use a Job (it already exists, it works) or expose a status-only field on the parent CR (.spec.requestedBackup: latest) and reconcile it into a Job internally. Per-verb CRDs grow without bound, are confusing to RBAC, and end up as a parallel work-queue in etcd that no one knows how to drain.

The third anti-pattern is the operator that reaches around the api-server: writes directly to etcd, reads pod stdout to derive state, runs without leader election in HA, mutates objects through admission webhooks instead of through reconcile loops. The whole CRD plus operator pattern works because the api-server is the single source of truth (see the api-server sub-page) and watches are the single mechanism for change propagation. Bypass either and you have built an island again — exactly the thing CRDs were invented to dissolve.

One last positive pattern: cross-link the controllers sub-page for the reconcile mechanics, the api-server sub-page for the request pipeline, and the etcd sub-page for what your CRs look like in storage. CRDs only make sense in the context of the system they extend.

Authoritative docs

Tooling + scaffolds

KEPs that shaped this

And the rest of the Semicolony ladder. The controller pattern sub-page is the deep version of what your operator's reconcile loop actually does; the api-server sub-page is what happens to your CR between kubectl apply and the etcd write; the architecture sub-page is the system around it. For the visceral side, the rollout simulator lets you watch a controller reconcile in motion. And the pod creation guide traces a single apply through every component named here.

One closing observation. CRDs and operators are the line where Kubernetes stops being a generic compute platform and starts being a domain-specific platform that your team owns. The discipline of the CRD schema — the spec/status split, the OpenAPI types, the CEL invariants, the subresources, the conversion contract — is what keeps that domain platform legible to everyone who arrives after you. Spend the time on the schema first. The reconcile loop is the easy part.

Next in the internals series

Keep going.

The controller pattern

Informers, listers, work queues, finalizers, reconciliation. Pseudocode you can ship.

The api-server pipeline

The six-step request pipeline every CR write traverses. Authn, authz, admission, conversion, storage.

Architecture, end to end

The eight processes and the one storage primitive. Where your operator fits in the larger picture.

Read ↑

Back to the internals index

All twelve sub-pages and the system on one canvas.

Index

Found this useful?

Extend the API.Don't bypass it.

Why CRDs exist at all.

Designing a CRD: spec, status, names, scope.

Validation: required fields, CEL rules, defaulting.

Subresources: /status and /scale.

Multi-version CRDs and the conversion webhook.

The operator maturity model — Levels I through V.

kubebuilder and Operator SDK: a working scaffold in ten minutes.

CRD vs aggregated apiserver, anti-patterns, further reading.

Keep going.

Extend the API.
Don't bypass it.