02 / 05
Security / 02

Secrets management

Every production system has credentials — database passwords, API keys, signing keys, OAuth client secrets, encryption keys. Where those credentials actually live is the question secrets management answers. Get it wrong and the API key ends up in a public git repo (this happens roughly 12 times per minute on GitHub, by the public secret-scanner feed). Get it right and rotating a leaked credential is a two-minute operation instead of a multi-day outage.


Where secrets leak — the working list

Start with what a secret actually is. A secret is any value that grants access or proves identity and that you would not want a stranger to read: a database password, an API key, a private signing key, a TLS private key, an OAuth client secret, a session-encryption key. The defining property is not that it is random — plenty of secrets are short or memorable — it is that possession equals authority. Whoever holds the value can do whatever the value permits. That is why a leaked secret is worse than a leaked config value: the leak is not information, it is access.

So the whole discipline reduces to one question. How many places does this value end up, and who can read each of those places? Every copy is a separate chance to leak, and copies multiply faster than anyone expects. The diagram below traces the paths a single credential travels from the developer's laptop to production, and marks every point where it tends to escape into a context wider than the process that needs it.

ONE SECRET · MANY COPIES · MANY LEAK POINTSlaptop.env filegit repocommit + historyCI / buildlogs + imageregistryimage layersproductionrunning podscreen sharepublic scrapelog retentionpull anon.env dumpstack traceSSRF → metadataeach red arrow is a place the value escapes the process that needs itfewer copies → fewer arrows → smaller attack surface
The leak surface of a single credential. Solid arrows are intended movement; dashed red arrows are the leaks.

Before tools, the failure modes. Almost every secret leak in the last decade fits into one of these:

Git history. A developer commits a .env file once, deletes it in the next commit, ships. The secret is in the history forever. Public-repo leaks get scraped by scanners within minutes; private-repo leaks survive until someone audits. The 2016 Uber breach started this way.

CI/CD logs. A build script echoes an environment variable for debugging. The variable contains a token. The log is retained for 90 days, visible to anyone with repo access. Common at organizations that grant broad CI access for productivity.

Container image layers. COPY .env ./.env in a Dockerfile. The .env is deleted in a later RUN, but the layer is still in the image. The image is pushed to a public registry. The 2017 Cyberreason research found thousands of these on Docker Hub.

Backups and snapshots. The production database with embedded secrets is snapshotted; the snapshot is shared with a dev account; the dev account has weaker access controls. The Codecov supply-chain attack (2021) exfiltrated secrets out of CI environments because they were sitting in environment variables that the compromised uploader could read.

Stack traces and error reports. A connection string with embedded password is included in a panic message sent to Sentry/Datadog/Rollbar. The error aggregator's storage is more permissive than the production database.

Cloud provider metadata. An SSRF in an app server reaches the EC2 metadata endpoint (169.254.169.254), which returns the instance role's temporary credentials. Those credentials have permissions the app server should never have had to use directly. The Capital One breach (2019) is the canonical example.

"Just one" hardcoded fallback. "If the env var is empty, use this default for local development." The default ends up in production because the env var was misnamed in deployment. Mitigation: hardcoded defaults should never be production-valid credentials.

The pattern. Almost all of these come from secrets being read into a context wider than the immediate process — a log, an image, a snapshot, an error report. Secrets management is the discipline of keeping that context narrow.

The architectural goal

Three properties a good secrets-management setup has:

Secrets at rest are encrypted, with separate access controls from the data. Encrypted in transit by default. Encrypted at rest by something that is not the same key everyone in engineering has. Audit log of access.

Secrets in the process are short-lived. Pulled at start-up, refreshed on expiry, never written to disk. The process has the secret in memory only; if the process restarts, it re-fetches.

Rotation is a routine operation, not a crisis. A leaked credential should be revocable in minutes by a single person, with the new credential propagating to all workloads that need it automatically. If rotation requires a coordinated multi-team deployment, the system will not rotate until something breaks.

Every tool below is a different trade-off on how to achieve these three. The question is never "should we use Vault" but "what does our deployment model allow, and what is the simplest tool that satisfies the three properties given that model".

Envelope encryption — how the store keeps the secret

"Encrypted at rest" sounds like one operation, but in every serious secret store it is two keys wearing a trench coat. The technique is called envelope encryption, and once you see it you will recognise it in cloud KMS, in Vault's seal, in disk encryption, and in most database encryption features.

The problem it solves is real. You want every secret encrypted, but you do not want to send the actual secret bytes to a central key service on every read — that service becomes a bottleneck and sees all your plaintext. You also do not want to encrypt a thousand secrets directly with one master key, because then rotating the master key means re-encrypting everything, and a single key encrypting a huge volume of data is its own weakness.

Envelope encryption splits the job across two key tiers. A data encryption key (DEK) encrypts the actual secret. A separate key encryption key (KEK) encrypts the DEK. The KEK lives in a hardened key service (a KMS, an HSM) and never leaves it; the encrypted DEK is stored right next to the ciphertext it protects. To read a secret you ask the key service to unwrap the DEK, decrypt the secret with the unwrapped DEK in memory, and throw the DEK away.

KEK WRAPS DEK · DEK ENCRYPTS SECRETKMS / HSM boundaryKEKnever leaves this boxunwrap(DEK_enc) → DEKwrap(DEK) → DEK_encrotate KEK = re-wrap DEKs onlyDEK (plaintext)in memory onlyDEK_encstored beside secretencrypted secretciphertext at restunwrapstore wrappedencryptthe plaintext DEK exists only for the moment of decrypt, then it is discarded
Envelope encryption. The master key (KEK) stays inside the key service; the per-secret DEK does the bulk work and is stored wrapped.

This structure earns its keep in three ways. Rotating the KEK is cheap: you re-wrap the small DEKs, not the bulk ciphertext, so a master-key rotation that would otherwise touch terabytes touches kilobytes instead. The KMS only ever handles tiny wrapped keys, never the plaintext secrets, so it stays fast and never sees your data. And the access boundary is clean — permission to call unwrap on the KEK is the only thing that grants the ability to read anything, so that single IAM permission is what you audit and scope.

When AWS KMS, GCP KMS, or Azure Key Vault back a secret manager, this is the machinery underneath. The same idea appears one level up: a secret manager's encrypted blob is itself often wrapped by a KMS key, so "who can decrypt this secret" becomes "who can use this KMS key," which is an ordinary IAM question rather than a key-handling one.

Cloud-native secret managers

AWS Secrets Manager, GCP Secret Manager, Azure Key Vault. If you are already on a cloud, these are almost always the right starting point. They handle encryption at rest (with a KMS-managed key), IAM-style access control, audit logging, and rotation hooks.

The pattern of use is straightforward:

# AWS, application code
import boto3, json

client = boto3.client('secretsmanager')
resp = client.get_secret_value(SecretId='prod/db/credentials')
creds = json.loads(resp['SecretString'])
# creds = {"username": "...", "password": "..."}

# IAM policy that grants this:
{
  "Effect": "Allow",
  "Action": "secretsmanager:GetSecretValue",
  "Resource": "arn:aws:secretsmanager:us-east-1:123:secret:prod/db/credentials-*"
}

Cost is per-secret per month plus per API call; cheap until you have thousands of secrets or millions of pulls per minute. The friction points: pulling on every cold start of a Lambda multiplies latency (cache the secret in process memory); cross-region access requires explicit replication; cross-account access needs resource policies which most teams get wrong the first time.

GCP Secret Manager is almost identical in API. Azure Key Vault is more complex because it bundles secrets, keys, and certificates with slightly different access models — Key Vault is the closest of the three to Vault in feature set, and the most expensive to operate.

Rotation: AWS Secrets Manager has built-in rotation for RDS, Redshift, DocumentDB, and a Lambda-runs-your-script model for everything else. The rotation Lambda creates a new version, validates it (test the new credentials), and promotes it. Applications pull "the current version" and get the new one on next fetch. Done right, rotation has zero downtime; done wrong, the validation step is missing and the rotation breaks production.

HashiCorp Vault — the more capable option

Vault does what the cloud secret managers do, plus a different and more interesting thing: it can issue dynamic secrets — credentials that did not exist before the app asked for them, scoped to the app, with a TTL.

The dynamic-secrets flow for a database:

# Vault config: a "database secret engine" for Postgres
vault write database/config/orders-db \
    plugin_name=postgresql-database-plugin \
    connection_url="postgresql://{{username}}:{{password}}@orders.internal:5432/" \
    allowed_roles="orders-app" \
    username="vault-admin" password="..."

vault write database/roles/orders-app \
    db_name=orders-db \
    creation_statements="CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}'; GRANT SELECT, INSERT, UPDATE ON ALL TABLES IN SCHEMA app TO \"{{name}}\";" \
    default_ttl="1h" max_ttl="24h"

# Application code:
creds = vault.read('database/creds/orders-app')
# returns a fresh username + password good for 1 hour
# vault has actually CREATEd a Postgres role with that name

The application's database credential is now ephemeral — a new role per pod per hour. A leak of the credential is bounded by its TTL; revocation is "delete the lease in Vault"; audit shows exactly which Vault role created which Postgres role.

The same pattern works for AWS IAM credentials, RabbitMQ, MongoDB, SSH certificates, PKI-issued TLS certs. Anything with a programmatic provisioning API can become a Vault dynamic secret engine.

The cost: operating Vault. The cluster itself, the auto-unseal configuration, the auth backends (Kubernetes service-account auth, OIDC, AppRole), the policies, the disaster recovery story. Vault is capable and rewards investment; teams that do not have the capacity to operate a complex stateful cluster should reach for the cloud-native option first and consider Vault when the dynamic-secrets pattern is worth the operational bill.

The best secret is no static secret — workload identity

Step back from "how do we store this key safely" and ask "why does the workload have a long-lived key at all." A static API key is a bearer token with no expiry: it works forever, anywhere, for anyone who holds it. Most of the leaks above are bad precisely because the leaked value is long-lived. If the value expired in an hour and only worked from the workload that was meant to use it, a leak would be a non-event.

That is the goal of workload identity. Instead of giving a service a stored credential, you give it a verifiable identity, and it exchanges that identity for short-lived credentials on demand. The identity is something the platform can attest to — "this is the pod running as service account X in cluster Y" — and the exchange is an ordinary token swap. The workload never holds a static secret; it holds proof of who it is, and the proof is checked fresh each time.

Every cloud has a native version. On AWS, IAM roles for service accounts (IRSA) let an EKS pod assume an IAM role through its Kubernetes service-account token; the pod gets temporary STS credentials that rotate automatically and expire on their own. GKE Workload Identity and AKS Managed Identities do the same on their platforms. An EC2 instance role works the same way at the VM level — the instance metadata service hands the workload short-lived credentials it never had to store. The thing you ship to production in all of these is zero secrets; the platform vouches for the workload and the credentials are minted at runtime.

The vendor-neutral version of this idea is SPIFFE. It defines a stable identity for a workload (a SPIFFE ID, a URI like spiffe://example.org/ns/prod/sa/orders) and a way to deliver a short-lived, cryptographically verifiable document proving that identity (an SVID, usually an X.509 cert or a JWT). SPIRE, the reference implementation, attests workloads based on properties the platform can confirm — the kubelet, the process, the node — and rotates the SVID continuously. Two services that both trust the same SPIFFE trust domain can authenticate to each other with no shared static secret at all, which is the foundation most service-mesh mTLS is built on.

The connective tissue between all of these is OIDC federation. A platform issues a signed identity token (a JWT) describing the workload; a target system is configured to trust that issuer and to map claims in the token to a role or permission set. This is how a GitHub Actions job assumes an AWS role with no stored access key — GitHub signs a token saying "this run is for repo X on branch main," AWS trusts GitHub's OIDC issuer, and the job receives temporary STS credentials scoped to exactly that. The same federation pattern lets a GCP workload call an AWS API, or a Kubernetes pod authenticate to Vault. The deep version of this — how the token, the issuer, and the trust mapping fit together — is the subject of the authentication and cloud identity chapters.

The order of preference. No secret (workload identity, federated tokens) beats a short-lived secret (Vault dynamic creds, STS), which beats a long-lived secret in a manager, which beats a long-lived secret in an env var, which beats a long-lived secret in git. Move every credential you can up that ladder; you cannot leak a secret that was never issued.

Sealed-secrets and sops — secrets in git, encrypted

Two tools for a different problem: how do you keep secrets in a GitOps workflow without leaking them through the git history.

Bitnami sealed-secrets is Kubernetes-specific. You install a controller in the cluster; the controller has a private key. To create a secret, you encrypt the value with the controller's public key using kubeseal, producing a SealedSecret manifest. The encrypted manifest is safe to commit to git. The controller decrypts it inside the cluster and creates the actual Secret resource.

# create a secret, encrypt with the cluster's public key
kubectl create secret generic db-pass --from-literal=password=hunter2 \
    --dry-run=client -o yaml | kubeseal -o yaml > db-pass.sealed.yaml

# db-pass.sealed.yaml is safe to commit:
apiVersion: bitnami.com/v1alpha1
kind: SealedSecret
metadata:
  name: db-pass
spec:
  encryptedData:
    password: AgB7T9...long encrypted blob...

Mozilla sops is more general. It encrypts only the values in a YAML/JSON file, leaving the keys readable so the file is still diff-friendly. Encryption uses whatever KMS you have — AWS KMS, GCP KMS, Azure Key Vault, age, PGP. Files encrypted with sops are checked in; reading them requires the KMS access.

The trade-off vs cloud secret managers: secrets in git are versioned, diffable, code-review-able. They live with the deployment manifest, so a rollback brings the right secret version with it. They do not require an extra API call at runtime — the secret is materialised by the deploy process. They do require KMS access to anyone who needs to edit them, which is its own access-control problem.

Sealed-secrets and sops shine for configuration-style secrets that change rarely and that naturally belong next to the manifest (TLS certs, signing keys, service account tokens). They are worse for fast-rotating secrets where the round-trip through git slows down rotation.

The .env-files argument

"Use environment variables" is one of the most cited and most misunderstood pieces of twelve-factor-app advice. The original Heroku argument was specifically against config files committed to the repo. Environment variables were the alternative that was always available in any deployment platform, and that Heroku itself could expose through its dashboard.

In 2026 the argument is more nuanced. Environment variables have real downsides for secrets:

They leak through child processes. Any subprocess your app spawns inherits its env. The npm install that runs at container start can see the database password. So can the cron job. So can the debugger that exec's a shell.

They appear in process listings and /proc/$pid/environ. Anyone with shell access to the container can read them.

They are easy to dump. A single line of code (JSON.stringify(process.env), os.environ.items()) sends every secret to the logs the moment a developer adds it for debugging.

They cannot rotate without restart. Most app frameworks read env vars once at startup; changing the secret in the env requires the process to restart.

The practical middle ground: env vars for non-secret config (LOG_LEVEL, DB_HOST, FEATURE_FLAG_X), secrets fetched from a secret manager at startup and held in process memory. Some platforms (Kubernetes secrets, AWS Lambda, Cloud Run) blur the line by injecting secret-manager values as env vars; that is fine as a delivery mechanism, but the secret manager remains the source of truth, not the env var.

Files mounted from a secrets volume (Kubernetes Secret as a tmpfs mount, Vault Agent templated file) are usually preferable to env vars for the same reason — they are readable only by the process that has the mount, not inherited by child processes, and Vault Agent or the cloud equivalents can refresh the file in place when the secret rotates.

Rotation — the operation that matters

A secrets-management system that cannot rotate is a write-once vault. Rotation has three modes, in order of how cleanly they work:

Dual-credential rotation. The secret has two valid versions at the same time. New code is pushed using version N+1; once all workloads have pulled it, version N is revoked. Used by every cloud IAM access-key rotation flow. Works because most resource servers (databases, APIs) can have multiple credentials granting the same access.

Atomic rotation. The secret manager updates the value atomically; the application pulls "current" and gets the new one. Works for secrets the application re-fetches on every use (typically wrong — too slow) or that have a known refresh interval and a known stale-window the application tolerates. For database passwords this breaks every in-flight connection at the moment of rotation.

Coordinated rotation. Drain the app, change the secret, restart the app against the new secret. Downtime is the cost. Almost always avoidable with dual-credential rotation; sometimes used because nobody set up the rotation flow until after the leak already happened.

Vault's dynamic secrets sidestep most of this — every pod gets its own credential, and rotation is the natural expiry of the lease. The credential a pod is holding stays valid until its TTL; the next pod that starts gets a different credential. The "rotation operation" becomes "the lease TTL expired".

Set the rotation cadence by the leak likelihood, not the calendar. Database passwords used by 200 services and held in 5 different config systems should rotate slowly because every rotation risks an outage. Cloud root keys held by 2 people should rotate often because the blast radius is enormous. The 90-day rotation cadence that compliance frameworks ask for is a floor, not a strategy.

Detecting leaks — assume they happen

Even with good hygiene, secrets leak. The defense is detection.

Git pre-commit hooks. gitleaks, trufflehog, git-secrets. Run on every commit, block obvious patterns (AWS access keys, Stripe keys, JWT secrets). Cheap, prevents the most common class of leak. Add to CI as well; pre-commit can be disabled.

Repository scanning. GitHub's secret scanning runs on public repos automatically and is opt-in for private. It catches credentials that match known provider patterns and (for partnered providers) revokes them automatically. AWS has a similar scanner that disables access keys it finds public.

Honeytokens. Place fake credentials in places where a leak would expose them — a fake AWS key in your CI logs, a fake database password in a stack trace template. If anyone uses it, you know it leaked. The blast radius is zero because the credential never had access.

Audit logs on the secret store. Every access logged. Anomalous access patterns (a credential pulled from an IP that has never pulled it before, at a time the workload does not usually run) generate alerts. Most teams turn this on once after a scare and then ignore the alerts; the discipline of actually reading them is the rare one.

When the leak happens — the playbook

You will get a leak. What to do in the first hour:

1. Revoke first, investigate second. If the credential is in a public git push, in a screenshot, in a CI log that was shared externally, revoke it before doing anything else. A leaked AWS key that has not been disabled is being used by a cryptominer within minutes.

2. Rotate every secret that was reachable from the compromised one. If the leaked key could pull from a secret manager, every secret in that manager is now suspect. The pivot is the most expensive part of a breach; rotation breaks it.

3. Check audit logs for use. Did the leaked credential get used between leak and revocation? By whom, from where, to do what. This tells you scope of damage and whether you have a follow-up incident.

4. Postmortem the leak path. How did the secret end up where it was leaked from? Was it the dev's local .env, the CI log, the container image, the error-reporter? Add a control there.

5. Update the threat model. The post-mortem ends with a line item in the threat model so the next team that builds something here knows the historical failure.

A practical default

For most teams in 2026 a working default looks like:

Use the cloud-native secret manager (AWS Secrets Manager, GCP Secret Manager, Azure Key Vault) for application credentials, fetched at startup and held in memory. Use IAM roles (IRSA on EKS, Workload Identity on GKE, Managed Identities on AKS) to grant the workload access; never ship long-lived cloud credentials to the workload itself. Use sops or sealed-secrets for GitOps configuration secrets that naturally live next to the manifest. Add gitleaks to CI. Turn on the platform's secret-scanning. Pick a rotation cadence per secret class and stick to it.

Add Vault when you need dynamic database credentials, PKI issuance for service mesh, or short-lived SSH certs at scale — and when you have someone whose job is to operate it. Vault is wonderful and not free.

Above all: the most important secrets-management practice is the one your team will actually follow. A working .env-in-AWS-Secrets-Manager setup with monthly rotation is better than a brilliant Vault deployment that nobody understands or updates.

Further reading

The Vault documentation is dense but authoritative; the "secrets engines" section is the most useful starting point. AWS, GCP, and Azure each have a "Secrets Manager best practices" guide that is worth an hour. The OWASP Secrets Management Cheat Sheet covers the patterns and pitfalls in compact form. For incident response, the Capital One breach post-mortem and the Codecov breach analyses are the two case studies every team should read.

Inside this section, the threat-modeling chapter explains how secret-related threats fit into a STRIDE pass; the planned authentication chapter covers the credentials those secrets typically issue or validate.

Found this useful?