02 / 16
Cloud Codex · AWS / 02

IAM, the hard parts.

The basic IAM model — users, policies, roles — gets a paragraph in every AWS tutorial. The model that actually runs production is bigger: STS-issued temporary credentials instead of static keys, AssumeRole as the bridge between workloads, federation for human identity, IRSA for Kubernetes, SCPs and permission boundaries as guardrails, and a policy evaluation order that explicit-denies, then explicit-allows, then implicit-denies. This page is that bigger model.


1 · What IAM actually evaluates

Every AWS API call walks the same evaluation pipeline before it's allowed to do anything. There are six policy layers, evaluated in a specific order, and the first hard rule of IAM is: an explicit DENY at any layer wins, full stop. The second rule: a request needs at least one explicit ALLOW and no DENY at any applicable layer. "No policy mentions me" defaults to deny.

Cross-account access has one extra wrinkle: both the resource policy in the target account and the identity policy in the caller's account must allow the call. Same-account calls only need one of them. This is why you can aws s3 ls s3://my-bucket with no bucket policy when the bucket is in your account, but a partner account can't see it until you add a "Principal": "arn:aws:iam::PARTNER:role/X" grant to the bucket policy and the partner role has s3:ListBucket in its identity policy.

If you see…Look here
"AccessDenied" with no context(1) SCP at the org/OU level (2) bucket / resource policy explicit deny (3) permission boundary missing the action
Works from my user, fails from the roleTrust policy on the role + identity policy on the user. The role's own identity policy is irrelevant to the AssumeRole call.
Works in console, fails from CLIThe console signs requests as your IAM/SSO principal; the CLI may be using a different profile. aws sts get-caller-identity first.
Works for actions, fails for resource-levelResource ARN doesn't match. arn:aws:s3:::bucket vs arn:aws:s3:::bucket/* are different resources.
Worked yesterday, denied today(1) A new SCP was applied (2) permission boundary was attached (3) condition key context changed (e.g., SourceVpc / SourceIp / time-of-day)
Allowed in one region, denied in anotherCondition key aws:RequestedRegion somewhere in the chain, or the role's STS regional endpoint behaviour (see §11).
The IAM Policy Simulator and Access Analyzer are the diagnostic tools. The simulator walks the evaluation in detail and tells you which statement matched. Access Analyzer surfaces resource policies that grant external access — including the kinds of subtle SCP-bypass or trust-policy-too-broad findings that a manual review misses. Run both in CI on every policy change.

2 · The four kinds of principal

A principal is anything that makes an authenticated API call to AWS. There are four practical kinds, and modern AWS estates use mostly the last two:

PrincipalHow it authenticatesWhen to use
IAM userLong-lived access keysAvoid. CI/CD tokens are the last remaining excuse, and most CI now supports OIDC instead.
Federated identitySAML / OIDC token from an external IdPHumans. Workforce SSO via Identity Center → Okta / Entra ID / Google.
Service-linked roleAWS service is the trust principalEC2 instance profile, Lambda execution role, ECS task role.
OIDC-federated workloadJWT from external OIDC providerGitHub Actions, GitLab, Buildkite, EKS pods (IRSA), Pod Identity.
The good shape: no long-lived keys anywhere. Humans federate through Identity Center; CI federates through GitHub OIDC; EC2 / Lambda / ECS use instance profiles or execution roles; Kubernetes pods use IRSA. The only long-lived secret is the root credentials of the management account, locked in a safe.

3 · STS and AssumeRole — the bridge

AWS Security Token Service (STS) is the service that issues temporary credentials — an access key, secret key, and session token that expire (typically 1 hour, configurable up to 12). Every modern IAM pattern flows through STS:

  • sts:AssumeRole — caller has IAM credentials, wants to act as a role in this account or another.
  • sts:AssumeRoleWithSAML — caller has a SAML assertion from a corporate IdP.
  • sts:AssumeRoleWithWebIdentity — caller has a JWT from an OIDC provider (the GitHub Actions case, also IRSA).
  • sts:GetSessionToken — re-issue temporary credentials for an MFA-protected user.

A role has two policies attached. The trust policy answers "who can assume me?" — it lists the AWS principals or federated identities allowed to call AssumeRole on this role. The identity policy answers "what can the assumer do?" — the actual permissions granted once the role is assumed.

The most-confused thing about IAM: the trust policy and the identity policy live on the same role but mean different things. The trust policy is on the door. The identity policy is what you're allowed to do once inside. Both must allow the call.

4 · How AssumeRole actually works

A successful AssumeRole is a four-actor dance: the caller, STS in the calling account, the target role's trust policy, and (eventually) the AWS service receiving the API call from the assumed session. Walking through it once removes a lot of mystery from "why does my credential expire mid-request":

Three operational consequences fall out of this shape. First, the credentials returned are not the caller's — they're a fresh short-lived (15-minute to 12-hour) triple. Anyone holding them can act as the role until they expire; rotate any logs / dumps that accidentally include them. Second, the session principal is recorded in CloudTrail as arn:aws:sts::ACCT:assumed-role/ROLE/SESSION-NAME — naming sessions matters for auditing. Third, the SDK caches the session credentials and refreshes them roughly when 5 minutes remain, so the typical pattern (one role assumed in a Lambda, used for the function's whole lifetime) does not need explicit refresh logic in your code.

Role chaining. If session credentials AssumeRole into yet another role, the new session is capped at one hour regardless of the target role's MaxSessionDuration — a hard AWS limit to prevent indefinite chaining. Long-running cross-account agents (CI runners, data-platform syncs) that need > 1 hour must AssumeRole directly from a non-session principal (IAM user, IRSA-derived session that isn't itself chained), not from another assumed-role session.

5 · Workload identity — IRSA, Pod Identity, GitHub OIDC

Three modern patterns that all do the same thing: let a workload assume an IAM role without being given a long-lived secret.

PatternWhere it runsHow it works
EC2 instance profileEC2, ECS-on-EC2EC2 metadata service (IMDSv2) at 169.254.169.254 returns role credentials. The SDK auto-discovers.
Lambda execution roleLambda functionsLambda runtime injects creds as env vars (AWS_ACCESS_KEY_ID, etc.). Auto-discovered.
ECS task roleFargate / ECS tasksEach task gets a metadata endpoint that returns role creds.
IRSA (EKS)EKS pods (mature pattern)OIDC provider in front of EKS; pod's service-account token gets exchanged via AssumeRoleWithWebIdentity.
EKS Pod IdentityEKS pods (2024+)Like IRSA, but configured via EKS API instead of trust-policy JSON. Recommended for new clusters.
GitHub OIDCGitHub ActionsGH issues a JWT per workflow; AWS role trusts token.actions.githubusercontent.com; the action calls AssumeRoleWithWebIdentity.

In each case the result is the same: short-lived STS credentials, no static AWS key checked into the workload's config, and a fine-grained trust policy that lets you say "only the my-service service account in the prod namespace can assume the my-service-role in account 1234." That last sentence is what production-grade IAM actually looks like.

6 · IRSA — the OIDC token flow, end to end

IRSA (IAM Roles for Service Accounts) is the original Kubernetes workload-identity pattern. It's worth tracing once because the same shape recurs in GitHub Actions, GitLab, Buildkite, and now EKS Pod Identity:

Four things make this pattern bulletproof for production. First, the pod never holds a long-lived AWS secret — only a projected ServiceAccount token that's rotated by kubelet on a schedule (default 1 hour). Second, the trust policy on the AWS role specifies both the OIDC issuer URL and a StringEquals condition on system:serviceaccount:<namespace>:<sa-name> — pods can only assume the role if their kubelet namespace and SA name match. Third, the OIDC issuer is per-cluster, so a stolen JWT from cluster A can't assume roles trusted to cluster B's issuer. Fourth, the AWS SDK auto-discovers AWS_WEB_IDENTITY_TOKEN_FILE and AWS_ROLE_ARN environment variables injected by the EKS pod-identity webhook; application code doesn't change.

EKS Pod Identity (2024+) is the same outcome with less plumbing — no OIDC provider to create, no trust-policy JSON to author. EKS holds the mapping (cluster + namespace + SA → role) as an API resource. Pod Identity is the recommended pattern for new clusters; IRSA remains supported and is fine to keep on existing ones.

Identity mechanismWhere it runsBest forTrap
IAM user + access keyAnywhere — but ideally nowhereOne-off CI before OIDC was supported; legacy scriptsStatic secret that leaks. Rotate aggressively, or migrate to OIDC.
IAM role + AssumeRoleCross-account from any AWS principalCentralised platform / observability roles that read from many workload accountsRole chaining = 1-hour cap on the new session
EC2 instance profileEC2 / ECS-on-EC2 hostsAnything that runs on a known instanceIMDSv2 must be required (HttpTokens=required) — see Capital One §11
Lambda execution roleLambda functionsDefault for serverless workloadsOne role per function; concurrency means many sessions at once
IRSAEKS pods (any version)Mature clusters; per-namespace fine-grained mappingTrust policy typos in aud / sub claims fail silently
EKS Pod IdentityEKS pods (1.27+)New clusters; less YAML, no OIDC provider to manageRequires the Pod Identity agent DaemonSet
Identity Center permission setHumans accessing AWSWorkforce SSO with Okta / Entra / GoogleSession times out (default 8h); CLI uses aws sso login
GitHub Actions OIDCCI/CD from GitHubAll deployment automationTrust policy must pin repo:<org>/<repo>:ref:<branch> or any repo can assume

7 · Guardrails — SCPs, permission boundaries, session policies

Beyond identity and resource policies, three other policy types exist to cap what a principal can do, regardless of what its identity policies grant. They never grant permissions, only restrict them.

  • SCPs (Service Control Policies) live on an AWS Organizations OU or account. Common SCP: "deny iam:CreateUser for everyone in this OU" so prod can't have any human-shaped IAM users at all.
  • Permission boundaries attach to an individual IAM role or user. "This role can have any policy attached but its effective permissions are capped at this set" — a delegation pattern: dev teams self-serve role creation within a guardrail.
  • Session policies are passed at AssumeRole time as an extra restriction on the assumed session. CI templates use this: a base "build" role attached, then per-build a session policy that scopes it to one S3 prefix.
The evaluation order: the request is allowed only if every applicable layer evaluates to allow. SCPs evaluated first — if any SCP denies, request is denied. Then permission boundary — if it doesn't allow, denied. Then identity and resource policies must both allow (or resource policy must explicitly allow if cross-account). Then session policy — if present, must also allow. Then explicit deny in any layer wins. The mental model: each layer can only narrow what's allowed, never widen it.

8 · Condition keys — IAM beyond "service-action-resource"

Every IAM policy statement can include a Condition block that adds context-aware checks. The most useful ones for production:

Condition keyWhat it constrainsExample
aws:SourceIpCaller's source IP"Only allow from corporate egress IP range."
aws:SourceVpc / aws:SourceVpceVPC / VPC endpoint of the caller"S3 bucket only accessible via this VPC endpoint." Stops data exfil.
aws:PrincipalOrgIDCaller's AWS org"Only allow if the caller is in my organization."
aws:MultiFactorAuthPresentMFA was used in the session"Require MFA to delete IAM users."
aws:RequestedRegionThe region the call is destined for"Deny all writes outside eu-west-2." Data residency.
aws:ResourceTag/<key>A tag on the target resource"Only let this role write to S3 buckets tagged env=prod."
aws:PrincipalTag/<key>A tag on the callerABAC: "user with tag team=payments can access resources with the same tag."
kms:ViaServiceWhich AWS service is calling KMS on the caller's behalf"Only S3 can use this KMS key" — prevents direct decrypt by users.
ABAC vs RBAC. Tag-based access control (ABAC) using PrincipalTag + ResourceTag lets you write one policy that scales: "any role tagged with team X can access any resource tagged team X." RBAC requires a new role per team. ABAC is harder to set up but scales better past a couple of dozen teams. AWS Identity Center supports passing attributes from the IdP into the session as principal tags.

9 · Workforce identity — Identity Center

AWS IAM Identity Center (formerly AWS SSO) is the modern way humans access AWS:

  1. Identity Center is enabled at the AWS Organizations management account.
  2. An identity source is connected — Okta, Entra ID (Azure AD), Google Workspace, or Identity Center's own directory.
  3. Permission sets are defined — each permission set is "a name + a set of managed/inline IAM policies."
  4. Users / groups are assigned permission sets in particular AWS accounts.
  5. Users log in to the Identity Center portal, click the account / role they need, and get an STS session in the browser or via aws sso login.

Behind the scenes Identity Center creates an IAM role per permission set per account, with the permission set's policies attached and a trust policy pointing back at Identity Center's identity provider. The user-facing experience is "pick an account and a role." The underlying machinery is the same AssumeRole-via-OIDC story.

10 · Build it yourself — cross-account AssumeRole lab

The fastest way to internalise AssumeRole is to do it. This lab uses one account but makes the pattern explicit; replace the account ID with a second sandbox account if you have one.

  1. Note your account ID.
    ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text) echo "Account: $ACCOUNT_ID"
  2. Create a target role with a trust policy that lets the current caller assume it.
    CURRENT_ARN=$(aws sts get-caller-identity --query Arn --output text) cat > /tmp/trust.json <<EOF { "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Principal": { "AWS": "$CURRENT_ARN" }, "Action": "sts:AssumeRole", "Condition": { "Bool": { "aws:MultiFactorAuthPresent": "false" } } }] } EOF aws iam create-role --role-name LabReadOnly --assume-role-policy-document file:///tmp/trust.json
  3. Attach a read-only managed policy.
    aws iam attach-role-policy --role-name LabReadOnly \ --policy-arn arn:aws:iam::aws:policy/ReadOnlyAccess
  4. Assume the role and inspect the returned credentials.
    aws sts assume-role \ --role-arn arn:aws:iam::${ACCOUNT_ID}:role/LabReadOnly \ --role-session-name lab-1 \ --duration-seconds 900 # Returns an AccessKeyId / SecretAccessKey / SessionToken triple, valid 15 minutes.
  5. Use the temporary creds.
    export AWS_ACCESS_KEY_ID=<from above> export AWS_SECRET_ACCESS_KEY=<from above> export AWS_SESSION_TOKEN=<from above> aws sts get-caller-identity # Should show the assumed role's session ARN, not your user. aws s3 ls # Allowed (ReadOnly). aws s3 mb s3://test-bucket-$RANDOM # Denied — ReadOnly doesn't include CreateBucket.
  6. Reset and tear down.
    unset AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN aws iam detach-role-policy --role-name LabReadOnly \ --policy-arn arn:aws:iam::aws:policy/ReadOnlyAccess aws iam delete-role --role-name LabReadOnly

Variation: replace the trust policy Principal with "Federated": "arn:aws:iam::ACCT:oidc-provider/token.actions.githubusercontent.com" and a condition matching token.actions.githubusercontent.com:sub to the repo you want — that's the GitHub Actions OIDC pattern, end-to-end.

11 · Real-world case studies

Three publicly-documented stories make the IAM model concrete — one cautionary, two aspirational.

Capital One (2019) — over-permissive IAM + SSRF. A misconfigured ModSecurity WAF on an EC2 instance was exploited via server-side request forgery to query the EC2 metadata endpoint (IMDSv1, before the metadata service required session tokens), which returned the instance's IAM role credentials. The role had broad S3 permissions, and the attacker used those creds to enumerate and exfiltrate ~106 million customer records. The chain of failures is documented in Krebs's writeup and in DOJ court filings. The lessons that landed across the industry: require IMDSv2 (which needs a session token before returning creds, breaking SSRF), scope EC2 roles to least privilege, never give a web-tier role wholesale S3 list/read across all buckets, and use VPC endpoint policies (aws:SourceVpce) so the buckets can only be read from inside the expected VPC. AWS made IMDSv2 the default for new instances after this incident.

Mozilla — least-privilege as code. Mozilla publishes its IAM least-privilege guidelines and the tooling that enforces them. The pattern: every IAM role in their AWS estate is generated from a code repository, with a permission boundary attached automatically by the platform; policy changes are reviewed by humans and re-validated by Access Analyzer on every PR. The argument that travels well: in any sufficiently large org, no IAM role survives manual review unless creating the role itself goes through code review. The Mozilla docs walk through SAR-style (Service / Action / Resource) policies derived from CloudTrail-observed usage, which is the same idea behind AWS's own IAM Access Advisor and the "Generate policy" feature in the console.

Netflix — ConsoleMe and shared-account workflows. Netflix open-sourced ConsoleMe, a self-service portal that lets engineers request a temporary AssumeRole into any of the hundreds of AWS accounts Netflix operates. The post describes a one-click "raise to this role for 1 hour, here's why" workflow that issues a short-lived session, records the justification in an audit log, and revokes the session on time-out. The model that survived: humans don't get long-lived credentials, but they get a frictionless on-demand path to the access they need, with full provenance. Multiple companies have since built variants — Netflix's predecessor tooling "Aardvark and Repokid" automatically reduces unused IAM permissions in production, walking each role's CloudTrail history to delete actions nobody has called in 90 days.

The through-line: in 2026 production AWS, the IAM that works is "no long-lived secrets, federation everywhere, automated guardrails, and reviewable provenance for every elevated session."

12 · What breaks

  • "User can't assume role" — almost always one of: (a) trust policy doesn't list them, (b) their identity policy doesn't have sts:AssumeRole on the target, (c) an SCP blocks it, (d) MFA condition requires MFA they don't have, (e) the role has an external-ID condition they're not passing.
  • "Access denied" with no clue. Use the IAM Policy Simulator or CloudTrail — the denied call shows up there with the matched-deny statement. AWS deliberately doesn't tell the caller why (information-leak risk), so the operator has to look on the AWS side.
  • Silent permission-boundary cut. When platform teams attach a boundary to a role, any action the role's identity policy grants but the boundary doesn't silently stops working. The role still exists, the identity policy still lists the action, the call still returns AccessDenied with no indication that a boundary is in play. Always check aws iam get-role --role-name X for PermissionsBoundary.
  • Trust-policy typos in aud/sub. IRSA's OIDC trust policy keys are <oidc-issuer>:aud and <oidc-issuer>:sub. A common copy-paste error puts sts.amazonaws.com:aud instead of oidc.eks.<region>.amazonaws.com/id/XXXX:aud; the policy looks correct but matches nothing. GitHub Actions has the same trap with token.actions.githubusercontent.com:sub — the value must be exactly repo:<org>/<repo>:ref:refs/heads/main or a wildcard you trust.
  • STS regional endpoints. Calls to global sts.amazonaws.com always route to us-east-1 — fast for North America, painful from Asia, and unavailable during a us-east-1 control-plane event. Use regional STS endpoints (sts.<region>.amazonaws.com) in the SDK config (AWS_STS_REGIONAL_ENDPOINTS=regional) so AssumeRole stays in-region. AWS's own SDKs default to regional in newer versions; older SDKs and many CI tools still default to global.
  • The 1-hour role-chaining limit. If your code path AssumeRoles from a session that was itself produced by AssumeRole, the new session's TTL is capped at 1 hour regardless of --duration-seconds or the target role's MaxSessionDuration. Long-running agents either re-authenticate every hour or AssumeRole directly from a non-chained principal.
  • Static keys leaked. If you see AKIA... in a commit, rotate immediately, scan the public web for the key (it's almost certainly already scraped), then audit CloudTrail for that key's API activity. AWS scans GitHub and emails you when it spots a key, but the bots get there first.
  • "My Lambda can't access S3 even though I gave it permission." Confirm (1) the Lambda's execution role has the S3 permission, (2) the S3 bucket policy doesn't deny the role, (3) the bucket isn't behind a KMS key the role can't decrypt with, (4) you're calling the right region.
  • IAM eventual consistency. Policy changes propagate over seconds, occasionally tens of seconds. CI scripts that create a role and immediately call it sometimes fail; retry with backoff.
  • Identity Center session expiry mid-CLI-command. Default Identity Center session is 8 hours; long-running CLI scripts mid-run will start seeing ExpiredToken. Re-run aws sso login and resume — or wrap the script in a retry that detects the error and re-authenticates.

13 · Further reading

Found this useful?