02 / 16

Cloud Codex · AWS / 02

IAM, the hard parts.

The basic IAM model — users, policies, roles — gets a paragraph in every AWS tutorial. The model that actually runs production is bigger: STS-issued temporary credentials instead of static keys, AssumeRole as the bridge between workloads, federation for human identity, IRSA for Kubernetes, SCPs and permission boundaries as guardrails, and a policy evaluation order that explicit-denies, then explicit-allows, then implicit-denies. This page is that bigger model.

1 · What IAM actually evaluates

Every AWS API call walks the same evaluation pipeline before it's allowed to do anything. There are six policy layers, evaluated in a specific order, and the first hard rule of IAM is: an explicit DENY at any layer wins, full stop. The second rule: a request needs at least one explicit ALLOW and no DENY at any applicable layer. "No policy mentions me" defaults to deny.

Cross-account access has one extra wrinkle: both the resource policy in the target account and the identity policy in the caller's account must allow the call. Same-account calls only need one of them. This is why you can aws s3 ls s3://my-bucket with no bucket policy when the bucket is in your account, but a partner account can't see it until you add a "Principal": "arn:aws:iam::PARTNER:role/X" grant to the bucket policy and the partner role has s3:ListBucket in its identity policy.

If you see…	Look here
"AccessDenied" with no context	(1) SCP at the org/OU level (2) bucket / resource policy explicit deny (3) permission boundary missing the action
Works from my user, fails from the role	Trust policy on the role + identity policy on the user. The role's own identity policy is irrelevant to the AssumeRole call.
Works in console, fails from CLI	The console signs requests as your IAM/SSO principal; the CLI may be using a different profile. `aws sts get-caller-identity` first.
Works for actions, fails for resource-level	Resource ARN doesn't match. `arn:aws:s3:::bucket` vs `arn:aws:s3:::bucket/*` are different resources.
Worked yesterday, denied today	(1) A new SCP was applied (2) permission boundary was attached (3) condition key context changed (e.g., SourceVpc / SourceIp / time-of-day)
Allowed in one region, denied in another	Condition key `aws:RequestedRegion` somewhere in the chain, or the role's STS regional endpoint behaviour (see §11).

The IAM Policy Simulator and Access Analyzer are the diagnostic tools. The simulator walks the evaluation in detail and tells you which statement matched. Access Analyzer surfaces resource policies that grant external access — including the kinds of subtle SCP-bypass or trust-policy-too-broad findings that a manual review misses. Run both in CI on every policy change.

2 · The four kinds of principal

A principal is anything that makes an authenticated API call to AWS. There are four practical kinds, and modern AWS estates use mostly the last two:

Principal	How it authenticates	When to use
IAM user	Long-lived access keys	Avoid. CI/CD tokens are the last remaining excuse, and most CI now supports OIDC instead.
Federated identity	SAML / OIDC token from an external IdP	Humans. Workforce SSO via Identity Center → Okta / Entra ID / Google.
Service-linked role	AWS service is the trust principal	EC2 instance profile, Lambda execution role, ECS task role.
OIDC-federated workload	JWT from external OIDC provider	GitHub Actions, GitLab, Buildkite, EKS pods (IRSA), Pod Identity.

The good shape: no long-lived keys anywhere. Humans federate through Identity Center; CI federates through GitHub OIDC; EC2 / Lambda / ECS use instance profiles or execution roles; Kubernetes pods use IRSA. The only long-lived secret is the root credentials of the management account, locked in a safe.

3 · STS and AssumeRole — the bridge

AWS Security Token Service (STS) is the service that issues temporary credentials — an access key, secret key, and session token that expire (typically 1 hour, configurable up to 12). Every modern IAM pattern flows through STS:

sts:AssumeRole — caller has IAM credentials, wants to act as a role in this account or another.
sts:AssumeRoleWithSAML — caller has a SAML assertion from a corporate IdP.
sts:AssumeRoleWithWebIdentity — caller has a JWT from an OIDC provider (the GitHub Actions case, also IRSA).
sts:GetSessionToken — re-issue temporary credentials for an MFA-protected user.

A role has two policies attached. The trust policy answers "who can assume me?" — it lists the AWS principals or federated identities allowed to call AssumeRole on this role. The identity policy answers "what can the assumer do?" — the actual permissions granted once the role is assumed.

The most-confused thing about IAM: the trust policy and the identity policy live on the same role but mean different things. The trust policy is on the door. The identity policy is what you're allowed to do once inside. Both must allow the call.

4 · How AssumeRole actually works

A successful AssumeRole is a four-actor dance: the caller, STS in the calling account, the target role's trust policy, and (eventually) the AWS service receiving the API call from the assumed session. Walking through it once removes a lot of mystery from "why does my credential expire mid-request":

Three operational consequences fall out of this shape. First, the credentials returned are not the caller's — they're a fresh short-lived (15-minute to 12-hour) triple. Anyone holding them can act as the role until they expire; rotate any logs / dumps that accidentally include them. Second, the session principal is recorded in CloudTrail as arn:aws:sts::ACCT:assumed-role/ROLE/SESSION-NAME — naming sessions matters for auditing. Third, the SDK caches the session credentials and refreshes them roughly when 5 minutes remain, so the typical pattern (one role assumed in a Lambda, used for the function's whole lifetime) does not need explicit refresh logic in your code.

Role chaining. If session credentials AssumeRole into yet another role, the new session is capped at one hour regardless of the target role's MaxSessionDuration — a hard AWS limit to prevent indefinite chaining. Long-running cross-account agents (CI runners, data-platform syncs) that need > 1 hour must AssumeRole directly from a non-session principal (IAM user, IRSA-derived session that isn't itself chained), not from another assumed-role session.

5 · Workload identity — IRSA, Pod Identity, GitHub OIDC

Three modern patterns that all do the same thing: let a workload assume an IAM role without being given a long-lived secret.

Pattern	Where it runs	How it works
EC2 instance profile	EC2, ECS-on-EC2	EC2 metadata service (IMDSv2) at `169.254.169.254` returns role credentials. The SDK auto-discovers.
Lambda execution role	Lambda functions	Lambda runtime injects creds as env vars (`AWS_ACCESS_KEY_ID`, etc.). Auto-discovered.
ECS task role	Fargate / ECS tasks	Each task gets a metadata endpoint that returns role creds.
IRSA (EKS)	EKS pods (mature pattern)	OIDC provider in front of EKS; pod's service-account token gets exchanged via `AssumeRoleWithWebIdentity`.
EKS Pod Identity	EKS pods (2024+)	Like IRSA, but configured via EKS API instead of trust-policy JSON. Recommended for new clusters.
GitHub OIDC	GitHub Actions	GH issues a JWT per workflow; AWS role trusts `token.actions.githubusercontent.com`; the action calls `AssumeRoleWithWebIdentity`.

In each case the result is the same: short-lived STS credentials, no static AWS key checked into the workload's config, and a fine-grained trust policy that lets you say "only the my-service service account in the prod namespace can assume the my-service-role in account 1234." That last sentence is what production-grade IAM actually looks like.

6 · IRSA — the OIDC token flow, end to end

IRSA (IAM Roles for Service Accounts) is the original Kubernetes workload-identity pattern. It's worth tracing once because the same shape recurs in GitHub Actions, GitLab, Buildkite, and now EKS Pod Identity:

Four things make this pattern bulletproof for production. First, the pod never holds a long-lived AWS secret — only a projected ServiceAccount token that's rotated by kubelet on a schedule (default 1 hour). Second, the trust policy on the AWS role specifies both the OIDC issuer URL and a StringEquals condition on system:serviceaccount:<namespace>:<sa-name> — pods can only assume the role if their kubelet namespace and SA name match. Third, the OIDC issuer is per-cluster, so a stolen JWT from cluster A can't assume roles trusted to cluster B's issuer. Fourth, the AWS SDK auto-discovers AWS_WEB_IDENTITY_TOKEN_FILE and AWS_ROLE_ARN environment variables injected by the EKS pod-identity webhook; application code doesn't change.

EKS Pod Identity (2024+) is the same outcome with less plumbing — no OIDC provider to create, no trust-policy JSON to author. EKS holds the mapping (cluster + namespace + SA → role) as an API resource. Pod Identity is the recommended pattern for new clusters; IRSA remains supported and is fine to keep on existing ones.

Identity mechanism	Where it runs	Best for	Trap
IAM user + access key	Anywhere — but ideally nowhere	One-off CI before OIDC was supported; legacy scripts	Static secret that leaks. Rotate aggressively, or migrate to OIDC.
IAM role + AssumeRole	Cross-account from any AWS principal	Centralised platform / observability roles that read from many workload accounts	Role chaining = 1-hour cap on the new session
EC2 instance profile	EC2 / ECS-on-EC2 hosts	Anything that runs on a known instance	IMDSv2 must be required (HttpTokens=required) — see Capital One §11
Lambda execution role	Lambda functions	Default for serverless workloads	One role per function; concurrency means many sessions at once
IRSA	EKS pods (any version)	Mature clusters; per-namespace fine-grained mapping	Trust policy typos in `aud` / `sub` claims fail silently
EKS Pod Identity	EKS pods (1.27+)	New clusters; less YAML, no OIDC provider to manage	Requires the Pod Identity agent DaemonSet
Identity Center permission set	Humans accessing AWS	Workforce SSO with Okta / Entra / Google	Session times out (default 8h); CLI uses `aws sso login`
GitHub Actions OIDC	CI/CD from GitHub	All deployment automation	Trust policy must pin `repo:<org>/<repo>:ref:<branch>` or any repo can assume

7 · Guardrails — SCPs, permission boundaries, session policies

Beyond identity and resource policies, three other policy types exist to cap what a principal can do, regardless of what its identity policies grant. They never grant permissions, only restrict them.

SCPs (Service Control Policies) live on an AWS Organizations OU or account. Common SCP: "deny iam:CreateUser for everyone in this OU" so prod can't have any human-shaped IAM users at all.
Permission boundaries attach to an individual IAM role or user. "This role can have any policy attached but its effective permissions are capped at this set" — a delegation pattern: dev teams self-serve role creation within a guardrail.
Session policies are passed at AssumeRole time as an extra restriction on the assumed session. CI templates use this: a base "build" role attached, then per-build a session policy that scopes it to one S3 prefix.

The evaluation order: the request is allowed only if every applicable layer evaluates to allow. SCPs evaluated first — if any SCP denies, request is denied. Then permission boundary — if it doesn't allow, denied. Then identity and resource policies must both allow (or resource policy must explicitly allow if cross-account). Then session policy — if present, must also allow. Then explicit deny in any layer wins. The mental model: each layer can only narrow what's allowed, never widen it.

8 · Condition keys — IAM beyond "service-action-resource"

Every IAM policy statement can include a Condition block that adds context-aware checks. The most useful ones for production:

Condition key	What it constrains	Example
`aws:SourceIp`	Caller's source IP	"Only allow from corporate egress IP range."
`aws:SourceVpc` / `aws:SourceVpce`	VPC / VPC endpoint of the caller	"S3 bucket only accessible via this VPC endpoint." Stops data exfil.
`aws:PrincipalOrgID`	Caller's AWS org	"Only allow if the caller is in my organization."
`aws:MultiFactorAuthPresent`	MFA was used in the session	"Require MFA to delete IAM users."
`aws:RequestedRegion`	The region the call is destined for	"Deny all writes outside eu-west-2." Data residency.
`aws:ResourceTag/<key>`	A tag on the target resource	"Only let this role write to S3 buckets tagged `env=prod`."
`aws:PrincipalTag/<key>`	A tag on the caller	ABAC: "user with tag `team=payments` can access resources with the same tag."
`kms:ViaService`	Which AWS service is calling KMS on the caller's behalf	"Only S3 can use this KMS key" — prevents direct decrypt by users.

ABAC vs RBAC. Tag-based access control (ABAC) using PrincipalTag + ResourceTag lets you write one policy that scales: "any role tagged with team X can access any resource tagged team X." RBAC requires a new role per team. ABAC is harder to set up but scales better past a couple of dozen teams. AWS Identity Center supports passing attributes from the IdP into the session as principal tags.

9 · Workforce identity — Identity Center

AWS IAM Identity Center (formerly AWS SSO) is the modern way humans access AWS:

Identity Center is enabled at the AWS Organizations management account.
An identity source is connected — Okta, Entra ID (Azure AD), Google Workspace, or Identity Center's own directory.
Permission sets are defined — each permission set is "a name + a set of managed/inline IAM policies."
Users / groups are assigned permission sets in particular AWS accounts.
Users log in to the Identity Center portal, click the account / role they need, and get an STS session in the browser or via aws sso login.

Behind the scenes Identity Center creates an IAM role per permission set per account, with the permission set's policies attached and a trust policy pointing back at Identity Center's identity provider. The user-facing experience is "pick an account and a role." The underlying machinery is the same AssumeRole-via-OIDC story.

10 · Build it yourself — cross-account AssumeRole lab

The fastest way to internalise AssumeRole is to do it. This lab uses one account but makes the pattern explicit; replace the account ID with a second sandbox account if you have one.

Note your account ID.
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text) echo "Account: $ACCOUNT_ID"
Create a target role with a trust policy that lets the current caller assume it.
CURRENT_ARN=$(aws sts get-caller-identity --query Arn --output text) cat > /tmp/trust.json <<EOF { "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Principal": { "AWS": "$CURRENT_ARN" }, "Action": "sts:AssumeRole", "Condition": { "Bool": { "aws:MultiFactorAuthPresent": "false" } } }] } EOF aws iam create-role --role-name LabReadOnly --assume-role-policy-document file:///tmp/trust.json
Attach a read-only managed policy.
aws iam attach-role-policy --role-name LabReadOnly \ --policy-arn arn:aws:iam::aws:policy/ReadOnlyAccess
Assume the role and inspect the returned credentials.
aws sts assume-role \ --role-arn arn:aws:iam::${ACCOUNT_ID}:role/LabReadOnly \ --role-session-name lab-1 \ --duration-seconds 900 # Returns an AccessKeyId / SecretAccessKey / SessionToken triple, valid 15 minutes.
Use the temporary creds.
export AWS_ACCESS_KEY_ID=<from above> export AWS_SECRET_ACCESS_KEY=<from above> export AWS_SESSION_TOKEN=<from above> aws sts get-caller-identity # Should show the assumed role's session ARN, not your user. aws s3 ls # Allowed (ReadOnly). aws s3 mb s3://test-bucket-$RANDOM # Denied — ReadOnly doesn't include CreateBucket.
Reset and tear down.
unset AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN aws iam detach-role-policy --role-name LabReadOnly \ --policy-arn arn:aws:iam::aws:policy/ReadOnlyAccess aws iam delete-role --role-name LabReadOnly

Variation: replace the trust policy Principal with "Federated": "arn:aws:iam::ACCT:oidc-provider/token.actions.githubusercontent.com" and a condition matching token.actions.githubusercontent.com:sub to the repo you want — that's the GitHub Actions OIDC pattern, end-to-end.

11 · Real-world case studies

Three publicly-documented stories make the IAM model concrete — one cautionary, two aspirational.

Capital One (2019) — over-permissive IAM + SSRF. A misconfigured ModSecurity WAF on an EC2 instance was exploited via server-side request forgery to query the EC2 metadata endpoint (IMDSv1, before the metadata service required session tokens), which returned the instance's IAM role credentials. The role had broad S3 permissions, and the attacker used those creds to enumerate and exfiltrate ~106 million customer records. The chain of failures is documented in Krebs's writeup and in DOJ court filings. The lessons that landed across the industry: require IMDSv2 (which needs a session token before returning creds, breaking SSRF), scope EC2 roles to least privilege, never give a web-tier role wholesale S3 list/read across all buckets, and use VPC endpoint policies (aws:SourceVpce) so the buckets can only be read from inside the expected VPC. AWS made IMDSv2 the default for new instances after this incident.

Mozilla — least-privilege as code. Mozilla publishes its IAM least-privilege guidelines and the tooling that enforces them. The pattern: every IAM role in their AWS estate is generated from a code repository, with a permission boundary attached automatically by the platform; policy changes are reviewed by humans and re-validated by Access Analyzer on every PR. The argument that travels well: in any sufficiently large org, no IAM role survives manual review unless creating the role itself goes through code review. The Mozilla docs walk through SAR-style (Service / Action / Resource) policies derived from CloudTrail-observed usage, which is the same idea behind AWS's own IAM Access Advisor and the "Generate policy" feature in the console.

Netflix — ConsoleMe and shared-account workflows. Netflix open-sourced ConsoleMe, a self-service portal that lets engineers request a temporary AssumeRole into any of the hundreds of AWS accounts Netflix operates. The post describes a one-click "raise to this role for 1 hour, here's why" workflow that issues a short-lived session, records the justification in an audit log, and revokes the session on time-out. The model that survived: humans don't get long-lived credentials, but they get a frictionless on-demand path to the access they need, with full provenance. Multiple companies have since built variants — Netflix's predecessor tooling "Aardvark and Repokid" automatically reduces unused IAM permissions in production, walking each role's CloudTrail history to delete actions nobody has called in 90 days.

The through-line: in 2026 production AWS, the IAM that works is "no long-lived secrets, federation everywhere, automated guardrails, and reviewable provenance for every elevated session."

12 · What breaks

"User can't assume role" — almost always one of: (a) trust policy doesn't list them, (b) their identity policy doesn't have sts:AssumeRole on the target, (c) an SCP blocks it, (d) MFA condition requires MFA they don't have, (e) the role has an external-ID condition they're not passing.
"Access denied" with no clue. Use the IAM Policy Simulator or CloudTrail — the denied call shows up there with the matched-deny statement. AWS deliberately doesn't tell the caller why (information-leak risk), so the operator has to look on the AWS side.
Silent permission-boundary cut. When platform teams attach a boundary to a role, any action the role's identity policy grants but the boundary doesn't silently stops working. The role still exists, the identity policy still lists the action, the call still returns AccessDenied with no indication that a boundary is in play. Always check aws iam get-role --role-name X for PermissionsBoundary.
Trust-policy typos in aud/sub. IRSA's OIDC trust policy keys are <oidc-issuer>:aud and <oidc-issuer>:sub. A common copy-paste error puts sts.amazonaws.com:aud instead of oidc.eks.<region>.amazonaws.com/id/XXXX:aud; the policy looks correct but matches nothing. GitHub Actions has the same trap with token.actions.githubusercontent.com:sub — the value must be exactly repo:<org>/<repo>:ref:refs/heads/main or a wildcard you trust.
STS regional endpoints. Calls to global sts.amazonaws.com always route to us-east-1 — fast for North America, painful from Asia, and unavailable during a us-east-1 control-plane event. Use regional STS endpoints (sts.<region>.amazonaws.com) in the SDK config (AWS_STS_REGIONAL_ENDPOINTS=regional) so AssumeRole stays in-region. AWS's own SDKs default to regional in newer versions; older SDKs and many CI tools still default to global.
The 1-hour role-chaining limit. If your code path AssumeRoles from a session that was itself produced by AssumeRole, the new session's TTL is capped at 1 hour regardless of --duration-seconds or the target role's MaxSessionDuration. Long-running agents either re-authenticate every hour or AssumeRole directly from a non-chained principal.
Static keys leaked. If you see AKIA... in a commit, rotate immediately, scan the public web for the key (it's almost certainly already scraped), then audit CloudTrail for that key's API activity. AWS scans GitHub and emails you when it spots a key, but the bots get there first.
"My Lambda can't access S3 even though I gave it permission." Confirm (1) the Lambda's execution role has the S3 permission, (2) the S3 bucket policy doesn't deny the role, (3) the bucket isn't behind a KMS key the role can't decrypt with, (4) you're calling the right region.
IAM eventual consistency. Policy changes propagate over seconds, occasionally tens of seconds. CI scripts that create a role and immediately call it sometimes fail; retry with backoff.
Identity Center session expiry mid-CLI-command. Default Identity Center session is 8 hours; long-running CLI scripts mid-run will start seeing ExpiredToken. Re-run aws sso login and resume — or wrap the script in a retry that detects the error and re-authenticates.

13 · Further reading

IAM policy evaluation logic. The canonical flowchart. Read this once and refer back; it's where every "why was this denied?" question ends.
Permissions boundaries. The mechanism for delegated role creation. Worth knowing for platform-team interviews.
IRSA. The original EKS workload-identity pattern.
EKS Pod Identity. The 2024 replacement; less plumbing, recommended for new clusters.
Capital One breach analysis (Krebs). The IMDSv1 + over-permissive role chain that drove the IMDSv2 default.
Mozilla IAM Least Privilege Guidelines. A public, mature take on least-privilege-as-code.
Netflix ConsoleMe. The self-service AssumeRole pattern at scale.
Netflix Aardvark + Repokid. Automated permission-pruning tooling that walks CloudTrail history.
Identity & IAM concepts. The conceptual companion — the principal/policy/role model, JSON anatomy, and SCPs.
Authentication primitives. The broader auth landscape — JWTs, OIDC, OAuth — that this page assumes you know.

EC2, EBS & AMIs →

Instance families, Nitro, EBS volume types and their performance internals, AMI lifecycle, snapshots, and placement groups.

Read EC2, EBS & AMIs

Found this useful?