14 / 16
Cloud Codex · AWS / 14

Step Functions.

A managed workflow engine. You describe a state machine in JSON (Amazon States Language); Step Functions runs it, persists state at every step, retries failed steps with backoff, and gives you a visual graph of every execution. The right tool when "this is a Lambda that calls another Lambda that calls another Lambda" has gotten out of hand.


1 · What workflow orchestration actually is

The mental model: Step Functions is a durable execution engine. Every transition between states persists to AWS-managed storage before the next state runs. If the process running the workflow dies — host vanishes, network partitions, AZ goes dark — Step Functions resumes from the last committed state on a different host. The application code itself (your Lambdas, your service calls) doesn't need to know it crashed; the engine remembers where you were.

What it isn't: a low-latency request router. Each state transition adds tens to low hundreds of milliseconds of overhead. If your "workflow" is three Lambda calls inside one user request and finishes in 200ms, you don't need Step Functions; you need a function that calls three functions. Step Functions earns its keep when the workflow spans seconds to days, or when "what happens if step 4 crashes after step 3 succeeded?" is a question you'd otherwise answer with database tables and cron jobs.

It's worth seeing where it sits among the alternatives, because the durability and visibility profiles differ sharply:

ApproachDurabilityVisibilityReach for it when
Lambda chain (Lambda A → invokes Lambda B → invokes Lambda C)None. If Lambda B dies mid-flight, nothing remembers Lambda C was supposed to run.CloudWatch logs per function. No end-to-end trace without X-Ray plumbing.Sub-second flows you don't need to recover from a crash.
SQS-driven worker (queue → worker → next queue)Per-message; at-least-once delivery. You write the idempotency.Queue depth + per-worker logs. Stitching the trace is on you.High-throughput, loosely-coupled pipelines where each stage is independently scalable.
Step Functions StandardPer-transition checkpointed in AWS storage. Exactly-once within an execution.Full visual replay; 90-day history; ListExecutions API.Anything with branching, retries, human approval, or "this might run for 11 hours."
Apache AirflowPer-task in a metadata DB you operate (or MWAA, which AWS operates).Best-in-class DAG UI; scheduler view, log per task.Scheduled data pipelines (ETL) authored in Python by data engineers.
Temporal / CadencePer-event in Temporal's event-sourced log. Replayable workflows; exactly-once side effects.Web UI + per-workflow event history; can be deeper than Step Functions.Long-running orchestration authored in code (TypeScript, Go, Java) — when JSON ASL is too constraining.
The litmus test. Ask: "if the host running my orchestration logic dies right now, what's the recovery procedure?" If the answer involves grep-ing logs and manually re-issuing API calls, you wanted Step Functions (or Temporal, or Airflow — but Step Functions is the AWS-native default). If the answer is "the work happens again because the message is still in the queue," you wanted SQS.

2 · How execution state is persisted

Each state transition in a Standard workflow writes a history event — an immutable record of "we left state X, entered state Y, with this input/output payload" — to AWS-managed storage before the workflow advances. If the worker thread crashes between two transitions, the next attempt reads the last committed event and resumes from there. This is the durability primitive that "Lambda calls Lambda" doesn't have.

Three properties fall out of this shape. First, you can resume long-running workflows (waiting for a human, waiting for a callback) without holding any process open — Step Functions parks the execution and rehydrates it when the wait completes. Second, the visible state (which state, what input/output, which retry attempt) is queryable by anyone with states:DescribeExecution — operators don't need to grep logs. Third, the engine charges per state transition, so every state you add is a recurring cost; this is why "loops with 10,000 iterations" need Distributed Map rather than a hand-rolled Map state.

3 · The state types

A state machine is a JSON document (Amazon States Language) describing nodes (states) and transitions. Step Functions executes it, persists the state of every transition, and resumes smoothly through failures.

State typeUse
TaskInvoke a Lambda, an AWS service action, or a worker via activity ARN. The unit of work.
ChoiceBranch based on the input data. Like an if.
ParallelExecute branches concurrently; wait for all to finish.
MapLike JavaScript's .map — execute the same sub-workflow for each item in an array. Up to 10k concurrent in Distributed Map.
WaitSleep for a fixed time or until a timestamp.
PassForward input as output, optionally transformed. Useful for data shaping.
Succeed / FailTerminal states.

4 · Standard vs Express — the internal model

Same JSON language, two fundamentally different execution engines underneath.

Standard is the durable engine described above: per-transition writes to the execution history, 90-day retention, exactly-once semantics within an execution, visual replay in the console. You pay per state transition (~$0.025 per 1,000), which sounds tiny until a workflow with 50 states runs a million times a day and costs $1,250. The tradeoff is full auditability and the ability to resume after multi-day waits.

Express runs more like a stream processor. The interpreter holds state in memory; transitions are batched and flushed to CloudWatch Logs at the end of the execution. There's no per-transition durable write, which is why Express can hit 100,000 starts per second and why a 5-minute execution costs fractions of a cent. The catch: at-least-once semantics (a node failure can replay a sub-section), no console replay (you have only the log stream), and a hard 5-minute wall-clock ceiling.

 StandardExpress
DurationUp to 1 yearUp to 5 minutes
Execution rate2,000 / s100,000 / s
Pricing modelPer state transition (~$0.025 / 1k)Per request + duration × memory (Lambda-shaped)
History90 days, fully replayable in consoleCloudWatch logs only
IdempotencyExactly-once within an executionAt-least-once
Synchronous modeNo — always asynchronousYes — StartSyncExecution blocks and returns the result; usable from API Gateway directly
Reach for it whenLong-running workflows; auditable; lower-throughputHigh-throughput, short workflows (per-request orchestration)
Express workflows are 100× cheaper at high throughput. A per-request orchestration (e.g., "every API call triggers a 5-step approval flow") on Standard would cost $2.50/M; Express runs the same flow for ~$0.025/M. Use Standard when you need long-running state, the visual replay history, or exactly-once semantics.

6 · Error handling and retries

Every Task state can declare Retry and Catch arrays. Retry: list of error patterns with backoff strategies; Step Functions retries automatically. Catch: list of patterns with target states to transition to on failure. The combination eliminates most of the "try/except/sleep/retry" boilerplate you'd otherwise write in your Lambdas.

{ "Type": "Task", "Resource": "arn:aws:lambda:...:function:CallStripe", "Retry": [ { "ErrorEquals": ["States.Timeout", "Stripe.RateLimitError"], "IntervalSeconds": 2, "MaxAttempts": 5, "BackoffRate": 2.0 } ], "Catch": [ { "ErrorEquals": ["States.ALL"], "Next": "CompensatingTransaction" } ], "Next": "MarkOrderPaid" }

This pattern — Retry on the retryable, Catch into a compensating transaction on the unrecoverable — is the saga pattern, expressed declaratively.

7 · AWS SDK integrations and "no Lambda needed"

Step Functions can call > 200 AWS services directly without an intermediary Lambda — arn:aws:states:::aws-sdk:dynamodb:getItem reads a DynamoDB item; arn:aws:states:::aws-sdk:s3:putObject writes to S3. This is the difference between "Step Functions calls a Lambda that calls DynamoDB" and "Step Functions calls DynamoDB directly." The latter is faster, cheaper, and one less Lambda to maintain.

There are three integration patterns — important because they change billing and semantics. Request Response fires-and-forgets (cheapest, no waiting). Run a Job (the .sync suffix) blocks the state until the called service reports completion (used for EMR steps, Batch jobs, ECS tasks). Wait for a callback with a task token (the .waitForTaskToken suffix) parks the execution until external code calls SendTaskSuccess / SendTaskFailure.

8 · Wait-for-task-token, end to end

The pattern for human approval flows ("wait for a human to click approve in Slack"), long-running external jobs, and any "another system has to call us back" workflow. The Task state pauses (consuming no compute, costing nothing per second), and only resumes when an external caller presents the token.

Two non-obvious properties: the token is generated by Step Functions and travels with the work, so the external system never authenticates against the workflow directly — possessing the token is the authorisation to resume. And the parked state costs nothing per second; you pay one state transition when it enters the wait and one when it leaves. A 30-day human approval flow is the same price as a 30-second one.

9 · Distributed Map and the 25,000-history-event ceiling

Every Standard workflow execution has a hard ceiling of 25,000 history events. Each state transition emits 2-3 events (state entered, state exited, plus task started / task succeeded for Task states). A naive Map over an array of 10,000 items burns ~30,000-40,000 events and crashes the execution with States.HistoryLimitExceeded.

Distributed Map (launched 2022) is the fix. Instead of running the iteration inside the parent execution's history, each iteration becomes its own child execution (with its own 25k budget), and the parent tracks them via S3 manifests rather than per-event log entries. The parent can drive up to 10,000 concurrent child executions and iterate over the contents of an S3 bucket (every object in s3://bucket/prefix/), CSV rows, or JSON arrays loaded from S3.

 Inline MapDistributed Map
Items per iteration~5,000 max (constrained by 25k history)Millions (limited by S3 manifest)
Concurrency40 by defaultUp to 10,000
State sharingParent's history sees every iterationEach iteration is its own execution; parent gets aggregate result
CostPer-transition × N items in one executionPer-transition × N items across N executions — same total, more visibility
Reach for it whenSmall fan-out (~hundreds of items)Batch processing every file in an S3 prefix; ETL fan-outs
Distributed Map isn't free. Each child execution is billed at the normal per-transition rate and counts against your execution-rate quota. Doing 10 million items in one parent costs the same as 10 million separate executions plus a small parent overhead — but you get the orchestration, retry-per-item, and partial-failure reporting that hand-rolled fan-out doesn't.

10 · Real-world case studies

Three public examples show where Step Functions earns its keep beyond toy tutorials.

Coca-Cola Freestyle — vending-machine orchestration. AWS's published Coca-Cola Freestyle case study describes the dispenser fleet (50,000+ machines worldwide) being driven through Step Functions–orchestrated flows for ordering, configuration updates, and telemetry processing. Each pour interaction kicks off short workflows that fan out to backend services; firmware-update campaigns use long-running Standard workflows that coordinate rollout across geographic groups. The interesting design choice is using Step Functions as the audit log itself — every state transition for every machine is queryable, so when a regulator asks "show me every dispense of product X in California last quarter," the answer comes from the execution history, not a custom event log.

Pokémon GO — Niantic's event-driven backend. Niantic's re:Invent 2018 talk on the Pokémon GO architecture (search "Niantic AWS re:Invent" — sessions ARC405 and similar) and their published case study describe Step Functions orchestrating raid events, Pokémon spawning logic, and player-progress flows across a multi-region deployment. The scale is the load-bearing detail: hundreds of millions of player events per day are routed through state machines that compose DynamoDB writes, ElastiCache invalidations, and downstream notifications. Express workflows handle per-request orchestration; Standard handles longer-running coordination like region-wide events.

Liberty Mutual — claims processing in regulated finance. Liberty Mutual's published architecture documents Step Functions as the backbone of their claims-intake and underwriting flows, where each step (document ingestion, OCR, fraud-rules engine, human review, settlement) needs durable state and a compliance-grade audit trail. The Standard workflow's 90-day execution history doubles as the audit log; regulators see exactly which path each claim took and which automated rules fired. This is the canonical "we'd otherwise be running Airflow plus a hand-rolled state table in Postgres" pattern — Step Functions collapses both into one managed service.

The through-line: Step Functions is most valuable when the orchestration itself is the audited, queryable system of record — not just plumbing between services. When that's true, the per-transition cost buys you a compliance posture you'd otherwise build by hand.

11 · Build it yourself — a 3-state workflow

  1. Create the role.
    SFN_ROLE=$(aws iam create-role --role-name lab-sfn \ --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"states.amazonaws.com"},"Action":"sts:AssumeRole"}]}' \ --query 'Role.Arn' --output text) aws iam attach-role-policy --role-name lab-sfn \ --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaRole sleep 10
  2. Define the state machine.
    cat > /tmp/sm.json <<'EOF' { "Comment": "lab — validate, branch, finish", "StartAt": "Validate", "States": { "Validate": { "Type": "Pass", "Result": { "amount": 1500, "valid": true }, "Next": "AmountCheck" }, "AmountCheck": { "Type": "Choice", "Choices": [{ "Variable": "$.amount", "NumericGreaterThan": 1000, "Next": "ManualReview" }], "Default": "AutoApprove" }, "ManualReview": { "Type": "Pass", "Result": "needs human", "End": true }, "AutoApprove": { "Type": "Pass", "Result": "approved", "End": true } } } EOF SM_ARN=$(aws stepfunctions create-state-machine --name lab-sm \ --definition file:///tmp/sm.json --role-arn $SFN_ROLE \ --type STANDARD --query stateMachineArn --output text)
  3. Run it.
    EXEC_ARN=$(aws stepfunctions start-execution --state-machine-arn $SM_ARN \ --input '{"orderId":"abc"}' --query executionArn --output text) sleep 2 aws stepfunctions describe-execution --execution-arn $EXEC_ARN \ --query '{status:status,output:output}'
  4. Inspect the history.
    aws stepfunctions get-execution-history --execution-arn $EXEC_ARN \ --query 'events[].{id:id,type:type}' --output table # Every state transition is logged. The console graph view is built from this.
  5. Tear down.
    aws stepfunctions delete-state-machine --state-machine-arn $SM_ARN aws iam detach-role-policy --role-name lab-sfn --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaRole aws iam delete-role --role-name lab-sfn

12 · What breaks

  • Quota: 25,000 history events per execution (Standard). Large Map states with thousands of items hit it. Distributed Map mode is the supported workaround — it checkpoints to S3 and runs each iteration as its own execution — but it's not a drop-in replacement (different IAM, different result shape, different debugging).
  • Inputs / outputs > 256 KB. Step Functions rejects payloads above 256 KB at any state boundary. The canonical fix is to store the bulk payload in S3 and pass only the S3 reference ({"bucket":"x","key":"y"}) between states; each state Lambda reads what it needs.
  • Express has no console replay. A failed Express execution leaves only the CloudWatch log stream — no visual graph, no time-travel. If you need to investigate "why did this fail at 3am two days ago," you're grep-ing JSON logs. Use Standard for anything you'll need to debug post-hoc.
  • "My state machine succeeded but the data is wrong." ResultPath, OutputPath, InputPath, and Parameters compose in subtle order. The default for ResultPath is $ (replace the whole state input with the task result), which silently drops anything you didn't pass through. Set ResultPath: "$.taskOutput" to merge instead of replace.
  • Cost spike from runaway retries. A buggy state with "MaxAttempts": 10, "BackoffRate": 2.0 calling Lambda 10 times per execution × millions of executions adds up fast. Set a low MaxAttempts for non-idempotent operations and let Catch route to a DLQ-equivalent state instead.
  • Task token never comes back. Wait-for-task-token states block forever if the external system forgets to call SendTaskSuccess/SendTaskFailure. There is no default timeout; you must set TimeoutSeconds on the Task state. A 1-year Standard execution will happily wait the full year for a token.
  • Standard executions count against in-flight quota. The default quota is 1,000,000 open executions per region. A workflow with a 30-day human-approval wait that runs at 1k/day will pile up 30k open executions — well under the limit, but a 100k/day workflow with the same wait would exceed it. Check QuotasService:ListServiceQuotas.
  • Express + StartSyncExecution payload limit. Synchronous Express has a 256 KB response limit (same as state I/O). API Gateway sitting in front adds its own 10 MB limit. Both will truncate quietly.

13 · Further reading

Next

Containers →

ECS task definitions, Fargate vs EC2 launch type, EKS architecture, IRSA, capacity providers — the container-orchestration layer.

Read Containers
Found this useful?