Lambda execution.
A function-as-a-service that runs your code inside a Firecracker microVM allocated on demand. The interesting parts are the lifecycle (init → warm pool → reused → eventually retired), the cold-start phase that adds latency on first invoke, and the levers that pin down tail latency — provisioned concurrency, SnapStart, response streaming, the VPC-attach cost.
1 · What Lambda actually is (and isn't)
The mental model that survives every conversation: Lambda is a function-as-a-service that ships your handler bytes into a private microVM, runs it once per concurrent request, and tears the VM down later when nobody's calling. You don't pick a server; you don't pick a port; you don't keep state between invocations. The platform decides where, when, and how many environments to run. You write a single function and get billed per-millisecond of execution.
What Lambda isn't: a server. There is no long-running process you can attach a debugger to. There is no shared memory between concurrent invocations — two requests in flight at the same time are running on two separate microVMs, and a global variable mutated by one is invisible to the other. There is no shell, no cron daemon, no listen socket your code controls. Treating Lambda as "a tiny EC2 instance" is the most common source of "why doesn't my background thread keep working / why did my in-memory cache disappear" puzzlement when teams first reach for it.
What Lambda is also: two distinct invocation modes with different durability and retry semantics. Synchronous invokes (API Gateway, ALB, direct SDK call) return the function's response to the caller; failures bubble back to the caller; no automatic retry. Asynchronous invokes (S3 events, SNS, EventBridge, SDK with InvocationType=Event) place the event onto an internal queue Lambda manages; Lambda retries failed invokes twice with backoff, then sends to a configured DLQ or on-failure destination. The function code is identical; the failure-handling story is completely different.
| Lambda is good for | Lambda is bad for |
|---|---|
| Spiky, event-driven workloads (S3 thumbnail, webhook handler, scheduled job) | Sustained > 50% utilisation 24/7 (Fargate or EC2 is cheaper above some break-even) |
| Glue code that wires AWS services together | Anything that needs a TCP listen socket (no inbound networking; only request-driven) |
| Bursty traffic where you don't want to manage capacity | Workloads sensitive to single-digit-ms p99 latency on the first request (cold starts add 100ms–2s) |
| Per-request work bounded under 15 minutes | Long jobs (use Step Functions, ECS, or AWS Batch) |
| Stateless request/response handlers | Anything assuming shared in-memory state across invocations (use ElastiCache or DynamoDB) |
2 · How Lambda is built — Firecracker microVMs
Lambda launched in 2014 running on shared Linux containers — multiple customers' functions packed onto the same host with cgroup and namespace isolation. By 2017 the security model was visibly under strain (a kernel exploit in one container could in principle reach the host), and AWS started building a replacement. The result was Firecracker, open-sourced in 2018, and described in the NSDI 2020 paper "Firecracker: Lightweight Virtualization for Serverless Applications". Lambda transitioned all functions to Firecracker over 2018–2020; Fargate followed.
Firecracker is a KVM-based virtual machine monitor (VMM) written in Rust, deliberately minimal — it has no PCI, no USB, no graphics, no BIOS, no full device emulation. It exposes a guest with a serial console, a network tap, and a virtio block device, nothing else. The result: a microVM boots in ~125 ms, takes ~5 MB of memory overhead for the VMM, and gives you hardware-level isolation backed by Intel VT-x / AMD-V — the same boundary that separates EC2 instances from each other.
A single worker host can pack thousands of microVMs. The scheduler — Lambda's Worker Manager and Placement Service, described in the NSDI paper — bin-packs function executions across the fleet, isolating customers by both VMM and underlying KVM enforcement. When you invoke a function, the router finds (or creates) a worker host with capacity, asks Firecracker to launch a microVM, and hands the request to your runtime. When the function is idle long enough, the microVM is torn down and the slot is freed for the next workload.
Two practical consequences fall out of this shape: any concurrent invocation is its own VM (no shared memory, full isolation), and the per-VM overhead is the floor of the cold start — you can't get below microVM boot time without a snapshot trick, which is what SnapStart provides.
3 · The execution model
Every Lambda invocation runs inside an execution environment — a Firecracker microVM with your code, dependencies, and the AWS Lambda runtime. When a request arrives and no warm environment is available, Lambda creates a new one:
- Download code from the function's location (S3 internally), unpack it.
- Start the runtime — Node, Python, Java JVM, etc.
- Run init code — anything outside your handler function. This is where SDK clients get created, DB connections opened, config parsed.
- Run the handler with the event.
Steps 1–3 are the cold start. After the handler returns, the environment is kept warm for 5–15 minutes; subsequent invokes skip 1–3 entirely. Lambda will spin up multiple environments in parallel as concurrency rises — there is no "one Lambda function instance," there are however many warm environments are needed.
4 · Cold start anatomy
"Cold start" is one word for four sequential phases, each with its own contributors. Following one request all the way through:
| Phase | Typical time | What's happening |
|---|---|---|
| Container provisioning | 50–200 ms | Firecracker microVM cloned from a snapshot |
| Code download & unpack | 10–100 ms (zip) · 200ms–2s (image) | Faster for small zips, slower for container images, very slow for > 50 MB |
| Runtime init | 50–500 ms | JVM start is the worst offender; Node and Python ~50 ms |
| Init code | 0–unbounded (10s hard cap) | Your code — SDK clients, secret fetch, DB connect |
| + VPC attach | + 0 ms (modern) — was 1-10s pre-2019 | ENI is now pre-attached; cost is the ENI itself |
| + Container image first invoke | + ~2x the zip equivalent | Subsequent cold starts faster — image layers cached on the worker fleet |
- Memory drives CPU. Lambda's CPU scales linearly with memory allocation. Setting memory to 512 MB gives ~30% of a vCPU; 1769 MB gives a full vCPU. Often paying for more memory makes cold start cheaper overall because init runs faster.
- Layers are extra zip files mounted at
/opt. Use them for shared dependencies across functions. Layers are downloaded as part of cold start; large layers slow it down. - Container image functions support up to 10 GB image; cold start is slower than zip (extra fetch + unpack), but the size limit unlocks ML workloads. AWS optimises hot-set caching of the image layers, so subsequent cold starts within the same account get faster — the USENIX ATC 2023 "On-demand Container Loading in AWS Lambda" paper describes the chunk-level deduplication that makes 10 GB images viable on serverless at all.
5 · SnapStart — boot from a memory snapshot
JVM cold starts hurt: a Spring Boot Lambda routinely takes 3–6 seconds from request arrival to first byte of response, because the JVM has to load and verify thousands of classes, warm up the JIT, and let your app run its constructors. SnapStart (announced at re:Invent 2022, originally Java-only; Python and .NET added in 2024) cuts that to ~100–300 ms for the common case.
The mechanism: when you publish a new function version with SnapStart enabled, Lambda pre-runs the init phase in a Firecracker microVM, then uses a CRIU-style memory checkpoint to capture the entire VM state — kernel pages, runtime heap, your initialised SDK clients, JIT-compiled code — into an encrypted snapshot stored on dedicated infrastructure. At cold-start time, Lambda restores the snapshot directly into a fresh microVM instead of repeating the init work. The microVM resumes execution at the point right after your init code returned, at the cost of one snapshot read plus a few hundred milliseconds of restore overhead.
| Behaviour | Without SnapStart | With SnapStart |
|---|---|---|
| Cold start (Spring Boot Java) | 3,000–6,000 ms | 200–400 ms |
| Cold start (Python with heavy deps like boto3 + pandas) | 800–1500 ms | ~150 ms |
| Init code runs | Every cold start | Once per version publish |
| Charge model | Compute during init + invoke | Java: free. Python/.NET: caching + restore charge per invoke. |
| Versioning | Aliases point to live code | Snapshot taken at version publish; alias updates re-snapshot |
SecureRandom instance — is captured in the snapshot and reused across every restore. The classic failures: thousands of "concurrent" Lambda invocations all using the same RNG seed and producing identical "random" IDs; reused DB connections whose underlying TCP socket was torn down between snapshot and restore. AWS exposes beforeCheckpoint and afterRestore runtime hooks specifically so you can close-and-reopen these resources around the snapshot boundary. Read "SnapStart and uniqueness" before enabling it on anything that signs, encrypts, or talks to a stateful backend.6 · Concurrency — reserved, provisioned, and the account quota
- Reserved concurrency. Caps the maximum parallel executions for one function. Protects downstream systems (databases, third-party APIs) from a fanout storm. Setting it to 0 effectively disables the function — useful in incidents ("kill switch" for a misbehaving consumer). Reserved capacity is carved out of the account quota; setting reserved=100 on one function reduces the available pool for every other function by 100.
- Provisioned concurrency. Pre-warm N execution environments and keep them alive 24/7. Cold start = 0 for those N. You pay an hourly rate per provisioned environment whether it's serving traffic or not (roughly $0.015 / GB-hour on x86 in us-east-1). Use for latency-sensitive endpoints with strict tail SLOs — synchronous APIs where p99 matters. The break-even versus on-demand depends on utilisation: if your provisioned slot is busy >50% of the time, it's also cheaper than on-demand.
- Account quota. Each region has a per-account concurrent execution quota (default 1,000). Beyond that, sync invokes return
TooManyRequestsException(429) and async invokes queue up to a few minutes before failing. Raising the quota is a free support-ticket; AWS approves into the tens of thousands for legitimate workloads without much friction. - Burst quotas. Within the account quota, Lambda will scale up by up to 1,000 new execution environments every 10 seconds per function (region-dependent — some regions allow 500/10s) before throttling kicks in. The 2023 change to "Lambda concurrency scaling" (announced at re:Invent 2022) raised this from the older 500/min burst, but spikes faster than 1,000 new envs per 10 seconds still need warm pools.
7 · Lambda vs Fargate vs ECS vs EC2
Lambda is one of four AWS compute primitives that ship roughly the same workload — your code, in a Linux process, somewhere. They differ in billing granularity, cold-start cost, max duration, and who manages the substrate.
| Lambda | Fargate | ECS (EC2) | EC2 | |
|---|---|---|---|---|
| Billing | Per-ms of execution + per-invoke | Per-second while task running | Per-second of the EC2 host (regardless of task density) | Per-second of the instance |
| Cold start | 100ms–2s (mitigatable) | ~30–60 s for first task | ~10–30 s for first task; instant if scheduler has space | Whatever your AMI takes to boot |
| Max duration | 15 minutes | Unbounded | Unbounded | Unbounded |
| Inbound networking | None (only request-driven) | Yes (own ENI) | Yes | Yes |
| Substrate | Firecracker microVMs (managed) | Firecracker microVMs (managed) | EC2 + Docker (you manage hosts) | Full VM (you manage everything) |
| Reach for it when | Sporadic, event-driven, short jobs | Long-running containers without cluster management | You want bin-packing, GPU, custom kernel | You need control of the host (specific kernel, hugepages, drivers, GPU, FPGA) |
| Break-even (rough) | < 50% sustained utilisation | 50–70% sustained utilisation | 70%+ sustained utilisation | Spot pricing or specialised hardware |
8 · Limits and gotchas
| Limit | Value | Notes |
|---|---|---|
| Max execution time | 15 minutes | Hard cap. Long jobs → Step Functions or ECS. |
| Memory | 128 MB – 10,240 MB | In 1 MB increments. |
| Ephemeral /tmp | 512 MB – 10 GB | Configurable; persists across invokes in the same environment. |
| Payload sync invoke | 6 MB | Both request and response combined. |
| Payload async invoke | 256 KB | EventBridge / SNS / SQS invokes are async. |
| Response streaming | 20 MB | Stream chunks instead of buffering — for long responses (LLM tokens). |
| Concurrent executions | 1,000 default, raisable | Account-wide unless reserved/provisioned set per function. |
| Function package (zip) | 50 MB direct, 250 MB via S3, unzipped 250 MB | Container image is 10 GB. |
| Layers per function | 5 | Layers count toward the 250 MB unzipped limit. |
| Environment variables | 4 KB total | Move large config to Parameter Store / Secrets Manager. |
9 · Real-world case studies
Three publicly documented Lambda deployments give a sense of where the model fits and how teams structure code around it.
Coca-Cola — connected vending machines. Coca-Cola Freestyle and the company's newer connected dispensers publish telemetry to AWS and accept remote command from a Lambda-fronted backend. The published AWS case study describes the move from a traditional EC2-based dispenser backend to a Lambda + API Gateway + DynamoDB stack, with the headline number being a ~65% reduction in cost-per-transaction. The architectural shape that makes it work: each dispenser transaction is a short, stateless event — pour a drink, log the recipe, decrement inventory — exactly the request shape Lambda is built for. The company didn't move everything to Lambda; the analytics and recipe-generation pipelines stayed on bigger compute. The lesson is in the partition: edge interactions → Lambda; long-running data work → elsewhere.
iRobot — Roomba IoT backend. When iRobot launched the connected Roomba 980 in 2015, they built the entire cloud backend on Lambda + IoT Core + DynamoDB, covered in a re:Invent talk and the AWS case study. By 2017 they were processing tens of millions of events per day across millions of devices with a team of a few engineers — the explicit reason serverless was chosen over a Kubernetes or EC2 fleet was operational burden. Their architecture: each robot publishes MQTT messages to IoT Core; IoT rules invoke Lambdas; Lambdas write state to DynamoDB and publish to other Lambdas via SNS/SQS for downstream work. The shape — "tens of thousands of small functions wired together by events" — is what serverless at scale actually looks like in production. iRobot's CTO has repeatedly noted that they prioritised function-per-task granularity over the "lambdalith" anti-pattern of stuffing one giant Express app into a Lambda.
Bustle Digital Group — entire CMS on Lambda. Bustle (which owns Bustle, Elite Daily, Romper, and other titles) re-platformed its publishing pipeline onto a Lambda + DynamoDB + GraphQL stack in 2017, documented in their AWS case study and a re:Invent talk by then-CTO Tyler Love. The serving side runs entirely on Lambda behind API Gateway + CloudFront; editorial workflows publish to S3, which triggers Lambda functions that materialise GraphQL responses into DynamoDB. They reported running the whole site on a serverless stack at a fraction of their prior EC2 spend, and — more interestingly for the architecture argument — with no on-call rotation for scaling: traffic spikes during viral stories are handled by Lambda's autoscaler with no human in the loop. The piece worth stealing is the cache hierarchy: CloudFront does ~95% of the work; Lambda + DynamoDB only see the misses.
The through-line: Lambda is most useful when treated as the glue between event sources and storage, not as a general application platform. The Coca-Cola, iRobot, and Bustle architectures all push request volume to S3 / DynamoDB / CloudFront where possible and reserve Lambda for the small per-request bit of logic that has to run somewhere.
10 · Build it yourself — Lambda with cold-start observation
- Create the function.
cat > /tmp/index.py <<'EOF' import time, os, json INIT_TS = time.time() def handler(event, context): return { "init_ts": INIT_TS, "now": time.time(), "container_id": os.environ.get("AWS_LAMBDA_LOG_STREAM_NAME") } EOF cd /tmp && zip fn.zip index.py ROLE_ARN=$(aws iam create-role --role-name lab-lambda \ --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"lambda.amazonaws.com"},"Action":"sts:AssumeRole"}]}' \ --query 'Role.Arn' --output text) aws iam attach-role-policy --role-name lab-lambda \ --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole sleep 10 # IAM eventual consistency aws lambda create-function --function-name lab-fn \ --runtime python3.12 --role $ROLE_ARN \ --handler index.handler --zip-file fileb:///tmp/fn.zip \ --memory-size 512 --timeout 30 - Invoke once — observe the cold start.
aws lambda invoke --function-name lab-fn --invocation-type RequestResponse /tmp/out.json cat /tmp/out.json # init_ts = container init time; now = invoke time. Difference ~milliseconds on first invoke. - Invoke 10 times in parallel — observe multiple container_ids.
for i in $(seq 1 10); do aws lambda invoke --function-name lab-fn /tmp/out-$i.json & done wait cat /tmp/out-*.json | jq -r '.container_id' | sort -u # Multiple log streams = multiple environments created in parallel — each was a cold start. - Invoke serially — observe warm reuse.
for i in $(seq 1 5); do aws lambda invoke --function-name lab-fn /tmp/warm-$i.json cat /tmp/warm-$i.json | jq -r '.container_id' done # Same container_id = same warm environment, no cold starts after the first. - Enable provisioned concurrency.
aws lambda publish-version --function-name lab-fn aws lambda put-provisioned-concurrency-config --function-name lab-fn --qualifier 1 --provisioned-concurrent-executions 2 - Tear down.
aws lambda delete-function --function-name lab-fn aws iam detach-role-policy --role-name lab-lambda --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole aws iam delete-role --role-name lab-lambda
AWS_LAMBDA_LOG_STREAM_NAME. Count distinct log stream names across N parallel invokes = number of cold starts. CloudWatch also exposes the Init Duration field on cold-start log entries (REPORT line).11 · What breaks
- Cold-start spikes during traffic bursts. Lambda creates new environments faster than it can keep up. Use provisioned concurrency, SnapStart, or shape traffic via SQS to smooth it out.
- Connection storms to RDS. Each cold environment opens its own DB connection. 1,000 concurrent invokes = 1,000 connections; Aurora's max ~5,000. Use RDS Proxy, or use HTTP-shaped DynamoDB instead.
- Timeout in init code. Init code has a 10-second deadline by default. SDK clients that hang on a misconfigured region or VPC route silently kill cold starts. Always set explicit timeouts on init-time HTTP calls.
- "My function ran 14:59 and then died." 15 minute hard limit. Long workflows belong in Step Functions; long batches in ECS / Batch.
- 6 MB payload limit on sync invokes. Request and response combined. The 5 MB JSON response that worked yesterday breaks the day someone adds a 1.5 MB request body. For larger payloads: pre-sign an S3 URL and pass the key, or switch to response streaming (20 MB ceiling, no first-byte buffer).
- 250 MB unzipped deployment package. Includes function code and all layer contents combined. A pandas + numpy + scipy + matplotlib bundle blows through it. Either trim to what you need (no-cuda Torch, no-tests pandas), build a container image (10 GB ceiling), or split the function.
- Default 1,000 concurrent executions per region per account. A noisy neighbour function in the same account can eat the whole quota and throttle everything else. Carve reserved concurrency on critical functions; alarm on
ConcurrentExecutionsapproaching the quota. - Provisioned concurrency bills while idle. A team forgets they enabled 100 provisioned slots during a launch; the bill keeps coming after traffic dies down. Use Application Auto Scaling to scale provisioned concurrency on schedule (down at night, up during business hours).
- Container image cold start ~2x the zip equivalent on first invoke. AWS does aggressive chunk-level caching across the worker fleet, but the very first invocation after deploy pulls the unique image bytes. Stagger deploys; warm with synthetic invokes before traffic hits.
- /tmp default is 512 MB. Now configurable up to 10 GB, but the default still bites teams unpacking models or processing video. Set it explicitly in IaC; remember that
/tmppersists across invokes in the same warm environment and the next caller can see your previous run's leftover files. - X-Ray trace ID confusion. When SQS/EventBridge invokes Lambda, the trace ID can be lost across the boundary unless you propagate it explicitly. Wire trace context through the message body if you need end-to-end tracing.
12 · Further reading
- Lambda execution environments. The official reference on the lifecycle.
- "Firecracker: Lightweight Virtualization for Serverless Applications" (NSDI 2020). The AWS paper on Firecracker's design and Lambda's worker fleet.
- "On-demand Container Loading in AWS Lambda" (USENIX ATC 2023). How Lambda makes 10 GB container images cold-start-fast via chunk deduplication.
- Lambda SnapStart. The official guide; pay attention to the uniqueness-token caveats.
- iRobot case study. Roomba's IoT backend on Lambda from launch.
- Bustle on AWS. Entire publishing platform on serverless.
- Cloud compute (concepts). When serverless is the right shape vs containers vs VMs.