Containers.

Three orchestrators wearing two skins. ECS is AWS's own scheduler — simple, IAM-native, the path of least surprise inside one cloud. EKS is Kubernetes the way AWS runs it — portable, complex, with a whole control plane you don't manage. Underneath both: Fargate (you give it a task, it gives you a container) or EC2 (you give it nodes, it packs containers onto them).

1 · What container orchestration actually does

Strip the marketing away and a container orchestrator solves four problems. The scheduler problem: given a heap of containers (with CPU, memory, GPU, port requirements) and a heap of hosts (with capacity), decide which container goes where. This is bin-packing, NP-hard in the general case; production schedulers use online heuristics (best-fit, spread, anti-affinity rules). The controller problem: the user says "I want 8 copies of this thing running"; the system observes "5 are running" and starts 3 more. Forever. This is the reconciliation loop — the heart of Kubernetes' design and ECS's services. Service discovery: when container A wants to call container B, how does A find a healthy B? Answers range from DNS records to mesh sidecars to load balancers in front of stable VIPs. The networking model: how do containers get IPs, route to each other, and reach the outside world without colliding on ports.

ECS and EKS solve all four; they solve them differently. ECS uses AWS primitives end-to-end — the scheduler is an AWS-managed control plane, services discover each other via ALB target groups or AWS Cloud Map, networking is VPC ENIs. EKS uses Kubernetes' abstractions — the scheduler is the kube-scheduler running in the EKS control plane, discovery is Kubernetes Services + kube-dns, networking is whatever CNI plugin you've installed (Amazon VPC CNI by default). Same problems, two layers of indirection.

Concern	ECS	EKS (Kubernetes)
Scheduling	AWS-managed; placement strategies are `binpack`, `spread`, `random`	kube-scheduler with pluggable predicates and priorities; affinity/anti-affinity, taints/tolerations
Desired-state controller	ECS Service maintains task count + integrates with ALB	Deployment → ReplicaSet → Pod; many other controller types (StatefulSet, DaemonSet, Job)
Service discovery	ALB target groups; AWS Cloud Map for service-to-service DNS	kube-dns (CoreDNS) — every Service gets `name.ns.svc.cluster.local`
Networking model	awsvpc (every task = ENI = real VPC IP) is the default; bridge / host modes on EC2 only	VPC CNI: each pod gets a real VPC IP from the node's secondary ENI pool
Config / secrets	Task definition env vars, Secrets Manager ARNs, SSM parameter ARNs	ConfigMaps, Secrets (base64, not encrypted by default — wire to KMS)
Operational footprint	You write task definitions in JSON; AWS runs everything else	You write YAML, install ingress controllers, manage CNI version, run cluster autoscaler / Karpenter

The honest framing. ECS is what AWS would have built if Kubernetes didn't exist; EKS is AWS shipping Kubernetes because customers asked for it. ECS is simpler, cheaper to operate, and locks you in. EKS is portable across clouds, has the Kubernetes ecosystem, and demands a platform team. Most "should we use ECS or EKS?" debates are really "do we already know Kubernetes?"

2 · The four primitives of ECS

Primitive	What it is
Cluster	A logical grouping. With Fargate, just a name. With EC2 launch type, it's also a pool of registered container instances.
Task definition	The blueprint. JSON spec of containers, image, CPU/memory, environment variables, IAM task role, log driver, port mappings. Immutable — every change is a new revision.
Task	A running instance of a task definition. May contain multiple containers (a pod, in K8s language).
Service	The desired-state controller. "Run N copies of this task definition; replace any that die; register them behind this load balancer." Without a Service you have a one-shot task.

An ECS Service does three things: keeps the right number of tasks running, talks to an ALB target group to register/deregister them, and drives rolling deployments (replace tasks one by one with the new task definition revision).

3 · ECS scheduler architecture

The flow from "I declared desired count = 8" to "8 containers are serving traffic behind a load balancer" — the parts AWS runs vs the parts you provision:

Two things are worth noticing. First, the scheduler is stateless from your perspective — there's no "ECS master" you provision. AWS runs it, and ECS itself is free; you pay only for the underlying compute (Fargate vCPU-seconds or EC2 hours). Second, the deployment circuit breaker is the safety net that catches stuck deploys: if N consecutive new tasks fail to start or fail health checks, ECS halts the deploy and (optionally) rolls back. It catches consecutive failures only — a deploy that fails 5 times, succeeds once, then fails 5 more in a slow pattern can still drain your healthy capacity over hours.

4 · Fargate vs EC2 launch type

	Fargate	EC2
You manage	Task definition, service config. Nothing else.	Task definition + the EC2 fleet that runs them (AMIs, scaling, patching).
Billing	Per vCPU-second + per GB-second per task. Rounded up to nearest second after 1 min.	Per EC2 hour. Tasks share the host — you pay for capacity, not utilisation.
Cold start	15–60s to pull image and start the task.	~5s on a warm node (image already cached locally).
Networking	Each task gets its own ENI. Counts against subnet IP pool.	Tasks share the node's ENI by default (bridge mode) or get their own (awsvpc mode).
Best for	Bursty workloads, infrequent jobs, small teams.	Steady high-throughput workloads, GPU/large memory, daemon containers (CloudWatch agent on every host).

Fargate is ~3× more expensive than EC2 at full utilisation. Crossover point is usually around 60–70% steady utilisation: below that, Fargate wins on ops simplicity; above that, EC2 (or Fargate Spot) wins on cost. Fargate Spot is ~70% off but can be reclaimed with 2-minute notice — great for batch and CI workers.

5 · Capacity providers

A capacity provider is a strategy for "where do tasks run?" — FARGATE, FARGATE_SPOT, or an EC2 Auto Scaling group. A service can split tasks across providers with weights: "run 1 task on regular Fargate (baseline), then 4-to-1 split between Spot and regular for the rest." This is how you get cost-optimised mixed deployments without writing custom scheduling logic.

With EC2 capacity providers, ECS drives the Auto Scaling group automatically — when you have pending tasks and not enough capacity, the ASG scales out; when nodes are idle, it scales in. The managed scaling setting (target capacity 80% by default) tells ECS how much headroom to keep.

6 · EKS — Kubernetes the way AWS runs it

EKS is a managed Kubernetes control plane. AWS runs etcd, the API server, the controller manager, and the scheduler across multiple AZs. You pay $0.10/hr per cluster regardless of size. You run the worker nodes (or use Fargate for them).

EKS choice	What it means
Self-managed nodes	You provision EC2 instances, run an AMI with kubelet, join them to the cluster. Most control, most work.
Managed node groups	AWS runs an Auto Scaling group for you; rolling upgrades when you change the AMI. Most teams' default.
Fargate profiles	Pods matching a selector run on Fargate instead of nodes. Per-pod billing, no node management. Limited (no DaemonSets, no privileged pods, no EBS).
EKS Auto Mode	(2024+) AWS provisions and manages compute, networking, storage add-ons. Closest to "Fargate but for whole nodes."

7 · IRSA — pod-level IAM

On a non-EKS K8s cluster, all pods on a node share the node's IAM role — every pod can do everything the node can. That's broken security. IRSA (IAM Roles for Service Accounts) fixes it.

The mechanism: the EKS cluster has an OIDC provider; you create an IAM role with a trust policy that says "trust tokens issued by this OIDC provider with a subject of system:serviceaccount:my-ns:my-sa"; you annotate the Kubernetes ServiceAccount with the role ARN; pods using that SA get a projected JWT, and the AWS SDK exchanges it for STS credentials scoped to that role.

Result: each pod gets exactly the AWS permissions its workload needs. The newer flavour, EKS Pod Identity (2023+), does the same thing without OIDC — an agent runs on each node and brokers credentials via a local endpoint. Simpler trust model; same end result. Pod Identity is what AWS would build today if starting from scratch; IRSA is what they built in 2019 because OIDC was the standards-based path.

8 · Which compute model — a comparison

Four common combinations, with different operational and cost profiles. The summary:

	ECS Fargate	ECS EC2	EKS managed nodes	EKS Fargate
What you operate	Task defs, services	Task defs, services, EC2 ASG, AMI patching	Cluster, node groups, Helm releases, add-ons	Cluster, Fargate profiles, Helm releases
Cold start	15-60s	~5s on warm node	~5s on warm node	30-90s (pod-by-pod microVM boot)
Scaling unit	Per task	ASG instances, then tasks	ASG / Karpenter nodes, then pods	Per pod
Cost at 70% util	~3× EC2	Baseline	Baseline + $0.10/hr cluster	~3× EC2 + $0.10/hr cluster
DaemonSets / privileged	N/A (no daemonset concept)	Yes	Yes	No
Reach for it when	Bursty, infrequent, small team	Steady high-throughput, GPU, daemons	Kubernetes ecosystem matters; portable	Per-pod isolation; security-sensitive workloads

The cost crossover. Fargate billing is per vCPU-second and per GB-second, rounded to the minute. EC2 billing is per instance-hour for whatever you provisioned, used or not. The Fargate premium pays for "we run the OS, the patches, the security updates" — which is exactly what teams without a dedicated platform engineer want to pay for. Past ~$30k/month of compute, EC2 + autoscaling math starts winning if you have someone to operate it.

9 · Real-world case studies

Four public stories give a sense of how teams actually pick between ECS and EKS at scale.

Lyft — ECS for fleet services, then a long Kubernetes journey. Lyft's engineering blog documents an early-era ECS deployment for the rideshare backend, then a long migration to Kubernetes once their service mesh (Envoy, which originated at Lyft) made the platform team Kubernetes-native anyway. The interesting decision history: ECS was the right call at small scale (one team, AWS-native, no platform engineers), and Kubernetes became the right call once the company had a dedicated platform group and Envoy's ecosystem demanded it. The "switch when your team grows" pattern is repeated across many of these case studies.

Robinhood — EKS migration for compliance and portability. Robinhood's engineering team has published several posts on their move to EKS for stockbroker-grade compliance workloads. The driver isn't cost — Fargate would be more expensive — but the combination of Kubernetes' RBAC model, network policies (Calico), and the ability to run the same manifests in dev / staging / prod via standard CI. For a regulated business, "we can prove this is the same software running here as in the last audit" is worth more than the savings of a less abstract platform.

Pinterest — Kubernetes for ML and data infrastructure. Pinterest's Kubernetes Engineering posts cover building a multi-tenant Kubernetes platform on EKS (and originally self-managed) for their ML training, recommendation serving, and data pipeline workloads. They highlight Karpenter for cost-efficient autoscaling — they replaced Cluster Autoscaler when latency to schedule a new GPU node dropped from minutes to ~30 seconds. The detail worth absorbing: Karpenter is not a small operational improvement; it changes the fundamental economics of GPU autoscaling because you can right-size aggressively without waiting for ASG groups to materialize.

CrowdStrike — ECS Fargate for the agent ingestion fleet. CrowdStrike's published architecture (see their security platform engineering posts and re:Invent talks) puts the agent telemetry ingestion path on ECS Fargate — trillions of events per day flow through Fargate tasks that decode, enrich, and route to downstream Kafka and S3. The reason for Fargate over EKS: every additional layer of abstraction is a potential attack surface in a security product, and Fargate's per-task microVM isolation is a cleaner story for customers than "we run multi-tenant Kubernetes nodes." For security companies, "explainable runtime" beats "operational efficiency" in trade-offs.

The through-line: ECS wins when AWS-native is fine and operational simplicity matters; EKS wins when your platform team is large enough to invest in the abstraction, or when portability across clouds is a hard requirement.

10 · Build it yourself — ECS Fargate from zero

Set up the cluster and roles.
aws ecs create-cluster --cluster-name lab-cluster # Task execution role (pulls image, writes logs) TER=$(aws iam create-role --role-name lab-task-exec \ --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"ecs-tasks.amazonaws.com"},"Action":"sts:AssumeRole"}]}' \ --query 'Role.Arn' --output text) aws iam attach-role-policy --role-name lab-task-exec \ --policy-arn arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy sleep 10
Register a task definition.
cat > /tmp/td.json <<EOF { "family": "lab-web", "networkMode": "awsvpc", "requiresCompatibilities": ["FARGATE"], "cpu": "256", "memory": "512", "executionRoleArn": "$TER", "containerDefinitions": [{ "name": "web", "image": "nginx:alpine", "portMappings": [{"containerPort": 80, "protocol": "tcp"}], "logConfiguration": {"logDriver":"awslogs", "options":{"awslogs-group":"/ecs/lab","awslogs-region":"us-east-1","awslogs-stream-prefix":"web","awslogs-create-group":"true"}} }] } EOF aws ecs register-task-definition --cli-input-json file:///tmp/td.json
Run it once as a standalone task.
# Use a public subnet and assign a public IP so it can pull from Docker Hub. SUBNET=$(aws ec2 describe-subnets --filters Name=default-for-az,Values=true \ --query 'Subnets[0].SubnetId' --output text) SG=$(aws ec2 describe-security-groups --group-names default \ --query 'SecurityGroups[0].GroupId' --output text) aws ecs run-task --cluster lab-cluster --launch-type FARGATE \ --task-definition lab-web \ --network-configuration "awsvpcConfiguration={subnets=[$SUBNET],securityGroups=[$SG],assignPublicIp=ENABLED}"
Inspect what happened.
aws ecs list-tasks --cluster lab-cluster aws ecs describe-tasks --cluster lab-cluster --tasks $(aws ecs list-tasks --cluster lab-cluster --query 'taskArns[0]' --output text) \ --query 'tasks[0].{status:lastStatus,health:healthStatus,containers:containers[].{name:name,exit:exitCode}}' # Logs land in /ecs/lab. Tail them: aws logs tail /ecs/lab --follow --since 5m
Tear down.
aws ecs stop-task --cluster lab-cluster --task $(aws ecs list-tasks --cluster lab-cluster --query 'taskArns[0]' --output text) aws ecs delete-cluster --cluster lab-cluster aws iam detach-role-policy --role-name lab-task-exec --policy-arn arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy aws iam delete-role --role-name lab-task-exec aws logs delete-log-group --log-group-name /ecs/lab

You've now seen the full Fargate task lifecycle without provisioning a single EC2 instance. To make it a long-running service, wrap step 3 with aws ecs create-service and an ALB target group.

11 · What breaks

"My task can't pull the image." Most common cause: private subnet without NAT or VPC endpoints for ECR — Fargate can't reach the internet to fetch the image. Either give the subnet a route, or set up ECR/S3 VPC endpoints (S3 because ECR images store their layers in S3).
ENI exhaustion in awsvpc mode. Each Fargate task = one ENI consuming one VPC IP. Subnets have finite IP pools (a /24 holds 251 usable IPs; a /22 holds 1,019). A service scaled to 2,000 tasks in a /22 will simply not schedule the last ~1,000 — they sit in PROVISIONING forever. Pre-allocate larger subnets, or shrink the task density per VPC.
Deployment circuit breaker only catches consecutive failures. If new tasks fail-succeed-fail-succeed in alternation, the circuit breaker never trips and you can drain healthy capacity over hours. Set CloudWatch alarms on task StoppedReason patterns as a secondary safety net.
Rolling deploys stall. ECS will not deregister a task from the ALB target group until deregistration_delay elapses (default 300s). A 4-task service with 50% minimum healthy = 10+ minutes per deploy. Lower the deregistration delay for non-stateful HTTP services.
EKS upgrades require strict kubectl version compatibility. kubectl supports +/- 1 minor version against the API server. A jump from 1.27 → 1.30 requires re-tooling every team's kubectl too — and the control plane upgrade is one-way; you cannot downgrade. Run kubectl version across your fleet before triggering EKS upgrade.
EKS control-plane logs are off by default. When you actually need them, they're missing. Enable api, audit, authenticator logging up front; it's a small CloudWatch bill but priceless when debugging "why did the API server refuse this request."
IRSA "AccessDenied" with no detail. The trust policy's aud claim must include sts.amazonaws.com and the sub must exactly match system:serviceaccount:<ns>:<sa-name>. Typos here are silent — the SDK gets a generic AssumeRoleWithWebIdentity failure with no hint at which claim mismatched. Always copy/paste, never retype.
Fargate Spot's 2-minute reclaim notice. Spot tasks receive a SIGTERM and a terminationNotice event 2 minutes before reclaim. If your app ignores SIGTERM (Node.js without a handler, Python with a long-running synchronous loop), it dies mid-request without draining. Wire a SIGTERM handler that fails health checks for 30 seconds before exiting.
Cross-region image pulls are punishingly slow. If your ECR is in us-east-1 and your cluster runs in eu-west-1, every cold task waits for the cross-region pull. Use ECR cross-region replication or push to a per-region repo as part of CI.
EKS Fargate pods can't use EBS or DaemonSets. If your "we'll migrate to Fargate" plan includes a DaemonSet for log shipping or anything stateful with PVCs, Fargate won't run it. Stick with managed node groups for those workloads.

12 · Further reading

ECS developer guide. The canonical reference; the "Task definition parameters" and "Service auto scaling" pages are the must-reads.
EKS user guide. Start with "Best practices" — AWS distilled the operational lessons there.
IAM Roles for Service Accounts. The OIDC mechanics in detail.
Karpenter docs. The cluster-autoscaler replacement Pinterest et al. moved to; reads as a meditation on what Cluster Autoscaler got wrong.
Lyft Engineering. The blog with the long arc of ECS → Kubernetes choices.
Pinterest Engineering — Kubernetes. Multi-year posts on EKS, Karpenter, and ML infrastructure.
IAM, deeper. The role-and-trust-policy primitives IRSA builds on.
Cloud orchestration (concepts). Where Kubernetes sits in the broader scheduler taxonomy.

KMS & Secrets →

Envelope encryption, key types, Secrets Manager vs SSM Parameter Store, rotation, grants — the cryptographic substrate the rest of AWS sits on.

Read KMS & Secrets

Found this useful?