cloud · ec2 fleet sizing

Auto-Scaling Group Simulator: target tracking in slow motion.

A CloudWatch alarm fires. The ASG asks for two more instances. They take ten minutes to warm up, reporting zero CPU the whole time, dragging the fleet average down so the controller orders a scale-in. The instances you just paid to provision get terminated before they ever served a request. Most ASG misconfigurations look like this. Drag the load below, change the mode, watch the controller make decisions.

Mode Target CPU % Min instances Max instances Cooldown (s)

External load: 60%

clock: t+0 min

CPU avg (solid) vs external load (dashed) — last 60 min target 50%

0 minnow

Fleet (2 instances, 0 warming) avg reported: 60% · real: 60%

i-001

serving

i-002

serving

Instances2

Avg CPU60%

Scale-outs0

Scale-ins0

Alarms fired0

Saturation min0

Log

No events yet. Click "Play" to start the clock, or "Step" for one minute at a time.

What you're looking at

A clock-driven ASG. The chart plots the fleet's average reported CPU (solid) against the external load you control (dashed), with horizontal lines at your target and at the 80% saturation mark. Below it, each box is one instance: amber while it warms up reporting 0% CPU, plain once it's serving. The controls set the scaling mode, target CPU, min/max bounds, and cooldown; the load slider and +20/-20 buttons drive demand. Press Play to run the minute-by-minute clock, or Step to advance one minute, and the log narrates every alarm and scaling action.

Push the load to 90, press Play, and watch a scale-out fire. Here's the trap: the new instances boot at 0% CPU for ten ticks, which drags the fleet average below target, so the controller orders a scale-in and kills the instances you just paid to launch. Then the survivors saturate again. That oscillation is the surprise. Then raise the cooldown and switch to predictive mode and watch the same load curve stay stable, because the controller stops reacting to its own half-warmed fleet.

What an Auto-Scaling Group actually does

An Auto-Scaling Group is a desired-state controller for a fleet of EC2 instances. You declare a launch template, a min, a max, and a desired count. The ASG keeps the fleet at the desired count, replacing failed instances and re-balancing across Availability Zones. That's the static job.

The dynamic job is to change the desired count based on a metric. A scaling policy attached to the ASG watches a CloudWatch metric (CPU utilisation, ALB RequestCountPerTarget, custom metric) and adjusts desired count up or down. The ASG then either launches new instances or terminates existing ones to converge.

Target tracking vs step scaling vs simple scaling vs predictive

	Target tracking	Step scaling	Simple scaling	Predictive
How it decides	Like a thermostat — calculates capacity to keep metric near target	You define alarm thresholds and step adjustments	One alarm, one adjustment, then a cooldown	ML forecast 24-48h ahead, scales ahead of demand
Best for	Most workloads with a clear scaling metric	Workloads with known traffic curves and clear thresholds	Legacy; AWS recommends moving off this	Predictable diurnal or weekly cycles
Cooldown	Managed by AWS automatically	Per-policy, configurable	Per-policy, mandatory	N/A — schedules forward
Alarm setup	AWS creates and manages alarms for you	You define alarms and step actions	You define one alarm + action	Auto-generated from historical data
Risk of flapping	Low — built-in damping	Medium — depends on step thresholds	High — only one threshold	Low — schedule-based

Default choice: target tracking on either CPU or ALB request count. Move to step scaling when you need finer control, or pair predictive with target tracking for the daily traffic shape plus the unexpected spike.

The control loop

Five hops, all asynchronous:

instance emits metric → CloudWatch agent → CloudWatch service (60s lag)
                                            ↓
                                       alarm evaluator
                                            ↓
                                  fires when threshold met
                                            ↓
                                   scaling policy invoked
                                            ↓
                                  ASG adjusts desired count
                                            ↓
                            EC2 fleet management launches/terminates
                                            ↓
                                instance boot + warm-up (60-600s)

From a CPU spike on the instance to traffic being served by a new pod, you're looking at 90 seconds to 12 minutes. The metric publication interval is 60 seconds by default (1 second for detailed monitoring at extra cost). Alarm evaluation requires N consecutive datapoints, where N defaults to several. Then the ASG action runs, EC2 boots, the load balancer starts sending traffic. None of this is instant.

CloudWatch metric publication lag

Default EC2 instance metrics publish at 5-minute intervals — that's basic monitoring, free with EC2. Enable detailed monitoring ($2.10 per instance per month at the time of writing) and you get 1-minute resolution. Custom metrics via the CloudWatch agent can go to 1-second resolution with high-resolution custom metrics, at extra cost.

Whatever resolution you choose, there's a publication delay on top — the metric covering minute t isn't available until minute t+1 or later. A target-tracking alarm evaluates the most recent N datapoints, so you're effectively making decisions on data that's 2–5 minutes old. If your cooldown is shorter than this lag, the next decision fires on the same stale data and you get oscillation.

Rule of thumb: cooldown should be at least 2× the metric publication interval, and ideally at least 1.5× the instance warm-up time. Defaults exist for a reason.

Warm-up time and the dilution trap

When a new instance launches, it boots, runs user-data, pulls container images, registers with the load balancer, and starts taking traffic. During this time it reports CPU near 0% because it's not doing any work. The ASG's CPU metric averages across all instances in the group.

So: existing instances at 80% CPU, controller scales out, two new instances come up at 0%, fleet average drops to (80×4 + 0×2)/6 = 53%. If your target is 50%, the controller now sees "below target" and orders a scale-in, terminating the brand-new instances. Then the original four spike back to 80% and you're in a thrash loop.

The fix is the instance warm-up parameter on the scaling policy. New instances are excluded from the metric average for the warm-up duration (default 300s, often needs to be 600s or more for slow-booting apps). With proper warm-up configured, the controller ignores brand-new instances until they're ready, and the dilution trap disappears.

Cooldown periods

Cooldown is the minimum time between scaling actions. After a scale-out, no further scale-out or scale-in happens until cooldown elapses. The intent is to give the previous scaling action time to take effect before reacting again.

Default is 300 seconds. Tune up if your application has a long warm-up time (Java apps with JIT compilation, large container images) or if your metric has high natural variance. Tune down only if you've measured that 300s is too conservative for your shape of traffic — and only after confirming there's no flapping at the shorter setting.

Target tracking handles cooldown automatically — AWS picks values based on the alarm and warm-up settings. Step scaling and simple scaling require you to set it yourself. Get this wrong and you'll either thrash (too short) or fail to respond to real spikes (too long).

Scaling on the right metric

CPU is the easy default, and it's wrong for many workloads. Web servers that block on database queries don't scale with CPU — they scale with concurrent connections. Background workers that pull from SQS don't scale with CPU — they scale with queue depth. Auth gateways that fan out to many downstreams don't scale with CPU — they scale with active outbound connections.

The right metric is the one that hits a wall first. For ALB-fronted services, RequestCountPerTarget usually beats CPU because it isolates the per-instance work signal from variable request weights. For queue-backed workers, ApproximateNumberOfMessagesVisible with a custom metric math expression dividing by current instance count. For services bound on something exotic (file descriptors, database connection pool saturation, GPU memory), emit a custom CloudWatch metric and target-track on that.

The principle is "scale on the bottleneck, not the symptom." CPU is the symptom in most overload scenarios; the bottleneck is connection pool exhaustion or queue depth or memory pressure. Find the bottleneck once, scale on it forever.

Mixed instance policy and Spot

A modern ASG runs a mixed instances policy: a list of instance types (e.g., m5.large, m5a.large, m6i.large, c5.large) and a split between On-Demand and Spot. The ASG launches across the eligible types and capacity pools, falling back automatically when one pool is exhausted.

Target tracking still works the same — the metric is fleet-wide. Spot interruption is the wrinkle. When AWS reclaims a Spot instance you get a 2-minute warning, and a lifecycle hook can run drain logic before the instance dies. If the hook takes too long the instance is force-terminated; if it returns fast the ASG launches a replacement, which then has to warm up.

The recommended config: capacity-optimized allocation strategy (AWS picks the deepest Spot pool to minimise interruption), 70–90% Spot for stateless workloads, a Capacity Rebalancing setting on for graceful replacement before interruption.

Predictive scaling

AWS trains a model on 14 days of your historical load and forecasts traffic 24–48 hours ahead. The ASG then pre-launches instances ahead of the predicted curve, so the fleet is already warm when traffic arrives. Combined with target tracking, you get the best of both — predictive handles the steady diurnal/weekly shape, target tracking handles deviation from it.

Where it helps: workloads with strong daily or weekly cycles (e-commerce, B2B SaaS, content platforms with regional peaks). Where it doesn't: launch traffic, viral events, anything where the past two weeks don't predict the next 24 hours. Predictive scaling will happily overprovision for the wrong curve if your traffic shape just changed.

Run it in "forecast only" mode for a week first; check the predicted curves against actual; then flip to "forecast and scale" if the predictions are sane.

Lifecycle hooks

Hooks pause the ASG transition for an instance and let you run setup or drain logic. Two hooks: autoscaling:EC2_INSTANCE_LAUNCHING runs before the instance is in service; autoscaling:EC2_INSTANCE_TERMINATING runs before termination. Each has a configurable timeout (default 1 hour, max 48 hours) and a default action (CONTINUE or ABANDON).

Common uses. Launching hook: pull config from Parameter Store, register with service discovery, run health checks before adding to the load balancer. Terminating hook: deregister from the load balancer (connection draining), flush in-memory state, upload final logs to S3. The hook signals "done" via complete-lifecycle-action in the SDK; if it times out, the default action applies.

Most teams use EventBridge to trigger a Lambda from the hook event, which is cleaner than running anything on the instance itself for the termination case. For launching, user-data + cfn-signal is the older pattern; lifecycle hooks + Lambda is the modern one.

Common production failures

Flapping. Cooldown too short relative to metric lag and warm-up time. The fleet oscillates between 4 and 8 instances every 10 minutes, never settling. Fix: bump cooldown to 600s, instance warm-up to match boot time, confirm with CloudWatch's scaling activity history.
Runaway scale-out. A metric goes wrong (NaN, stuck at 100%) and the controller scales to max-instances. If you don't have a max-instances cap, you wake up to a billion-dollar EC2 bill. Always set a max that's no more than 3–5× expected steady-state, even if it means dropping requests during a real spike.
AZ-imbalanced fleet. An AZ has limited capacity for your instance type. ASG launches everything in the remaining AZs and you lose the resilience that the multi-AZ ASG was supposed to give you. Mitigation: mixed instance policy across instance types, so the ASG can pick a different type in the constrained AZ.
ASG fighting cluster-autoscaler. If you're running Kubernetes on top of an ASG with cluster-autoscaler, the cluster-autoscaler scales the node group based on pod resource requests. Don't also attach a target-tracking policy to the same ASG. The two controllers will disagree and one will keep undoing the other's actions.
Stale launch template. A new launch template version (new AMI) is published but the ASG points at "version $Latest" without instance refresh. New instances boot with the new AMI; old ones keep the stale one. Either explicitly bump the launch template version + run instance refresh, or accept that mixed-AMI fleets are temporary.
SNS subscription on every scaling event. If you wire scaling events to a Slack channel, expect 50+ notifications a day on an active fleet. Either filter to scale-failures only or send to a dashboard rather than a chat channel.

A worked target-tracking example

Imagine a web fleet of 4 c5.large instances behind an ALB. Target: 50% CPU. Min 2, max 16. Cooldown 300s. Warm-up 300s.

t+0:   load 60%, fleet at 65% CPU, no action (just above target, within damping)
t+1:   load 80%, fleet at 85% CPU
t+2:   alarm fires (3 consecutive > target + 10)
t+2:   ASG orders +3 instances (proportional to overshoot)
t+2:   3 new instances launching, warmup_remaining=300s
t+3-7: new instances at 0% but excluded from metric average via warm-up
t+8:   warm-up expires, fleet now at 7 instances, average CPU drops to ~48%
t+10:  cooldown expires, controller checks: avg=48%, near target, no action
t+30:  load drops to 30%, fleet at 25% CPU
t+33:  alarm fires (under target - 15)
t+33:  ASG orders -2 instances (proportional)
t+33:  oldest 2 terminate after draining

Notice the gaps. From the alarm firing to the new instances being effective is 8 minutes. From load dropping to instances being terminated is 3 minutes. The ASG is a slow controller; it suits steady-state workloads, not bursty ones. For bursty, layer a CloudFront cache or an SQS queue in front and let the queue absorb the spike while the ASG catches up.