How to Rightsize EC2 Auto Scaling Groups
Rightsizing an ASG means every instance in the group is smaller and cheaper. Here's how to evaluate whether your group can safely go down a size.

Introduction
Rightsizing a standalone EC2 instance is straightforward. You look at one set of metrics, decide if it's oversized, and change the instance type. With an Auto Scaling Group, the idea is the same but the execution is different. You're not looking at one instance. You're looking at every instance in the group, and they all need to pass the same checks.
This post covers how to evaluate whether an ASG's launch template instance type can safely go down one size. We're talking about single-instance-type ASGs backed by a launch template. Mixed instance type ASGs are a different situation. And we're not covering scaling policies here. Adjusting how many instances you run is a separate lever for cost savings, and it pairs well with rightsizing, but it's a different topic.
If the instance type in your launch template is bigger than it needs to be, you're overpaying on every instance in the group. Go down one size and you cut the per-instance cost roughly in half.
Why ASGs are harder than standalone instances
With a standalone EC2 instance, you have one machine running continuously. Its metrics tell a clear story over time.
ASG instances are ephemeral. They come and go as the group scales. An instance running right now might not have been running last week, and the instance that was running last week might already be terminated. To get a full picture of the group's utilization, you need to look at metrics from all instances that were part of the group during your observation window, not just the ones currently running.
This matters because if you only look at the current set of instances, you might miss periods where the group was under heavier load with different instances active. The same EC2 rightsizing principles apply here, but you need to apply them across the entire group.
Pre-flight checks
Before looking at any metrics, rule out groups that can't be downsized.
Skip any ASG where the launch template already specifies the smallest instance type in its family. There's nowhere to go.
Skip any ASG where the instances use instance store volumes. Ephemeral storage gets destroyed on resize. This is rare for ASG-backed workloads since most use EBS, but check anyway.
How much data you need
Use at least 30 days of CloudWatch data at the current instance size. 60 days is better. You can look back up to 365 days to catch seasonal patterns, but anything older than a year is stale.
The 30-day minimum matters more for ASGs than standalone instances. Individual instances in the group may only run for hours or days before being replaced. You need enough calendar time for the group to have experienced its full range of normal load patterns.
Look at every instance, not just one
Don't pick one instance from the group, check its metrics, and assume the rest look the same. They should look similar if your load balancer is distributing traffic evenly, but that's exactly the kind of assumption that causes problems.
Check all instances in the group. If most instances are sitting at 15% CPU but one is consistently at 60%, that's probably a load balancing issue, not a rightsizing signal. Fixing the load distribution is a better move than keeping the whole group oversized to accommodate one hot instance.
Also watch out for warm pool instances. If your ASG uses a warm pool, those instances will show near-zero utilization because they're sitting idle waiting to be brought into service.
CPU
Use P99.5 CPU utilization as your primary signal. Check it across all instances in the group over the full observation window. If P99.5 is below 40% for every instance, the group is a candidate.
Why P99.5? It captures nearly all real usage while filtering out the handful of one-off spikes from restarts or deploys that don't reflect actual load. Raw max overreacts to blips. Averages and even P95 can smooth over the peaks that actually cause problems after a downsize.
Going down one size within the same family roughly halves CPU capacity. An instance sitting at 40% P99.5 will peak around 80% on the smaller type. That's high but manageable since the average will be much lower, leaving room for normal variation.
Track the average alongside P99.5 for context. The average gives you a feel for typical load, but it shouldn't drive the decision.
Memory
The same threshold applies: P99.5 below 40% across all instances.
EC2 doesn't publish memory metrics by default. You need the CloudWatch Agent installed on your instances. For ASGs, make sure the agent is baked into your AMI or installed via user data so every new instance reports memory automatically. If you don't have memory data, you can still rightsize based on CPU alone, but you're flying partially blind. We cover the setup in our guide to enabling EC2 memory metrics.
Seasonal spike detection
CPU and memory thresholds tell you whether the group is oversized right now. But some workloads spike predictably: end of quarter, holidays, monthly batch runs.
If you have 12 months of data, check whether any recurring peak would push utilization past the threshold on a smaller instance. If it would, hold off. If you have less than 12 months, be aware of the blind spot and revisit after you've accumulated more history.
EBS and network limits
A smaller instance type can come with lower EBS throughput and network bandwidth ceilings. This is a bigger deal for ASGs than you might expect.
If your current instance type is close to its EBS or network limits (say 80% of max throughput), going down a size could eliminate your remaining headroom entirely. The smaller type has lower limits, and your workload stays the same.
This is a judgment call. Even if you lose some I/O headroom, going down a size roughly halves the per-instance price. If the group scales up to handle load, you might still come out ahead on cost. But if your workload is I/O-heavy and you're already pushing limits, a downsize could cause throttling that scaling won't fix. Check the specific limits for both the current and target instance types before deciding.
Desired capacity and scaling headroom
Here's a gotcha specific to ASGs. If your group's desired capacity is already close to its max (say 80% or above), going down a size means each instance handles less load, which means the group will likely scale up to compensate. If it's already near max, it might not have room to add instances.
This doesn't automatically disqualify a downsize. Going down one size roughly halves the per-instance price. Even if the group scales from 4 instances to 6, you're still paying less (6 instances at half price is 75% of the original cost). But if the group is already at or near max and can't add capacity, you'll hit a ceiling during load spikes.
Review your scaling configuration alongside the rightsizing decision. If max capacity needs to increase, that's a separate change to coordinate.
Rollout and rollback
ASGs have a built-in mechanism for this: instance refresh. It gradually replaces instances with new ones matching the updated launch template. You can control the pace (minimum healthy percentage, warmup time) to avoid taking too much capacity offline at once.
Update the instance type in your launch template, then trigger an instance refresh. Monitor CPU and memory on the new instances as they come into service. If something looks wrong, update the launch template back to the original instance type and trigger another refresh.
If you manage infrastructure with Terraform, this is a launch template change and an instance refresh trigger. Keep the revert ready before you ship so you can move fast if needed.
Scaling policies are a separate lever
Rightsizing (smaller instances) and scaling policy tuning (fewer or more instances) are both useful for reducing ASG costs. They complement each other. But they're different decisions with different data requirements, and you should evaluate them separately.
This post covers rightsizing only. If your instances are the right size but you're running too many of them, that's a scaling policy problem. If your instances are too big and you're running too many, fix the instance size first, then revisit your scaling thresholds.
Conclusion
Rightsizing an ASG follows the same logic as rightsizing a standalone EC2 instance: check that CPU and memory P99.5 are under 40%, verify EBS and network limits, and look for seasonal spikes. The difference is that you need to check every instance in the group, account for the ephemeral nature of ASG instances, and watch for load balancing issues that might skew individual instance metrics.
One size down within the same family cuts per-instance cost roughly in half. Even if the group scales up slightly to compensate, the math usually works out. Combine it with scaling policy tuning for the full picture on ASG cost optimization.
Automate Your ASG Rightsizing
Infralyst continuously runs these checks across every instance in your Auto Scaling Groups. When it finds savings, you're one click from a ready-to-merge Terraform PR.
Start free with 3 PRsNo credit card required · Read-only IAM role · Your team reviews and merges every change