How do you evaluate whether an Auto Scaling Group's instance type is too large?

Check P99.5 CPU utilization across every instance in the group over at least 30 days of CloudWatch data. If P99.5 stays below 40% for all instances, the group is a candidate for downsizing one size within the same family, and Infralyst flags these automatically. Memory P99.5 should also be under 40% if CloudWatch Agent data is available.

How much CloudWatch data do you need before downsizing an Auto Scaling Group?

At least 30 days, though 60 is better. You can look back up to 365 days to catch seasonal patterns. The 30-day minimum matters more for ASGs than standalone EC2 instances because individual instances may only run for hours before being replaced. You need enough calendar time for the group to have experienced its full range of normal load patterns.

What happens to ASG costs when instances scale up after a downsize?

Going down one size roughly halves the per-instance price. Even if the group scales up to compensate, the math usually works out. For example, scaling from 4 to 6 instances at half the per-instance cost is still 75% of the original spend. The risk: if the group is already near max capacity, it can't add instances during load spikes.

Why use P99.5 instead of average or max CPU when rightsizing an ASG?

P99.5 captures nearly all real usage while filtering out one-off spikes from restarts or deploys. Raw max overreacts to blips that don't reflect actual load. Averages and even P95 can smooth over the peaks that actually cause problems after a downsize. Infralyst uses P99.5 as the primary metric for all its ASG rightsizing analysis.

Can EBS throughput or network bandwidth limits block an ASG downsize?

Yes. A smaller instance type comes with lower EBS throughput and network bandwidth ceilings. If your current type is already near 80% of its I/O limits, going down a size could cause throttling that scaling won't fix. Check the specific EBS and network limits for both the current and target instance types before deciding.

How do you roll back an Auto Scaling Group instance type change?

ASGs use instance refresh to gradually replace instances with ones matching the updated launch template. To roll back, update the launch template to the original instance type. Then trigger another instance refresh. You control the pace with minimum healthy percentage and warmup time settings. If you manage infrastructure with Terraform, Infralyst generates the PR for the launch template change.

Should you rightsize ASG instances before or after tuning scaling policies?

Fix instance size first, then revisit scaling thresholds. Rightsizing (smaller instances) and scaling policy tuning (fewer or more instances) are separate levers with different data requirements. If your instances are too big and you're running too many, downsizing first gives you a cleaner baseline for evaluating how many instances you actually need.

How do warm pool instances affect ASG rightsizing analysis?

Warm pool instances sit idle waiting to be brought into service, so they show near-zero utilization. If you don't filter them out, they skew metrics downward and make the group look more oversized than it is. Infralyst only recommends a downsize when every instance in the group passes the utilization checks.

What if one instance in an ASG is consistently hotter than the rest?

That's usually a load balancing issue, not a rightsizing signal. If most instances sit at 15% CPU but one runs at 60%, keeping the whole group oversized to accommodate the outlier wastes money. Fix the load distribution first, then revisit rightsizing once metrics are even across the fleet.

Infralyst | Automated Terraform PRs to cut AWS costs

Q: Why do ASG rightsizing decisions need metrics from terminated instances, not just running ones?

ASG instances are ephemeral. They scale in and out, so the instances running today may not reflect peak load periods. You need CloudWatch data from all instances that were part of the group during your observation window. Infralyst aggregates CloudWatch data across the full observation window, including instances that have since been terminated.

Introduction

Rightsizing a standalone EC2 instance is straightforward. You look at one set of metrics, decide if it's oversized, and change the instance type. With an Auto Scaling Group, the idea is the same but the execution is different. You're not looking at one instance. You're looking at every instance in the group, and they all need to pass the same checks.

This post covers how to evaluate whether an ASG's launch template instance type can safely go down one size. We're talking about single-instance-type ASGs backed by a launch template. Mixed instance type ASGs are a different situation. And we're not covering scaling policies here. Adjusting how many instances you run is a separate lever for cost savings, and it pairs well with rightsizing, but it's a different topic.

If the launch template is oversized, you're overpaying on every instance in the group.

If the instance type in your launch template is bigger than it needs to be, you're overpaying on every instance in the group. Go down one size and you cut the per-instance cost roughly in half.

Why ASGs are harder than standalone instances

With a standalone EC2 instance, you have one machine running continuously. Its metrics tell a clear story over time.

You need metrics from every instance that was in the group, not just the ones running today.

ASG instances are ephemeral. They come and go as the group scales. An instance running right now might not have been running last week, and the instance that was running last week might already be terminated. To get a full picture of the group's utilization, you need to look at metrics from all instances that were part of the group during your observation window, not just the ones currently running.

This matters because if you only look at the current set of instances, you might miss periods where the group was under heavier load with different instances active. The same EC2 rightsizing principles apply here, but you need to apply them across the entire group.

Pre-flight checks

Before looking at any metrics, rule out groups that can't be downsized.

Skip any ASG where the launch template already specifies the smallest instance type in its family. There's nowhere to go.

Skip any ASG where the instances use instance store volumes. Ephemeral storage gets destroyed on resize. This is rare for ASG-backed workloads since most use EBS, but check anyway.

How much data you need

Use at least 30 days of CloudWatch data at the current instance size. 60 days is better. You can look back up to 365 days to catch seasonal patterns, but anything older than a year is stale.

The 30-day minimum matters more for ASGs than standalone instances. Individual instances in the group may only run for hours or days before being replaced. You need enough calendar time for the group to have experienced its full range of normal load patterns.

Look at every instance, not just one

Don't pick one instance from the group, check its metrics, and assume the rest look the same. They should look similar if your load balancer is distributing traffic evenly, but that's exactly the kind of assumption that causes problems.

Check all instances in the group. If most instances are sitting at 15% CPU but one is consistently at 60%, that's probably a load balancing issue, not a rightsizing signal. Fixing the load distribution is a better move than keeping the whole group oversized to accommodate one hot instance.

Also watch out for warm pool instances. If your ASG uses a warm pool, those instances will show near-zero utilization because they're sitting idle waiting to be brought into service.

CPU

Use P99.5 CPU utilization as your primary signal. Check it across all instances in the group over the full observation window. If P99.5 is below 40% for every instance, the group is a candidate.

Why P99.5? It captures nearly all real usage while filtering out the handful of one-off spikes from restarts or deploys that don't reflect actual load. Raw max overreacts to blips. Averages and even P95 can smooth over the peaks that actually cause problems after a downsize.

Going down one size within the same family roughly halves CPU capacity. An instance sitting at 40% P99.5 will peak around 80% on the smaller type. That's high but manageable since the average will be much lower, leaving room for normal variation.

Track the average alongside P99.5 for context. The average gives you a feel for typical load, but it shouldn't drive the decision.

Memory

The same threshold applies: P99.5 below 40% across all instances.

EC2 doesn't publish memory metrics by default. You need the CloudWatch Agent installed on your instances. For ASGs, make sure the agent is baked into your AMI or installed via user data so every new instance reports memory automatically. If you don't have memory data, you can still rightsize based on CPU alone, but you're flying partially blind. We cover the setup in our guide to enabling EC2 memory metrics.

Seasonal spike detection

CPU and memory thresholds tell you whether the group is oversized right now. But some workloads spike predictably: end of quarter, holidays, monthly batch runs.

If you have 12 months of data, check whether any recurring peak would push utilization past the threshold on a smaller instance. If it would, hold off. If you have less than 12 months, be aware of the blind spot and revisit after you've accumulated more history.

EBS and network limits

A smaller instance type can come with lower EBS throughput and network bandwidth ceilings. This is a bigger deal for ASGs than you might expect.

If your current instance type is close to its EBS or network limits (say 80% of max throughput), going down a size could eliminate your remaining headroom entirely. The smaller type has lower limits, and your workload stays the same.

This is a judgment call. Even if you lose some I/O headroom, going down a size roughly halves the per-instance price. If the group scales up to handle load, you might still come out ahead on cost. But if your workload is I/O-heavy and you're already pushing limits, a downsize could cause throttling that scaling won't fix. Check the specific limits for both the current and target instance types before deciding.

Desired capacity and scaling headroom

Here's a gotcha specific to ASGs. If your group's desired capacity is already close to its max (say 80% or above), going down a size means each instance handles less load, which means the group will likely scale up to compensate. If it's already near max, it might not have room to add instances.

This doesn't automatically disqualify a downsize. Going down one size roughly halves the per-instance price. Even if the group scales from 4 instances to 6, you're still paying less (6 instances at half price is 75% of the original cost). But if the group is already at or near max and can't add capacity, you'll hit a ceiling during load spikes.

Review your scaling configuration alongside the rightsizing decision. If max capacity needs to increase, that's a separate change to coordinate.

Rollout and rollback

ASGs have a built-in mechanism for this: instance refresh. It gradually replaces instances with new ones matching the updated launch template. You can control the pace (minimum healthy percentage, warmup time) to avoid taking too much capacity offline at once.

Update the instance type in your launch template, then trigger an instance refresh. Monitor CPU and memory on the new instances as they come into service. If something looks wrong, update the launch template back to the original instance type and trigger another refresh.

If you manage infrastructure with Terraform, this is a launch template change and an instance refresh trigger. Keep the revert ready before you ship so you can move fast if needed.

Scaling policies are a separate lever

Rightsizing (smaller instances) and scaling policy tuning (fewer or more instances) are both useful for reducing ASG costs. They complement each other. But they're different decisions with different data requirements, and you should evaluate them separately.

This post covers rightsizing only. If your instances are the right size but you're running too many of them, that's a scaling policy problem. If your instances are too big and you're running too many, fix the instance size first, then revisit your scaling thresholds.

Conclusion

Rightsizing an ASG follows the same logic as rightsizing a standalone EC2 instance: check that CPU and memory P99.5 are under 40%, verify EBS and network limits, and look for seasonal spikes. The difference is that you need to check every instance in the group, account for the ephemeral nature of ASG instances, and watch for load balancing issues that might skew individual instance metrics.

One size down within the same family cuts per-instance cost roughly in half. Even if the group scales up slightly to compensate, the math usually works out. Combine it with scaling policy tuning for the full picture on AWS cost optimization for your ASGs.

How to Rightsize EC2 Auto Scaling Groups