Best Practices for EC2 Rightsizing
A practical guide to EC2 rightsizing within the same instance family: how much data you need, what thresholds to use, and the guardrails that keep you out of trouble.

Introduction
If you're looking to reduce EC2 costs, downsizing oversized instances is the lowest risk place to start. No architecture changes, no migration, no AWS cost optimization tool required, just a smaller instance type within the same family.
This post covers how we think about EC2 rightsizing: how much data you need, what to measure, and the guardrails that prevent bad surprises.
Slightly overprovisioned is fine
The goal of EC2 rightsizing isn't to run your instances hot. Some headroom is good. It absorbs traffic spikes, deploy day load, and unplanned surges. The goal is to find instances with way too much headroom and bring them down one notch.
Going down one size within the same family typically halves CPU and memory. An m6i.xlarge (4 vCPUs, 16 GB) becomes an m6i.large (2 vCPUs, 8 GB). There are exceptions, so always check the specific types you're working with, but halving is a useful mental model. It also explains why the thresholds below are what they are: you need enough room for the workload to fit comfortably after losing half its resources.
Pre-flight checks
Before looking at metrics, rule out instances that can't be downsized. Skip anything with instance store volumes (that ephemeral storage gets destroyed on resize) and anything already at the smallest size in its family.
How much data you need
Use at least 30 days of CloudWatch data, preferably 60. You can look back up to 365 days to catch seasonal patterns, but anything older than a year is stale. Make sure the instance actually ran for at least 95% of the observation window so the data represents a real workload, not an instance that was mostly idle.
If you care about short spikes, enable detailed monitoring (1 minute intervals). Basic monitoring uses 5 minute intervals and can smooth over brief peaks.
Avoid acting on data from the first 14 days after a major deploy or a recent upsize. Those metrics won't be representative.
CPU
Use P99.5 CPU utilization as your primary signal. Most AWS rightsizing advice focuses on averages or P95, but these can miss the peaks that actually cause problems after a downsize. If P99.5 is below 40 percent across the observation window, the instance is a reasonable candidate.
Why P99.5 specifically? It captures nearly all real usage while filtering out the handful of one-off spikes from restarts or deploys that don't reflect actual load. It's less noisy than raw max (which overreacts to blips) but doesn't smooth over the peaks that matter.
Since going down a size roughly doubles utilization, an instance sitting at 40% P99.5 will peak around 80% on the smaller instance. The average will typically be much lower, so there's still headroom for normal variation.
It's worth tracking the average alongside P99.5. The average gives you a feel for typical load, which is useful context when reviewing a change, but it shouldn't be what drives the decision.
Memory
The same threshold applies to memory: P99.5 below 40 percent.
EC2 doesn't publish memory usage by default. You need an agent like the CloudWatch Agent to get this data. If you don't have it, you can still downsize based on CPU alone, but you should be aware that you're flying partially blind. A smaller instance means less RAM, and there's no metric telling you whether that matters.
Even when P99.5 looks fine, watch for sustained swapping (swap in/out activity or consistently high swap used). Swapping means the instance is already under memory pressure regardless of what the percentage says.
Seasonal spike detection
CPU and memory thresholds tell you whether the instance is oversized right now. But some workloads have predictable peaks that only show up at certain times of year: end of quarter traffic, holiday surges, annual batch jobs, or recurring monthly spikes.
If you have 12 months of CloudWatch data, check whether any recurring peak would push CPU or memory past the threshold on a smaller instance. If it would, hold off on the downsize. If you have less than 12 months, just be aware of the blind spot and check again after you've accumulated more history.
Disk and network guardrails
A smaller instance can come with lower network and EBS ceilings.
For EBS, watch for saturation using EBS volume metrics (AWS/EBS), not just CPU. Look at sustained high throughput or IOPS and any persistent queueing. If you're on gp2/st1/sc1, check BurstBalance for burst credit depletion. Verify that the target instance type's EBS throughput limit is comfortably above your observed peak throughput.
For network, check NetworkIn/NetworkOut and packet rates for sustained high utilization.
Burstable instances (T family)
This section only applies to T family instances (t3, t3a, t4g, etc.). If your instances aren't T family, skip ahead.
T instances accumulate CPU credits during low usage and spend them during bursts. Before downsizing a T instance, confirm CPUCreditBalance isn't trending toward zero and that CPUSurplusCreditCharged stays at zero across the lookback. If a workload lives on constant burst credits, it's not a downsizing candidate.
Rollout and rollback
Treat a downsize like any other production change. One size down at a time. Low risk window. Watch CPU and memory right after to make sure the smaller instance is handling the load.
Have a rollback plan ready. If you manage infrastructure with Terraform, that's just reverting the instance type in code and applying. Keep the revert PR ready before you ship the downsize so you can move fast if something looks wrong. Don't stack multiple changes (deploys, config changes, downsizes) in the same window.
If any single check in this post fails, keep the current size and look again next cycle.
Conclusion
EC2 cost optimization doesn't need to be complicated. Get at least 30 days of data, check that CPU and memory P99.5 are both under 40 percent, glance at a year of history for seasonal spikes, and verify your disk and network limits. That covers most of what any EC2 rightsizing tool should be doing under the hood.
One size down within the same family is usually enough to capture real savings without breaking anything. If you're managing infrastructure with Terraform, that makes the whole process even simpler: Terraform cost optimization is just changing an instance type in code, reviewing the metrics, and merging. Check back monthly as your monitoring improves.
Automate Your EC2 Rightsizing
Infralyst continuously runs every check you just learned. When it finds savings, you're one click from a ready-to-merge Terraform PR.
Start free with 3 PRsNo credit card required · Read-only IAM role · Your team reviews and merges every change