Best Practices for Downsizing EC2 Instances

Safe, practical methods for downsizing EC2 within the same family using CPU and memory percentiles, lookback windows, and simple guardrails.

AWSComputeEC2
Best Practices for Downsizing EC2 Instances

Introduction

Downsizing is the quickest way to reduce EC2 spend while keeping performance steady. This post focuses on downsizing within the same instance family, one size at a time, using conservative thresholds.

The goal is simple: gather enough history, read the right metrics, and apply a few safety checks so you can shrink instances without surprises.

How long to observe

Use at least 30 days of data, preferably 90. If you can, also scan 365 days to catch seasonal peaks. Ensure the instance actually ran for more than half of the window so the data is representative. A useful practice is to examine the worst 7 day block inside your window and confirm that it still meets the CPU thresholds below. If you care about short spikes, enable detailed monitoring (1 minute); basic monitoring is 5 minute and can smooth over brief peaks.

CPU first

CPU is the most common starting signal, but don't ignore the default EC2 metrics you already have (network, disk ops/bytes, and status checks).

For a safe downsize, keep average CPU under 20 percent across the lookback. Check the high end too: p95 should be at or below 50 percent, p99 at or below 70 percent, and the maximum should not exceed about 85 percent once you exclude obvious reboot or deploy blips.

If the worst 7 day block violates these numbers, wait or collect more data.

CPUUtilization is across all vCPUs (not per core). When you downsize vCPUs, the same absolute workload will show higher CPUUtilization. A simple mental model: if an instance has 2 vCPUs and sits at ~20% CPU, downsizing to 1 vCPU with the same workload will often land around ~40% CPU. It won't be exact (bursts, IO wait, scheduling), but it's a useful first approximation.

Memory matters

EC2 does not publish memory usage by default, so you only have this signal if you’ve installed an agent. Downsizing without understanding memory usage is risky because a smaller size reduces RAM.

If you have memory metrics, treat p95 under 70 percent and p99 under 85 percent as healthy, and watch for sustained swapping (swap in/out activity or consistently high swap used), which is a red flag for downsizing.

If you do not have memory, proceed only when CPU looks very low and stable, and include a clear note that memory was not observed. Many teams add the CloudWatch Agent later to upgrade this check.

Disk and network guardrails

A smaller instance can change network and EBS ceilings. Use simple guardrails to avoid crossing obvious limits.

For EBS volumes, watch for signs of saturation using EBS volume metrics (AWS/EBS), not just CPU. Look for sustained high throughput or IOPS, and any persistent queueing. If you’re on gp2/st1/sc1, also monitor BurstBalance to make sure you are not being throttled by burst credit depletion. Finally, check the target instance type's EBS throughput limit and make sure your observed p99 EBS throughput stays comfortably below it (use the EBS volume read/write bytes metrics to estimate this).

For network, sanity check NetworkIn/NetworkOut and packet rates for sustained high utilization.

Burstable instances

T family instances accumulate CPU credits during low usage and spend them during bursts. Before downsizing a T instance, confirm CPUCreditBalance isn't trending toward zero and that CPUSurplusCreditCharged stays at zero across the lookback. A workload that lives on constant bursting is a poor downsizing candidate.

Seasonality and recent changes

Yearly peaks, monthly jobs, and end of quarter spikes can hide in shorter windows. That is why a 90 day lookback plus a 365 day scan is ideal when available. Also avoid acting within about 14 days of a major deploy or a recent upsizing, since metrics can be noisy or unrepresentative.

When not to downsize

Do not shrink instances that are memory bound even if CPU is idle. Do not move to a size that removes required instance store NVMe, reduces ENI count below what you use, or lowers network or EBS ceilings below observed p95 needs.

If you rely on instance store, downsizing can destroy that data. Treat it as ephemeral and migrate anything important first.

If any single safeguard fails, keep the current size and recheck next cycle.

Instance type constraints to verify

Before you downsize, confirm the target type does not remove something your workload relies on.

Common gotchas include CPU architecture (x86 vs ARM), required instruction set or licensing constraints, loss of instance store, fewer ENIs or lower network bandwidth, and lower EBS bandwidth or IOPS ceilings.

If you are moving between generations within a family, also confirm compatibility with Nitro and any networking features you depend on.

Rollout and rollback

Treat a downsize like any other production change. Apply one size reduction at a time, change during a low risk window, and watch service level signals (latency, error rate, queue depth) alongside CPU and memory immediately after.

Have a fast rollback plan to revert to the previous type if you see sustained regressions, and avoid stacking multiple changes (deploys, config changes, downsizes) in the same window.

Conclusion

Keep downsizing simple and conservative. Use 30 to 90 days of history, validate CPU with averages and percentiles, confirm memory headroom when available, and run a few disk and network guards.

One size down within the same family is usually enough to capture savings without drama. Revisit the data monthly and tighten the process as your observability improves.

Automate Your EC2 Savings in One Click

Infralyst continuously runs every check you just learned, 24×7. Let us cut the bill for you.

Start free with 3 PRs

No credit card required · Read-only IAM role · Your team reviews and merges every change