Cloud Database Cost Triage: Storage, IOPS, CPU, Replicas
The RDS bill is higher than expected and the instinct to scale up the instance or add a replica is almost always the wrong first move. Cost spikes in cloud databases have four distinct drivers — storage, IOPS, instance class, and replicas — and each requires a different remediation. Acting on the wrong one wastes money and may make the problem worse. The right move is triage first.
Situation
AWS RDS and Aurora bill on four independent cost dimensions: storage consumed, I/O operations performed, the instance class running the engine, and the number of instances attached to the cluster. When a monthly bill grows faster than traffic, it is usually one of these dimensions accelerating — not all four simultaneously.
The problem is that Cost Explorer shows total database spend, not cost per dimension. An engineer looking at a $4,000 line item for “Amazon RDS” cannot tell whether the driver is 2 TB of unclaimed storage, a gp2 volume depleting its burst I/O credits, an over-provisioned db.r6g.2xlarge sitting at 8% CPU, or three read replicas that no longer carry meaningful traffic.
Each of those four scenarios has a different first command to run and a different remediation. Conflating them means you might rightsize the instance when the actual driver is 800 GB of dead tuples waiting on autovacuum.
Symptoms
| Signal | Where to see it | What it means |
|---|---|---|
| Storage cost growing without traffic growth | AWS Cost Explorer, grouped by usage type | Table bloat, dead tuples, or log accumulation not being reclaimed |
| IOPS charges on a gp2 volume | CloudWatch VolumeReadIOPS and VolumeWriteIOPS | Burst credit balance depleted; every I/O now billed at the gp2 overage rate |
| High instance cost relative to CPU utilization | CloudWatch CPUUtilization p95 over 30 days | Instance class is over-provisioned for the actual workload |
| Replica count grew over time | RDS console — DB instances view | Replicas added reactively without a retirement policy; each one bills at primary instance rates |
| Snapshot retention set to maximum | RDS console — Maintenance and backups | Snapshots older than policy requires accumulate silently at $0.095 per GB-month |
First Five Checks
- Database and table sizes — connect to the PostgreSQL instance and run both queries. The first gives total database size; the second surfaces the top bloat candidates by table.
-- Total database size
SELECT pg_size_pretty(pg_database_size(current_database()));
-- Top 10 tables by total size (including indexes and toast)
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS total_size,
pg_size_pretty(pg_relation_size(schemaname || '.' || tablename)) AS table_size
FROM pg_tables
ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC
LIMIT 10;
If a table’s total size is significantly larger than its live row count implies, dead tuples are accumulating. Cross-reference with pg_stat_user_tables.n_dead_tup.
- Write amplification signal from the background writer — PostgreSQL’s
pg_stat_bgwritertracks how much I/O the background writer and checkpointer are generating. Highbuffers_checkpointrelative tobuffers_cleanorbuffers_backendindicates that checkpointing is driving write I/O, not the application directly.
SELECT
checkpoints_timed,
checkpoints_req,
buffers_checkpoint,
buffers_clean,
buffers_backend,
maxwritten_clean
FROM pg_stat_bgwriter;
AWS documents that RDS gp2 volumes use a credit-based burst model. As documented in the AWS RDS storage documentation, a gp2 volume earns 3 IOPS per GB per second and can burst to 3,000 IOPS until the credit bucket empties. Once depleted, throughput drops to the baseline rate and every operation above baseline is billed at the provisioned IOPS rate. buffers_checkpoint growing while CloudWatch BurstBalance drops toward zero is the signature of this problem.
- IOPS consumption in CloudWatch — pull
VolumeReadIOPSandVolumeWriteIOPSfor the last 30 days with a 1-hour resolution. If the volume is gp2 and you see sustained IOPS above 3,000, the burst balance is gone and you are in the expensive steady state.
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name WriteIOPS \
--dimensions Name=DBInstanceIdentifier,Value=YOUR_DB_ID \
--start-time $(date -u -v-30d +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 3600 \
--statistics Average \
--output table
- CPU utilization p95 over 30 days — pull
CPUUtilizationstatistics. AWS Compute Optimizer evaluates RDS instances and flags over-provisioned instances when p99 CPU stays below 40% over a 14-day observation window. If p95 CPU is consistently below 40%, the instance is a rightsizing candidate.
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name CPUUtilization \
--dimensions Name=DBInstanceIdentifier,Value=YOUR_DB_ID \
--start-time $(date -u -v-30d +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 3600 \
--statistics p95 \
--output table
Rightsizing down one instance class (e.g., db.r6g.2xlarge to db.r6g.xlarge) typically halves the instance-hour cost while maintaining the same network and storage performance characteristics.
- Replica replication activity — query
pg_stat_replicationon the primary to see what each replica is actually doing.sent_lsnminusreplay_lsnis the replication lag in bytes. If a replica’sstateisstreamingbut it is rarely queried (verify via the replica’s ownpg_stat_activityor CloudWatchDatabaseConnections), it is a cost-only presence.
SELECT
client_addr,
application_name,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
sync_state
FROM pg_stat_replication;
For the broader question of whether read replicas are delivering value relative to their cost, see Read Replicas Are Not Free Scale — which covers the replication lag model and the routing decisions that make replicas worth keeping.
Decision Tree
flowchart TD
A[Bill spike detected] --> B{Storage cost growing?}
B -->|yes| C{Table bloat above 20%?}
C -->|yes| D[Run VACUUM or pg_repack]
C -->|no| E[Audit snapshot retention policy]
B -->|no| F{IOPS charges high?}
F -->|yes| G{gp2 burst balance depleted?}
G -->|yes| H[Migrate volume to gp3]
G -->|no| I[Check pg_stat_bgwriter for write amplification]
F -->|no| J{CPU p95 below 40%?}
J -->|yes| K[Rightsize instance class down]
J -->|no| L{CPU p95 above 70%?}
L -->|yes| M[Optimize queries or scale up]
L -->|no| N{Replica traffic justified?}
N -->|no| O[Remove idle replicas]
N -->|yes| P[No cost action needed — monitor]
Remediation Options
Option 1 — Reclaim storage from table bloat
PostgreSQL’s MVCC model retains dead tuples until autovacuum or manual vacuum cleans them. On RDS, autovacuum runs automatically but can fall behind on high-write tables. Bloat inflates pg_database_size, which directly inflates Aurora storage billing (Aurora charges per GB-month for all allocated storage, including dead tuple space).
For tables where you can tolerate a brief lock, VACUUM FULL rewrites the table and releases space to the OS. For live tables, pg_repack performs the same operation online without a full table lock.
-- Identify bloat candidates
SELECT
schemaname,
tablename,
n_dead_tup,
n_live_tup,
round(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 1) AS dead_pct,
last_autovacuum,
last_vacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY dead_pct DESC;
-- Reclaim space (causes brief AccessExclusiveLock)
VACUUM FULL VERBOSE your_schema.your_table;
Option 2 — Migrate gp2 to gp3 for explicit IOPS control
AWS documents the gp2 volume type as a burst model: baseline throughput is 3 IOPS/GB, maximum burst is 3,000 IOPS, and burst credits replenish at 3 credits per GB per second. Once the credit bucket empties, the volume returns to baseline and sustained writes above baseline are billed at the gp2 I/O pricing tier.
gp3 eliminates the burst model. Storage and IOPS are provisioned independently: 3,000 IOPS and 125 MiB/s baseline are included at no additional cost, with additional IOPS purchasable at $0.02 per provisioned IOPS-month. For workloads that have depleted their gp2 burst balance, gp3 is typically lower cost at equivalent IOPS.
The migration is online and reversible — RDS performs it as a storage modification with no downtime required.
aws rds modify-db-instance \
--db-instance-identifier YOUR_DB_ID \
--storage-type gp3 \
--iops 3000 \
--apply-immediately
Option 3 — Rightsize the instance class
When CloudWatch CPUUtilization p95 stays below 40% over a 30-day window, the instance class is over-provisioned. AWS Compute Optimizer surfaces RDS rightsizing recommendations automatically; the recommendations include projected savings and a confidence rating based on observed utilization.
Rightsizing down one class within the same instance family (e.g., db.r6g.2xlarge to db.r6g.xlarge) retains the same memory-to-CPU ratio and network performance tier while halving instance-hour cost. Verify that the target instance class can accommodate peak connection count and memory requirements before applying.
# Apply instance class change with minimal downtime (uses MultiAZ failover if enabled)
aws rds modify-db-instance \
--db-instance-identifier YOUR_DB_ID \
--db-instance-class db.r6g.xlarge \
--apply-immediately
Option 4 — Remove idle read replicas
Each RDS or Aurora read replica is a full instance billed at the same rate as the primary. Replicas that carry negligible query traffic (verify via CloudWatch DatabaseConnections on the replica endpoint) are pure cost with no throughput benefit.
Removing a replica is a permanent action — there is no undo. If a replica might be needed for failover, promote it to a standalone instance first, then terminate the original replica relationship. If it is genuinely unused, delete it directly.
# Delete a replica with no promotion needed
aws rds delete-db-instance \
--db-instance-identifier YOUR_REPLICA_ID \
--skip-final-snapshot
Rollback Plan
- Storage VACUUM FULL — not reversible in the traditional sense; the operation releases space. If the lock causes application errors, monitor
pg_stat_activityfor blocking queries. Preferpg_repackon production tables to avoid the lock. - gp2 to gp3 migration — reversible. AWS allows reverting a gp3 volume back to gp2 via another storage modification. Monitor CloudWatch
WriteLatencyandReadLatencyafter the change; if latency increases, revert. - Instance class rightsize — reversible. Scale back up via
modify-db-instance. If using Multi-AZ, the downtime is a failover window (typically under 60 seconds). MonitorDatabaseConnections,FreeableMemory, andCPUUtilizationfor 48 hours after the change. - Replica removal — not reversible. A deleted replica cannot be re-attached. Create a new replica from scratch if needed. Before deleting, capture the replica’s CloudWatch
DatabaseConnectionsover the last 30 days to confirm it was idle.
Automation Opportunity
Cost anomaly detection in AWS Cost Explorer can alert when RDS spend deviates from a predicted baseline. Set a threshold of 10–15% above the trailing 30-day average for the database service line; this catches storage growth and IOPS spikes before the end-of-month invoice.
AWS Compute Optimizer generates RDS rightsizing recommendations on a rolling basis. Export the recommendations weekly via the Compute Optimizer API and route flagged instances to a Slack channel or ticket queue for review. The documented API call is straightforward:
aws compute-optimizer get-rds-database-recommendations \
--filters name=Finding,values=Overprovisioned \
--output json
For replica auditing, a scheduled PostgreSQL query on the primary that writes pg_stat_replication state and replica endpoint DatabaseConnections to a monitoring table gives a weekly audit trail. Flag replicas where the rolling 7-day average connection count on the replica endpoint is below five; those are candidates for removal review.
Leadership Summary
- What broke: The RDS billing line grew faster than traffic because one or more of four cost dimensions — storage bloat, IOPS burst depletion, over-provisioned instance class, or idle replicas — was not monitored against a policy.
- What was done: Each dimension was triaged in order using documented CloudWatch metrics and PostgreSQL system catalog queries; the offending dimension was identified and remediated with a reversible change.
- What prevents recurrence: Compute Optimizer rightsizing alerts, Cost Explorer anomaly detection, and a monthly replica audit ensure each dimension is reviewed before it compounds.
Checklist
- Pull AWS Cost Explorer grouped by RDS usage type to identify which billing dimension is growing.
- Run
SELECT pg_size_pretty(pg_database_size(current_database()))on each RDS instance to establish a storage baseline. - Query
pg_stat_user_tablesfor tables with dead tuple percentages above 20%; scheduleVACUUM FULLorpg_repackfor the top offenders. - Check CloudWatch
BurstBalanceon any gp2 volume; if it is below 50% and trending down, plan a gp3 migration. - Pull 30-day
VolumeWriteIOPSwith 1-hour resolution; compare to gp2 baseline rate for the volume size. - Query
pg_stat_bgwriterto detect write amplification from checkpoint pressure; tunecheckpoint_completion_targetandmax_wal_sizeifcheckpoints_reqis high. - Pull 30-day
CPUUtilizationp95; flag any instance where p95 is below 40% as an over-provisioning candidate. - Review AWS Compute Optimizer recommendations for the RDS cluster; document each flagged instance and projected savings.
- Query
pg_stat_replicationon the primary and cross-reference replica endpointDatabaseConnectionsto identify replicas with no meaningful traffic. - Remove or repurpose idle replicas after confirming they are not required for failover topology.
- Set snapshot retention to match the recovery point objective in the database’s SLA; remove retention beyond policy.
- Enable Cost Explorer anomaly detection for the RDS service line at a 10–15% deviation threshold.
What to Do Next
- Problem: An RDS bill spike triggers the instinct to scale the instance or add replicas — changes that are expensive, slow to take effect, and often targeting the wrong cost dimension entirely.
- Solution: Triage the four cost dimensions in order — storage bloat, IOPS burst depletion, over-provisioned instance class, idle replicas — using CloudWatch metrics and PostgreSQL system catalog queries before making any change.
- Proof: A specific dimension is identified as the driver, a targeted remediation is applied, and the next month’s Cost Explorer line for that dimension is lower — without touching the dimensions that were not the cause.
- Action: This week, enable AWS Compute Optimizer for your RDS instances and set a Cost Explorer anomaly detection alert at 15% above your 30-day RDS baseline — both are free to configure and will surface the next cost spike before it compounds.