Cloud Database Cost Triage: Storage, IOPS, CPU, Replicas

The RDS bill is higher than expected and the instinct to scale up the instance or add a replica is almost always the wrong first move. Cost spikes in cloud databases have four distinct drivers — storage, IOPS, instance class, and replicas — and each requires a different remediation. Acting on the wrong one wastes money and may make the problem worse. The right move is triage first.

Situation

AWS RDS and Aurora bill on four independent cost dimensions: storage consumed, I/O operations performed, the instance class running the engine, and the number of instances attached to the cluster. When a monthly bill grows faster than traffic, it is usually one of these dimensions accelerating — not all four simultaneously.

The problem is that Cost Explorer shows total database spend, not cost per dimension. An engineer looking at a $4,000 line item for “Amazon RDS” cannot tell whether the driver is 2 TB of unclaimed storage, a gp2 volume depleting its burst I/O credits, an over-provisioned db.r6g.2xlarge sitting at 8% CPU, or three read replicas that no longer carry meaningful traffic.

Each of those four scenarios has a different first command to run and a different remediation. Conflating them means you might rightsize the instance when the actual driver is 800 GB of dead tuples waiting on autovacuum.

Symptoms

Signal	Where to see it	What it means
Storage cost growing without traffic growth	AWS Cost Explorer, grouped by usage type	Table bloat, dead tuples, or log accumulation not being reclaimed
IOPS charges on a gp2 volume	CloudWatch `VolumeReadIOPS` and `VolumeWriteIOPS`	Burst credit balance depleted; every I/O now billed at the gp2 overage rate
High instance cost relative to CPU utilization	CloudWatch `CPUUtilization` p95 over 30 days	Instance class is over-provisioned for the actual workload
Replica count grew over time	RDS console — DB instances view	Replicas added reactively without a retirement policy; each one bills at primary instance rates
Snapshot retention set to maximum	RDS console — Maintenance and backups	Snapshots older than policy requires accumulate silently at $0.095 per GB-month

First Five Checks

Database and table sizes — connect to the PostgreSQL instance and run both queries. The first gives total database size; the second surfaces the top bloat candidates by table.

-- Total database size
SELECT pg_size_pretty(pg_database_size(current_database()));

-- Top 10 tables by total size (including indexes and toast)
SELECT
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS total_size,
  pg_size_pretty(pg_relation_size(schemaname || '.' || tablename))       AS table_size
FROM pg_tables
ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC
LIMIT 10;

If a table’s total size is significantly larger than its live row count implies, dead tuples are accumulating. Cross-reference with pg_stat_user_tables.n_dead_tup.

Write amplification signal from the background writer — PostgreSQL’s pg_stat_bgwriter tracks how much I/O the background writer and checkpointer are generating. High buffers_checkpoint relative to buffers_clean or buffers_backend indicates that checkpointing is driving write I/O, not the application directly.

SELECT
  checkpoints_timed,
  checkpoints_req,
  buffers_checkpoint,
  buffers_clean,
  buffers_backend,
  maxwritten_clean
FROM pg_stat_bgwriter;

AWS documents that RDS gp2 volumes use a credit-based burst model. As documented in the AWS RDS storage documentation, a gp2 volume earns 3 IOPS per GB per second and can burst to 3,000 IOPS until the credit bucket empties. Once depleted, throughput drops to the baseline rate and every operation above baseline is billed at the provisioned IOPS rate. buffers_checkpoint growing while CloudWatch BurstBalance drops toward zero is the signature of this problem.

IOPS consumption in CloudWatch — pull VolumeReadIOPS and VolumeWriteIOPS for the last 30 days with a 1-hour resolution. If the volume is gp2 and you see sustained IOPS above 3,000, the burst balance is gone and you are in the expensive steady state.

aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name WriteIOPS \
  --dimensions Name=DBInstanceIdentifier,Value=YOUR_DB_ID \
  --start-time $(date -u -v-30d +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 3600 \
  --statistics Average \
  --output table

CPU utilization p95 over 30 days — pull CPUUtilization statistics. AWS Compute Optimizer evaluates RDS instances and flags over-provisioned instances when p99 CPU stays below 40% over a 14-day observation window. If p95 CPU is consistently below 40%, the instance is a rightsizing candidate.

aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name CPUUtilization \
  --dimensions Name=DBInstanceIdentifier,Value=YOUR_DB_ID \
  --start-time $(date -u -v-30d +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 3600 \
  --statistics p95 \
  --output table

Rightsizing down one instance class (e.g., db.r6g.2xlarge to db.r6g.xlarge) typically halves the instance-hour cost while maintaining the same network and storage performance characteristics.

Replica replication activity — query pg_stat_replication on the primary to see what each replica is actually doing. sent_lsn minus replay_lsn is the replication lag in bytes. If a replica’s state is streaming but it is rarely queried (verify via the replica’s own pg_stat_activity or CloudWatch DatabaseConnections), it is a cost-only presence.

SELECT
  client_addr,
  application_name,
  state,
  sent_lsn,
  write_lsn,
  flush_lsn,
  replay_lsn,
  sync_state
FROM pg_stat_replication;

For the broader question of whether read replicas are delivering value relative to their cost, see Read Replicas Are Not Free Scale — which covers the replication lag model and the routing decisions that make replicas worth keeping.

Decision Tree

flowchart TD
    A[Bill spike detected] --> B{Storage cost growing?}
    B -->|yes| C{Table bloat above 20%?}
    C -->|yes| D[Run VACUUM or pg_repack]
    C -->|no| E[Audit snapshot retention policy]
    B -->|no| F{IOPS charges high?}
    F -->|yes| G{gp2 burst balance depleted?}
    G -->|yes| H[Migrate volume to gp3]
    G -->|no| I[Check pg_stat_bgwriter for write amplification]
    F -->|no| J{CPU p95 below 40%?}
    J -->|yes| K[Rightsize instance class down]
    J -->|no| L{CPU p95 above 70%?}
    L -->|yes| M[Optimize queries or scale up]
    L -->|no| N{Replica traffic justified?}
    N -->|no| O[Remove idle replicas]
    N -->|yes| P[No cost action needed — monitor]

Remediation Options

Option 1 — Reclaim storage from table bloat

PostgreSQL’s MVCC model retains dead tuples until autovacuum or manual vacuum cleans them. On RDS, autovacuum runs automatically but can fall behind on high-write tables. Bloat inflates pg_database_size, which directly inflates Aurora storage billing (Aurora charges per GB-month for all allocated storage, including dead tuple space).

For tables where you can tolerate a brief lock, VACUUM FULL rewrites the table and releases space to the OS. For live tables, pg_repack performs the same operation online without a full table lock.

-- Identify bloat candidates
SELECT
  schemaname,
  tablename,
  n_dead_tup,
  n_live_tup,
  round(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 1) AS dead_pct,
  last_autovacuum,
  last_vacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY dead_pct DESC;

-- Reclaim space (causes brief AccessExclusiveLock)
VACUUM FULL VERBOSE your_schema.your_table;

Option 2 — Migrate gp2 to gp3 for explicit IOPS control

AWS documents the gp2 volume type as a burst model: baseline throughput is 3 IOPS/GB, maximum burst is 3,000 IOPS, and burst credits replenish at 3 credits per GB per second. Once the credit bucket empties, the volume returns to baseline and sustained writes above baseline are billed at the gp2 I/O pricing tier.

gp3 eliminates the burst model. Storage and IOPS are provisioned independently: 3,000 IOPS and 125 MiB/s baseline are included at no additional cost, with additional IOPS purchasable at $0.02 per provisioned IOPS-month. For workloads that have depleted their gp2 burst balance, gp3 is typically lower cost at equivalent IOPS.

The migration is online and reversible — RDS performs it as a storage modification with no downtime required.

aws rds modify-db-instance \
  --db-instance-identifier YOUR_DB_ID \
  --storage-type gp3 \
  --iops 3000 \
  --apply-immediately

Option 3 — Rightsize the instance class

When CloudWatch CPUUtilization p95 stays below 40% over a 30-day window, the instance class is over-provisioned. AWS Compute Optimizer surfaces RDS rightsizing recommendations automatically; the recommendations include projected savings and a confidence rating based on observed utilization.

Rightsizing down one class within the same instance family (e.g., db.r6g.2xlarge to db.r6g.xlarge) retains the same memory-to-CPU ratio and network performance tier while halving instance-hour cost. Verify that the target instance class can accommodate peak connection count and memory requirements before applying.

# Apply instance class change with minimal downtime (uses MultiAZ failover if enabled)
aws rds modify-db-instance \
  --db-instance-identifier YOUR_DB_ID \
  --db-instance-class db.r6g.xlarge \
  --apply-immediately

Option 4 — Remove idle read replicas

Each RDS or Aurora read replica is a full instance billed at the same rate as the primary. Replicas that carry negligible query traffic (verify via CloudWatch DatabaseConnections on the replica endpoint) are pure cost with no throughput benefit.

Removing a replica is a permanent action — there is no undo. If a replica might be needed for failover, promote it to a standalone instance first, then terminate the original replica relationship. If it is genuinely unused, delete it directly.

# Delete a replica with no promotion needed
aws rds delete-db-instance \
  --db-instance-identifier YOUR_REPLICA_ID \
  --skip-final-snapshot

Rollback Plan

Storage VACUUM FULL — not reversible in the traditional sense; the operation releases space. If the lock causes application errors, monitor pg_stat_activity for blocking queries. Prefer pg_repack on production tables to avoid the lock.
gp2 to gp3 migration — reversible. AWS allows reverting a gp3 volume back to gp2 via another storage modification. Monitor CloudWatch WriteLatency and ReadLatency after the change; if latency increases, revert.
Instance class rightsize — reversible. Scale back up via modify-db-instance. If using Multi-AZ, the downtime is a failover window (typically under 60 seconds). Monitor DatabaseConnections, FreeableMemory, and CPUUtilization for 48 hours after the change.
Replica removal — not reversible. A deleted replica cannot be re-attached. Create a new replica from scratch if needed. Before deleting, capture the replica’s CloudWatch DatabaseConnections over the last 30 days to confirm it was idle.

Automation Opportunity

Cost anomaly detection in AWS Cost Explorer can alert when RDS spend deviates from a predicted baseline. Set a threshold of 10–15% above the trailing 30-day average for the database service line; this catches storage growth and IOPS spikes before the end-of-month invoice.

AWS Compute Optimizer generates RDS rightsizing recommendations on a rolling basis. Export the recommendations weekly via the Compute Optimizer API and route flagged instances to a Slack channel or ticket queue for review. The documented API call is straightforward:

aws compute-optimizer get-rds-database-recommendations \
  --filters name=Finding,values=Overprovisioned \
  --output json

For replica auditing, a scheduled PostgreSQL query on the primary that writes pg_stat_replication state and replica endpoint DatabaseConnections to a monitoring table gives a weekly audit trail. Flag replicas where the rolling 7-day average connection count on the replica endpoint is below five; those are candidates for removal review.

Leadership Summary

What broke: The RDS billing line grew faster than traffic because one or more of four cost dimensions — storage bloat, IOPS burst depletion, over-provisioned instance class, or idle replicas — was not monitored against a policy.
What was done: Each dimension was triaged in order using documented CloudWatch metrics and PostgreSQL system catalog queries; the offending dimension was identified and remediated with a reversible change.
What prevents recurrence: Compute Optimizer rightsizing alerts, Cost Explorer anomaly detection, and a monthly replica audit ensure each dimension is reviewed before it compounds.

Checklist

Pull AWS Cost Explorer grouped by RDS usage type to identify which billing dimension is growing.
Run SELECT pg_size_pretty(pg_database_size(current_database())) on each RDS instance to establish a storage baseline.
Query pg_stat_user_tables for tables with dead tuple percentages above 20%; schedule VACUUM FULL or pg_repack for the top offenders.
Check CloudWatch BurstBalance on any gp2 volume; if it is below 50% and trending down, plan a gp3 migration.
Pull 30-day VolumeWriteIOPS with 1-hour resolution; compare to gp2 baseline rate for the volume size.
Query pg_stat_bgwriter to detect write amplification from checkpoint pressure; tune checkpoint_completion_target and max_wal_size if checkpoints_req is high.
Pull 30-day CPUUtilization p95; flag any instance where p95 is below 40% as an over-provisioning candidate.
Review AWS Compute Optimizer recommendations for the RDS cluster; document each flagged instance and projected savings.
Query pg_stat_replication on the primary and cross-reference replica endpoint DatabaseConnections to identify replicas with no meaningful traffic.
Remove or repurpose idle replicas after confirming they are not required for failover topology.
Set snapshot retention to match the recovery point objective in the database’s SLA; remove retention beyond policy.
Enable Cost Explorer anomaly detection for the RDS service line at a 10–15% deviation threshold.

What to Do Next

Problem: An RDS bill spike triggers the instinct to scale the instance or add replicas — changes that are expensive, slow to take effect, and often targeting the wrong cost dimension entirely.
Solution: Triage the four cost dimensions in order — storage bloat, IOPS burst depletion, over-provisioned instance class, idle replicas — using CloudWatch metrics and PostgreSQL system catalog queries before making any change.
Proof: A specific dimension is identified as the driver, a targeted remediation is applied, and the next month’s Cost Explorer line for that dimension is lower — without touching the dimensions that were not the cause.
Action: This week, enable AWS Compute Optimizer for your RDS instances and set a Cost Explorer anomaly detection alert at 15% above your 30-day RDS baseline — both are free to configure and will surface the next cost spike before it compounds.

Situation

Symptoms

First Five Checks

Decision Tree

Remediation Options

Rollback Plan

Automation Opportunity

Leadership Summary

Checklist

What to Do Next

Rajiv

Related Posts

Aurora Cost Optimization: The Hidden Database Bill

Aurora Serverless v2: Good Fit, Bad Fit

Aurora Global Database: What It Solves and What It Does Not