Cost Observability: Build Dashboards That Show Waste Before Finance Finds It

If the first time engineering hears about a database cost spike is during a monthly finance review, your observability stack is fundamentally incomplete.

Situation

Database engineering traditionally focuses on two metrics: availability and latency. As long as the database is up and queries are fast, the system is considered healthy. However, in the cloud era, infrastructure is elastic, and cost is the hidden third metric. Managed database services like Amazon RDS, Aurora, and DynamoDB make it incredibly easy to spin up massive, highly available clusters. They also make it incredibly easy to bleed tens of thousands of dollars in hidden waste.

Most monitoring dashboards ignore cost entirely. Engineers look at CPU utilization to ensure it isn’t too high, but they rarely look at CPU utilization to ensure it isn’t too low. When observability is decoupled from cost, teams routinely run development environments on db.r6g.4xlarge instances, leave obsolete manual snapshots sitting in S3 for years, and over-provision EBS IOPS for workloads that no longer need them.

Symptoms

Cost inefficiency in cloud databases rarely triggers an immediate outage. Instead, it manifests as silent financial degradation. The symptoms include:

The Idle Giant: A massive database instance sits at 2% CPU utilization and 5% memory usage 24/7.
The IOPS Over-Provision: A database is running on an io2 Block Express volume provisioned for 20,000 IOPS, but CloudWatch shows it has never exceeded 1,000 IOPS in the past month.
The Snapshot Hoard: The AWS bill shows RDS backup storage costs exceeding the actual running instance costs due to years of manual, un-expired snapshots.
The Multi-AZ Dev Environment: Non-production environments are running with Multi-AZ redundancy enabled, doubling the compute cost for workloads that can tolerate an hour of downtime.

First Five Checks

To integrate cost into your operational posture, build a dedicated “Cost Triage” dashboard with these five checks:

Check Peak CPU and Connection Counts (30-Day Window): If an instance has not exceeded 20% CPU utilization and 10% connection pool usage during its highest peak over a 30-day window, it is a prime candidate for downsizing.
Evaluate Provisioned IOPS vs. Consumed IOPS: Compare the VolumeReadOps and VolumeWriteOps against the provisioned IOPS limit. If consumption is a fraction of the limit, migrate from io2 to gp3 or lower the provisioned io2 ceiling.
Audit Multi-AZ Deployments by Environment Tag: Query your infrastructure state (via AWS Config or your IaC state file) to find any instance tagged env:dev or env:staging that has MultiAZ set to true.
Review Manual Snapshot Age: List all manual RDS snapshots without an expiration tag. Automated backups age out naturally; manual snapshots taken “just in case” before a migration live forever and incur continuous S3 storage costs.
Track CloudWatch Log Ingestion and Retention: Database audit logs, slow query logs, and error logs pushed to CloudWatch Logs can become extremely expensive. Check the retention policies—logs kept indefinitely instead of aging out to S3 Glacier drive up costs.

Decision Tree

When evaluating a database for cost optimization, use this triage flow to determine the safest remediation path.

flowchart TD
    A[Database Identified as High Cost] --> B{Is it Production?}
    B -->|No| C[Check High-Availability Config]
    C --> C1{Is Multi-AZ Enabled?}
    C1 -->|Yes| C2[Disable Multi-AZ]
    C1 -->|No| C3[Check Uptime Needs]
    C3 -->|Can be stopped| C4[Implement Nightly Stop/Start Schedule]
    
    B -->|Yes| D[Check Utilization Metrics]
    D --> D1{Is Peak CPU < 20%?}
    D1 -->|Yes| D2[Downsize Instance Type]
    D1 -->|No| D3[Check Storage Configuration]
    D3 --> D4{Using Provisioned IOPS io1/io2?}
    D4 -->|Yes| D5[Evaluate Migration to gp3]

Remediation Options

Instance Downsizing (High Impact, Low Risk): Scaling an RDS instance down to a smaller instance class halves the compute cost.
- Tradeoff: This requires a brief interruption of service (failover). Ensure the application is resilient to connection drops.
Migrating io1/io2 to gp3 (High Impact, Zero Downtime): Modern gp3 volumes offer baseline performance of 3,000 IOPS and can be scaled up to 16,000 IOPS, which covers 90% of database workloads at a fraction of the cost of io2. Storage type modifications can be done online.
- Tradeoff: Modifying a large volume can take days to complete in the background, during which performance may be slightly degraded.
Automated Start/Stop for Dev Environments (Medium Impact, Zero Cost Risk): Using AWS Instance Scheduler to shut down dev databases at 6 PM and start them at 8 AM reduces compute costs by over 60%.
- Tradeoff: Engineers working off-hours will need self-service access to manually restart their environments.

Rollback Plan

When downsizing a database, always monitor application latency immediately following the cutover. If the smaller instance lacks the CPU cache or memory to serve queries efficiently, the rollback plan is to immediately initiate another modify instance command to scale back up. Because scaling up requires a reboot/failover, expect an additional 30-60 seconds of disruption.

Automation Opportunity

Deploy a Lambda function triggered by EventBridge that runs weekly. The function should scan all RDS snapshots, identify any manual snapshot older than 90 days that does not have a Compliance or LegalHold tag, and automatically delete it. This prevents the “snapshot hoard” from silently inflating the AWS bill over time.

Leadership Summary

Cost is an Engineering Metric: Do not treat cost as an external business constraint. Expose cloud costs directly alongside CPU and memory on your engineering dashboards.
Tagging is Operations: You cannot optimize what you cannot identify. Strict enforcement of Environment, Team, and Service tags is the prerequisite for all cost observability.
The Cloud is Elastic, Use It: A database that runs 24/7 at 5% utilization is a failure of cloud architecture. Build your environments to scale down or shut off entirely when not in use.

What to Do Next

Problem: When observability is decoupled from cost, teams routinely over-provision dev environments on db.r6g.4xlarge, hoard manual snapshots for years, and leave io2 volumes provisioned at 20,000 IOPS for workloads that never exceed 1,000 — none of which triggers an availability alert until the finance review.
Solution: Build a “Database Waste” dashboard ranking instances by lowest peak CPU and highest storage cost, then automate weekly scans for Multi-AZ dev environments and snapshots older than 90 days without a compliance tag.
Proof: Identify one non-production database with Multi-AZ enabled, disable it via Terraform, and show the projected yearly savings — this is the first concrete signal that cost observability is surfacing real waste before finance does.
Action: Run the five checks above against your current RDS fleet this week. Any dev instance at sub-20% peak CPU with Multi-AZ enabled is an immediate win: disable Multi-AZ and schedule a nightly stop/start via Instance Scheduler.

Situation

Symptoms

First Five Checks

Decision Tree

Remediation Options

Rollback Plan

Automation Opportunity

Leadership Summary

What to Do Next

Rajiv

Related Posts

Alert Fatigue Engineering: How to Build Fewer, Better, Actionable Alerts

Agentic SRE Architecture: Skills, Agents, MCP Servers, and Human Approval Loops

Telemetry Cost Control: Why Observability Data Itself Needs Governance