CloudWatch Database Insights for Aurora and RDS: The New AWS Monitoring Center
If you are still SSH-ing into a bastion host to run top and SHOW PROCESSLIST during an Aurora outage, you are ignoring the richest telemetry plane AWS provides.
Situation
Historically, monitoring a managed database like Amazon RDS or Aurora meant making a choice: rely on the sparse, high-level metrics provided by default CloudWatch, or install a third-party agent that required network access, credential management, and additional compute overhead.
The industry standard has shifted. AWS has unified Performance Insights (PI), Enhanced Monitoring (EM), and CloudWatch into a central observability plane. For teams operating Aurora and RDS at scale, the native AWS monitoring stack now provides enough granularity to diagnose deadlocks, pinpoint bad query plans, and trace I/O saturation without ever leaving the AWS console or writing a custom exporter.
Symptoms
Database failures in Aurora rarely look like hard crashes. They look like creeping degradation. The operational symptoms typically manifest as:
- The Phantom CPU Spike:
CPUUtilizationhits 99%, butDatabaseConnectionsremains flat. The application feels sluggish. - The I/O Ceiling: Queries that normally take 5ms suddenly take 500ms. The
ReadIOPSorWriteIOPSmetrics flatline at the exact provisioned limit. - The Connection Storm:
DatabaseConnectionsspikes vertically, followed immediately by application-side 502 Bad Gateway errors as the connection pool queue fills up. - The Silent Blocker: Application latency increases, but
CPUUtilizationis suspiciously low. Threads are waiting, not working.
First Five Checks
When a paging alert fires for an Aurora or RDS instance, these are the first five checks an engineer should perform using native AWS tools:
-
Check
DBLoadin Performance Insights: This is the single most important metric. DBLoad measures the number of active sessions in the database engine. If DBLoad exceeds the number of vCPUs, the database is bottlenecked. -
Review the
Wait EventsBreakdown: Slice the DBLoad metric bywaits. Are sessions waiting onCPU(working)?io/table/sql/read(I/O bound)? OrLock(contention)? -
Check
FreeableMemoryandSwapUsage(CloudWatch): IfFreeableMemoryplunges near zero andSwapUsagebegins climbing, the instance is thrashing. This often precedes an Out Of Memory (OOM) crash. -
Identify the Top SQL by Load (Performance Insights): Look at the “Top SQL” panel. Is the load caused by a single terrible query plan (one bar dominates), or an aggregate increase in all traffic?
-
Examine
CommitLatencyandDeadlocks(Aurora Specific): For Aurora PostgreSQL, check theCommitLatencymetric. If commit latency spikes while read IOPS are low, the storage volume might be experiencing multi-AZ replication delays.
Decision Tree
When diagnosing an Aurora performance incident, diagnosing the wait event is the critical pivot point.
flowchart TD
A[DBLoad Exceeds vCPUs] --> B{What is the Dominant Wait State?}
B -->|CPU| C[Check Top SQL by Load]
C --> C1{Is it a single query?}
C1 -->|Yes| C2[Missing Index or Bad Plan]
C1 -->|No| C3[Traffic Spike: Scale Up Instance]
B -->|I/O| D[Check IOPS Metrics]
D --> D1{Hitting Provisioned Limits?}
D1 -->|Yes| D2[Increase Provisioned IOPS or EBS Volume Size]
D1 -->|No| D3[Check Buffer Cache Hit Ratio]
B -->|Locks| E[Check Blocking Sessions]
E --> E1[Identify the Blocking PID]
E1 --> E2[Kill Blocker or Refactor Transaction Scope]
Remediation Options
Once the root cause is identified, you have a limited set of remediation paths.
-
Kill the Offending Query (Fastest, High Risk): If a single analytic query is holding an
AccessExclusiveLock, terminating the PID (pg_terminate_backend) immediately restores service.- Tradeoff: The application must handle the failure gracefully. If it immediately retries the exact same bad query, the database will lock up again.
-
Vertical Scaling (Medium Speed, High Cost): Modifying the instance to a larger SKU provides more CPU and memory. For Aurora, this takes minutes.
- Tradeoff: It requires a brief interruption of service (failover) and treats the symptom (lack of resources) rather than the disease (bad queries).
-
Deploy an Emergency Index (Slowest, Permanent Fix): If the Top SQL reveals a missing index causing a sequential scan, building the index
CONCURRENTLYresolves the CPU load.- Tradeoff: Building an index takes time and adds I/O pressure to an already struggling database.
Rollback Plan
If a remediation action worsens the situation (e.g., terminating a session causes a massive rollback that spikes I/O), the immediate rollback plan must be well-defined:
- Stop the application traffic at the load balancer to shed load.
- Wait for the database engine to finish its internal rollback procedures.
- Do not reboot the instance during an active transaction rollback, as it will simply restart the rollback process upon recovery.
Automation Opportunity
CloudWatch allows for automated remediation through Alarms and Systems Manager (SSM) Runbooks. For example, you can create a CloudWatch Alarm that triggers when FreeableMemory drops below 10%. Instead of just paging an engineer, the alarm can trigger an AWS Lambda function that queries Performance Insights, identifies the session consuming the most memory, and automatically terminates it.
Leadership Summary
- Standardize on Performance Insights: Do not rely purely on basic CloudWatch metrics. PI’s
DBLoadis the only metric that accurately reflects database saturation. - Tag Your Queries: Mandate that application teams use SQL comments (e.g.,
/* route=checkout, user=123 */) so that PI can group database load by application feature. - Alert on Saturation, Not Averages: Set alarms on wait events and connection limits, not just 80% CPU utilization.
What to Do Next
- Problem: Engineers SSH into bastion hosts and run
SHOW PROCESSLISTduring Aurora incidents because the default CloudWatch dashboard surfaces host saturation, not database saturation —CPUUtilizationat 40% tells you nothing about 500 sessions waiting on a lock. - Solution: Make
DBLoadsliced by wait event type the primary diagnostic signal in every Aurora incident — it’s the only metric that shows whether the database is blocked, I/O-bound, or genuinely CPU-saturated. - Proof: Simulate an I/O spike in staging and verify the corresponding CloudWatch alarm fires within 2 minutes with the wait event correctly identified — if the alarm fires on CPU and not
DBLoad, the triage workflow hasn’t improved. - Action: Enable Performance Insights at 1-second granularity on all production Aurora clusters, add a
DBLoad > vCPUsalarm with wait-event context, and require “Top SQL by Load” in the next database post-mortem.