If you are still SSH-ing into a bastion host to run top and SHOW PROCESSLIST during an Aurora outage, you are ignoring the richest telemetry plane AWS provides.

Situation

Historically, monitoring a managed database like Amazon RDS or Aurora meant making a choice: rely on the sparse, high-level metrics provided by default CloudWatch, or install a third-party agent that required network access, credential management, and additional compute overhead.

The industry standard has shifted. AWS has unified Performance Insights (PI), Enhanced Monitoring (EM), and CloudWatch into a central observability plane. For teams operating Aurora and RDS at scale, the native AWS monitoring stack now provides enough granularity to diagnose deadlocks, pinpoint bad query plans, and trace I/O saturation without ever leaving the AWS console or writing a custom exporter.

Symptoms

Database failures in Aurora rarely look like hard crashes. They look like creeping degradation. The operational symptoms typically manifest as:

  • The Phantom CPU Spike: CPUUtilization hits 99%, but DatabaseConnections remains flat. The application feels sluggish.
  • The I/O Ceiling: Queries that normally take 5ms suddenly take 500ms. The ReadIOPS or WriteIOPS metrics flatline at the exact provisioned limit.
  • The Connection Storm: DatabaseConnections spikes vertically, followed immediately by application-side 502 Bad Gateway errors as the connection pool queue fills up.
  • The Silent Blocker: Application latency increases, but CPUUtilization is suspiciously low. Threads are waiting, not working.

First Five Checks

When a paging alert fires for an Aurora or RDS instance, these are the first five checks an engineer should perform using native AWS tools:

  1. Check DBLoad in Performance Insights: This is the single most important metric. DBLoad measures the number of active sessions in the database engine. If DBLoad exceeds the number of vCPUs, the database is bottlenecked.

  2. Review the Wait Events Breakdown: Slice the DBLoad metric by waits. Are sessions waiting on CPU (working)? io/table/sql/read (I/O bound)? Or Lock (contention)?

  3. Check FreeableMemory and SwapUsage (CloudWatch): If FreeableMemory plunges near zero and SwapUsage begins climbing, the instance is thrashing. This often precedes an Out Of Memory (OOM) crash.

  4. Identify the Top SQL by Load (Performance Insights): Look at the “Top SQL” panel. Is the load caused by a single terrible query plan (one bar dominates), or an aggregate increase in all traffic?

  5. Examine CommitLatency and Deadlocks (Aurora Specific): For Aurora PostgreSQL, check the CommitLatency metric. If commit latency spikes while read IOPS are low, the storage volume might be experiencing multi-AZ replication delays.

Decision Tree

When diagnosing an Aurora performance incident, diagnosing the wait event is the critical pivot point.

flowchart TD
    A[DBLoad Exceeds vCPUs] --> B{What is the Dominant Wait State?}
    B -->|CPU| C[Check Top SQL by Load]
    C --> C1{Is it a single query?}
    C1 -->|Yes| C2[Missing Index or Bad Plan]
    C1 -->|No| C3[Traffic Spike: Scale Up Instance]
    
    B -->|I/O| D[Check IOPS Metrics]
    D --> D1{Hitting Provisioned Limits?}
    D1 -->|Yes| D2[Increase Provisioned IOPS or EBS Volume Size]
    D1 -->|No| D3[Check Buffer Cache Hit Ratio]
    
    B -->|Locks| E[Check Blocking Sessions]
    E --> E1[Identify the Blocking PID]
    E1 --> E2[Kill Blocker or Refactor Transaction Scope]

Remediation Options

Once the root cause is identified, you have a limited set of remediation paths.

  1. Kill the Offending Query (Fastest, High Risk): If a single analytic query is holding an AccessExclusiveLock, terminating the PID (pg_terminate_backend) immediately restores service.

    • Tradeoff: The application must handle the failure gracefully. If it immediately retries the exact same bad query, the database will lock up again.
  2. Vertical Scaling (Medium Speed, High Cost): Modifying the instance to a larger SKU provides more CPU and memory. For Aurora, this takes minutes.

    • Tradeoff: It requires a brief interruption of service (failover) and treats the symptom (lack of resources) rather than the disease (bad queries).
  3. Deploy an Emergency Index (Slowest, Permanent Fix): If the Top SQL reveals a missing index causing a sequential scan, building the index CONCURRENTLY resolves the CPU load.

    • Tradeoff: Building an index takes time and adds I/O pressure to an already struggling database.

Rollback Plan

If a remediation action worsens the situation (e.g., terminating a session causes a massive rollback that spikes I/O), the immediate rollback plan must be well-defined:

  • Stop the application traffic at the load balancer to shed load.
  • Wait for the database engine to finish its internal rollback procedures.
  • Do not reboot the instance during an active transaction rollback, as it will simply restart the rollback process upon recovery.

Automation Opportunity

CloudWatch allows for automated remediation through Alarms and Systems Manager (SSM) Runbooks. For example, you can create a CloudWatch Alarm that triggers when FreeableMemory drops below 10%. Instead of just paging an engineer, the alarm can trigger an AWS Lambda function that queries Performance Insights, identifies the session consuming the most memory, and automatically terminates it.

Leadership Summary

  • Standardize on Performance Insights: Do not rely purely on basic CloudWatch metrics. PI’s DBLoad is the only metric that accurately reflects database saturation.
  • Tag Your Queries: Mandate that application teams use SQL comments (e.g., /* route=checkout, user=123 */) so that PI can group database load by application feature.
  • Alert on Saturation, Not Averages: Set alarms on wait events and connection limits, not just 80% CPU utilization.

What to Do Next

  • Problem: Engineers SSH into bastion hosts and run SHOW PROCESSLIST during Aurora incidents because the default CloudWatch dashboard surfaces host saturation, not database saturation — CPUUtilization at 40% tells you nothing about 500 sessions waiting on a lock.
  • Solution: Make DBLoad sliced by wait event type the primary diagnostic signal in every Aurora incident — it’s the only metric that shows whether the database is blocked, I/O-bound, or genuinely CPU-saturated.
  • Proof: Simulate an I/O spike in staging and verify the corresponding CloudWatch alarm fires within 2 minutes with the wait event correctly identified — if the alarm fires on CPU and not DBLoad, the triage workflow hasn’t improved.
  • Action: Enable Performance Insights at 1-second granularity on all production Aurora clusters, add a DBLoad > vCPUs alarm with wait-event context, and require “Top SQL by Load” in the next database post-mortem.