Aurora MySQL Writer CPU Spike Workflow

An Aurora MySQL writer CPU spike is almost never just a CPU problem. The writer processes writes exclusively for the cluster, and when CPU spikes, the culprit is usually a query that changed execution plan, a lock contention burst, a batch job running longer than expected, or a sudden increase in connection count. Treating it as a capacity problem and scaling the instance is the expensive, slow-feedback response. The fast response starts with Performance Insights.

Situation

CloudWatch shows Aurora MySQL writer CPUUtilization at 80–95%. Application latency is climbing. The P99 for write endpoints has doubled. The on-call engineer opens the console and sees the CPU metric, the latency metric, and a blinking cursor.

Aurora MySQL separates the writer from the reader cluster endpoints. The writer handles all DML. Readers handle only SELECT queries that have been explicitly routed to the reader endpoint. When the writer is saturated, writes stall, and any reads routed to the writer stall with them. Scaling the writer instance buys time but does not address the root cause — and Aurora Serverless v2 auto-scaling adds latency while scaling happens, which worsens the incident in the short term.

The diagnostic sequence determines whether this resolves in 10 minutes or 2 hours.

Symptoms

Signal	Where to see it	What it means
CPUUtilization 80–100%	CloudWatch — Aurora writer	Writer is bottlenecked; cause unknown
High DBLoad	Performance Insights — DBLoad metric	Confirms sessions waiting; compare DBLoadCPU vs DBLoadNonCPU
One query dominating AAS	Performance Insights — Top SQL	Single query is consuming most writer capacity
Long lock wait in INNODB STATUS	`SHOW ENGINE INNODB STATUS\G`	Lock contention between concurrent transactions
Active connections spike	CloudWatch — DatabaseConnections	Connection pool exhausted or connection storm
PROCESSLIST shows many similar queries	`SHOW FULL PROCESSLIST`	Hot query pattern, not a single rogue query

First Five Checks

Performance Insights — split CPU vs wait — Determine whether the bottleneck is CPU execution or wait events:

Performance Insights DBLoad chart separates db.load.avg into DBLoadCPU (executing on CPU) and DBLoadNonCPU (waiting — on locks, I/O, etc.). If DBLoadNonCPU dominates, the CPU spike is a secondary effect of sessions piling up behind a lock or slow I/O, not pure execution load.

Navigate to: RDS Console → your Aurora cluster → Performance Insights → select DB Load breakdown by wait event.

Top SQL by average active sessions — Identify the specific query driving load:

Performance Insights → Top SQL tab, sorted by Load (AAS). The top query by AAS is the first candidate. Note its digest, get the full SQL text, and examine its execution plan.

-- Run on the Aurora writer — substitute the digest from Performance Insights
EXPLAIN SELECT * FROM orders WHERE customer_id = 12345;

Currently running queries:

SHOW FULL PROCESSLIST;

Look for queries in State: executing or State: Waiting for table metadata lock or State: updating. A large number of identical or similar queries stacking up indicates the query is not returning promptly — the connection pool is filling with in-flight sessions.

InnoDB lock contention:

SHOW ENGINE INNODB STATUS\G

Scroll to the TRANSACTIONS section and look for LOCK WAIT. Lock waits indicate two or more transactions competing for the same row or range. The LATEST DETECTED DEADLOCK section shows the most recent deadlock event — if it is recent and matches the CPU spike timing, lock contention is the primary cause.

Long transactions:

SELECT trx_id, trx_started, trx_query,
       TIMESTAMPDIFF(SECOND, trx_started, NOW()) AS age_sec
FROM information_schema.INNODB_TRX
ORDER BY trx_started
LIMIT 10;

Any transaction older than 60 seconds on the writer during a CPU spike is a strong suspect. Long transactions hold row locks longer, block concurrent writes, and generate undo log that increases internal InnoDB maintenance work.

Decision Tree

flowchart TD
    A[Aurora writer CPU spike] --> B{Performance Insights — single query dominant?}
    B -->|yes| C[EXPLAIN the query — check for full scan]
    C --> D{Missing index?}
    D -->|yes| E[Add index — test in staging first]
    D -->|no| F[Check statistics staleness — run ANALYZE TABLE]
    B -->|no| G{DBLoadNonCPU dominant?}
    G -->|yes| H{INNODB STATUS shows lock waits?}
    H -->|yes| I[Find blocking transaction — reduce scope or kill]
    H -->|no| J[Check I/O metrics — consider read offload]
    G -->|no| K{Many connections in PROCESSLIST?}
    K -->|yes| L[Check connection pool config — reduce max connections]
    K -->|no| M{Aurora Serverless v2 scaling in progress?}
    M -->|yes| N[Wait for scale-up — increase minimum ACU to prevent recurrence]
    M -->|no| O[Check recent schema or code deployment]

Remediation Options

Option 1 — Add index for the top query

If Performance Insights identifies a query doing a full scan (type=ALL in EXPLAIN) as the top AAS consumer, adding the right index is the highest-leverage fix:

-- Confirm execution plan before adding index
EXPLAIN SELECT * FROM orders WHERE customer_id = 12345 AND status = 'pending';

-- Add the index (run during low-traffic window or use pt-online-schema-change for large tables)
ALTER TABLE orders ADD INDEX idx_customer_status (customer_id, status);

-- Verify the new plan
EXPLAIN SELECT * FROM orders WHERE customer_id = 12345 AND status = 'pending';

Aurora MySQL supports online DDL for most index additions. For large tables, monitor information_schema.INNODB_ONLINE_DDL for progress.

Option 2 — Route reads to Aurora reader endpoint

If reads are being sent to the writer endpoint — intentionally or by misconfiguration — routing them to the reader reduces writer load immediately:

-- Verify no heavy reads are running on writer
SELECT user, info, time
FROM information_schema.PROCESSLIST
WHERE command != 'Sleep'
  AND info LIKE 'SELECT%'
ORDER BY time DESC
LIMIT 10;

Update application connection configuration to direct SELECT queries to the Aurora reader endpoint (cluster.ro.amazonaws.com). For applications that cannot distinguish read vs write connections, a read-write splitting proxy (ProxySQL, RDS Proxy) is an intermediate step.

Option 3 — Kill long-running blocking transactions

If INFORMATION_SCHEMA.INNODB_TRX shows a transaction blocking others and it has been running longer than its normal expected duration:

-- Identify the blocking thread
SELECT trx_mysql_thread_id, trx_started, trx_query
FROM information_schema.INNODB_TRX
ORDER BY trx_started
LIMIT 5;

-- Kill it
KILL <trx_mysql_thread_id>;

Coordinate with the application team before killing production transactions. For recurring batch jobs that grow too large, the fix is chunking them: process rows in batches of 1,000–10,000 with explicit commits between chunks rather than one large transaction.

Rollback Plan

Index additions: Indexes can be dropped if they cause unexpected plan changes for other queries: ALTER TABLE orders DROP INDEX idx_customer_status. Monitor query plan changes via Performance Insights for 24 hours after index additions.
Read routing changes: Application-level changes to reader endpoint routing can be reverted by changing the connection string back. Stateful connections in the pool drain within one connection TTL cycle.
Killed transactions: The killed transaction rolls back automatically. InnoDB rollback time is proportional to transaction size. Monitor information_schema.INNODB_TRX to confirm completion. No binlog event is written for the rolled-back transaction.

Automation Opportunity

Aurora Performance Insights exposes API access to DB load metrics. A CloudWatch Alarm on DBLoad exceeding the instance’s max_connections-based threshold (typically 2x vCPU count as a conservative threshold) can trigger automated notification before CPU fully saturates.

A more targeted detection: schedule a query every 2 minutes on the writer that checks for long-running transactions and high-AAS queries simultaneously:

-- Long transaction detection (run on writer, schedule via external monitor)
SELECT COUNT(*) AS long_txn_count
FROM information_schema.INNODB_TRX
WHERE TIMESTAMPDIFF(SECOND, trx_started, NOW()) > 120;

Alert if long_txn_count exceeds 2 during business hours. In most workloads, a transaction running more than 2 minutes on a write-heavy Aurora cluster is either a stuck batch job or a deadlock victim that failed to rollback.

Leadership Summary

What broke: Aurora MySQL writer CPU spiked to 90%+, causing write latency to climb and application error rates to increase. The root cause was a high-AAS query executing a full table scan on a growing table after a recent data volume increase changed the query’s cost model.
What was done: Performance Insights identified the specific query. An index was added targeting the full-scan column. Writer CPU returned to baseline within 5 minutes of the index becoming active.
What prevents recurrence: Performance Insights monitoring with a DBLoad alarm at 4 AAS (writer-size-appropriate threshold) provides early warning. The long-transaction check query is scheduled to run every 2 minutes as a canary for batch job runaway.

Checklist

Open Performance Insights — confirm DBLoad is elevated on the writer, not the reader
Compare DBLoadCPU vs DBLoadNonCPU — determine if wait events or CPU execution dominate
Identify top query by AAS in Performance Insights Top SQL tab
Run EXPLAIN on the top query — look for type=ALL or high rows estimate
Run SHOW FULL PROCESSLIST — check for many stacked identical queries
Run SHOW ENGINE INNODB STATUS\G — look for lock waits and recent deadlocks
Run long-transaction query on INFORMATION_SCHEMA.INNODB_TRX — look for transactions older than 60 seconds
If full scan confirmed — add index in staging, test plan change, deploy to production
If lock contention confirmed — identify blocking transaction, coordinate kill or reduce transaction scope
Verify no SELECT queries are routed to writer endpoint — check connection strings in application config

What to Do Next

Problem: An Aurora MySQL writer CPU spike is treated as a capacity problem, which leads to scaling the instance or adding replicas — changes that are slow, expensive, and do not address a bad query plan, lock contention, or a batch job that outgrew its transaction scope.
Solution: Open Performance Insights first: split DBLoadCPU from DBLoadNonCPU to determine whether the bottleneck is execution or waiting, identify the top AAS query, then follow the decision tree to the targeted remediation.
Proof: CPU returns to baseline and DBLoad drops below the vCPU-count threshold within minutes of addressing the root cause — without any instance scaling.
Action: This week, enable a CloudWatch alarm on DBLoad at a threshold of 2× the instance’s vCPU count, and verify that Performance Insights is enabled on your Aurora writer so the top SQL tab is populated the next time a spike occurs.

Situation

Symptoms

First Five Checks

Decision Tree

Remediation Options

Rollback Plan

Automation Opportunity

Leadership Summary

Checklist

What to Do Next

Rajiv

Related Posts

Cloud Architecture Review Checklist for Database-Backed Applications

AWS RDS Oracle and SQL Server: The License Cost Nobody Talks About

Autovacuum Is a Capacity Problem, Not a Maintenance Task