MySQL and Aurora Monitoring: The Dashboard That Catches Problems Before Users Do
A MySQL dashboard that shows only CPU and disk IOPS will miss the failures that actually page you at 3 AM: replication stopped because of a single bad row, InnoDB buffer pool thrashing on a cold restart, connection exhaustion from a leaked pool, and a lock chain building behind an ALTER TABLE that forgot LOCK=NONE. The metrics that matter come from INFORMATION_SCHEMA, performance_schema, and the MySQL status variables — not the OS.
Situation
Most MySQL monitoring starts with the infrastructure layer: CPU, memory, disk I/O, network. These are necessary for capacity planning but insufficient for operational health. A MySQL instance with 30% CPU and plenty of free memory can still be moments from an outage: replica lag at 45 minutes, InnoDB buffer pool hit rate at 80% (normal is 99%), connection count at 95% of max_connections, and five sessions blocked behind a lock on a hot row.
Aurora adds its own layer: storage auto-scaling, volume bytes ceiling, cluster-level failover, and replica lag measured differently than MySQL’s Seconds_Behind_Master. Monitoring Aurora with only MySQL queries misses the Aurora-specific failure modes.
The seven metric groups below apply to both self-managed MySQL and Aurora MySQL. Where Aurora differs, the Aurora-specific metric or query is noted.
Symptoms
| Symptom | Likely source | First place to check |
|---|---|---|
| Application queries suddenly slower | Lock contention or plan regression | INFORMATION_SCHEMA.INNODB_TRX, SHOW PROCESSLIST |
| Connection pool exhausted | max_connections hit or leaked connections | SHOW STATUS LIKE 'Threads_connected' |
| Replica reads returning stale data | Replication lag | SHOW SLAVE STATUS / Aurora CloudWatch ReplicaLag |
| Table scan on a previously fast query | Missing index or stale stats | EXPLAIN, information_schema.STATISTICS |
Got error 1040: Too many connections in app logs | Connections near or at limit | SHOW VARIABLES LIKE 'max_connections' vs current threads |
| Disk filling faster than expected | Binary logs not purging or large temp tables | SHOW VARIABLES LIKE 'expire_logs_days' |
| OOM kill on MySQL process | Buffer pool too large for available RAM | innodb_buffer_pool_size vs system RAM |
Lock wait timeout exceeded in app | Long-running transaction holding row locks | INFORMATION_SCHEMA.INNODB_TRX + INNODB_LOCKS |
First Five Checks
Run these in order when something is wrong. Each requires only PROCESS privilege or SELECT on performance_schema.
1. What are active threads doing right now?
SELECT id, user, host, db, command, time, state, LEFT(info, 120) AS query
FROM information_schema.PROCESSLIST
WHERE command != 'Sleep'
AND time > 5
ORDER BY time DESC
LIMIT 20;
Look for threads in Waiting for lock, Sending data, or Copying to tmp table with long durations. Any active query running more than 30 seconds in OLTP deserves investigation. Waiting for lock with a chain of blocked sessions is a reliability event.
2. Is anyone waiting on InnoDB row locks?
SELECT
r.trx_id AS waiting_trx_id,
r.trx_mysql_thread_id AS waiting_thread,
r.trx_query AS waiting_query,
b.trx_id AS blocking_trx_id,
b.trx_mysql_thread_id AS blocking_thread,
b.trx_query AS blocking_query,
TIMESTAMPDIFF(SECOND, r.trx_wait_started, NOW()) AS wait_seconds
FROM information_schema.INNODB_LOCK_WAITS w
JOIN information_schema.INNODB_TRX r ON r.trx_id = w.requesting_trx_id
JOIN information_schema.INNODB_TRX b ON b.trx_id = w.blocking_trx_id
ORDER BY wait_seconds DESC;
For MySQL 8.0+, use performance_schema.data_lock_waits instead of INNODB_LOCK_WAITS (deprecated). A lock wait exceeding 10 seconds on an OLTP system is a reliability event, not a transient blip.
3. How far behind is the replica?
-- MySQL self-managed:
SHOW SLAVE STATUS\G
-- Key fields: Seconds_Behind_Master, Slave_IO_Running, Slave_SQL_Running, Last_Error
Seconds_Behind_Master reports the difference between the timestamp of the last event the replica’s SQL thread applied and the current timestamp. It goes to NULL when replication is stopped — this is not zero lag, it is broken replication.
For Aurora MySQL: use CloudWatch metric ReplicaLag. Aurora’s lag metric is more accurate because replicas share the same storage volume and lag is measured as I/O apply delay, not binary log position difference.
4. What is the InnoDB buffer pool hit rate?
SELECT
variable_name,
variable_value
FROM performance_schema.global_status
WHERE variable_name IN (
'Innodb_buffer_pool_read_requests',
'Innodb_buffer_pool_reads',
'Innodb_buffer_pool_wait_free',
'Innodb_buffer_pool_pages_dirty',
'Innodb_buffer_pool_pages_total'
);
Compute hit rate: (Innodb_buffer_pool_read_requests - Innodb_buffer_pool_reads) / Innodb_buffer_pool_read_requests * 100. Below 99% means the buffer pool is too small or the working set exceeds available memory. Innodb_buffer_pool_wait_free > 0 means MySQL had to wait for a clean page — a sign of memory pressure.
5. What does the slow query rate look like?
SHOW STATUS LIKE 'Slow_queries';
SHOW VARIABLES LIKE 'long_query_time';
SHOW VARIABLES LIKE 'slow_query_log%';
If slow_query_log is OFF, turn it on: SET GLOBAL slow_query_log = 'ON'; SET GLOBAL long_query_time = 1; (1 second threshold for OLTP). Slow_queries is a cumulative counter since last restart — track the rate of change, not the absolute value.
For performance_schema, query the top queries by total latency:
SELECT schema_name, digest_text,
count_star AS executions,
ROUND(avg_timer_wait / 1e12, 3) AS avg_latency_sec,
ROUND(sum_timer_wait / 1e12, 3) AS total_latency_sec
FROM performance_schema.events_statements_summary_by_digest
WHERE schema_name IS NOT NULL
ORDER BY sum_timer_wait DESC
LIMIT 10;
Decision Tree
flowchart TD
A[Symptom observed] --> B{Active threads check}
B -->|Long-running active queries| C[Run EXPLAIN — plan regression or missing index?]
B -->|Threads in lock wait| D[Find blocking transaction — INNODB_TRX]
B -->|Many Sleep threads| E[Check connection pool — leaked connections or idle timeout not set?]
B -->|All looks normal| F{Check replication}
F -->|Seconds_Behind_Master high or NULL| G[Check Slave_IO_Running and Slave_SQL_Running — IO stopped means network or binlog issue — SQL stopped means error on replica apply]
F -->|Lag acceptable| H{Check InnoDB buffer pool}
H -->|Hit rate below 99%| I[Working set exceeds buffer pool — increase innodb_buffer_pool_size or identify hot tables]
H -->|wait_free above zero| J[Memory pressure — check OS swap and buffer pool size vs available RAM]
H -->|Buffer pool healthy| K{Check slow queries}
K -->|Slow query rate spiking| L[Run EXPLAIN on top queries from performance_schema digest — find index gaps]
K -->|No slow query signal| M{Check connections}
M -->|Threads_connected near max_connections| N[Check for leaked connections — application not closing pool]
M -->|Connections healthy| O[Check InnoDB redo log waits and binary log position]
Remediation Options
| Problem | Immediate action | Durable fix |
|---|---|---|
| Lock chain blocking transactions | KILL <blocking_thread_id> — use with caution, rolls back the transaction | Fix the application transaction that holds locks across slow external calls; add innodb_lock_wait_timeout |
| Replication stopped — SQL thread error | SHOW SLAVE STATUS for Last_SQL_Error; STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; only if the row is truly safe to skip | Fix the root cause (schema drift, unsupported statement in ROW format); never skip without understanding the error |
| InnoDB buffer pool hit rate below 99% | Identify and cache the hot tables; check if a large dump or batch job is evicting the working set | Increase innodb_buffer_pool_size (safe upper bound: 70–80% of total RAM); use buffer pool warmup after restart |
| Connection exhaustion | Kill idle connections: SELECT CONCAT('KILL ', id, ';') FROM information_schema.PROCESSLIST WHERE command='Sleep' AND time > 300; | Set wait_timeout and interactive_timeout; fix application connection pool to return connections after use |
| Slow query regression | Temporarily add an index with CREATE INDEX ... ALGORITHM=INPLACE, LOCK=NONE; or force a plan with FORCE INDEX | Tune the query; rebuild statistics with ANALYZE TABLE; add index permanently after testing |
| Disk filling from binary logs | PURGE BINARY LOGS BEFORE DATE_SUB(NOW(), INTERVAL 3 DAY) | Set expire_logs_days = 7; verify replica is not lagging — purging logs a replica needs will break replication |
Automation Opportunity
Three MySQL checks can be automated into a runbook trigger:
-
Replication watchdog: poll
Seconds_Behind_Masterevery 60 seconds; alert when it exceeds 60 seconds; alert as critical when it isNULL(replication stopped). For Aurora, subscribe to CloudWatchReplicaLagmetric and create the same two-level alarm. -
Connection saturation check: query
Threads_connected / max_connectionsevery 60 seconds. Alert at 70%, page at 85%. This gives the team time to identify the source (pool leak, burst traffic, slow query cascade) before connection errors reach the application. -
Long-running transaction watchdog: query
INFORMATION_SCHEMA.INNODB_TRXevery 60 seconds. Alert if any transaction has been running more than 5 minutes. Auto-terminate transactions running more than 30 minutes with a logged record. Long-running transactions block autovacuum analogs (purge thread), hold row locks, and inflate undo log.
Leadership Summary
MySQL health is not visible in CPU and disk IOPS alone. Replication lag, InnoDB buffer pool utilization, lock chains, and connection exhaustion are the failure modes that cause user-visible errors — and all of them are visible in MySQL status variables and INFORMATION_SCHEMA before CPU shows any anomaly. The most common monitoring gap in MySQL deployments is treating Seconds_Behind_Master = NULL as zero lag instead of broken replication, and setting a single global slow query threshold that fires on legitimate batch queries while missing OLTP regressions. The seven metric groups above require only a PROCESS privilege and a 60-second poll interval.
Where It Breaks
| Failure mode | Why it happens | Fix |
|---|---|---|
Seconds_Behind_Master = NULL treated as healthy | NULL means replication stopped, not zero lag | Alert on NULL as critical, not informational |
| Slow query alert fires on batch jobs | Global long_query_time threshold applies to all queries | Set per-session long_query_time for batch roles; alert on rate from performance_schema digest by schema |
| Buffer pool hit rate appears fine but queries are slow | A large report query is evicting the working set during the report window | Alert on hit rate averaged over 5 minutes; monitor Innodb_buffer_pool_reads rate alongside hit rate |
| Lock wait queries not visible | INNODB_LOCK_WAITS requires MySQL 5.6–5.7 syntax; MySQL 8.0 uses performance_schema.data_lock_waits | Upgrade monitoring queries for MySQL 8.0 |
Aurora Seconds_Behind_Master not available | Aurora replicas don’t expose this variable via SHOW SLAVE STATUS in the same way | Use CloudWatch ReplicaLag metric; do not rely on SHOW SLAVE STATUS for Aurora replica lag |
performance_schema disabled | Default enabled since MySQL 5.7 but can be disabled; digest table empty | Verify performance_schema = ON in my.cnf; enable events_statements_history consumer |
What to Do Next
- Problem: MySQL and Aurora monitoring shows infrastructure metrics but misses the database-level signals that precede outages.
- Solution: Add the seven metric groups above using a
PROCESS-privileged monitoring user and a 60-second poll interval. For Aurora, add CloudWatch alarms forReplicaLag,DatabaseConnections, andFreeStorageSpace. - Proof: Run the five checks above against your production instance right now and confirm replication is not
NULL, buffer pool hit rate is above 99%, and no thread has been blocked on a lock for more than 10 seconds. - Action: This week, create a monitoring role (
GRANT PROCESS, SELECT ON performance_schema.* TO 'monitoring'@'%'), enableslow_query_log, and set a replication lag alert with a 60-second warning threshold.