MySQL and Aurora Monitoring: The Dashboard That Catches Problems Before Users Do

A MySQL dashboard that shows only CPU and disk IOPS will miss the failures that actually page you at 3 AM: replication stopped because of a single bad row, InnoDB buffer pool thrashing on a cold restart, connection exhaustion from a leaked pool, and a lock chain building behind an ALTER TABLE that forgot LOCK=NONE. The metrics that matter come from INFORMATION_SCHEMA, performance_schema, and the MySQL status variables — not the OS.

Situation

Most MySQL monitoring starts with the infrastructure layer: CPU, memory, disk I/O, network. These are necessary for capacity planning but insufficient for operational health. A MySQL instance with 30% CPU and plenty of free memory can still be moments from an outage: replica lag at 45 minutes, InnoDB buffer pool hit rate at 80% (normal is 99%), connection count at 95% of max_connections, and five sessions blocked behind a lock on a hot row.

Aurora adds its own layer: storage auto-scaling, volume bytes ceiling, cluster-level failover, and replica lag measured differently than MySQL’s Seconds_Behind_Master. Monitoring Aurora with only MySQL queries misses the Aurora-specific failure modes.

The seven metric groups below apply to both self-managed MySQL and Aurora MySQL. Where Aurora differs, the Aurora-specific metric or query is noted.

Symptoms

Symptom	Likely source	First place to check
Application queries suddenly slower	Lock contention or plan regression	`INFORMATION_SCHEMA.INNODB_TRX`, `SHOW PROCESSLIST`
Connection pool exhausted	`max_connections` hit or leaked connections	`SHOW STATUS LIKE 'Threads_connected'`
Replica reads returning stale data	Replication lag	`SHOW SLAVE STATUS` / Aurora CloudWatch `ReplicaLag`
Table scan on a previously fast query	Missing index or stale stats	`EXPLAIN`, `information_schema.STATISTICS`
`Got error 1040: Too many connections` in app logs	Connections near or at limit	`SHOW VARIABLES LIKE 'max_connections'` vs current threads
Disk filling faster than expected	Binary logs not purging or large temp tables	`SHOW VARIABLES LIKE 'expire_logs_days'`
OOM kill on MySQL process	Buffer pool too large for available RAM	`innodb_buffer_pool_size` vs system RAM
`Lock wait timeout exceeded` in app	Long-running transaction holding row locks	`INFORMATION_SCHEMA.INNODB_TRX` + `INNODB_LOCKS`

First Five Checks

Run these in order when something is wrong. Each requires only PROCESS privilege or SELECT on performance_schema.

1. What are active threads doing right now?

SELECT id, user, host, db, command, time, state, LEFT(info, 120) AS query
FROM information_schema.PROCESSLIST
WHERE command != 'Sleep'
  AND time > 5
ORDER BY time DESC
LIMIT 20;

Look for threads in Waiting for lock, Sending data, or Copying to tmp table with long durations. Any active query running more than 30 seconds in OLTP deserves investigation. Waiting for lock with a chain of blocked sessions is a reliability event.

2. Is anyone waiting on InnoDB row locks?

SELECT
  r.trx_id AS waiting_trx_id,
  r.trx_mysql_thread_id AS waiting_thread,
  r.trx_query AS waiting_query,
  b.trx_id AS blocking_trx_id,
  b.trx_mysql_thread_id AS blocking_thread,
  b.trx_query AS blocking_query,
  TIMESTAMPDIFF(SECOND, r.trx_wait_started, NOW()) AS wait_seconds
FROM information_schema.INNODB_LOCK_WAITS w
JOIN information_schema.INNODB_TRX r ON r.trx_id = w.requesting_trx_id
JOIN information_schema.INNODB_TRX b ON b.trx_id = w.blocking_trx_id
ORDER BY wait_seconds DESC;

For MySQL 8.0+, use performance_schema.data_lock_waits instead of INNODB_LOCK_WAITS (deprecated). A lock wait exceeding 10 seconds on an OLTP system is a reliability event, not a transient blip.

3. How far behind is the replica?

-- MySQL self-managed:
SHOW SLAVE STATUS\G
-- Key fields: Seconds_Behind_Master, Slave_IO_Running, Slave_SQL_Running, Last_Error

Seconds_Behind_Master reports the difference between the timestamp of the last event the replica’s SQL thread applied and the current timestamp. It goes to NULL when replication is stopped — this is not zero lag, it is broken replication.

For Aurora MySQL: use CloudWatch metric ReplicaLag. Aurora’s lag metric is more accurate because replicas share the same storage volume and lag is measured as I/O apply delay, not binary log position difference.

4. What is the InnoDB buffer pool hit rate?

SELECT
  variable_name,
  variable_value
FROM performance_schema.global_status
WHERE variable_name IN (
  'Innodb_buffer_pool_read_requests',
  'Innodb_buffer_pool_reads',
  'Innodb_buffer_pool_wait_free',
  'Innodb_buffer_pool_pages_dirty',
  'Innodb_buffer_pool_pages_total'
);

Compute hit rate: (Innodb_buffer_pool_read_requests - Innodb_buffer_pool_reads) / Innodb_buffer_pool_read_requests * 100. Below 99% means the buffer pool is too small or the working set exceeds available memory. Innodb_buffer_pool_wait_free > 0 means MySQL had to wait for a clean page — a sign of memory pressure.

5. What does the slow query rate look like?

SHOW STATUS LIKE 'Slow_queries';
SHOW VARIABLES LIKE 'long_query_time';
SHOW VARIABLES LIKE 'slow_query_log%';

If slow_query_log is OFF, turn it on: SET GLOBAL slow_query_log = 'ON'; SET GLOBAL long_query_time = 1; (1 second threshold for OLTP). Slow_queries is a cumulative counter since last restart — track the rate of change, not the absolute value.

For performance_schema, query the top queries by total latency:

SELECT schema_name, digest_text,
       count_star AS executions,
       ROUND(avg_timer_wait / 1e12, 3) AS avg_latency_sec,
       ROUND(sum_timer_wait / 1e12, 3) AS total_latency_sec
FROM performance_schema.events_statements_summary_by_digest
WHERE schema_name IS NOT NULL
ORDER BY sum_timer_wait DESC
LIMIT 10;

Decision Tree

flowchart TD
    A[Symptom observed] --> B{Active threads check}
    B -->|Long-running active queries| C[Run EXPLAIN — plan regression or missing index?]
    B -->|Threads in lock wait| D[Find blocking transaction — INNODB_TRX]
    B -->|Many Sleep threads| E[Check connection pool — leaked connections or idle timeout not set?]
    B -->|All looks normal| F{Check replication}
    F -->|Seconds_Behind_Master high or NULL| G[Check Slave_IO_Running and Slave_SQL_Running — IO stopped means network or binlog issue — SQL stopped means error on replica apply]
    F -->|Lag acceptable| H{Check InnoDB buffer pool}
    H -->|Hit rate below 99%| I[Working set exceeds buffer pool — increase innodb_buffer_pool_size or identify hot tables]
    H -->|wait_free above zero| J[Memory pressure — check OS swap and buffer pool size vs available RAM]
    H -->|Buffer pool healthy| K{Check slow queries}
    K -->|Slow query rate spiking| L[Run EXPLAIN on top queries from performance_schema digest — find index gaps]
    K -->|No slow query signal| M{Check connections}
    M -->|Threads_connected near max_connections| N[Check for leaked connections — application not closing pool]
    M -->|Connections healthy| O[Check InnoDB redo log waits and binary log position]

Remediation Options

Problem	Immediate action	Durable fix
Lock chain blocking transactions	`KILL <blocking_thread_id>` — use with caution, rolls back the transaction	Fix the application transaction that holds locks across slow external calls; add `innodb_lock_wait_timeout`
Replication stopped — SQL thread error	`SHOW SLAVE STATUS` for `Last_SQL_Error`; `STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE;` only if the row is truly safe to skip	Fix the root cause (schema drift, unsupported statement in ROW format); never skip without understanding the error
InnoDB buffer pool hit rate below 99%	Identify and cache the hot tables; check if a large dump or batch job is evicting the working set	Increase `innodb_buffer_pool_size` (safe upper bound: 70–80% of total RAM); use buffer pool warmup after restart
Connection exhaustion	Kill idle connections: `SELECT CONCAT('KILL ', id, ';') FROM information_schema.PROCESSLIST WHERE command='Sleep' AND time > 300;`	Set `wait_timeout` and `interactive_timeout`; fix application connection pool to return connections after use
Slow query regression	Temporarily add an index with `CREATE INDEX ... ALGORITHM=INPLACE, LOCK=NONE`; or force a plan with `FORCE INDEX`	Tune the query; rebuild statistics with `ANALYZE TABLE`; add index permanently after testing
Disk filling from binary logs	`PURGE BINARY LOGS BEFORE DATE_SUB(NOW(), INTERVAL 3 DAY)`	Set `expire_logs_days = 7`; verify replica is not lagging — purging logs a replica needs will break replication

Automation Opportunity

Three MySQL checks can be automated into a runbook trigger:

Replication watchdog: poll Seconds_Behind_Master every 60 seconds; alert when it exceeds 60 seconds; alert as critical when it is NULL (replication stopped). For Aurora, subscribe to CloudWatch ReplicaLag metric and create the same two-level alarm.
Connection saturation check: query Threads_connected / max_connections every 60 seconds. Alert at 70%, page at 85%. This gives the team time to identify the source (pool leak, burst traffic, slow query cascade) before connection errors reach the application.
Long-running transaction watchdog: query INFORMATION_SCHEMA.INNODB_TRX every 60 seconds. Alert if any transaction has been running more than 5 minutes. Auto-terminate transactions running more than 30 minutes with a logged record. Long-running transactions block autovacuum analogs (purge thread), hold row locks, and inflate undo log.

Leadership Summary

MySQL health is not visible in CPU and disk IOPS alone. Replication lag, InnoDB buffer pool utilization, lock chains, and connection exhaustion are the failure modes that cause user-visible errors — and all of them are visible in MySQL status variables and INFORMATION_SCHEMA before CPU shows any anomaly. The most common monitoring gap in MySQL deployments is treating Seconds_Behind_Master = NULL as zero lag instead of broken replication, and setting a single global slow query threshold that fires on legitimate batch queries while missing OLTP regressions. The seven metric groups above require only a PROCESS privilege and a 60-second poll interval.

Where It Breaks

Failure mode	Why it happens	Fix
`Seconds_Behind_Master = NULL` treated as healthy	NULL means replication stopped, not zero lag	Alert on `NULL` as critical, not informational
Slow query alert fires on batch jobs	Global `long_query_time` threshold applies to all queries	Set per-session `long_query_time` for batch roles; alert on rate from `performance_schema` digest by schema
Buffer pool hit rate appears fine but queries are slow	A large report query is evicting the working set during the report window	Alert on hit rate averaged over 5 minutes; monitor `Innodb_buffer_pool_reads` rate alongside hit rate
Lock wait queries not visible	`INNODB_LOCK_WAITS` requires MySQL 5.6–5.7 syntax; MySQL 8.0 uses `performance_schema.data_lock_waits`	Upgrade monitoring queries for MySQL 8.0
Aurora `Seconds_Behind_Master` not available	Aurora replicas don’t expose this variable via `SHOW SLAVE STATUS` in the same way	Use CloudWatch `ReplicaLag` metric; do not rely on `SHOW SLAVE STATUS` for Aurora replica lag
`performance_schema` disabled	Default enabled since MySQL 5.7 but can be disabled; digest table empty	Verify `performance_schema = ON` in `my.cnf`; enable `events_statements_history` consumer

What to Do Next

Problem: MySQL and Aurora monitoring shows infrastructure metrics but misses the database-level signals that precede outages.
Solution: Add the seven metric groups above using a PROCESS-privileged monitoring user and a 60-second poll interval. For Aurora, add CloudWatch alarms for ReplicaLag, DatabaseConnections, and FreeStorageSpace.
Proof: Run the five checks above against your production instance right now and confirm replication is not NULL, buffer pool hit rate is above 99%, and no thread has been blocked on a lock for more than 10 seconds.
Action: This week, create a monitoring role (GRANT PROCESS, SELECT ON performance_schema.* TO 'monitoring'@'%'), enable slow_query_log, and set a replication lag alert with a 60-second warning threshold.

Situation

Symptoms

First Five Checks

Decision Tree

Remediation Options

Automation Opportunity

Leadership Summary

Where It Breaks

What to Do Next

Rajiv

Related Posts

Datadog Database Monitoring: PostgreSQL, MySQL, and Aurora Setup

Prometheus and Grafana for Database Monitoring: PostgreSQL and MySQL Setup

Database Alert Design: Thresholds That Fire on Real Problems