A MySQL dashboard that shows only CPU and disk IOPS will miss the failures that actually page you at 3 AM: replication stopped because of a single bad row, InnoDB buffer pool thrashing on a cold restart, connection exhaustion from a leaked pool, and a lock chain building behind an ALTER TABLE that forgot LOCK=NONE. The metrics that matter come from INFORMATION_SCHEMA, performance_schema, and the MySQL status variables — not the OS.

Situation

Most MySQL monitoring starts with the infrastructure layer: CPU, memory, disk I/O, network. These are necessary for capacity planning but insufficient for operational health. A MySQL instance with 30% CPU and plenty of free memory can still be moments from an outage: replica lag at 45 minutes, InnoDB buffer pool hit rate at 80% (normal is 99%), connection count at 95% of max_connections, and five sessions blocked behind a lock on a hot row.

Aurora adds its own layer: storage auto-scaling, volume bytes ceiling, cluster-level failover, and replica lag measured differently than MySQL’s Seconds_Behind_Master. Monitoring Aurora with only MySQL queries misses the Aurora-specific failure modes.

The seven metric groups below apply to both self-managed MySQL and Aurora MySQL. Where Aurora differs, the Aurora-specific metric or query is noted.

Symptoms

SymptomLikely sourceFirst place to check
Application queries suddenly slowerLock contention or plan regressionINFORMATION_SCHEMA.INNODB_TRX, SHOW PROCESSLIST
Connection pool exhaustedmax_connections hit or leaked connectionsSHOW STATUS LIKE 'Threads_connected'
Replica reads returning stale dataReplication lagSHOW SLAVE STATUS / Aurora CloudWatch ReplicaLag
Table scan on a previously fast queryMissing index or stale statsEXPLAIN, information_schema.STATISTICS
Got error 1040: Too many connections in app logsConnections near or at limitSHOW VARIABLES LIKE 'max_connections' vs current threads
Disk filling faster than expectedBinary logs not purging or large temp tablesSHOW VARIABLES LIKE 'expire_logs_days'
OOM kill on MySQL processBuffer pool too large for available RAMinnodb_buffer_pool_size vs system RAM
Lock wait timeout exceeded in appLong-running transaction holding row locksINFORMATION_SCHEMA.INNODB_TRX + INNODB_LOCKS

First Five Checks

Run these in order when something is wrong. Each requires only PROCESS privilege or SELECT on performance_schema.

1. What are active threads doing right now?

SELECT id, user, host, db, command, time, state, LEFT(info, 120) AS query
FROM information_schema.PROCESSLIST
WHERE command != 'Sleep'
  AND time > 5
ORDER BY time DESC
LIMIT 20;

Look for threads in Waiting for lock, Sending data, or Copying to tmp table with long durations. Any active query running more than 30 seconds in OLTP deserves investigation. Waiting for lock with a chain of blocked sessions is a reliability event.

2. Is anyone waiting on InnoDB row locks?

SELECT
  r.trx_id AS waiting_trx_id,
  r.trx_mysql_thread_id AS waiting_thread,
  r.trx_query AS waiting_query,
  b.trx_id AS blocking_trx_id,
  b.trx_mysql_thread_id AS blocking_thread,
  b.trx_query AS blocking_query,
  TIMESTAMPDIFF(SECOND, r.trx_wait_started, NOW()) AS wait_seconds
FROM information_schema.INNODB_LOCK_WAITS w
JOIN information_schema.INNODB_TRX r ON r.trx_id = w.requesting_trx_id
JOIN information_schema.INNODB_TRX b ON b.trx_id = w.blocking_trx_id
ORDER BY wait_seconds DESC;

For MySQL 8.0+, use performance_schema.data_lock_waits instead of INNODB_LOCK_WAITS (deprecated). A lock wait exceeding 10 seconds on an OLTP system is a reliability event, not a transient blip.

3. How far behind is the replica?

-- MySQL self-managed:
SHOW SLAVE STATUS\G
-- Key fields: Seconds_Behind_Master, Slave_IO_Running, Slave_SQL_Running, Last_Error

Seconds_Behind_Master reports the difference between the timestamp of the last event the replica’s SQL thread applied and the current timestamp. It goes to NULL when replication is stopped — this is not zero lag, it is broken replication.

For Aurora MySQL: use CloudWatch metric ReplicaLag. Aurora’s lag metric is more accurate because replicas share the same storage volume and lag is measured as I/O apply delay, not binary log position difference.

4. What is the InnoDB buffer pool hit rate?

SELECT
  variable_name,
  variable_value
FROM performance_schema.global_status
WHERE variable_name IN (
  'Innodb_buffer_pool_read_requests',
  'Innodb_buffer_pool_reads',
  'Innodb_buffer_pool_wait_free',
  'Innodb_buffer_pool_pages_dirty',
  'Innodb_buffer_pool_pages_total'
);

Compute hit rate: (Innodb_buffer_pool_read_requests - Innodb_buffer_pool_reads) / Innodb_buffer_pool_read_requests * 100. Below 99% means the buffer pool is too small or the working set exceeds available memory. Innodb_buffer_pool_wait_free > 0 means MySQL had to wait for a clean page — a sign of memory pressure.

5. What does the slow query rate look like?

SHOW STATUS LIKE 'Slow_queries';
SHOW VARIABLES LIKE 'long_query_time';
SHOW VARIABLES LIKE 'slow_query_log%';

If slow_query_log is OFF, turn it on: SET GLOBAL slow_query_log = 'ON'; SET GLOBAL long_query_time = 1; (1 second threshold for OLTP). Slow_queries is a cumulative counter since last restart — track the rate of change, not the absolute value.

For performance_schema, query the top queries by total latency:

SELECT schema_name, digest_text,
       count_star AS executions,
       ROUND(avg_timer_wait / 1e12, 3) AS avg_latency_sec,
       ROUND(sum_timer_wait / 1e12, 3) AS total_latency_sec
FROM performance_schema.events_statements_summary_by_digest
WHERE schema_name IS NOT NULL
ORDER BY sum_timer_wait DESC
LIMIT 10;

Decision Tree

flowchart TD
    A[Symptom observed] --> B{Active threads check}
    B -->|Long-running active queries| C[Run EXPLAIN — plan regression or missing index?]
    B -->|Threads in lock wait| D[Find blocking transaction — INNODB_TRX]
    B -->|Many Sleep threads| E[Check connection pool — leaked connections or idle timeout not set?]
    B -->|All looks normal| F{Check replication}
    F -->|Seconds_Behind_Master high or NULL| G[Check Slave_IO_Running and Slave_SQL_Running — IO stopped means network or binlog issue — SQL stopped means error on replica apply]
    F -->|Lag acceptable| H{Check InnoDB buffer pool}
    H -->|Hit rate below 99%| I[Working set exceeds buffer pool — increase innodb_buffer_pool_size or identify hot tables]
    H -->|wait_free above zero| J[Memory pressure — check OS swap and buffer pool size vs available RAM]
    H -->|Buffer pool healthy| K{Check slow queries}
    K -->|Slow query rate spiking| L[Run EXPLAIN on top queries from performance_schema digest — find index gaps]
    K -->|No slow query signal| M{Check connections}
    M -->|Threads_connected near max_connections| N[Check for leaked connections — application not closing pool]
    M -->|Connections healthy| O[Check InnoDB redo log waits and binary log position]

Remediation Options

ProblemImmediate actionDurable fix
Lock chain blocking transactionsKILL <blocking_thread_id> — use with caution, rolls back the transactionFix the application transaction that holds locks across slow external calls; add innodb_lock_wait_timeout
Replication stopped — SQL thread errorSHOW SLAVE STATUS for Last_SQL_Error; STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; only if the row is truly safe to skipFix the root cause (schema drift, unsupported statement in ROW format); never skip without understanding the error
InnoDB buffer pool hit rate below 99%Identify and cache the hot tables; check if a large dump or batch job is evicting the working setIncrease innodb_buffer_pool_size (safe upper bound: 70–80% of total RAM); use buffer pool warmup after restart
Connection exhaustionKill idle connections: SELECT CONCAT('KILL ', id, ';') FROM information_schema.PROCESSLIST WHERE command='Sleep' AND time > 300;Set wait_timeout and interactive_timeout; fix application connection pool to return connections after use
Slow query regressionTemporarily add an index with CREATE INDEX ... ALGORITHM=INPLACE, LOCK=NONE; or force a plan with FORCE INDEXTune the query; rebuild statistics with ANALYZE TABLE; add index permanently after testing
Disk filling from binary logsPURGE BINARY LOGS BEFORE DATE_SUB(NOW(), INTERVAL 3 DAY)Set expire_logs_days = 7; verify replica is not lagging — purging logs a replica needs will break replication

Automation Opportunity

Three MySQL checks can be automated into a runbook trigger:

  1. Replication watchdog: poll Seconds_Behind_Master every 60 seconds; alert when it exceeds 60 seconds; alert as critical when it is NULL (replication stopped). For Aurora, subscribe to CloudWatch ReplicaLag metric and create the same two-level alarm.

  2. Connection saturation check: query Threads_connected / max_connections every 60 seconds. Alert at 70%, page at 85%. This gives the team time to identify the source (pool leak, burst traffic, slow query cascade) before connection errors reach the application.

  3. Long-running transaction watchdog: query INFORMATION_SCHEMA.INNODB_TRX every 60 seconds. Alert if any transaction has been running more than 5 minutes. Auto-terminate transactions running more than 30 minutes with a logged record. Long-running transactions block autovacuum analogs (purge thread), hold row locks, and inflate undo log.

Leadership Summary

MySQL health is not visible in CPU and disk IOPS alone. Replication lag, InnoDB buffer pool utilization, lock chains, and connection exhaustion are the failure modes that cause user-visible errors — and all of them are visible in MySQL status variables and INFORMATION_SCHEMA before CPU shows any anomaly. The most common monitoring gap in MySQL deployments is treating Seconds_Behind_Master = NULL as zero lag instead of broken replication, and setting a single global slow query threshold that fires on legitimate batch queries while missing OLTP regressions. The seven metric groups above require only a PROCESS privilege and a 60-second poll interval.

Where It Breaks

Failure modeWhy it happensFix
Seconds_Behind_Master = NULL treated as healthyNULL means replication stopped, not zero lagAlert on NULL as critical, not informational
Slow query alert fires on batch jobsGlobal long_query_time threshold applies to all queriesSet per-session long_query_time for batch roles; alert on rate from performance_schema digest by schema
Buffer pool hit rate appears fine but queries are slowA large report query is evicting the working set during the report windowAlert on hit rate averaged over 5 minutes; monitor Innodb_buffer_pool_reads rate alongside hit rate
Lock wait queries not visibleINNODB_LOCK_WAITS requires MySQL 5.6–5.7 syntax; MySQL 8.0 uses performance_schema.data_lock_waitsUpgrade monitoring queries for MySQL 8.0
Aurora Seconds_Behind_Master not availableAurora replicas don’t expose this variable via SHOW SLAVE STATUS in the same wayUse CloudWatch ReplicaLag metric; do not rely on SHOW SLAVE STATUS for Aurora replica lag
performance_schema disabledDefault enabled since MySQL 5.7 but can be disabled; digest table emptyVerify performance_schema = ON in my.cnf; enable events_statements_history consumer

What to Do Next

  • Problem: MySQL and Aurora monitoring shows infrastructure metrics but misses the database-level signals that precede outages.
  • Solution: Add the seven metric groups above using a PROCESS-privileged monitoring user and a 60-second poll interval. For Aurora, add CloudWatch alarms for ReplicaLag, DatabaseConnections, and FreeStorageSpace.
  • Proof: Run the five checks above against your production instance right now and confirm replication is not NULL, buffer pool hit rate is above 99%, and no thread has been blocked on a lock for more than 10 seconds.
  • Action: This week, create a monitoring role (GRANT PROCESS, SELECT ON performance_schema.* TO 'monitoring'@'%'), enable slow_query_log, and set a replication lag alert with a 60-second warning threshold.