Most MySQL slowdowns have a short list of root causes: a missing index, a lock wait, or stale optimizer statistics. The hard part is not the fix — it is getting from “p99 alert fired” to “I know which query, why it is slow, and what the safe remediation is” without wasting an hour looking at the wrong thing. This playbook gives you that path as a repeatable workflow. Run these checks in order, and you will have a diagnosis before you start guessing.

Situation

The alert fires. Maybe it is a CloudWatch SlowQueries metric spike on RDS, a p99 latency alarm from your application APM, or a PagerDuty page from a long-running query threshold. You open a terminal, connect to the database, and face the standard problem: MySQL is running dozens of queries per second, and you need to identify the one that is costing you.

MySQL gives you several places to look — the slow query log, Performance Schema digest tables, SHOW PROCESSLIST, and InnoDB status — and the right place to start depends on whether the problem is active right now or a pattern you are trying to reconstruct after the fact. This runbook covers both: active incidents where queries are blocking or running hot, and post-incident analysis where you need to find the pattern in aggregated data.

The version context matters. MySQL 8.0 added EXPLAIN ANALYZE, which gives actual row counts alongside estimated ones. If you are on MySQL 5.7 or RDS Aurora MySQL, the same diagnostic steps apply but you will use EXPLAIN FORMAT=JSON without ANALYZE for the execution plan.

Symptoms

SignalWhere to see itWhat it means
Query_time >> Lock_time in slow log entryslow_query_log_file or mysqldumpslow outputQuery is executing slowly independent of locking — likely index or scan issue
High Lock_time in slow logSame sourceTransaction waiting on a row lock before it can execute
rows_examined far exceeds rows_sentSlow log entry or events_statements_summary_by_digestFull or partial table scan — index not covering the WHERE clause
Thread in Waiting for table metadata lock stateSHOW PROCESSLISTAnother connection holds a metadata lock, usually from an open transaction or an ALTER TABLE
High SUM_TIMER_WAIT for a specific digestperformance_schema.events_statements_summary_by_digestA specific query pattern accounts for most DB wall-clock time
LATEST DETECTED DEADLOCK section presentSHOW ENGINE INNODB STATUSTwo transactions deadlocked; one was rolled back

First Five Checks

  1. Enable the slow query log and read it — If the slow log is not already running, turn it on without a restart:

    SET GLOBAL slow_query_log = ON;
    SET GLOBAL long_query_time = 1;
    SET GLOBAL log_output = 'FILE';
    SHOW VARIABLES LIKE 'slow_query_log_file';
    

    Then use mysqldumpslow to aggregate entries. The -s t flag sorts by total time, which surfaces the queries with the most cumulative cost rather than just the single longest run:

    mysqldumpslow -s t -t 10 /var/lib/mysql/hostname-slow.log
    

    Each entry shows Query_time, Lock_time, Rows_sent, and Rows_examined. A rows_examined / rows_sent ratio above 100 is a strong signal of a full or near-full table scan.

  2. Find top queries by total time in Performance Schema — For RDS or environments where you cannot read the log file directly, Performance Schema digest tables give the same aggregate view:

    SELECT
      DIGEST_TEXT,
      COUNT_STAR,
      SUM_TIMER_WAIT / 1000000000000 AS total_sec,
      AVG_TIMER_WAIT / 1000000000000 AS avg_sec,
      SUM_ROWS_EXAMINED,
      SUM_ROWS_SENT
    FROM performance_schema.events_statements_summary_by_digest
    ORDER BY SUM_TIMER_WAIT DESC
    LIMIT 10;
    

    The DIGEST_TEXT column normalizes literals to ? placeholders, so you see the query pattern regardless of parameter values. Focus on rows where SUM_ROWS_EXAMINED greatly exceeds SUM_ROWS_SENT.

  3. Check current lock waits — If the incident is active and threads are blocked, identify the blocking transaction immediately. On MySQL 8.0, use performance_schema.data_lock_waits:

    SELECT
      r.trx_id AS waiting_trx_id,
      r.trx_mysql_thread_id AS waiting_thread,
      r.trx_query AS waiting_query,
      b.trx_id AS blocking_trx_id,
      b.trx_mysql_thread_id AS blocking_thread,
      b.trx_query AS blocking_query
    FROM information_schema.innodb_lock_waits w
    INNER JOIN information_schema.innodb_trx b
      ON b.trx_id = w.blocking_trx_id
    INNER JOIN information_schema.innodb_trx r
      ON r.trx_id = w.requesting_trx_id;
    

    The blocking_query column often shows NULL — this means the blocking transaction has already executed its statement and is sitting idle with an open transaction, holding row locks. Check b.trx_started to see how long it has been open.

  4. Check index usage for the affected table — The sys schema surfaces unused indexes, which are candidates for removal, and lets you quickly see what indexes exist:

    -- Indexes that have never been used since last server restart
    SELECT * FROM sys.schema_unused_indexes
    WHERE object_schema = 'your_db';
    
    -- All indexes on the table with cardinality
    SHOW INDEX FROM your_table;
    

    Low Cardinality on a column you are filtering by is a sign the index may not help the optimizer — or that statistics are stale and need updating. A Cardinality of 1 on a column with millions of rows is usually wrong.

  5. Get EXPLAIN for the slow query — Once you have identified the query pattern, capture its execution plan. On MySQL 8.0, EXPLAIN ANALYZE runs the query and returns actual row counts alongside estimates:

    -- MySQL 8.0+ — runs the query and returns actual vs estimated rows
    EXPLAIN ANALYZE
    SELECT user_id, created_at FROM orders
    WHERE status = 'pending' AND created_at > '2022-01-01';
    
    -- All versions — returns JSON with full cost estimates
    EXPLAIN FORMAT=JSON
    SELECT user_id, created_at FROM orders
    WHERE status = 'pending' AND created_at > '2022-01-01';
    

    In the output, look for type: ALL (full table scan), type: index (full index scan), Extra: Using filesort, and Extra: Using temporary. Any of these signals a query that is doing more work than it needs to. The rows column shows the optimizer’s estimate; with EXPLAIN ANALYZE, the actual rows field shows what actually happened.

Decision Tree

flowchart TD
    A[Slow query alert fires] --> B{rows_examined far exceeds rows_sent?}
    B -->|yes| C[Check EXPLAIN for full scan or wrong index]
    C --> D{type=ALL or index in EXPLAIN?}
    D -->|yes| E[Add or modify index based on WHERE clause]
    D -->|no| F[Check for filesort or temporary table in Extra]
    B -->|no| G{lock_time high in slow log?}
    G -->|yes| H[Query innodb_lock_waits for blocking thread]
    H --> I[Kill blocking thread or wait for commit]
    G -->|no| J{Query recently regressed?}
    J -->|yes| K{Cardinality looks wrong in SHOW INDEX?}
    K -->|yes| L[Run ANALYZE TABLE to refresh statistics]
    K -->|no| M[Check for schema change or data distribution shift]
    J -->|no| N{I/O bound — buffer pool hit rate low?}
    N -->|yes| O[Check innodb_buffer_pool hit rate and increase if possible]
    N -->|no| P[Profile with Performance Schema events_stages_summary]

What this diagram shows: A MySQL slow query decision tree — starting with the rows_examined/rows_sent ratio to detect full scans, then lock_time for blocking threads, cardinality estimates for stale statistics, and buffer pool hit rate for I/O saturation — each branch leads to a specific actionable fix.

Remediation Options

Option 1 — Add or modify an index based on EXPLAIN output

When EXPLAIN shows type: ALL or the optimizer is choosing an index that does not cover the WHERE clause, the fix is usually a covering index that includes all columns referenced in the WHERE, ORDER BY, and SELECT list. In MySQL 8.0, ALTER TABLE ... ADD INDEX uses online DDL by default, which means reads and writes continue during the operation:

-- Add a covering index for the query above
ALTER TABLE orders
  ADD INDEX idx_status_created_user (status, created_at, user_id);

-- Verify the optimizer uses it
EXPLAIN SELECT user_id, created_at FROM orders
WHERE status = 'pending' AND created_at > '2022-01-01';

Column order in the index matters. MySQL’s B-tree indexes support leftmost prefix matching — the optimizer can use (status, created_at) for a filter on status alone, but it cannot use (created_at, status) for a filter on status alone. Put the equality predicates first, range predicates last.

Option 2 — Update statistics with ANALYZE TABLE

When the optimizer is choosing a bad plan despite a suitable index, the cause is often stale statistics. This happens after large data loads, bulk deletes, or tables that have grown significantly since the last statistics update. ANALYZE TABLE is non-blocking in InnoDB and safe to run in production:

ANALYZE TABLE orders;

-- Verify cardinality updated
SHOW INDEX FROM orders;

According to the MySQL 8.0 Reference Manual, InnoDB calculates index statistics by sampling random pages — innodb_stats_sample_pages controls sample size. If your table has extremely skewed data distribution, increasing this value can improve plan quality at the cost of more I/O during the statistics update.

Option 3 — Kill the blocking transaction

When lock waits are causing the slowdown, the fastest resolution is to identify and kill the blocking thread. Use the blocking thread ID from the lock wait query in Check 3:

-- Show full information about the blocking thread
SELECT * FROM information_schema.processlist
WHERE id = <blocking_thread_id>;

-- Kill it (this rolls back the blocking transaction)
KILL <blocking_thread_id>;

KILL in MySQL sends a signal to the thread to terminate cleanly. The thread’s current transaction is rolled back. This is the correct tool for a long-running idle transaction holding row locks — not a hard connection reset. After killing, verify the waiting queries resume with SHOW PROCESSLIST.

Rollback Plan

  1. Adding an index — Reversible at any time with DROP INDEX. The online DDL used in MySQL 8.0 InnoDB means the add is also reversible mid-execution by canceling the ALTER (though partial progress is lost and the operation must restart). To remove: ALTER TABLE orders DROP INDEX idx_status_created_user;

  2. ANALYZE TABLE — No rollback needed. ANALYZE TABLE updates statistics but does not change data. If the new statistics produce a worse plan, you can hint the optimizer with USE INDEX (index_name) as a temporary workaround while investigating the plan regression. Statistics will also auto-update over time as InnoDB detects data changes.

  3. KILL thread — The killed transaction is rolled back. There is no undo for the kill itself — the work that transaction had done is lost. Before killing, check trx_query and trx_rows_modified to understand what the transaction was doing. For a long-running OLAP query that was just reading, the only cost is rerunning the query. For a transaction in the middle of writes, the application will see a lost connection error and should retry.

Automation Opportunity

The diagnosis steps in this playbook can be partially automated with two tools.

Percona Toolkit’s pt-query-digest processes slow log files and produces an aggregated report sorted by total time, showing query patterns, execution statistics, and EXPLAIN output. It is the documented standard for batch slow log analysis and handles log rotation correctly:

pt-query-digest /var/lib/mysql/hostname-slow.log > digest_report.txt
pt-query-digest --since='1h' /var/lib/mysql/hostname-slow.log

Percona Toolkit is open-source and documented at percona.com/software/database-tools/percona-toolkit.

Trending with Performance Schema — The digest table retains aggregated data across the server’s uptime. A scheduled query that snapshots SUM_TIMER_WAIT and COUNT_STAR into a monitoring table every 5 minutes gives you a trend line for query cost over time, which is more useful than a point-in-time alert:

-- Snapshot top 20 digests into a monitoring table every 5 minutes
INSERT INTO perf_snapshots (captured_at, digest, total_sec, call_count)
SELECT
  NOW(),
  DIGEST_TEXT,
  SUM_TIMER_WAIT / 1000000000000,
  COUNT_STAR
FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 20;

On RDS, the SlowQueries CloudWatch metric counts queries exceeding long_query_time per minute. Set an alarm at a threshold above your baseline (e.g., more than 5 slow queries per minute) to trigger early before p99 latency is customer-visible.

Leadership Summary

  • A database query exceeded the response time threshold, causing elevated p99 latency visible in application monitoring.
  • The slow query was identified using Performance Schema digest tables and the slow query log; root cause was a missing index causing a full table scan. The index was added using online DDL with no downtime.
  • Automated slow query alerting via CloudWatch and a scheduled Performance Schema snapshot prevents undetected regressions going forward.

Checklist

  1. Confirm slow_query_log = ON and long_query_time is set to a meaningful threshold (1 second is standard; 0.5 on high-volume OLTP).
  2. Run mysqldumpslow -s t -t 10 on the slow log to identify the top queries by total time.
  3. Query performance_schema.events_statements_summary_by_digest sorted by SUM_TIMER_WAIT DESC to confirm the same pattern.
  4. Check information_schema.innodb_lock_waits for any active lock waits involving the slow query’s table.
  5. Run SHOW INDEX FROM <table> and check Cardinality values — anomalously low values indicate stale statistics.
  6. Run EXPLAIN FORMAT=JSON (or EXPLAIN ANALYZE on MySQL 8.0+) on the identified query and look for type: ALL, Using filesort, and Using temporary.
  7. If a full scan is confirmed, design a covering index that places equality predicates first and range predicates last, then test with EXPLAIN before adding.
  8. If lock contention is confirmed, identify the blocking thread using innodb_lock_waits and decide whether to kill it based on transaction age and trx_rows_modified.
  9. If plan is bad despite good indexes, run ANALYZE TABLE to refresh InnoDB statistics.
  10. After adding an index, re-run the original query under load and verify rows_examined drops to near rows_sent in the slow log.
  11. Set up a CloudWatch alarm on SlowQueries above baseline, or configure a Performance Schema snapshot job to trend query cost over time.
  12. Document the root cause, the index added, and the cardinality values before and after for the incident record.

What This Post Does Not Cover

This post covers identifying and resolving an active slow query in MySQL or Aurora MySQL. It does not cover: InnoDB full-text search tuning, ProxySQL query routing and query cache invalidation, Aurora Serverless v2 capacity scaling behavior during query spikes, or MySQL Group Replication lag as a driver of secondary read slowness. Those are distinct triage paths.

What to Do Next

  • Problem: When a slow query alert fires, engineers waste time looking at the wrong signal — checking instance CPU when the real cause is a missing index, or tuning configuration when lock contention is blocking a single thread.
  • Solution: Run the five checks in order — slow log, Performance Schema digest, lock waits, index cardinality, EXPLAIN — before touching any configuration or schema. Each check either confirms the cause or narrows it to the next step.
  • Proof: After applying the fix, rows_examined drops to within 2× of rows_sent in the slow log and SUM_TIMER_WAIT for the affected digest falls out of the top-10 list.
  • Action: This week, confirm slow_query_log = ON and long_query_time <= 1 on every production MySQL instance, and set a CloudWatch SlowQueries alarm above your normal baseline so the next regression is detected before it reaches p99 latency.