MySQL Slow Query Playbook: From Slow Log to Fix
Most MySQL slowdowns have a short list of root causes: a missing index, a lock wait, or stale optimizer statistics. The hard part is not the fix — it is getting from “p99 alert fired” to “I know which query, why it is slow, and what the safe remediation is” without wasting an hour looking at the wrong thing. This playbook gives you that path as a repeatable workflow. Run these checks in order, and you will have a diagnosis before you start guessing.
Situation
The alert fires. Maybe it is a CloudWatch SlowQueries metric spike on RDS, a p99 latency alarm from your application APM, or a PagerDuty page from a long-running query threshold. You open a terminal, connect to the database, and face the standard problem: MySQL is running dozens of queries per second, and you need to identify the one that is costing you.
MySQL gives you several places to look — the slow query log, Performance Schema digest tables, SHOW PROCESSLIST, and InnoDB status — and the right place to start depends on whether the problem is active right now or a pattern you are trying to reconstruct after the fact. This runbook covers both: active incidents where queries are blocking or running hot, and post-incident analysis where you need to find the pattern in aggregated data.
The version context matters. MySQL 8.0 added EXPLAIN ANALYZE, which gives actual row counts alongside estimated ones. If you are on MySQL 5.7 or RDS Aurora MySQL, the same diagnostic steps apply but you will use EXPLAIN FORMAT=JSON without ANALYZE for the execution plan.
Symptoms
| Signal | Where to see it | What it means |
|---|---|---|
Query_time >> Lock_time in slow log entry | slow_query_log_file or mysqldumpslow output | Query is executing slowly independent of locking — likely index or scan issue |
High Lock_time in slow log | Same source | Transaction waiting on a row lock before it can execute |
rows_examined far exceeds rows_sent | Slow log entry or events_statements_summary_by_digest | Full or partial table scan — index not covering the WHERE clause |
Thread in Waiting for table metadata lock state | SHOW PROCESSLIST | Another connection holds a metadata lock, usually from an open transaction or an ALTER TABLE |
High SUM_TIMER_WAIT for a specific digest | performance_schema.events_statements_summary_by_digest | A specific query pattern accounts for most DB wall-clock time |
LATEST DETECTED DEADLOCK section present | SHOW ENGINE INNODB STATUS | Two transactions deadlocked; one was rolled back |
First Five Checks
-
Enable the slow query log and read it — If the slow log is not already running, turn it on without a restart:
SET GLOBAL slow_query_log = ON; SET GLOBAL long_query_time = 1; SET GLOBAL log_output = 'FILE'; SHOW VARIABLES LIKE 'slow_query_log_file';Then use
mysqldumpslowto aggregate entries. The-s tflag sorts by total time, which surfaces the queries with the most cumulative cost rather than just the single longest run:mysqldumpslow -s t -t 10 /var/lib/mysql/hostname-slow.logEach entry shows
Query_time,Lock_time,Rows_sent, andRows_examined. Arows_examined / rows_sentratio above 100 is a strong signal of a full or near-full table scan. -
Find top queries by total time in Performance Schema — For RDS or environments where you cannot read the log file directly, Performance Schema digest tables give the same aggregate view:
SELECT DIGEST_TEXT, COUNT_STAR, SUM_TIMER_WAIT / 1000000000000 AS total_sec, AVG_TIMER_WAIT / 1000000000000 AS avg_sec, SUM_ROWS_EXAMINED, SUM_ROWS_SENT FROM performance_schema.events_statements_summary_by_digest ORDER BY SUM_TIMER_WAIT DESC LIMIT 10;The
DIGEST_TEXTcolumn normalizes literals to?placeholders, so you see the query pattern regardless of parameter values. Focus on rows whereSUM_ROWS_EXAMINEDgreatly exceedsSUM_ROWS_SENT. -
Check current lock waits — If the incident is active and threads are blocked, identify the blocking transaction immediately. On MySQL 8.0, use
performance_schema.data_lock_waits:SELECT r.trx_id AS waiting_trx_id, r.trx_mysql_thread_id AS waiting_thread, r.trx_query AS waiting_query, b.trx_id AS blocking_trx_id, b.trx_mysql_thread_id AS blocking_thread, b.trx_query AS blocking_query FROM information_schema.innodb_lock_waits w INNER JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id INNER JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id;The
blocking_querycolumn often showsNULL— this means the blocking transaction has already executed its statement and is sitting idle with an open transaction, holding row locks. Checkb.trx_startedto see how long it has been open. -
Check index usage for the affected table — The
sysschema surfaces unused indexes, which are candidates for removal, and lets you quickly see what indexes exist:-- Indexes that have never been used since last server restart SELECT * FROM sys.schema_unused_indexes WHERE object_schema = 'your_db'; -- All indexes on the table with cardinality SHOW INDEX FROM your_table;Low
Cardinalityon a column you are filtering by is a sign the index may not help the optimizer — or that statistics are stale and need updating. ACardinalityof 1 on a column with millions of rows is usually wrong. -
Get EXPLAIN for the slow query — Once you have identified the query pattern, capture its execution plan. On MySQL 8.0,
EXPLAIN ANALYZEruns the query and returns actual row counts alongside estimates:-- MySQL 8.0+ — runs the query and returns actual vs estimated rows EXPLAIN ANALYZE SELECT user_id, created_at FROM orders WHERE status = 'pending' AND created_at > '2022-01-01'; -- All versions — returns JSON with full cost estimates EXPLAIN FORMAT=JSON SELECT user_id, created_at FROM orders WHERE status = 'pending' AND created_at > '2022-01-01';In the output, look for
type: ALL(full table scan),type: index(full index scan),Extra: Using filesort, andExtra: Using temporary. Any of these signals a query that is doing more work than it needs to. Therowscolumn shows the optimizer’s estimate; withEXPLAIN ANALYZE, theactual rowsfield shows what actually happened.
Decision Tree
flowchart TD
A[Slow query alert fires] --> B{rows_examined far exceeds rows_sent?}
B -->|yes| C[Check EXPLAIN for full scan or wrong index]
C --> D{type=ALL or index in EXPLAIN?}
D -->|yes| E[Add or modify index based on WHERE clause]
D -->|no| F[Check for filesort or temporary table in Extra]
B -->|no| G{lock_time high in slow log?}
G -->|yes| H[Query innodb_lock_waits for blocking thread]
H --> I[Kill blocking thread or wait for commit]
G -->|no| J{Query recently regressed?}
J -->|yes| K{Cardinality looks wrong in SHOW INDEX?}
K -->|yes| L[Run ANALYZE TABLE to refresh statistics]
K -->|no| M[Check for schema change or data distribution shift]
J -->|no| N{I/O bound — buffer pool hit rate low?}
N -->|yes| O[Check innodb_buffer_pool hit rate and increase if possible]
N -->|no| P[Profile with Performance Schema events_stages_summary]
What this diagram shows: A MySQL slow query decision tree — starting with the rows_examined/rows_sent ratio to detect full scans, then lock_time for blocking threads, cardinality estimates for stale statistics, and buffer pool hit rate for I/O saturation — each branch leads to a specific actionable fix.
Remediation Options
Option 1 — Add or modify an index based on EXPLAIN output
When EXPLAIN shows type: ALL or the optimizer is choosing an index that does not cover the WHERE clause, the fix is usually a covering index that includes all columns referenced in the WHERE, ORDER BY, and SELECT list. In MySQL 8.0, ALTER TABLE ... ADD INDEX uses online DDL by default, which means reads and writes continue during the operation:
-- Add a covering index for the query above
ALTER TABLE orders
ADD INDEX idx_status_created_user (status, created_at, user_id);
-- Verify the optimizer uses it
EXPLAIN SELECT user_id, created_at FROM orders
WHERE status = 'pending' AND created_at > '2022-01-01';
Column order in the index matters. MySQL’s B-tree indexes support leftmost prefix matching — the optimizer can use (status, created_at) for a filter on status alone, but it cannot use (created_at, status) for a filter on status alone. Put the equality predicates first, range predicates last.
Option 2 — Update statistics with ANALYZE TABLE
When the optimizer is choosing a bad plan despite a suitable index, the cause is often stale statistics. This happens after large data loads, bulk deletes, or tables that have grown significantly since the last statistics update. ANALYZE TABLE is non-blocking in InnoDB and safe to run in production:
ANALYZE TABLE orders;
-- Verify cardinality updated
SHOW INDEX FROM orders;
According to the MySQL 8.0 Reference Manual, InnoDB calculates index statistics by sampling random pages — innodb_stats_sample_pages controls sample size. If your table has extremely skewed data distribution, increasing this value can improve plan quality at the cost of more I/O during the statistics update.
Option 3 — Kill the blocking transaction
When lock waits are causing the slowdown, the fastest resolution is to identify and kill the blocking thread. Use the blocking thread ID from the lock wait query in Check 3:
-- Show full information about the blocking thread
SELECT * FROM information_schema.processlist
WHERE id = <blocking_thread_id>;
-- Kill it (this rolls back the blocking transaction)
KILL <blocking_thread_id>;
KILL in MySQL sends a signal to the thread to terminate cleanly. The thread’s current transaction is rolled back. This is the correct tool for a long-running idle transaction holding row locks — not a hard connection reset. After killing, verify the waiting queries resume with SHOW PROCESSLIST.
Rollback Plan
-
Adding an index — Reversible at any time with
DROP INDEX. The online DDL used in MySQL 8.0 InnoDB means the add is also reversible mid-execution by canceling the ALTER (though partial progress is lost and the operation must restart). To remove:ALTER TABLE orders DROP INDEX idx_status_created_user; -
ANALYZE TABLE — No rollback needed.
ANALYZE TABLEupdates statistics but does not change data. If the new statistics produce a worse plan, you can hint the optimizer withUSE INDEX (index_name)as a temporary workaround while investigating the plan regression. Statistics will also auto-update over time as InnoDB detects data changes. -
KILL thread — The killed transaction is rolled back. There is no undo for the kill itself — the work that transaction had done is lost. Before killing, check
trx_queryandtrx_rows_modifiedto understand what the transaction was doing. For a long-running OLAP query that was just reading, the only cost is rerunning the query. For a transaction in the middle of writes, the application will see a lost connection error and should retry.
Automation Opportunity
The diagnosis steps in this playbook can be partially automated with two tools.
Percona Toolkit’s pt-query-digest processes slow log files and produces an aggregated report sorted by total time, showing query patterns, execution statistics, and EXPLAIN output. It is the documented standard for batch slow log analysis and handles log rotation correctly:
pt-query-digest /var/lib/mysql/hostname-slow.log > digest_report.txt
pt-query-digest --since='1h' /var/lib/mysql/hostname-slow.log
Percona Toolkit is open-source and documented at percona.com/software/database-tools/percona-toolkit.
Trending with Performance Schema — The digest table retains aggregated data across the server’s uptime. A scheduled query that snapshots SUM_TIMER_WAIT and COUNT_STAR into a monitoring table every 5 minutes gives you a trend line for query cost over time, which is more useful than a point-in-time alert:
-- Snapshot top 20 digests into a monitoring table every 5 minutes
INSERT INTO perf_snapshots (captured_at, digest, total_sec, call_count)
SELECT
NOW(),
DIGEST_TEXT,
SUM_TIMER_WAIT / 1000000000000,
COUNT_STAR
FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 20;
On RDS, the SlowQueries CloudWatch metric counts queries exceeding long_query_time per minute. Set an alarm at a threshold above your baseline (e.g., more than 5 slow queries per minute) to trigger early before p99 latency is customer-visible.
Leadership Summary
- A database query exceeded the response time threshold, causing elevated p99 latency visible in application monitoring.
- The slow query was identified using Performance Schema digest tables and the slow query log; root cause was a missing index causing a full table scan. The index was added using online DDL with no downtime.
- Automated slow query alerting via CloudWatch and a scheduled Performance Schema snapshot prevents undetected regressions going forward.
Checklist
- Confirm
slow_query_log = ONandlong_query_timeis set to a meaningful threshold (1 second is standard; 0.5 on high-volume OLTP). - Run
mysqldumpslow -s t -t 10on the slow log to identify the top queries by total time. - Query
performance_schema.events_statements_summary_by_digestsorted bySUM_TIMER_WAIT DESCto confirm the same pattern. - Check
information_schema.innodb_lock_waitsfor any active lock waits involving the slow query’s table. - Run
SHOW INDEX FROM <table>and checkCardinalityvalues — anomalously low values indicate stale statistics. - Run
EXPLAIN FORMAT=JSON(orEXPLAIN ANALYZEon MySQL 8.0+) on the identified query and look fortype: ALL,Using filesort, andUsing temporary. - If a full scan is confirmed, design a covering index that places equality predicates first and range predicates last, then test with
EXPLAINbefore adding. - If lock contention is confirmed, identify the blocking thread using
innodb_lock_waitsand decide whether to kill it based on transaction age andtrx_rows_modified. - If plan is bad despite good indexes, run
ANALYZE TABLEto refresh InnoDB statistics. - After adding an index, re-run the original query under load and verify
rows_examineddrops to nearrows_sentin the slow log. - Set up a CloudWatch alarm on
SlowQueriesabove baseline, or configure a Performance Schema snapshot job to trend query cost over time. - Document the root cause, the index added, and the cardinality values before and after for the incident record.
What This Post Does Not Cover
This post covers identifying and resolving an active slow query in MySQL or Aurora MySQL. It does not cover: InnoDB full-text search tuning, ProxySQL query routing and query cache invalidation, Aurora Serverless v2 capacity scaling behavior during query spikes, or MySQL Group Replication lag as a driver of secondary read slowness. Those are distinct triage paths.
What to Do Next
- Problem: When a slow query alert fires, engineers waste time looking at the wrong signal — checking instance CPU when the real cause is a missing index, or tuning configuration when lock contention is blocking a single thread.
- Solution: Run the five checks in order — slow log, Performance Schema digest, lock waits, index cardinality, EXPLAIN — before touching any configuration or schema. Each check either confirms the cause or narrows it to the next step.
- Proof: After applying the fix,
rows_examineddrops to within 2× ofrows_sentin the slow log andSUM_TIMER_WAITfor the affected digest falls out of the top-10 list. - Action: This week, confirm
slow_query_log = ONandlong_query_time <= 1on every production MySQL instance, and set a CloudWatchSlowQueriesalarm above your normal baseline so the next regression is detected before it reaches p99 latency.