MySQL Replication Lag Decision Tree
Replication lag in MySQL is a symptom, not a cause — but the cause is almost always one of five things, and the diagnostic sequence matters. Engineers who start tuning parallel replica workers before they check whether the replica’s SQL thread is even running waste an hour on the wrong problem. This runbook covers the decision tree from first alert to targeted remediation.
Situation
The alert fires: Seconds_Behind_Source is 300 and climbing. Read queries routed to the replica are returning data that is several minutes stale. The application is surfacing incorrect balances, missing recent records, or serving out-of-date inventory counts depending on what is being replicated.
Seconds_Behind_Source measures the timestamp difference between the most recently executed event on the replica and the timestamp recorded in the primary’s binlog for the same event. It is an estimate of how far behind the replica is in applying committed transactions from the primary. When it grows without bound, the replica is applying events slower than the primary is producing them — or it has stopped applying events entirely.
The distinction between “stopped” and “slow” is the first fork in the diagnostic tree.
Symptoms
| Signal | Where to see it | What it means |
|---|---|---|
Seconds_Behind_Source growing | SHOW REPLICA STATUS\G | Replica is falling behind; does not indicate why |
SQL_Running: No | SHOW REPLICA STATUS\G | SQL thread stopped — replication halted, not just slow |
IO_Running: No | SHOW REPLICA STATUS\G | I/O thread stopped — not receiving new binlog events |
Last_SQL_Error non-empty | SHOW REPLICA STATUS\G | SQL thread encountered an error on a specific event |
| High relay log space | Relay_Log_Space in SHOW REPLICA STATUS | Binlog arriving faster than SQL thread can apply it |
| Long-running transactions on primary | INFORMATION_SCHEMA.INNODB_TRX | Large transactions create large binlog events that take time to apply |
First Five Checks
- Thread status — Verify both replication threads are running before investigating lag causes:
SHOW REPLICA STATUS\G
Look for Replica_IO_Running: Yes and Replica_SQL_Running: Yes. If either is No, read Last_IO_Error or Last_SQL_Error for the stop reason. A stopped thread is not a lag problem — it is a replication failure. Fix the root cause before any lag remediation.
- Long-running transactions on the primary — A single long transaction creates one large binlog event that the replica must apply sequentially:
SELECT trx_id, trx_started, trx_query,
TIMESTAMPDIFF(SECOND, trx_started, NOW()) AS trx_age_sec
FROM information_schema.INNODB_TRX
ORDER BY trx_started
LIMIT 5;
Any transaction older than 30–60 seconds is a candidate for blocking replica apply. Check trx_query for the SQL responsible.
- Top queries by wait time on primary — Identify what the primary is spending time on:
SELECT DIGEST_TEXT, COUNT_STAR, SUM_TIMER_WAIT,
ROUND(SUM_TIMER_WAIT / COUNT_STAR / 1e12, 3) AS avg_latency_sec
FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 5;
High-latency statements generating large binlog events are a common cause of chronic lag. A 10-second DELETE running every minute creates a 10-second replication backlog per cycle.
- Parallel apply configuration — Check whether multi-threaded replica apply is enabled:
SELECT @@replica_parallel_workers, @@replica_parallel_type;
If replica_parallel_workers is 0 or 1, the replica applies one transaction at a time. Modern MySQL supports LOGICAL_CLOCK parallelism, which applies transactions from the same binlog group commit in parallel. On a high-throughput primary, single-threaded apply is the most common cause of chronic lag.
- Relay log space — Check if the relay log backlog is growing:
SHOW REPLICA STATUS\G
Look at Relay_Log_Space. If this is large and growing, the I/O thread is receiving binlog events faster than the SQL thread processes them — confirming a slow-apply bottleneck rather than a network or connectivity issue.
Decision Tree
flowchart TD
A[Seconds_Behind_Source growing] --> B{SQL_Running = YES?}
B -->|no| C[Read Last_SQL_Error]
C --> D[Fix SQL error — skip or repair event]
B -->|yes| E{IO_Running = YES?}
E -->|no| F[Read Last_IO_Error]
F --> G[Fix network or auth issue]
E -->|yes| H{Long transaction on primary?}
H -->|yes| I[Reduce transaction size on primary]
H -->|no| J{parallel_workers is 0 or 1?}
J -->|yes| K[Enable LOGICAL_CLOCK parallel apply]
J -->|no| L{Relay log space growing?}
L -->|yes| M[Increase relay_log_space_limit or scale replica]
L -->|no| N[Check primary write volume vs replica capacity]
Remediation Options
Option 1 — Enable parallel replica apply
Single-threaded apply is the most common cause of lag on busy primaries. Enable multi-threaded apply using the LOGICAL_CLOCK algorithm, which replicates the parallelism from the primary’s binlog group commit:
SET GLOBAL replica_parallel_workers = 4;
SET GLOBAL replica_parallel_type = 'LOGICAL_CLOCK';
-- Required for crash-safe parallel apply
SET GLOBAL replica_preserve_commit_order = 1;
Restart the SQL thread to apply:
STOP REPLICA SQL_THREAD;
START REPLICA SQL_THREAD;
Monitor Seconds_Behind_Source to confirm the replica is catching up. The MySQL documentation recommends replica_preserve_commit_order = 1 when using parallel apply to maintain consistent external visibility.
Option 2 — Kill blocking long transactions on the primary
If a single large transaction is generating a binlog event that takes minutes to apply, identify and interrupt it:
-- On the primary
SELECT trx_id, trx_started, trx_mysql_thread_id
FROM information_schema.INNODB_TRX
ORDER BY trx_started
LIMIT 5;
KILL <trx_mysql_thread_id>;
After killing the transaction, verify it rolls back cleanly. This is disruptive — validate that the transaction is truly blocking before killing it. If the transaction is a scheduled batch job, coordinate with the application team to reduce its scope (process in smaller batches) or schedule it during low-replication-sensitivity windows.
Option 3 — Promote replica or add a new downstream replica
If the primary’s write volume consistently exceeds what a single replica can apply even with parallel workers, the architecture has reached a scale limit. Options:
- Promote the lagging replica to primary and demote the original (for planned maintenance or topology change)
- Add a second-tier replica that replicates from a relay replica closer to the primary
- Evaluate whether reads can be sharded or moved to a read-optimized layer
This is not a quick fix — it is an architectural response to sustained primary write volume exceeding replica apply capacity.
Rollback Plan
-
For parallel apply changes: Disable by setting
replica_parallel_workers = 0and restarting the SQL thread. The change is non-destructive — disabling parallel apply reverts to sequential mode immediately. -
For killed transactions on primary: The transaction will roll back automatically. Monitor
information_schema.INNODB_TRXto confirm the rollback completes. If the transaction was large, rollback can take as long as the original execution. No binlog event is emitted for the rolled-back transaction. -
For relay log space changes: Increasing
relay_log_space_limitis non-destructive and can be done at runtime withSET GLOBAL. Decreasing it requires waiting for relay log consumption to catch up first.
Automation Opportunity
Replication lag monitoring lends itself to a simple alerting script. The core signal — Seconds_Behind_Source above a threshold — can be captured from SHOW REPLICA STATUS via any MySQL-compatible monitoring tool (Percona Monitoring and Management, CloudWatch RDS Enhanced Monitoring, or a custom cron-driven script).
A more targeted automation: schedule a query on the primary every 5 minutes to check for transactions older than 60 seconds and write the result to a monitoring table. Any row in that table with trx_age_sec > 300 is a candidate for alerting before it generates a multi-minute binlog event that stalls the replica.
-- Scheduled check for long-running transactions (run on primary)
SELECT COUNT(*) AS long_txn_count
FROM information_schema.INNODB_TRX
WHERE TIMESTAMPDIFF(SECOND, trx_started, NOW()) > 60;
If this returns nonzero during steady-state operation, the replication lag root cause is already present even when lag is not yet visible.
Leadership Summary
- What broke: MySQL replication lag caused read replicas to serve stale data. The replica was applying committed transactions slower than the primary was producing them.
- What was done: Identified the root cause (long transactions or single-threaded apply), enabled parallel replica apply or reduced transaction scope on the primary, and verified
Seconds_Behind_Sourcereturned to near zero. - What prevents recurrence: Parallel apply configured with
LOGICAL_CLOCKhandles normal write volume. Long-transaction alerting on the primary gives early warning before binlog events stall the replica apply thread.
Checklist
- Run
SHOW REPLICA STATUS\Gand confirm bothReplica_IO_RunningandReplica_SQL_RunningareYes - Read
Last_SQL_ErrorandLast_IO_Error— if either is non-empty, address the error before diagnosing lag - Check
Seconds_Behind_Sourcetrend — is it growing, stable, or recovering? - Query
INFORMATION_SCHEMA.INNODB_TRXon primary for transactions older than 30 seconds - Run
performance_schema.events_statements_summary_by_digeston primary for top wait-time queries - Check
SELECT @@replica_parallel_workers, @@replica_parallel_type— if workers is 0 or 1, evaluate enabling parallel apply - Check
Relay_Log_SpacefromSHOW REPLICA STATUS— large growing relay log confirms slow-apply bottleneck - If enabling parallel apply, set
replica_preserve_commit_order = 1before restarting the SQL thread - After any change, monitor
Seconds_Behind_Sourcefor 10–15 minutes to confirm the trend reverses - Document the root cause and resolution in your incident log for pattern tracking
What to Do Next
- Problem:
Seconds_Behind_Sourcegrows during an incident and the natural instinct is to tune parallel workers — but if the SQL thread has stopped or there is a long transaction blocking apply, that tuning changes nothing. - Solution: Follow the decision tree: check thread status first, long transactions second, parallel apply configuration third, relay log space last. Each check either identifies the cause or rules it out before the next step.
- Proof: After the correct remediation,
Seconds_Behind_Sourcestops growing and trends back toward zero within a few minutes, confirming the apply bottleneck was addressed. - Action: This week, run
SELECT @@replica_parallel_workers, @@replica_parallel_typeon every replica in your fleet — if any replica hasparallel_workers = 0or1, evaluate enablingLOGICAL_CLOCKparallel apply before the next high-write event.