MySQL Replication Lag Decision Tree

Replication lag in MySQL is a symptom, not a cause — but the cause is almost always one of five things, and the diagnostic sequence matters. Engineers who start tuning parallel replica workers before they check whether the replica’s SQL thread is even running waste an hour on the wrong problem. This runbook covers the decision tree from first alert to targeted remediation.

Situation

The alert fires: Seconds_Behind_Source is 300 and climbing. Read queries routed to the replica are returning data that is several minutes stale. The application is surfacing incorrect balances, missing recent records, or serving out-of-date inventory counts depending on what is being replicated.

Seconds_Behind_Source measures the timestamp difference between the most recently executed event on the replica and the timestamp recorded in the primary’s binlog for the same event. It is an estimate of how far behind the replica is in applying committed transactions from the primary. When it grows without bound, the replica is applying events slower than the primary is producing them — or it has stopped applying events entirely.

The distinction between “stopped” and “slow” is the first fork in the diagnostic tree.

Symptoms

Signal	Where to see it	What it means
`Seconds_Behind_Source` growing	`SHOW REPLICA STATUS\G`	Replica is falling behind; does not indicate why
`SQL_Running: No`	`SHOW REPLICA STATUS\G`	SQL thread stopped — replication halted, not just slow
`IO_Running: No`	`SHOW REPLICA STATUS\G`	I/O thread stopped — not receiving new binlog events
`Last_SQL_Error` non-empty	`SHOW REPLICA STATUS\G`	SQL thread encountered an error on a specific event
High relay log space	`Relay_Log_Space` in SHOW REPLICA STATUS	Binlog arriving faster than SQL thread can apply it
Long-running transactions on primary	`INFORMATION_SCHEMA.INNODB_TRX`	Large transactions create large binlog events that take time to apply

First Five Checks

Thread status — Verify both replication threads are running before investigating lag causes:

SHOW REPLICA STATUS\G

Look for Replica_IO_Running: Yes and Replica_SQL_Running: Yes. If either is No, read Last_IO_Error or Last_SQL_Error for the stop reason. A stopped thread is not a lag problem — it is a replication failure. Fix the root cause before any lag remediation.

Long-running transactions on the primary — A single long transaction creates one large binlog event that the replica must apply sequentially:

SELECT trx_id, trx_started, trx_query,
       TIMESTAMPDIFF(SECOND, trx_started, NOW()) AS trx_age_sec
FROM information_schema.INNODB_TRX
ORDER BY trx_started
LIMIT 5;

Any transaction older than 30–60 seconds is a candidate for blocking replica apply. Check trx_query for the SQL responsible.

Top queries by wait time on primary — Identify what the primary is spending time on:

SELECT DIGEST_TEXT, COUNT_STAR, SUM_TIMER_WAIT,
       ROUND(SUM_TIMER_WAIT / COUNT_STAR / 1e12, 3) AS avg_latency_sec
FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 5;

High-latency statements generating large binlog events are a common cause of chronic lag. A 10-second DELETE running every minute creates a 10-second replication backlog per cycle.

Parallel apply configuration — Check whether multi-threaded replica apply is enabled:

SELECT @@replica_parallel_workers, @@replica_parallel_type;

If replica_parallel_workers is 0 or 1, the replica applies one transaction at a time. Modern MySQL supports LOGICAL_CLOCK parallelism, which applies transactions from the same binlog group commit in parallel. On a high-throughput primary, single-threaded apply is the most common cause of chronic lag.

Relay log space — Check if the relay log backlog is growing:

SHOW REPLICA STATUS\G

Look at Relay_Log_Space. If this is large and growing, the I/O thread is receiving binlog events faster than the SQL thread processes them — confirming a slow-apply bottleneck rather than a network or connectivity issue.

Decision Tree

flowchart TD
    A[Seconds_Behind_Source growing] --> B{SQL_Running = YES?}
    B -->|no| C[Read Last_SQL_Error]
    C --> D[Fix SQL error — skip or repair event]
    B -->|yes| E{IO_Running = YES?}
    E -->|no| F[Read Last_IO_Error]
    F --> G[Fix network or auth issue]
    E -->|yes| H{Long transaction on primary?}
    H -->|yes| I[Reduce transaction size on primary]
    H -->|no| J{parallel_workers is 0 or 1?}
    J -->|yes| K[Enable LOGICAL_CLOCK parallel apply]
    J -->|no| L{Relay log space growing?}
    L -->|yes| M[Increase relay_log_space_limit or scale replica]
    L -->|no| N[Check primary write volume vs replica capacity]

Remediation Options

Option 1 — Enable parallel replica apply

Single-threaded apply is the most common cause of lag on busy primaries. Enable multi-threaded apply using the LOGICAL_CLOCK algorithm, which replicates the parallelism from the primary’s binlog group commit:

SET GLOBAL replica_parallel_workers = 4;
SET GLOBAL replica_parallel_type = 'LOGICAL_CLOCK';

-- Required for crash-safe parallel apply
SET GLOBAL replica_preserve_commit_order = 1;

Restart the SQL thread to apply:

STOP REPLICA SQL_THREAD;
START REPLICA SQL_THREAD;

Monitor Seconds_Behind_Source to confirm the replica is catching up. The MySQL documentation recommends replica_preserve_commit_order = 1 when using parallel apply to maintain consistent external visibility.

Option 2 — Kill blocking long transactions on the primary

If a single large transaction is generating a binlog event that takes minutes to apply, identify and interrupt it:

-- On the primary
SELECT trx_id, trx_started, trx_mysql_thread_id
FROM information_schema.INNODB_TRX
ORDER BY trx_started
LIMIT 5;

KILL <trx_mysql_thread_id>;

After killing the transaction, verify it rolls back cleanly. This is disruptive — validate that the transaction is truly blocking before killing it. If the transaction is a scheduled batch job, coordinate with the application team to reduce its scope (process in smaller batches) or schedule it during low-replication-sensitivity windows.

Option 3 — Promote replica or add a new downstream replica

If the primary’s write volume consistently exceeds what a single replica can apply even with parallel workers, the architecture has reached a scale limit. Options:

Promote the lagging replica to primary and demote the original (for planned maintenance or topology change)
Add a second-tier replica that replicates from a relay replica closer to the primary
Evaluate whether reads can be sharded or moved to a read-optimized layer

This is not a quick fix — it is an architectural response to sustained primary write volume exceeding replica apply capacity.

Rollback Plan

For parallel apply changes: Disable by setting replica_parallel_workers = 0 and restarting the SQL thread. The change is non-destructive — disabling parallel apply reverts to sequential mode immediately.
For killed transactions on primary: The transaction will roll back automatically. Monitor information_schema.INNODB_TRX to confirm the rollback completes. If the transaction was large, rollback can take as long as the original execution. No binlog event is emitted for the rolled-back transaction.
For relay log space changes: Increasing relay_log_space_limit is non-destructive and can be done at runtime with SET GLOBAL. Decreasing it requires waiting for relay log consumption to catch up first.

Automation Opportunity

Replication lag monitoring lends itself to a simple alerting script. The core signal — Seconds_Behind_Source above a threshold — can be captured from SHOW REPLICA STATUS via any MySQL-compatible monitoring tool (Percona Monitoring and Management, CloudWatch RDS Enhanced Monitoring, or a custom cron-driven script).

A more targeted automation: schedule a query on the primary every 5 minutes to check for transactions older than 60 seconds and write the result to a monitoring table. Any row in that table with trx_age_sec > 300 is a candidate for alerting before it generates a multi-minute binlog event that stalls the replica.

-- Scheduled check for long-running transactions (run on primary)
SELECT COUNT(*) AS long_txn_count
FROM information_schema.INNODB_TRX
WHERE TIMESTAMPDIFF(SECOND, trx_started, NOW()) > 60;

If this returns nonzero during steady-state operation, the replication lag root cause is already present even when lag is not yet visible.

Leadership Summary

What broke: MySQL replication lag caused read replicas to serve stale data. The replica was applying committed transactions slower than the primary was producing them.
What was done: Identified the root cause (long transactions or single-threaded apply), enabled parallel replica apply or reduced transaction scope on the primary, and verified Seconds_Behind_Source returned to near zero.
What prevents recurrence: Parallel apply configured with LOGICAL_CLOCK handles normal write volume. Long-transaction alerting on the primary gives early warning before binlog events stall the replica apply thread.

Checklist

Run SHOW REPLICA STATUS\G and confirm both Replica_IO_Running and Replica_SQL_Running are Yes
Read Last_SQL_Error and Last_IO_Error — if either is non-empty, address the error before diagnosing lag
Check Seconds_Behind_Source trend — is it growing, stable, or recovering?
Query INFORMATION_SCHEMA.INNODB_TRX on primary for transactions older than 30 seconds
Run performance_schema.events_statements_summary_by_digest on primary for top wait-time queries
Check SELECT @@replica_parallel_workers, @@replica_parallel_type — if workers is 0 or 1, evaluate enabling parallel apply
Check Relay_Log_Space from SHOW REPLICA STATUS — large growing relay log confirms slow-apply bottleneck
If enabling parallel apply, set replica_preserve_commit_order = 1 before restarting the SQL thread
After any change, monitor Seconds_Behind_Source for 10–15 minutes to confirm the trend reverses
Document the root cause and resolution in your incident log for pattern tracking

What to Do Next

Problem: Seconds_Behind_Source grows during an incident and the natural instinct is to tune parallel workers — but if the SQL thread has stopped or there is a long transaction blocking apply, that tuning changes nothing.
Solution: Follow the decision tree: check thread status first, long transactions second, parallel apply configuration third, relay log space last. Each check either identifies the cause or rules it out before the next step.
Proof: After the correct remediation, Seconds_Behind_Source stops growing and trends back toward zero within a few minutes, confirming the apply bottleneck was addressed.
Action: This week, run SELECT @@replica_parallel_workers, @@replica_parallel_type on every replica in your fleet — if any replica has parallel_workers = 0 or 1, evaluate enabling LOGICAL_CLOCK parallel apply before the next high-write event.

Situation

Symptoms

First Five Checks

Decision Tree

Remediation Options

Rollback Plan

Automation Opportunity

Leadership Summary

Checklist

What to Do Next

Rajiv

Related Posts

Autovacuum Is a Capacity Problem, Not a Maintenance Task

When Autovacuum Becomes a Backpressure Signal

Prometheus + Grafana for Database Engineers: Open-Source Monitoring That Actually Works