Replication Lag Explained

Replication lag is not one number — it is three. Write lag, flush lag, and replay lag measure different things, fail in different ways, and require different interventions. Monitoring only total lag means you cannot tell whether the standby is slow to receive, slow to confirm, or slow to apply.

Situation

PostgreSQL’s pg_stat_replication view exposes three lag components for each connected standby: write_lag, flush_lag, and replay_lag. Most monitoring systems expose only the largest — typically replay_lag — and alert on it as a single number. That number is correct but incomplete.

Replication lag is the delay between a change being committed on the primary and being available on the standby. But “available” means different things depending on what you are protecting against.

The Problem

An alert fires: replication lag on the standby has reached 45 seconds. The on-call engineer does not know: is the primary sending WAL slowly? Is the standby receiving but not flushing? Is the standby flushing but not replaying? Each has a different root cause and a different fix. Without understanding the three components, you cannot triage the alert correctly.

What do the three lag components actually measure, and which one is relevant to your RPO?

The Three Components

PostgreSQL measures lag as the time between a change being committed on the primary and each stage completing on the standby:

Write lag: time between commit on primary and the standby confirming it has written the WAL record to its own WAL buffer (in memory). This measures network latency and standby receive throughput.

Flush lag: time between commit on primary and the standby confirming it has flushed the WAL record to disk. This measures the standby’s I/O performance for WAL writes.

Replay lag: time between commit on primary and the standby confirming it has applied the WAL record to its data files. This measures the standby’s ability to apply changes — which can fall behind under high write volume or during long-running queries on the standby that hold recovery locks.

-- On the primary: all three lag components per standby
SELECT application_name,
       write_lag,
       flush_lag,
       replay_lag,
       state,
       sync_state
FROM pg_stat_replication
ORDER BY replay_lag DESC NULLS LAST;

-- On the standby: time since last replay
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;

For RPO purposes, replay_lag is what matters — it is the measure of how much committed data could be lost if the primary fails right now and you promote the standby.

In Practice

The documented PostgreSQL behavior for physical streaming replication is that write_lag and flush_lag are typically small (milliseconds in a well-connected environment) and replay_lag is the dominant component. Replay lag grows when: the standby is I/O constrained applying data pages; the standby has long-running read queries that block recovery (hot standby conflict); or the primary is generating WAL faster than the standby can replay.

synchronous_commit = remote_apply causes the primary to wait until replay_lag reaches zero before acknowledging a commit — at the cost of commit latency equal to the standby’s replay time. synchronous_commit = remote_write waits only for write_lag to clear, providing weaker durability guarantees but lower commit latency.

Where It Breaks

Lag component growing	Root cause	Fix
Write lag	Network congestion or bandwidth saturation	Investigate network path; consider WAL compression
Flush lag	Standby I/O pressure (disk writes slow)	Upgrade standby storage; separate WAL to faster device
Replay lag	Long-running queries on standby causing hot standby conflicts	`max_standby_streaming_delay`; cancel conflicting queries
All three	Primary generating WAL faster than standby can process	Vertical scale of standby; reduce primary write throughput

What to Do Next

Problem: Monitoring a single lag number does not distinguish between a network problem, a standby I/O problem, and a replay conflict — three very different operational responses.
Solution: Monitor all three components separately; alert on replay_lag > RPO_threshold for durability; alert on flush_lag > write_lag * 5 to detect standby I/O problems specifically.
Proof: After adding per-component monitoring, lag spikes will clearly show which component is growing, cutting triage time from minutes to seconds.
Action: Run the pg_stat_replication query above right now on your primary and capture the three lag values as your baseline — if you have never looked at them before, you likely do not know which component your standby’s lag comes from.

Situation

The Problem

The Three Components

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

Consistency Models Your Application Actually Needs

CAP Theorem in Operational Terms

Caches, Queues, and Databases: When to Use Each