Replication lag is not one number — it is three. Write lag, flush lag, and replay lag measure different things, fail in different ways, and require different interventions. Monitoring only total lag means you cannot tell whether the standby is slow to receive, slow to confirm, or slow to apply.

Situation

PostgreSQL’s pg_stat_replication view exposes three lag components for each connected standby: write_lag, flush_lag, and replay_lag. Most monitoring systems expose only the largest — typically replay_lag — and alert on it as a single number. That number is correct but incomplete.

Replication lag is the delay between a change being committed on the primary and being available on the standby. But “available” means different things depending on what you are protecting against.

The Problem

An alert fires: replication lag on the standby has reached 45 seconds. The on-call engineer does not know: is the primary sending WAL slowly? Is the standby receiving but not flushing? Is the standby flushing but not replaying? Each has a different root cause and a different fix. Without understanding the three components, you cannot triage the alert correctly.

What do the three lag components actually measure, and which one is relevant to your RPO?

The Three Components

PostgreSQL measures lag as the time between a change being committed on the primary and each stage completing on the standby:

Write lag: time between commit on primary and the standby confirming it has written the WAL record to its own WAL buffer (in memory). This measures network latency and standby receive throughput.

Flush lag: time between commit on primary and the standby confirming it has flushed the WAL record to disk. This measures the standby’s I/O performance for WAL writes.

Replay lag: time between commit on primary and the standby confirming it has applied the WAL record to its data files. This measures the standby’s ability to apply changes — which can fall behind under high write volume or during long-running queries on the standby that hold recovery locks.

-- On the primary: all three lag components per standby
SELECT application_name,
       write_lag,
       flush_lag,
       replay_lag,
       state,
       sync_state
FROM pg_stat_replication
ORDER BY replay_lag DESC NULLS LAST;

-- On the standby: time since last replay
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;

For RPO purposes, replay_lag is what matters — it is the measure of how much committed data could be lost if the primary fails right now and you promote the standby.

In Practice

The documented PostgreSQL behavior for physical streaming replication is that write_lag and flush_lag are typically small (milliseconds in a well-connected environment) and replay_lag is the dominant component. Replay lag grows when: the standby is I/O constrained applying data pages; the standby has long-running read queries that block recovery (hot standby conflict); or the primary is generating WAL faster than the standby can replay.

synchronous_commit = remote_apply causes the primary to wait until replay_lag reaches zero before acknowledging a commit — at the cost of commit latency equal to the standby’s replay time. synchronous_commit = remote_write waits only for write_lag to clear, providing weaker durability guarantees but lower commit latency.

Where It Breaks

Lag component growingRoot causeFix
Write lagNetwork congestion or bandwidth saturationInvestigate network path; consider WAL compression
Flush lagStandby I/O pressure (disk writes slow)Upgrade standby storage; separate WAL to faster device
Replay lagLong-running queries on standby causing hot standby conflictsmax_standby_streaming_delay; cancel conflicting queries
All threePrimary generating WAL faster than standby can processVertical scale of standby; reduce primary write throughput

What to Do Next

  • Problem: Monitoring a single lag number does not distinguish between a network problem, a standby I/O problem, and a replay conflict — three very different operational responses.
  • Solution: Monitor all three components separately; alert on replay_lag > RPO_threshold for durability; alert on flush_lag > write_lag * 5 to detect standby I/O problems specifically.
  • Proof: After adding per-component monitoring, lag spikes will clearly show which component is growing, cutting triage time from minutes to seconds.
  • Action: Run the pg_stat_replication query above right now on your primary and capture the three lag values as your baseline — if you have never looked at them before, you likely do not know which component your standby’s lag comes from.