Logical Replication Failure Workflow

Logical replication lag does not announce itself with an error message — it accumulates silently in the WAL retention on the publisher, and the subscriber falls further and further behind until either the replication slot fills the disk or you notice the data is hours stale. Unlike streaming replication, which breaks loudly, logical replication degrades quietly: the subscription stays connected, the apply worker reports running, and the divergence grows until something downstream catches it.

Situation

PostgreSQL logical replication works by decoding WAL changes on the publisher into a row-level change stream, which the subscriber applies table by table. This is fundamentally different from physical replication, which ships binary WAL blocks. Logical replication lets you replicate subsets of tables, replicate across major versions, and fan out to multiple subscribers — but it introduces failure modes that streaming replication does not have.

The most common operational problems: a subscription falls behind because the apply worker hit a conflict (an update arriving for a row that does not exist on the subscriber); the subscription is technically active but the apply worker is stalled waiting for a lock; the publisher and subscriber diverge on schema, causing the apply worker to crash with a type mismatch; or the replication slot on the publisher accumulates enough unreleased WAL to fill the disk.

The diagnostic workflow must cover all four of these. They share symptoms but have different root causes and different remediations.

Symptoms

Signal	Where to see it	What it means
Increasing lag between publisher and subscriber	`pg_replication_slots.confirmed_flush_lsn` vs `pg_current_wal_lsn()`	Apply worker not keeping up — lag in bytes growing
Replication slot holding excessive WAL	`pg_replication_slots` — slot not advancing	Subscriber disconnected or stalled; disk risk if slot persists
Apply worker process absent from `pg_stat_subscription`	`pg_stat_activity`, `pg_stat_subscription`	Apply worker crashed — check PostgreSQL error log
Subscription state `e` (error) in `pg_subscription_rel`	`pg_subscription_rel.srsubstate`	Specific table failed to apply — conflict or schema mismatch
Error message in logs — “conflict in logical replication”	`postgresql.log`	Row-level conflict on insert, update, or delete
Schema-related error in logs — “column X of relation Y does not exist”	`postgresql.log`	DDL executed on publisher without matching DDL on subscriber

First Five Checks

Replication lag in bytes — the most immediate measure of how far behind the subscriber is:

-- Run on the publisher
SELECT
  slot_name,
  plugin,
  active,
  confirmed_flush_lsn,
  pg_current_wal_lsn() - confirmed_flush_lsn AS lag_bytes,
  pg_size_pretty(pg_current_wal_lsn() - confirmed_flush_lsn) AS lag_human
FROM pg_replication_slots
WHERE slot_type = 'logical';

A growing lag_bytes means the subscriber is not applying changes as fast as they are being generated. A slot that is not active (no connected subscriber) is holding WAL indefinitely — disk risk. A slot that is active but lag_bytes is growing means the apply worker is falling behind.

Subscription status — verify the subscription is enabled and the apply worker is running:

-- Run on the subscriber
SELECT
  subname,
  subenabled,
  subpublications,
  subconninfo
FROM pg_subscription;

subenabled = false means the subscription was manually disabled. It will not apply changes until re-enabled. This is the most common cause of lag that looks like a network issue but is actually an administrative action that was forgotten.

Per-table replication state — identify which tables are in which state:

-- Run on the subscriber
SELECT
  srrelid::regclass AS tablename,
  srsubstate,
  srsublsn
FROM pg_subscription_rel
ORDER BY srsubstate;

State codes: i = initialize, d = data copy in progress, s = synchronized, r = ready, e = error. A table in state e has failed to apply changes — check the error log for the specific conflict or error. A table stuck in state d for an extended period means the initial data copy is running slowly or stalled.

Apply worker activity — check what the apply worker is currently doing:

-- Run on the subscriber
SELECT
  pid,
  application_name,
  client_addr,
  state,
  sent_lsn,
  write_lsn,
  flush_lsn,
  replay_lsn,
  now() - backend_start AS worker_age
FROM pg_stat_replication;

-- Also check the subscription worker directly
SELECT
  subname,
  pid,
  received_lsn,
  last_msg_send_time,
  last_msg_receipt_time,
  latest_end_lsn,
  latest_end_time
FROM pg_stat_subscription;

A pid that is NULL in pg_stat_subscription means no worker is running for that subscription. Check the PostgreSQL log for the crash reason.

Error log review — the log contains the exact conflict type and LSN:

# Find conflict-related errors in the PostgreSQL log
grep -E 'ERROR|conflict|replication' /var/log/postgresql/postgresql.log | tail -50

# More targeted
grep 'logical replication' /var/log/postgresql/postgresql.log | tail -20

The log will contain lines like ERROR: duplicate key value violates unique constraint or ERROR: could not find row for updating — these identify the conflict type. The log also shows the LSN at which the conflict occurred, which is needed for the SKIP remediation in Option 1 below.

Decision Tree

flowchart TD
    A[Logical replication lag growing] --> B{Subscription enabled?}
    B -->|no| C[ALTER SUBSCRIPTION sub ENABLE]
    B -->|yes| D{Apply worker running?}
    D -->|no — pid null| E[Check pg_subscription_rel for error state]
    E --> F{Table in error state?}
    F -->|yes| G{Conflict type?}
    G -->|insert conflict| H[ALTER SUBSCRIPTION sub SKIP lsn]
    G -->|update or delete missing row| I[ALTER SUBSCRIPTION sub SKIP lsn]
    G -->|schema mismatch| J[Apply DDL to subscriber — re-enable]
    D -->|yes — worker running| K{Lag growing despite active worker?}
    K -->|yes| L{Publisher write rate too high?}
    L -->|yes| M[Tune max_logical_replication_workers]
    L -->|no| N{Lock wait on subscriber?}
    N -->|yes| O[Identify blocking query on subscriber]
    N -->|no| P[Check network throughput publisher to subscriber]
    F -->|no — stuck in data copy| Q[Check disk and I/O on subscriber]

Remediation Options

Option 1 — Skip a conflicting transaction

When the apply worker fails due to a row conflict — an update or delete targeting a row that does not exist on the subscriber, or an insert violating a unique constraint — the correct resolution is to identify the LSN of the conflicting transaction and skip it:

-- On the subscriber, find the last received LSN from pg_stat_subscription
SELECT received_lsn FROM pg_stat_subscription WHERE subname = 'my_subscription';

-- Skip the conflicting transaction (PostgreSQL 15+)
ALTER SUBSCRIPTION my_subscription SKIP (lsn = 'LSN_VALUE');

-- For PostgreSQL 14 and earlier, use:
SELECT pg_replication_origin_advance('pg_16399', 'LSN_VALUE');
-- where 16399 is the subscription OID from pg_subscription

After skipping, re-enable the subscription if it was auto-disabled:

ALTER SUBSCRIPTION my_subscription ENABLE;

The skipped transaction is permanently lost on the subscriber. Before skipping, verify the row conflict is expected — for example, the subscriber already has the correct version of that row through another path. If data integrity is critical, investigate why the divergence occurred before skipping blindly.

Option 2 — Resync after schema drift

When a schema change (DDL) was applied to the publisher without also being applied to the subscriber, the apply worker will crash with a column or type mismatch error. The fix is to apply the matching DDL to the subscriber, then re-enable the subscription:

-- On the subscriber: apply the matching DDL
ALTER TABLE orders ADD COLUMN shipped_at timestamptz;

-- Re-enable the subscription
ALTER SUBSCRIPTION my_subscription ENABLE;

-- Verify lag starts recovering
SELECT pg_size_pretty(pg_current_wal_lsn() - confirmed_flush_lsn)
FROM pg_replication_slots
WHERE slot_name = 'my_subscription';  -- check on publisher

Logical replication does not replicate DDL. Every schema change on the publisher must be manually applied to the subscriber in the correct order before re-enabling the subscription.

Option 3 — Full resync of a specific table

When the data divergence is too large to resolve by skipping individual transactions, resync the affected table:

-- On the subscriber: refresh the subscription for a specific table
ALTER SUBSCRIPTION my_subscription REFRESH PUBLICATION FOR ALL TABLES;

-- Or drop and recreate with initial data copy
ALTER SUBSCRIPTION my_subscription DISABLE;
DROP SUBSCRIPTION my_subscription;
CREATE SUBSCRIPTION my_subscription
  CONNECTION 'host=publisher port=5432 dbname=mydb'
  PUBLICATION my_publication
  WITH (copy_data = true, create_slot = true);

A full resync will re-copy all data for subscribed tables. On large tables this can take hours. During resync, the subscriber is in an inconsistent state. If downstream applications read from the subscriber during resync, they should be aware the data is being rebuilt.

Rollback Plan

ALTER SUBSCRIPTION sub ENABLE and DISABLE are immediately reversible — toggle between them as needed. No data is lost.
ALTER SUBSCRIPTION sub SKIP (lsn) is irreversible — the skipped transaction is permanently lost on the subscriber. There is no undo. The only recovery if the skipped data was needed is a full table resync.
DDL applied to the subscriber for schema drift: cannot be automatically undone — but the DDL itself can be reversed (e.g., ALTER TABLE DROP COLUMN) if the column is not yet populated. Coordinate DDL rollback with the publisher-side change.
DROP SUBSCRIPTION followed by CREATE SUBSCRIPTION: dropping a subscription removes the replication slot on the publisher. The slot must be recreated (it happens automatically with create_slot = true). Once dropped, WAL that was retained for the old slot is released.

Automation Opportunity

Replication lag monitoring should be a first-class alert, not a periodic check. The key metric is the byte lag at the replication slot:

-- Scheduled query to capture slot lag for alerting
SELECT cron.schedule('replication-lag-monitor', '*/5 * * * *', $$
  INSERT INTO ops.replication_lag (slot_name, lag_bytes, active, captured_at)
  SELECT
    slot_name,
    pg_current_wal_lsn() - confirmed_flush_lsn,
    active,
    now()
  FROM pg_replication_slots
  WHERE slot_type = 'logical';
$$);

Alert thresholds: lag exceeding 1 GB warrants a warning; lag exceeding 10 GB is an incident — the publisher is retaining that much WAL, and disk exhaustion is a real risk. A slot that becomes active = false for more than 5 minutes outside a maintenance window should page immediately.

In Practice

The PostgreSQL logical replication documentation describes conflict handling behavior: when an apply worker encounters a conflict (e.g., a unique constraint violation), it pauses the apply process and waits for manual intervention. The documented resolution is either to skip the conflicting transaction using ALTER SUBSCRIPTION ... SKIP (PostgreSQL 15+) or to use pg_replication_origin_advance on earlier versions. The documentation explicitly states that skipping is a destructive operation — the skipped changes are permanently absent from the subscriber.

The documented constraint on logical replication and DDL is unambiguous: DDL changes are not replicated. The PostgreSQL replication documentation requires that schema changes be applied to all subscribers before or simultaneously with the publisher, depending on whether the change is backward-compatible. Adding a nullable column with a default is backward-compatible and can be applied to the subscriber after the publisher; removing a column is not backward-compatible and must be applied to both simultaneously.

Where It Breaks

Failure mode	Trigger	Fix
Replication slot fills disk on publisher	Subscriber disconnected for hours while high-write workload runs	Monitor slot lag; set `max_slot_wal_keep_size` to cap WAL retention
Apply worker stuck waiting for lock	Long-running query on subscriber table being replicated	Identify and terminate the blocking query on subscriber
`SKIP` causes downstream data inconsistency	Skipped row was a critical update needed for referential integrity	Resync the table after skip; audit downstream data for orphaned rows
Schema divergence not caught until conflict	Publisher DDL run without notifying the subscriber	Add subscriber DDL to publisher migration scripts; use migration locking tools
`max_wal_senders` exceeded	Too many replication connections — logical and physical combined	Increase `max_wal_senders` in `postgresql.conf`; requires restart

What to Do Next

Problem: Logical replication lag accumulates silently, WAL retention grows on the publisher, and by the time the disk alert fires, the subscriber is hours behind with no fast path to catch up.
Solution: Add active monitoring on replication slot lag bytes with an alert threshold at 1 GB, set max_slot_wal_keep_size as a disk safety cap, and treat any pg_subscription_rel table in e state as an incident requiring same-day resolution.
Proof: After resolving a conflict and re-enabling the subscription, pg_size_pretty(pg_current_wal_lsn() - confirmed_flush_lsn) from the publisher should decrease steadily — the subscriber is catching up.
Action: Run Check 1 on the publisher this week. If any replication slot shows lag_bytes > 1 GB or active = false, treat it as an open incident. If lag is normal, add a monitoring alert so you know before it becomes critical.

Checklist

Query pg_replication_slots on publisher — check active status and lag_bytes for each logical slot
Query pg_subscription on subscriber — verify subenabled = true for each subscription
Query pg_subscription_rel on subscriber — check srsubstate for any tables in e (error) state
Query pg_stat_subscription on subscriber — confirm pid is not NULL for each subscription
Review PostgreSQL log on subscriber for conflict type and LSN
If table in error state with row conflict: use ALTER SUBSCRIPTION sub SKIP (lsn) to unblock
If schema mismatch: apply matching DDL to subscriber, then re-enable subscription
If apply worker stalled on lock: identify and resolve the blocking query on subscriber
After resolution, monitor lag_bytes decreasing — confirm subscriber is catching up
Set max_slot_wal_keep_size on publisher to cap disk usage from stalled slots
Add monitoring alert at lag > 1 GB per logical replication slot
Document schema change protocol — every publisher DDL must have a matching subscriber DDL step

Situation

Symptoms

First Five Checks

Decision Tree

Remediation Options

Rollback Plan

Automation Opportunity

In Practice

Where It Breaks

What to Do Next

Checklist

Rajiv

Related Posts

Autovacuum Is a Capacity Problem, Not a Maintenance Task

When Autovacuum Becomes a Backpressure Signal

Prometheus + Grafana for Database Engineers: Open-Source Monitoring That Actually Works