Logical Replication Failure Workflow
Logical replication lag does not announce itself with an error message — it accumulates silently in the WAL retention on the publisher, and the subscriber falls further and further behind until either the replication slot fills the disk or you notice the data is hours stale. Unlike streaming replication, which breaks loudly, logical replication degrades quietly: the subscription stays connected, the apply worker reports running, and the divergence grows until something downstream catches it.
Situation
PostgreSQL logical replication works by decoding WAL changes on the publisher into a row-level change stream, which the subscriber applies table by table. This is fundamentally different from physical replication, which ships binary WAL blocks. Logical replication lets you replicate subsets of tables, replicate across major versions, and fan out to multiple subscribers — but it introduces failure modes that streaming replication does not have.
The most common operational problems: a subscription falls behind because the apply worker hit a conflict (an update arriving for a row that does not exist on the subscriber); the subscription is technically active but the apply worker is stalled waiting for a lock; the publisher and subscriber diverge on schema, causing the apply worker to crash with a type mismatch; or the replication slot on the publisher accumulates enough unreleased WAL to fill the disk.
The diagnostic workflow must cover all four of these. They share symptoms but have different root causes and different remediations.
Symptoms
| Signal | Where to see it | What it means |
|---|---|---|
| Increasing lag between publisher and subscriber | pg_replication_slots.confirmed_flush_lsn vs pg_current_wal_lsn() | Apply worker not keeping up — lag in bytes growing |
| Replication slot holding excessive WAL | pg_replication_slots — slot not advancing | Subscriber disconnected or stalled; disk risk if slot persists |
Apply worker process absent from pg_stat_subscription | pg_stat_activity, pg_stat_subscription | Apply worker crashed — check PostgreSQL error log |
Subscription state e (error) in pg_subscription_rel | pg_subscription_rel.srsubstate | Specific table failed to apply — conflict or schema mismatch |
| Error message in logs — “conflict in logical replication” | postgresql.log | Row-level conflict on insert, update, or delete |
| Schema-related error in logs — “column X of relation Y does not exist” | postgresql.log | DDL executed on publisher without matching DDL on subscriber |
First Five Checks
- Replication lag in bytes — the most immediate measure of how far behind the subscriber is:
-- Run on the publisher
SELECT
slot_name,
plugin,
active,
confirmed_flush_lsn,
pg_current_wal_lsn() - confirmed_flush_lsn AS lag_bytes,
pg_size_pretty(pg_current_wal_lsn() - confirmed_flush_lsn) AS lag_human
FROM pg_replication_slots
WHERE slot_type = 'logical';
A growing lag_bytes means the subscriber is not applying changes as fast as they are being generated. A slot that is not active (no connected subscriber) is holding WAL indefinitely — disk risk. A slot that is active but lag_bytes is growing means the apply worker is falling behind.
- Subscription status — verify the subscription is enabled and the apply worker is running:
-- Run on the subscriber
SELECT
subname,
subenabled,
subpublications,
subconninfo
FROM pg_subscription;
subenabled = false means the subscription was manually disabled. It will not apply changes until re-enabled. This is the most common cause of lag that looks like a network issue but is actually an administrative action that was forgotten.
- Per-table replication state — identify which tables are in which state:
-- Run on the subscriber
SELECT
srrelid::regclass AS tablename,
srsubstate,
srsublsn
FROM pg_subscription_rel
ORDER BY srsubstate;
State codes: i = initialize, d = data copy in progress, s = synchronized, r = ready, e = error. A table in state e has failed to apply changes — check the error log for the specific conflict or error. A table stuck in state d for an extended period means the initial data copy is running slowly or stalled.
- Apply worker activity — check what the apply worker is currently doing:
-- Run on the subscriber
SELECT
pid,
application_name,
client_addr,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
now() - backend_start AS worker_age
FROM pg_stat_replication;
-- Also check the subscription worker directly
SELECT
subname,
pid,
received_lsn,
last_msg_send_time,
last_msg_receipt_time,
latest_end_lsn,
latest_end_time
FROM pg_stat_subscription;
A pid that is NULL in pg_stat_subscription means no worker is running for that subscription. Check the PostgreSQL log for the crash reason.
- Error log review — the log contains the exact conflict type and LSN:
# Find conflict-related errors in the PostgreSQL log
grep -E 'ERROR|conflict|replication' /var/log/postgresql/postgresql.log | tail -50
# More targeted
grep 'logical replication' /var/log/postgresql/postgresql.log | tail -20
The log will contain lines like ERROR: duplicate key value violates unique constraint or ERROR: could not find row for updating — these identify the conflict type. The log also shows the LSN at which the conflict occurred, which is needed for the SKIP remediation in Option 1 below.
Decision Tree
flowchart TD
A[Logical replication lag growing] --> B{Subscription enabled?}
B -->|no| C[ALTER SUBSCRIPTION sub ENABLE]
B -->|yes| D{Apply worker running?}
D -->|no — pid null| E[Check pg_subscription_rel for error state]
E --> F{Table in error state?}
F -->|yes| G{Conflict type?}
G -->|insert conflict| H[ALTER SUBSCRIPTION sub SKIP lsn]
G -->|update or delete missing row| I[ALTER SUBSCRIPTION sub SKIP lsn]
G -->|schema mismatch| J[Apply DDL to subscriber — re-enable]
D -->|yes — worker running| K{Lag growing despite active worker?}
K -->|yes| L{Publisher write rate too high?}
L -->|yes| M[Tune max_logical_replication_workers]
L -->|no| N{Lock wait on subscriber?}
N -->|yes| O[Identify blocking query on subscriber]
N -->|no| P[Check network throughput publisher to subscriber]
F -->|no — stuck in data copy| Q[Check disk and I/O on subscriber]
Remediation Options
Option 1 — Skip a conflicting transaction
When the apply worker fails due to a row conflict — an update or delete targeting a row that does not exist on the subscriber, or an insert violating a unique constraint — the correct resolution is to identify the LSN of the conflicting transaction and skip it:
-- On the subscriber, find the last received LSN from pg_stat_subscription
SELECT received_lsn FROM pg_stat_subscription WHERE subname = 'my_subscription';
-- Skip the conflicting transaction (PostgreSQL 15+)
ALTER SUBSCRIPTION my_subscription SKIP (lsn = 'LSN_VALUE');
-- For PostgreSQL 14 and earlier, use:
SELECT pg_replication_origin_advance('pg_16399', 'LSN_VALUE');
-- where 16399 is the subscription OID from pg_subscription
After skipping, re-enable the subscription if it was auto-disabled:
ALTER SUBSCRIPTION my_subscription ENABLE;
The skipped transaction is permanently lost on the subscriber. Before skipping, verify the row conflict is expected — for example, the subscriber already has the correct version of that row through another path. If data integrity is critical, investigate why the divergence occurred before skipping blindly.
Option 2 — Resync after schema drift
When a schema change (DDL) was applied to the publisher without also being applied to the subscriber, the apply worker will crash with a column or type mismatch error. The fix is to apply the matching DDL to the subscriber, then re-enable the subscription:
-- On the subscriber: apply the matching DDL
ALTER TABLE orders ADD COLUMN shipped_at timestamptz;
-- Re-enable the subscription
ALTER SUBSCRIPTION my_subscription ENABLE;
-- Verify lag starts recovering
SELECT pg_size_pretty(pg_current_wal_lsn() - confirmed_flush_lsn)
FROM pg_replication_slots
WHERE slot_name = 'my_subscription'; -- check on publisher
Logical replication does not replicate DDL. Every schema change on the publisher must be manually applied to the subscriber in the correct order before re-enabling the subscription.
Option 3 — Full resync of a specific table
When the data divergence is too large to resolve by skipping individual transactions, resync the affected table:
-- On the subscriber: refresh the subscription for a specific table
ALTER SUBSCRIPTION my_subscription REFRESH PUBLICATION FOR ALL TABLES;
-- Or drop and recreate with initial data copy
ALTER SUBSCRIPTION my_subscription DISABLE;
DROP SUBSCRIPTION my_subscription;
CREATE SUBSCRIPTION my_subscription
CONNECTION 'host=publisher port=5432 dbname=mydb'
PUBLICATION my_publication
WITH (copy_data = true, create_slot = true);
A full resync will re-copy all data for subscribed tables. On large tables this can take hours. During resync, the subscriber is in an inconsistent state. If downstream applications read from the subscriber during resync, they should be aware the data is being rebuilt.
Rollback Plan
ALTER SUBSCRIPTION sub ENABLEandDISABLEare immediately reversible — toggle between them as needed. No data is lost.ALTER SUBSCRIPTION sub SKIP (lsn)is irreversible — the skipped transaction is permanently lost on the subscriber. There is no undo. The only recovery if the skipped data was needed is a full table resync.- DDL applied to the subscriber for schema drift: cannot be automatically undone — but the DDL itself can be reversed (e.g.,
ALTER TABLE DROP COLUMN) if the column is not yet populated. Coordinate DDL rollback with the publisher-side change. DROP SUBSCRIPTIONfollowed byCREATE SUBSCRIPTION: dropping a subscription removes the replication slot on the publisher. The slot must be recreated (it happens automatically withcreate_slot = true). Once dropped, WAL that was retained for the old slot is released.
Automation Opportunity
Replication lag monitoring should be a first-class alert, not a periodic check. The key metric is the byte lag at the replication slot:
-- Scheduled query to capture slot lag for alerting
SELECT cron.schedule('replication-lag-monitor', '*/5 * * * *', $$
INSERT INTO ops.replication_lag (slot_name, lag_bytes, active, captured_at)
SELECT
slot_name,
pg_current_wal_lsn() - confirmed_flush_lsn,
active,
now()
FROM pg_replication_slots
WHERE slot_type = 'logical';
$$);
Alert thresholds: lag exceeding 1 GB warrants a warning; lag exceeding 10 GB is an incident — the publisher is retaining that much WAL, and disk exhaustion is a real risk. A slot that becomes active = false for more than 5 minutes outside a maintenance window should page immediately.
In Practice
The PostgreSQL logical replication documentation describes conflict handling behavior: when an apply worker encounters a conflict (e.g., a unique constraint violation), it pauses the apply process and waits for manual intervention. The documented resolution is either to skip the conflicting transaction using ALTER SUBSCRIPTION ... SKIP (PostgreSQL 15+) or to use pg_replication_origin_advance on earlier versions. The documentation explicitly states that skipping is a destructive operation — the skipped changes are permanently absent from the subscriber.
The documented constraint on logical replication and DDL is unambiguous: DDL changes are not replicated. The PostgreSQL replication documentation requires that schema changes be applied to all subscribers before or simultaneously with the publisher, depending on whether the change is backward-compatible. Adding a nullable column with a default is backward-compatible and can be applied to the subscriber after the publisher; removing a column is not backward-compatible and must be applied to both simultaneously.
Where It Breaks
| Failure mode | Trigger | Fix |
|---|---|---|
| Replication slot fills disk on publisher | Subscriber disconnected for hours while high-write workload runs | Monitor slot lag; set max_slot_wal_keep_size to cap WAL retention |
| Apply worker stuck waiting for lock | Long-running query on subscriber table being replicated | Identify and terminate the blocking query on subscriber |
SKIP causes downstream data inconsistency | Skipped row was a critical update needed for referential integrity | Resync the table after skip; audit downstream data for orphaned rows |
| Schema divergence not caught until conflict | Publisher DDL run without notifying the subscriber | Add subscriber DDL to publisher migration scripts; use migration locking tools |
max_wal_senders exceeded | Too many replication connections — logical and physical combined | Increase max_wal_senders in postgresql.conf; requires restart |
What to Do Next
- Problem: Logical replication lag accumulates silently, WAL retention grows on the publisher, and by the time the disk alert fires, the subscriber is hours behind with no fast path to catch up.
- Solution: Add active monitoring on replication slot lag bytes with an alert threshold at 1 GB, set
max_slot_wal_keep_sizeas a disk safety cap, and treat anypg_subscription_reltable inestate as an incident requiring same-day resolution. - Proof: After resolving a conflict and re-enabling the subscription,
pg_size_pretty(pg_current_wal_lsn() - confirmed_flush_lsn)from the publisher should decrease steadily — the subscriber is catching up. - Action: Run Check 1 on the publisher this week. If any replication slot shows
lag_bytes > 1 GBoractive = false, treat it as an open incident. If lag is normal, add a monitoring alert so you know before it becomes critical.
Checklist
- Query
pg_replication_slotson publisher — checkactivestatus andlag_bytesfor each logical slot - Query
pg_subscriptionon subscriber — verifysubenabled = truefor each subscription - Query
pg_subscription_relon subscriber — checksrsubstatefor any tables ine(error) state - Query
pg_stat_subscriptionon subscriber — confirmpidis not NULL for each subscription - Review PostgreSQL log on subscriber for conflict type and LSN
- If table in error state with row conflict: use
ALTER SUBSCRIPTION sub SKIP (lsn)to unblock - If schema mismatch: apply matching DDL to subscriber, then re-enable subscription
- If apply worker stalled on lock: identify and resolve the blocking query on subscriber
- After resolution, monitor
lag_bytesdecreasing — confirm subscriber is catching up - Set
max_slot_wal_keep_sizeon publisher to cap disk usage from stalled slots - Add monitoring alert at lag > 1 GB per logical replication slot
- Document schema change protocol — every publisher DDL must have a matching subscriber DDL step