PostgreSQL Observability: Vacuum, Bloat, Locks, Replication Lag, and Query Plans

If you treat PostgreSQL like a black box that only consumes CPU and Memory, you will eventually be crushed by the invisible weight of its MVCC architecture.

Situation

PostgreSQL’s Multi-Version Concurrency Control (MVCC) is powerful, but it requires continuous internal maintenance. Every UPDATE creates a new row version, and every DELETE marks an old row as a “dead tuple.” The autovacuum daemon must eventually clean up these dead tuples to prevent table bloat and transaction ID wraparound.

When teams migrate to PostgreSQL from other database engines, they often bring their generic monitoring dashboards with them. They alert on CPU spikes or memory exhaustion. But in PostgreSQL, the most dangerous failures are silent. An aggressive transaction holds a lock for too long, replication falls silently behind, or autovacuum is misconfigured and gives up on heavily updated tables. By the time these issues manifest as CPU spikes, the database is already deeply unhealthy.

Symptoms

A failing PostgreSQL instance leaves distinct operational footprints before it fully collapses:

The Bloat Spiral: Queries that used to return in milliseconds now take seconds. The table size on disk has doubled, but the actual row count hasn’t changed.
The Stale Stats Fallacy: The query planner suddenly switches from a fast Index Scan to a catastrophic Sequential Scan because the table statistics are out of date.
The Lock Cascade: Application monitoring shows massive latency spikes across unrelated endpoints because a long-running reporting query is holding an AccessShareLock that blocks an AccessExclusiveLock requested by a schema migration, which in turn blocks all subsequent SELECT queries.
Replication Desync: The primary database is healthy, but read-heavy applications serving from replicas are displaying data that is five minutes old.

First Five Checks

When a PostgreSQL incident begins, these are the queries and metrics you must check first:

Check for Blocking Sessions (pg_locks):

SELECT blocked_locks.pid AS blocked_pid,
       blocking_locks.pid AS blocking_pid,
       blocked_activity.query AS blocked_query,
       blocking_activity.query AS blocking_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted AND blocking_locks.granted;

Check Dead Tuples and Autovacuum Status (pg_stat_user_tables): Look at n_dead_tup vs n_live_tup. Check last_autovacuum to see if the daemon is actually completing its work.
Check Replication Lag (pg_stat_replication): Compare pg_current_wal_lsn() with the replay_lsn of the standby to calculate the byte lag.
Identify Long-Running Transactions (pg_stat_activity): Transactions sitting in idle in transaction for hours are holding locks and preventing dead tuples from being vacuumed.
Examine Query Plan Regressions (pg_stat_statements): If a specific query is suddenly slow, use EXPLAIN (ANALYZE, BUFFERS) to see if it is executing a sequential scan due to stale statistics.

Decision Tree

When diagnosing sudden latency in PostgreSQL, the triage path branches quickly based on locks vs. load.

flowchart TD
    A[Latency Spike Detected] --> B{Are there blocking sessions?}
    B -->|Yes| C[Identify Blocking PID]
    C --> C1{Is the blocker idle in transaction?}
    C1 -->|Yes| C2[Terminate Blocker]
    C1 -->|No| C3[Evaluate Impact: Terminate or Wait]
    
    B -->|No| D{Are queries using Sequential Scans?}
    D -->|Yes| D1[Check n_dead_tup]
    D1 -->|High| D2[Run VACUUM ANALYZE manually]
    D1 -->|Low| D3[Update pg_statistic via ANALYZE]
    
    D -->|No| E[Check Connection Pool]
    E --> E1[If saturated, increase pool size or shed load]

Remediation Options

Kill the Blocking Session (Fast, Disruptive): Using pg_terminate_backend(pid) will immediately release locks.
- Tradeoff: The terminated application transaction will fail and must be retried.
Manual VACUUM ANALYZE (Medium Speed, High I/O): If a table has massive bloat and stale stats, forcing a manual vacuum updates the planner.
- Tradeoff: This generates significant disk I/O and can degrade performance further while it runs.
Tuning autovacuum_vacuum_scale_factor (Slow, Permanent Fix): If large tables are never being vacuumed, lower the scale factor for those specific tables using ALTER TABLE ... SET (autovacuum_vacuum_scale_factor = 0.01).
- Tradeoff: Requires understanding the write velocity of the specific table to tune correctly.

Rollback Plan

If you execute a manual VACUUM FULL attempting to reclaim disk space, remember that it takes an AccessExclusiveLock on the entire table. If this blocks production traffic unexpectedly, the rollback plan is to immediately cancel the VACUUM FULL command. PostgreSQL will safely release the lock and revert to the previous state, though no space will have been reclaimed.

Automation Opportunity

Deploy an agent or cron job that explicitly alerts on “Transactions older than 1 hour” and “Idle in transaction older than 15 minutes.” These are almost always application bugs (leaked connections) and they are the primary cause of autovacuum failing to clean up dead tuples.

Leadership Summary

Vacuum is a Feature, Not a Chore: Do not disable or restrict autovacuum. If it is consuming too much I/O, tune it to run more frequently but less aggressively.
Alert on the Right Metrics: Stop alerting purely on CPU. Alert on replication lag, connection saturation, and long-running locks.
Monitor Query Plans: Use pg_stat_statements to track the average execution time of your top queries to catch regressions before they cause outages.

What to Do Next

Problem: PostgreSQL’s most dangerous failures — bloat spirals, lock cascades, replication desync — are invisible on CPU and memory dashboards until the database is already deeply unhealthy. By the time CPU spikes from bloat, the table has been unvacuumed long enough to cause query plan regressions.
Solution: Add lock chain detection, dead tuple ratio, replication byte lag, and long transaction age as continuously scraped metrics alongside host metrics — these are the leading indicators CPU can never provide.
Proof: Introduce a sleeping idle in transaction connection in staging and verify it appears on the “Transactions older than 15 minutes” alert before it blocks a schema migration — if the alert doesn’t fire, the monitoring gap is real.
Action: Add lock_timeout = '5s' to all schema migration scripts this sprint, and create a Grafana panel tracking n_dead_tup / (n_live_tup + n_dead_tup) per table to catch bloat before it affects query plans.

Situation

Symptoms

First Five Checks

Decision Tree

Remediation Options

Rollback Plan

Automation Opportunity

Leadership Summary

What to Do Next

Rajiv

Related Posts

Prometheus + Grafana for Database Engineers: Open-Source Monitoring That Actually Works

Cassandra Observability: Compaction, Tombstones, Repair, Latency, and Hot Partitions

The Database Observability Baseline: What Every DBA Dashboard Must Show