PostgreSQL Observability: Vacuum, Bloat, Locks, Replication Lag, and Query Plans
If you treat PostgreSQL like a black box that only consumes CPU and Memory, you will eventually be crushed by the invisible weight of its MVCC architecture.
Situation
PostgreSQL’s Multi-Version Concurrency Control (MVCC) is powerful, but it requires continuous internal maintenance. Every UPDATE creates a new row version, and every DELETE marks an old row as a “dead tuple.” The autovacuum daemon must eventually clean up these dead tuples to prevent table bloat and transaction ID wraparound.
When teams migrate to PostgreSQL from other database engines, they often bring their generic monitoring dashboards with them. They alert on CPU spikes or memory exhaustion. But in PostgreSQL, the most dangerous failures are silent. An aggressive transaction holds a lock for too long, replication falls silently behind, or autovacuum is misconfigured and gives up on heavily updated tables. By the time these issues manifest as CPU spikes, the database is already deeply unhealthy.
Symptoms
A failing PostgreSQL instance leaves distinct operational footprints before it fully collapses:
- The Bloat Spiral: Queries that used to return in milliseconds now take seconds. The table size on disk has doubled, but the actual row count hasn’t changed.
- The Stale Stats Fallacy: The query planner suddenly switches from a fast Index Scan to a catastrophic Sequential Scan because the table statistics are out of date.
- The Lock Cascade: Application monitoring shows massive latency spikes across unrelated endpoints because a long-running reporting query is holding an
AccessShareLockthat blocks anAccessExclusiveLockrequested by a schema migration, which in turn blocks all subsequentSELECTqueries. - Replication Desync: The primary database is healthy, but read-heavy applications serving from replicas are displaying data that is five minutes old.
First Five Checks
When a PostgreSQL incident begins, these are the queries and metrics you must check first:
-
Check for Blocking Sessions (
pg_locks):SELECT blocked_locks.pid AS blocked_pid, blocking_locks.pid AS blocking_pid, blocked_activity.query AS blocked_query, blocking_activity.query AS blocking_query FROM pg_catalog.pg_locks blocked_locks JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid WHERE NOT blocked_locks.granted AND blocking_locks.granted; -
Check Dead Tuples and Autovacuum Status (
pg_stat_user_tables): Look atn_dead_tupvsn_live_tup. Checklast_autovacuumto see if the daemon is actually completing its work. -
Check Replication Lag (
pg_stat_replication): Comparepg_current_wal_lsn()with thereplay_lsnof the standby to calculate the byte lag. -
Identify Long-Running Transactions (
pg_stat_activity): Transactions sitting inidle in transactionfor hours are holding locks and preventing dead tuples from being vacuumed. -
Examine Query Plan Regressions (
pg_stat_statements): If a specific query is suddenly slow, useEXPLAIN (ANALYZE, BUFFERS)to see if it is executing a sequential scan due to stale statistics.
Decision Tree
When diagnosing sudden latency in PostgreSQL, the triage path branches quickly based on locks vs. load.
flowchart TD
A[Latency Spike Detected] --> B{Are there blocking sessions?}
B -->|Yes| C[Identify Blocking PID]
C --> C1{Is the blocker idle in transaction?}
C1 -->|Yes| C2[Terminate Blocker]
C1 -->|No| C3[Evaluate Impact: Terminate or Wait]
B -->|No| D{Are queries using Sequential Scans?}
D -->|Yes| D1[Check n_dead_tup]
D1 -->|High| D2[Run VACUUM ANALYZE manually]
D1 -->|Low| D3[Update pg_statistic via ANALYZE]
D -->|No| E[Check Connection Pool]
E --> E1[If saturated, increase pool size or shed load]
Remediation Options
-
Kill the Blocking Session (Fast, Disruptive): Using
pg_terminate_backend(pid)will immediately release locks.- Tradeoff: The terminated application transaction will fail and must be retried.
-
Manual
VACUUM ANALYZE(Medium Speed, High I/O): If a table has massive bloat and stale stats, forcing a manual vacuum updates the planner.- Tradeoff: This generates significant disk I/O and can degrade performance further while it runs.
-
Tuning
autovacuum_vacuum_scale_factor(Slow, Permanent Fix): If large tables are never being vacuumed, lower the scale factor for those specific tables usingALTER TABLE ... SET (autovacuum_vacuum_scale_factor = 0.01).- Tradeoff: Requires understanding the write velocity of the specific table to tune correctly.
Rollback Plan
If you execute a manual VACUUM FULL attempting to reclaim disk space, remember that it takes an AccessExclusiveLock on the entire table. If this blocks production traffic unexpectedly, the rollback plan is to immediately cancel the VACUUM FULL command. PostgreSQL will safely release the lock and revert to the previous state, though no space will have been reclaimed.
Automation Opportunity
Deploy an agent or cron job that explicitly alerts on “Transactions older than 1 hour” and “Idle in transaction older than 15 minutes.” These are almost always application bugs (leaked connections) and they are the primary cause of autovacuum failing to clean up dead tuples.
Leadership Summary
- Vacuum is a Feature, Not a Chore: Do not disable or restrict autovacuum. If it is consuming too much I/O, tune it to run more frequently but less aggressively.
- Alert on the Right Metrics: Stop alerting purely on CPU. Alert on replication lag, connection saturation, and long-running locks.
- Monitor Query Plans: Use
pg_stat_statementsto track the average execution time of your top queries to catch regressions before they cause outages.
What to Do Next
- Problem: PostgreSQL’s most dangerous failures — bloat spirals, lock cascades, replication desync — are invisible on CPU and memory dashboards until the database is already deeply unhealthy. By the time CPU spikes from bloat, the table has been unvacuumed long enough to cause query plan regressions.
- Solution: Add lock chain detection, dead tuple ratio, replication byte lag, and long transaction age as continuously scraped metrics alongside host metrics — these are the leading indicators CPU can never provide.
- Proof: Introduce a sleeping
idle in transactionconnection in staging and verify it appears on the “Transactions older than 15 minutes” alert before it blocks a schema migration — if the alert doesn’t fire, the monitoring gap is real. - Action: Add
lock_timeout = '5s'to all schema migration scripts this sprint, and create a Grafana panel trackingn_dead_tup / (n_live_tup + n_dead_tup)per table to catch bloat before it affects query plans.