Autovacuum is not background housekeeping; in a write-heavy PostgreSQL system, delayed vacuum is a backpressure signal from Multi-Version Concurrency Control before the application admits it is overloaded.

Situation

PostgreSQL’s default approach is to let autovacuum clean dead row versions in the background while application traffic continues. The alternative is to treat vacuum health as part of the write path: measured, alerted, tuned per table, and included in incident triage.

ApproachWhat it assumesWhat production eventually proves
Default autovacuumTable churn is moderate and cleanup can trail safelyHigh-update tables create cleanup debt faster than defaults can retire it
Manual emergency vacuumOperators can intervene after latency spikesThe database is already paying interest on bloat by then
Vacuum as backpressure telemetryDead tuples, transaction age, locks, and vacuum progress are monitored togetherThe incident is visible before p95 latency becomes the alert

The Problem

Autovacuum is often blamed because it is visible during the outage. That is usually too shallow. In PostgreSQL, UPDATE and DELETE create dead row versions under Multi-Version Concurrency Control; VACUUM can only remove versions no active snapshot can still see. A single old transaction can hold back the cleanup horizon through backend_xmin, which PostgreSQL exposes in pg_stat_activity.

Failure pointWhat breaksWhy it matters
Long transaction ageVacuum cannot remove dead tuples still visible to an old snapshotBloat grows even while autovacuum appears active
Idle transaction sessionsstate = 'idle in transaction' keeps a snapshot open without doing useful workOne abandoned app connection can pin cleanup behind thousands of writes
High-churn tables on defaultsautovacuum_vacuum_scale_factor = 0.2 waits for 20 percent table churn plus thresholdOn a 200M-row table, that can mean tens of millions of dead tuples before cleanup starts
Lock conflictsPlain VACUUM uses ShareUpdateExclusiveLock; VACUUM FULL takes AccessExclusiveLockConfusing the two during an incident can turn a slowdown into an outage
Dead tuple percent aloneSmall tables, append-heavy tables, and partitioned tables distort the signalAlerts need relation size, last vacuum age, transaction age, and latency together

PostgreSQL’s own documentation is explicit about the mechanics: routine vacuuming removes dead row versions and prevents transaction ID wraparound, while old open transactions can block cleanup progress. The operational question is not “is autovacuum running?” The question is: which workload condition is forcing it to fall behind?

Treat Autovacuum as Backpressure Telemetry

The right architecture is a vacuum control loop: observe the cleanup horizon, identify blockers, tune the few hot tables, and validate under write load. Do not start by changing global autovacuum settings across the cluster. That is how a maintenance problem becomes an I/O scheduling problem.

flowchart TD
    App[application writes] --> MVCC[MVCC row versions]
    MVCC --> Dead[dead tuples accumulate]
    Txn[old transaction xmin] --> Horizon[cleanup horizon held back]
    Dead --> Auto[autovacuum worker]
    Horizon --> Auto
    Auto --> Locks[ShareUpdateExclusiveLock]
    DDL[DDL or index maintenance] --> Locks
    Locks --> Lag[vacuum lag]
    Lag --> Bloat[table and index bloat]
    Bloat --> Planner[slower plans and more IO]
    Planner --> App
    Lag --> Alert[backpressure alert]
  1. Build a vacuum incident view.

    Include active vacuum progress, oldest transaction age, idle-in-transaction sessions, dead tuple counts, table size, and blockers. pg_stat_progress_vacuum has existed since PostgreSQL 9.6 and reports active vacuum workers, including autovacuum workers.

    Verification: during a load test, you can name the table being vacuumed, its phase, heap blocks scanned, and any blocking backend in under one minute.

  2. Alert on cleanup debt, not just dead tuple percentage.

    A 40 percent dead tuple ratio on a 5 MB table is noise. Five percent on a 900 GB high-update table may be a serious future incident. Use a composite signal: n_dead_tup, pg_total_relation_size, last_autovacuum, oldest backend_xmin, and query latency for the table’s top statements.

    Verification: every alert points to one table, one suspected blocker class, and one next action.

  3. Tune high-churn tables per table.

    Lower scale factors on tables such as orders, sessions, and job queues. A setting like autovacuum_vacuum_scale_factor = 0.01 with a fixed threshold can make cleanup continuous instead of bursty. Keep cost delay and cost limit workload-aware; aggressive cleanup still competes for disk and cache.

    Verification: after tuning, n_dead_tup forms a sawtooth with a lower ceiling under production-like write load.

  4. Fix transaction hygiene before killing vacuum.

    Terminating autovacuum can reduce immediate pressure when it is competing with foreground work, but repeated termination increases bloat debt. The durable fix is shorter transactions, timeouts for idle sessions, safer migration locks, and partition or index maintenance where needed.

    Verification: oldest transaction age remains bounded during peak traffic, not only during maintenance windows.

A useful runbook query starts here:

SELECT
  pid,
  usename,
  application_name,
  state,
  wait_event_type,
  wait_event,
  age(clock_timestamp(), xact_start) AS xact_age,
  age(clock_timestamp(), query_start) AS query_age,
  backend_xmin,
  left(query, 160) AS query
FROM pg_stat_activity
WHERE state <> 'idle'
ORDER BY xact_start NULLS LAST;

In Practice

The most useful public case study is not an anonymous war story; it is the AWS Database Blog write-up on tuning autovacuum for Amazon RDS for PostgreSQL 9.6.3 after an Oracle-to-PostgreSQL OLTP migration. The database was provisioned for 30,000 IOPS. During the first weeks after migration, several databases saw Read IOPS spike as high as 25,000 without a matching increase in application load. The visible symptom was not one slow query. It was cleanup work arriving late, in large chunks, on already-bloated tables.

The concrete numbers are the part worth carrying into a runbook:

Published observationValueOperational reading
table1 live tuples450,398,643Large enough that percentage-based thresholds delay cleanup
table1 dead tuples459,406,616More dead tuples than estimated live tuples
table2 dead tuples1,919,230,596Vacuum debt was not isolated to one table
table3 dead tuples4,642,232,802Cluster-level worker saturation becomes plausible
Longest autovacuum session2 days 16:03 on sh.table1Vacuum was active but not converging fast enough
Blocking session stateidle in transaction for 2 days 22:25 on table1The cleanup horizon was pinned by transaction hygiene
RDS setting called outautovacuum_vacuum_scale_factor = 0.1, autovacuum_max_workers = 3Millions of dead tuples accumulated before work started
Tuning result reportedautovacuum_max_workers = 8, autovacuum_vacuum_cost_limit = 4800Read IOPS during concurrent autovacuum was brought to about 10,000, one-third of provisioned capacity

That case is useful because it separates three failure modes operators often collapse into one. First, the trigger threshold was too high for tables with hundreds of millions of rows. Second, the default worker count meant a few large tables could occupy all autovacuum workers while other tables continued to accumulate dead tuples. Third, an idle in transaction session kept old tuple versions visible, so autovacuum could run and still fail to reclaim enough space.

The lock behavior is documented, not folklore. PostgreSQL’s explicit locking documentation states that plain VACUUM acquires ShareUpdateExclusiveLock, while VACUUM FULL requires AccessExclusiveLock. That distinction matters at 03:00. Plain vacuum is designed to coexist with normal reads and writes; VACUUM FULL rewrites the table and blocks concurrent access. Reaching for it during a live checkout incident is usually the database equivalent of fixing a smoke alarm with a hammer.

A separate public PGConf/OtterTune autovacuum case connects the same mechanics to request latency. The case describes an update-heavy workload where long-running queries blocked autovacuum, dead tuples accumulated by 600x, blocks read increased by 375x, non-HOT updates reached 100 percent, update latency increased from 12 ms to 710 ms, throughput dropped by 25 percent during the spike, and query latency spiked by 90x. The exact schema is less important than the shape of the failure: stale tuple versions made ordinary updates read and write far more than the application expected.

The practical pattern is visible in named system behavior:

System behaviorOperational implicationSource
Dead row versions remain until no active transaction can see themWatch backend_xmin, not only table sizePostgreSQL routine vacuuming
Autovacuum triggers from threshold plus scale factorLarge tables need per-table thresholdsAutovacuum settings
Plain vacuum and DDL can conflict through table locksIncident views need pg_locks, not only connection countsPostgreSQL explicit locking
Vacuum progress is visible while runningTreat active vacuum as observable work, not mystery loadPostgreSQL progress reporting
Large-table defaults can produce delayed, bursty cleanupTune hot tables before making broad cluster changesAWS RDS autovacuum case study
Long-running queries can turn vacuum lag into latency spikesTrack transaction age beside table bloat and top statement latencyPGConf autovacuum case study

The more interesting production lesson is that vacuum lag is a system signal, not a storage metric. It often points at application behavior: oversized transactions, forgotten cursors, migration scripts without lock timeouts, reporting queries running at REPEATABLE READ, or connection pools that keep sessions open after the request has ended.

Where It Breaks

Failure modeTriggerFix
Autovacuum workers saturatedSeveral large tables cross vacuum thresholds at the same timeTune hot tables individually and review autovacuum_max_workers with disk capacity
Cleanup horizon pinnedOld backend_xmin, prepared transaction, or replication slot prevents tuple removalAlert on transaction age, prepared transactions, and replication slot lag
Foreground latency worsens after tuningLower scale factors create more frequent vacuum I/O under peak writesAdjust cost limit, cost delay, and schedule manual maintenance for cold periods
VACUUM FULL blocks trafficOperator uses it to reclaim disk on a live tablePrefer regular vacuum, REINDEX CONCURRENTLY, partition rotation, or planned maintenance
Bloat estimate misleadsStatistics are stale or relation layout makes estimates noisyPair estimates with pg_stat_user_tables, relation size trends, and query plans
Partitioned table hides hot childParent looks healthy while one partition churns heavilyMonitor child partitions and tune storage parameters per partition

What to Do Next

  • Problem: PostgreSQL vacuum lag becomes dangerous when dead tuples, old snapshots, and lock waits are observed as separate symptoms.
  • Solution: Build a single incident view that joins transaction age, blocked vacuum, table churn, relation size, and active vacuum progress.
  • Proof: A valid signal names the blocker class before p95 query latency crosses the page threshold, and it explains whether the issue is threshold delay, worker saturation, pinned cleanup horizon, or lock conflict.
  • Action: This week, pick the top three write-heavy tables and set table-specific vacuum alerts before changing global autovacuum settings.

Autovacuum is the database telling you how much write-path debt your architecture is carrying; the mature response is to measure the debt before the bill arrives.