When Autovacuum Becomes a Backpressure Signal
Autovacuum is not background housekeeping; in a write-heavy PostgreSQL system, delayed vacuum is a backpressure signal from Multi-Version Concurrency Control before the application admits it is overloaded.
Situation
PostgreSQL’s default approach is to let autovacuum clean dead row versions in the background while application traffic continues. The alternative is to treat vacuum health as part of the write path: measured, alerted, tuned per table, and included in incident triage.
| Approach | What it assumes | What production eventually proves |
|---|---|---|
| Default autovacuum | Table churn is moderate and cleanup can trail safely | High-update tables create cleanup debt faster than defaults can retire it |
| Manual emergency vacuum | Operators can intervene after latency spikes | The database is already paying interest on bloat by then |
| Vacuum as backpressure telemetry | Dead tuples, transaction age, locks, and vacuum progress are monitored together | The incident is visible before p95 latency becomes the alert |
The Problem
Autovacuum is often blamed because it is visible during the outage. That is usually too shallow. In PostgreSQL, UPDATE and DELETE create dead row versions under Multi-Version Concurrency Control; VACUUM can only remove versions no active snapshot can still see. A single old transaction can hold back the cleanup horizon through backend_xmin, which PostgreSQL exposes in pg_stat_activity.
| Failure point | What breaks | Why it matters |
|---|---|---|
| Long transaction age | Vacuum cannot remove dead tuples still visible to an old snapshot | Bloat grows even while autovacuum appears active |
| Idle transaction sessions | state = 'idle in transaction' keeps a snapshot open without doing useful work | One abandoned app connection can pin cleanup behind thousands of writes |
| High-churn tables on defaults | autovacuum_vacuum_scale_factor = 0.2 waits for 20 percent table churn plus threshold | On a 200M-row table, that can mean tens of millions of dead tuples before cleanup starts |
| Lock conflicts | Plain VACUUM uses ShareUpdateExclusiveLock; VACUUM FULL takes AccessExclusiveLock | Confusing the two during an incident can turn a slowdown into an outage |
| Dead tuple percent alone | Small tables, append-heavy tables, and partitioned tables distort the signal | Alerts need relation size, last vacuum age, transaction age, and latency together |
PostgreSQL’s own documentation is explicit about the mechanics: routine vacuuming removes dead row versions and prevents transaction ID wraparound, while old open transactions can block cleanup progress. The operational question is not “is autovacuum running?” The question is: which workload condition is forcing it to fall behind?
Treat Autovacuum as Backpressure Telemetry
The right architecture is a vacuum control loop: observe the cleanup horizon, identify blockers, tune the few hot tables, and validate under write load. Do not start by changing global autovacuum settings across the cluster. That is how a maintenance problem becomes an I/O scheduling problem.
flowchart TD
App[application writes] --> MVCC[MVCC row versions]
MVCC --> Dead[dead tuples accumulate]
Txn[old transaction xmin] --> Horizon[cleanup horizon held back]
Dead --> Auto[autovacuum worker]
Horizon --> Auto
Auto --> Locks[ShareUpdateExclusiveLock]
DDL[DDL or index maintenance] --> Locks
Locks --> Lag[vacuum lag]
Lag --> Bloat[table and index bloat]
Bloat --> Planner[slower plans and more IO]
Planner --> App
Lag --> Alert[backpressure alert]
-
Build a vacuum incident view.
Include active vacuum progress, oldest transaction age, idle-in-transaction sessions, dead tuple counts, table size, and blockers.
pg_stat_progress_vacuumhas existed since PostgreSQL 9.6 and reports active vacuum workers, including autovacuum workers.Verification: during a load test, you can name the table being vacuumed, its phase, heap blocks scanned, and any blocking backend in under one minute.
-
Alert on cleanup debt, not just dead tuple percentage.
A 40 percent dead tuple ratio on a 5 MB table is noise. Five percent on a 900 GB high-update table may be a serious future incident. Use a composite signal:
n_dead_tup,pg_total_relation_size,last_autovacuum, oldestbackend_xmin, and query latency for the table’s top statements.Verification: every alert points to one table, one suspected blocker class, and one next action.
-
Tune high-churn tables per table.
Lower scale factors on tables such as
orders,sessions, and job queues. A setting likeautovacuum_vacuum_scale_factor = 0.01with a fixed threshold can make cleanup continuous instead of bursty. Keep cost delay and cost limit workload-aware; aggressive cleanup still competes for disk and cache.Verification: after tuning,
n_dead_tupforms a sawtooth with a lower ceiling under production-like write load. -
Fix transaction hygiene before killing vacuum.
Terminating autovacuum can reduce immediate pressure when it is competing with foreground work, but repeated termination increases bloat debt. The durable fix is shorter transactions, timeouts for idle sessions, safer migration locks, and partition or index maintenance where needed.
Verification: oldest transaction age remains bounded during peak traffic, not only during maintenance windows.
A useful runbook query starts here:
SELECT
pid,
usename,
application_name,
state,
wait_event_type,
wait_event,
age(clock_timestamp(), xact_start) AS xact_age,
age(clock_timestamp(), query_start) AS query_age,
backend_xmin,
left(query, 160) AS query
FROM pg_stat_activity
WHERE state <> 'idle'
ORDER BY xact_start NULLS LAST;
In Practice
The most useful public case study is not an anonymous war story; it is the AWS Database Blog write-up on tuning autovacuum for Amazon RDS for PostgreSQL 9.6.3 after an Oracle-to-PostgreSQL OLTP migration. The database was provisioned for 30,000 IOPS. During the first weeks after migration, several databases saw Read IOPS spike as high as 25,000 without a matching increase in application load. The visible symptom was not one slow query. It was cleanup work arriving late, in large chunks, on already-bloated tables.
The concrete numbers are the part worth carrying into a runbook:
| Published observation | Value | Operational reading |
|---|---|---|
table1 live tuples | 450,398,643 | Large enough that percentage-based thresholds delay cleanup |
table1 dead tuples | 459,406,616 | More dead tuples than estimated live tuples |
table2 dead tuples | 1,919,230,596 | Vacuum debt was not isolated to one table |
table3 dead tuples | 4,642,232,802 | Cluster-level worker saturation becomes plausible |
| Longest autovacuum session | 2 days 16:03 on sh.table1 | Vacuum was active but not converging fast enough |
| Blocking session state | idle in transaction for 2 days 22:25 on table1 | The cleanup horizon was pinned by transaction hygiene |
| RDS setting called out | autovacuum_vacuum_scale_factor = 0.1, autovacuum_max_workers = 3 | Millions of dead tuples accumulated before work started |
| Tuning result reported | autovacuum_max_workers = 8, autovacuum_vacuum_cost_limit = 4800 | Read IOPS during concurrent autovacuum was brought to about 10,000, one-third of provisioned capacity |
That case is useful because it separates three failure modes operators often collapse into one. First, the trigger threshold was too high for tables with hundreds of millions of rows. Second, the default worker count meant a few large tables could occupy all autovacuum workers while other tables continued to accumulate dead tuples. Third, an idle in transaction session kept old tuple versions visible, so autovacuum could run and still fail to reclaim enough space.
The lock behavior is documented, not folklore. PostgreSQL’s explicit locking documentation states that plain VACUUM acquires ShareUpdateExclusiveLock, while VACUUM FULL requires AccessExclusiveLock. That distinction matters at 03:00. Plain vacuum is designed to coexist with normal reads and writes; VACUUM FULL rewrites the table and blocks concurrent access. Reaching for it during a live checkout incident is usually the database equivalent of fixing a smoke alarm with a hammer.
A separate public PGConf/OtterTune autovacuum case connects the same mechanics to request latency. The case describes an update-heavy workload where long-running queries blocked autovacuum, dead tuples accumulated by 600x, blocks read increased by 375x, non-HOT updates reached 100 percent, update latency increased from 12 ms to 710 ms, throughput dropped by 25 percent during the spike, and query latency spiked by 90x. The exact schema is less important than the shape of the failure: stale tuple versions made ordinary updates read and write far more than the application expected.
The practical pattern is visible in named system behavior:
| System behavior | Operational implication | Source |
|---|---|---|
| Dead row versions remain until no active transaction can see them | Watch backend_xmin, not only table size | PostgreSQL routine vacuuming |
| Autovacuum triggers from threshold plus scale factor | Large tables need per-table thresholds | Autovacuum settings |
| Plain vacuum and DDL can conflict through table locks | Incident views need pg_locks, not only connection counts | PostgreSQL explicit locking |
| Vacuum progress is visible while running | Treat active vacuum as observable work, not mystery load | PostgreSQL progress reporting |
| Large-table defaults can produce delayed, bursty cleanup | Tune hot tables before making broad cluster changes | AWS RDS autovacuum case study |
| Long-running queries can turn vacuum lag into latency spikes | Track transaction age beside table bloat and top statement latency | PGConf autovacuum case study |
The more interesting production lesson is that vacuum lag is a system signal, not a storage metric. It often points at application behavior: oversized transactions, forgotten cursors, migration scripts without lock timeouts, reporting queries running at REPEATABLE READ, or connection pools that keep sessions open after the request has ended.
Where It Breaks
| Failure mode | Trigger | Fix |
|---|---|---|
| Autovacuum workers saturated | Several large tables cross vacuum thresholds at the same time | Tune hot tables individually and review autovacuum_max_workers with disk capacity |
| Cleanup horizon pinned | Old backend_xmin, prepared transaction, or replication slot prevents tuple removal | Alert on transaction age, prepared transactions, and replication slot lag |
| Foreground latency worsens after tuning | Lower scale factors create more frequent vacuum I/O under peak writes | Adjust cost limit, cost delay, and schedule manual maintenance for cold periods |
VACUUM FULL blocks traffic | Operator uses it to reclaim disk on a live table | Prefer regular vacuum, REINDEX CONCURRENTLY, partition rotation, or planned maintenance |
| Bloat estimate misleads | Statistics are stale or relation layout makes estimates noisy | Pair estimates with pg_stat_user_tables, relation size trends, and query plans |
| Partitioned table hides hot child | Parent looks healthy while one partition churns heavily | Monitor child partitions and tune storage parameters per partition |
What to Do Next
- Problem: PostgreSQL vacuum lag becomes dangerous when dead tuples, old snapshots, and lock waits are observed as separate symptoms.
- Solution: Build a single incident view that joins transaction age, blocked vacuum, table churn, relation size, and active vacuum progress.
- Proof: A valid signal names the blocker class before p95 query latency crosses the page threshold, and it explains whether the issue is threshold delay, worker saturation, pinned cleanup horizon, or lock conflict.
- Action: This week, pick the top three write-heavy tables and set table-specific vacuum alerts before changing global autovacuum settings.
Autovacuum is the database telling you how much write-path debt your architecture is carrying; the mature response is to measure the debt before the bill arrives.