PostgreSQL Statistics Drift Workflow
A query that ran in 8 milliseconds last week and now takes 4 seconds has not changed — but the planner’s model of the data has. PostgreSQL’s query optimizer builds execution plans from table statistics: column value distributions, row counts, and correlation coefficients stored in pg_statistic. When those statistics drift from reality, the optimizer chooses wrong plans with confidence, and the resulting regressions are difficult to catch because no error is raised — just slower queries.
Situation
PostgreSQL uses a cost-based optimizer that estimates how many rows each plan step will process. Those estimates come from statistics gathered by ANALYZE. If statistics are stale — from a bulk load, a large delete, or simply not running ANALYZE for an extended period — the planner’s row estimates diverge from actual counts, and plan choices that were correct for the old data distribution become wrong for the current one.
The most common presentation: a query that joins two tables starts doing a nested loop instead of a hash join because the planner underestimates the inner table’s row count. Or an index scan gets chosen when the data has changed enough that a sequential scan would be faster. Or a partial index gets selected for a query where the filtered row count no longer makes that index selective.
Statistics drift is distinct from index bloat or table bloat. The physical storage might be fine. The problem is that the optimizer’s mental model of the data is wrong, and it is building plans optimized for a database that no longer exists.
Symptoms
| Signal | Where to see it | What it means |
|---|---|---|
EXPLAIN ANALYZE estimated rows far from actual rows | Query plan output | Statistics are stale or the column distribution is unusual |
last_analyze or last_autoanalyze is days old | pg_stat_user_tables | Automatic statistics updates not running on this table |
| Query plan changed after a bulk load or large delete | Application performance logs | The new data volume or distribution triggered a different plan |
| Planner chooses sequential scan on a selective query | EXPLAIN ANALYZE output | Row count estimate too high; planner thinks index would cost more |
| Planner chooses nested loop for a large result set | EXPLAIN ANALYZE output | Row count estimate too low; planner underestimated join output |
n_distinct in pg_stats shows -1 for a column with few distinct values | pg_stats | Statistics estimate is extrapolated, not exact |
First Five Checks
- Confirm the estimate-vs-actual divergence — the EXPLAIN output is the primary diagnostic:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.status = 'pending'
AND o.created_at > now() - interval '7 days';
Look for rows where rows=N (actual rows=M) and N is off by more than a factor of 10. A nested loop chosen over a hash join when the actual row count exceeds 10,000 is a clear statistics failure. Note the exact node type (SeqScan, IndexScan, Hash, NestLoop) — this tells you which estimate was wrong.
- Inspect column statistics for the affected table —
pg_statsstores what the planner knows:
SELECT
attname,
n_distinct,
correlation,
null_frac,
avg_width,
most_common_vals,
most_common_freqs
FROM pg_stats
WHERE tablename = 'orders'
AND attname IN ('status', 'created_at', 'customer_id')
ORDER BY attname;
n_distinct > 0 means an absolute count; n_distinct < 0 means a fraction of the table. If n_distinct = -1, PostgreSQL is guessing that every row is unique — problematic for low-cardinality columns. Low correlation (near 0) on a column used in a range scan means physical row order does not match logical sort order, which raises index scan costs.
- Check when statistics were last collected — stale analyze timestamps are the first explanation:
SELECT
relname,
last_analyze,
last_autoanalyze,
n_live_tup,
n_dead_tup,
n_mod_since_analyze
FROM pg_stat_user_tables
WHERE relname IN ('orders', 'customers')
ORDER BY last_analyze NULLS LAST;
n_mod_since_analyze is the counter that autovacuum uses to decide whether to run ANALYZE. If it is large relative to n_live_tup, statistics are definitely stale. A last_analyze of NULL means ANALYZE has never run on this table.
- Check for bulk data changes that were not followed by ANALYZE — look at table modification counts:
SELECT
relname,
n_mod_since_analyze,
n_live_tup,
round(n_mod_since_analyze::numeric / nullif(n_live_tup, 0) * 100, 2) AS mod_pct
FROM pg_stat_user_tables
WHERE n_mod_since_analyze > 0
ORDER BY mod_pct DESC
LIMIT 10;
A mod_pct above 20% means more than 20% of the table has changed since the last statistics collection — the autovacuum analyze_scale_factor default is 0.2, so autovacuum should have triggered, but may not have if the table is very large or autovacuum was busy.
- Check raw statistics storage — to understand what the planner is actually seeing:
SELECT
staattnum,
stakind1,
stavalues1,
stanumbers1
FROM pg_statistic
WHERE starelid = 'orders'::regclass
LIMIT 5;
stakind 1 = most-common-values, 2 = histogram, 3 = correlation. If stavalues1 is sparse or missing, the planner has no useful distribution data for that column. This is the raw form of what pg_stats presents in human-readable form.
Decision Tree
flowchart TD
A[Slow query — plan regression suspected] --> B{EXPLAIN estimated rows match actual?}
B -->|yes — estimates correct| C[Statistics not the problem — check indexes or locks]
B -->|no — large divergence| D{last_analyze recent?}
D -->|no — stale or never| E[ANALYZE tablename — re-check plan]
D -->|yes — but still wrong| F{Column has unusual distribution?}
F -->|yes — skewed or correlated| G[ALTER COLUMN SET STATISTICS 500]
G --> H[ANALYZE tablename — re-check plan]
F -->|no| I{Multiple columns in WHERE clause?}
I -->|yes| J[CREATE STATISTICS for correlated columns]
J --> K[ANALYZE tablename — re-check plan]
I -->|no| L{n_distinct estimate wrong?}
L -->|yes| M[ALTER COLUMN SET n_distinct — explicit override]
L -->|no| N[Check for partial index mismatch or planner bugs]
Remediation Options
Option 1 — Run ANALYZE to refresh statistics
The simplest fix — and always the first step:
-- Analyze a specific table
ANALYZE VERBOSE orders;
-- Analyze multiple tables
ANALYZE VERBOSE orders, customers, order_items;
-- Analyze a specific column (faster on large tables)
ANALYZE orders (status, created_at, customer_id);
ANALYZE VERBOSE prints a summary of rows sampled, which is useful for confirming the statistics update ran successfully. After ANALYZE, re-run EXPLAIN (ANALYZE, BUFFERS) on the slow query to see if the estimates improved.
ANALYZE takes a SHARE UPDATE EXCLUSIVE lock — it blocks DDL but not reads or writes. It is safe to run on production tables at any time.
Option 2 — Increase statistics target for selective columns
The default default_statistics_target = 100 samples 300 * 100 = 30,000 rows for statistics. For columns with many distinct values or highly skewed distributions, this sample may not capture the tail. Increase the per-column target:
-- Increase statistics detail for a specific column
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;
ALTER TABLE orders ALTER COLUMN created_at SET STATISTICS 500;
-- Then refresh statistics
ANALYZE orders;
A statistics target of 500 collects approximately 150,000 rows — 5x the default. The pg_stats documentation notes that n_distinct estimates and histogram bucket counts improve with higher targets, especially for columns where the value distribution has a long tail.
After increasing the target, verify in pg_stats that most_common_vals is more populated and that histogram buckets look representative:
SELECT attname, array_length(most_common_vals, 1) AS mcv_count,
array_length(histogram_bounds, 1) AS histogram_buckets
FROM pg_stats
WHERE tablename = 'orders';
Option 3 — Create extended statistics for correlated columns
When a WHERE clause filters on two columns that are correlated — e.g., status = 'shipped' AND region = 'EU' where shipped orders are disproportionately from EU — the planner multiplies the selectivity of each column independently and underestimates the result set. PostgreSQL 10 introduced extended statistics to model this:
-- Create statistics tracking correlation between two columns
CREATE STATISTICS orders_status_region (dependencies)
ON status, region
FROM orders;
-- Collect the extended statistics
ANALYZE orders;
-- Verify
SELECT stxname, stxkind, stxdefined
FROM pg_statistic_ext
WHERE stxrelid = 'orders'::regclass;
Extended statistics with dependencies teaches the planner that the two columns are correlated. The ndistinct option captures combined distinct value counts; mcv captures the most common value combinations. After collecting, re-run EXPLAIN to see if the multi-column estimate improved.
Rollback Plan
ANALYZEis always safe to run and always safe to re-run. It does not modify data. The only rollback consideration is performance: on a very large table with a high statistics target,ANALYZEcan take minutes and create I/O pressure. Run during off-peak hours on tables over 100 GB.ALTER COLUMN SET STATISTICS Nis reversible:ALTER TABLE orders ALTER COLUMN status SET STATISTICS -1returns to the default. NoANALYZEre-run is needed to revert — the change takes effect on the nextANALYZE.CREATE STATISTICSis reversible:DROP STATISTICS orders_status_region. The planner reverts to independent column estimates immediately.ALTER TABLE ... SET (n_distinct = N)— an explicit override that bypasses sampling — is reversible:ALTER TABLE orders ALTER COLUMN col SET (n_distinct = -1)restores to estimated mode.
Automation Opportunity
Stale statistics are predictable: they happen after bulk loads and large deletes. A pattern worth automating is a post-ETL ANALYZE call baked into the data pipeline itself, rather than relying on autovacuum timing:
-- After any bulk insert, run ANALYZE immediately
INSERT INTO orders_archive SELECT * FROM orders WHERE status = 'completed' AND created_at < now() - interval '1 year';
DELETE FROM orders WHERE status = 'completed' AND created_at < now() - interval '1 year';
ANALYZE orders; -- do not skip this
For monitoring, a pg_cron query that alerts when n_mod_since_analyze exceeds a threshold gives advance notice before the planner starts making wrong decisions:
SELECT cron.schedule('stats-staleness-check', '30 * * * *', $$
INSERT INTO ops.stats_alerts (tablename, mod_pct, captured_at)
SELECT
relname,
round(n_mod_since_analyze::numeric / nullif(n_live_tup, 0) * 100, 2),
now()
FROM pg_stat_user_tables
WHERE n_live_tup > 100000
AND n_mod_since_analyze::numeric / nullif(n_live_tup, 0) > 0.15;
$$);
In Practice
The PostgreSQL statistics documentation describes the statistics target as controlling both the number of histogram buckets and the most-common-values list length. The documented relationship is: statistics_target × 300 = rows sampled. For a column where 0.01% of rows have a specific value that is frequently queried, the default 30,000-row sample will often miss that value entirely, producing a histogram-based estimate that is substantially wrong.
The documented behavior of CREATE STATISTICS with dependencies is that it computes functional dependency statistics between columns. Where the selectivity of col_a = 'x' is 0.01 and col_b = 'y' is 0.05, the planner without extended statistics estimates the joint selectivity as 0.01 × 0.05 = 0.0005. With a dependencies statistic showing that col_a = 'x' implies col_b = 'y' with 95% probability, the planner correctly estimates closer to 0.01.
Where It Breaks
| Failure mode | Trigger | Fix |
|---|---|---|
ANALYZE runs but estimates still wrong | Column has extreme skew — 99% of rows share one value | Increase statistics_target to 1000; use CREATE STATISTICS mcv |
| Extended statistics do not help | Correlation is partial, not functional dependency | Try ndistinct variant of CREATE STATISTICS |
ANALYZE is too slow on large table | Table has 1B+ rows and wide schema | Analyze specific columns only: ANALYZE table (col1, col2) |
| Autovacuum is running ANALYZE but estimates still drift | analyze_scale_factor threshold crossed only after large drift | Lower autovacuum_analyze_scale_factor per-table to 0.01 |
| Plan regression returns after ANALYZE | Statistics are correct but planner constant factors are wrong | Consider pg_hint_plan as a temporary override while investigating |
What to Do Next
- Problem: Stale or low-resolution statistics cause the planner to choose wrong join types and scan methods, producing query regressions that look like load spikes but are actually optimizer failures.
- Solution: Run
ANALYZEafter bulk loads, raisestatistics targetto 500 for join and filter columns on large tables, and create extended statistics for correlated column pairs. - Proof: After
ANALYZE,EXPLAIN (ANALYZE)estimated rows should be within a factor of 2 of actual rows for the primary scan nodes. - Action: Run the
n_mod_since_analyzequery from Check 4 this week. Any table wheremod_pct > 20%needs anANALYZErun today.
Checklist
- Run
EXPLAIN (ANALYZE, BUFFERS)on the slow query — compare estimated vs actual rows at each node - Query
pg_statsfor the filtered columns — checkn_distinct,correlation, andmost_common_vals - Query
pg_stat_user_tablesforlast_analyze,last_autoanalyze, andn_mod_since_analyze - If
last_analyzeis stale or NULL: runANALYZE tablenameimmediately - Re-run
EXPLAIN (ANALYZE)afterANALYZEto verify estimates improved - If estimates still wrong: check for correlated columns in the
WHEREclause - Raise
statistics_targetto 500 for high-cardinality or skewed columns - Create extended statistics with
CREATE STATISTICS (dependencies)for correlated column pairs - Run
ANALYZEagain after any statistics configuration change - Lower
autovacuum_analyze_scale_factorto 0.01 per-table for high-write tables - Add
ANALYZEcalls to ETL pipelines immediately after bulk loads or large deletes - Add a monitoring query on
n_mod_since_analyze— alert whenmod_pct > 15%on production tables