PostgreSQL Statistics Drift Workflow

A query that ran in 8 milliseconds last week and now takes 4 seconds has not changed — but the planner’s model of the data has. PostgreSQL’s query optimizer builds execution plans from table statistics: column value distributions, row counts, and correlation coefficients stored in pg_statistic. When those statistics drift from reality, the optimizer chooses wrong plans with confidence, and the resulting regressions are difficult to catch because no error is raised — just slower queries.

Situation

PostgreSQL uses a cost-based optimizer that estimates how many rows each plan step will process. Those estimates come from statistics gathered by ANALYZE. If statistics are stale — from a bulk load, a large delete, or simply not running ANALYZE for an extended period — the planner’s row estimates diverge from actual counts, and plan choices that were correct for the old data distribution become wrong for the current one.

The most common presentation: a query that joins two tables starts doing a nested loop instead of a hash join because the planner underestimates the inner table’s row count. Or an index scan gets chosen when the data has changed enough that a sequential scan would be faster. Or a partial index gets selected for a query where the filtered row count no longer makes that index selective.

Statistics drift is distinct from index bloat or table bloat. The physical storage might be fine. The problem is that the optimizer’s mental model of the data is wrong, and it is building plans optimized for a database that no longer exists.

Symptoms

Signal	Where to see it	What it means
`EXPLAIN ANALYZE` estimated rows far from actual rows	Query plan output	Statistics are stale or the column distribution is unusual
`last_analyze` or `last_autoanalyze` is days old	`pg_stat_user_tables`	Automatic statistics updates not running on this table
Query plan changed after a bulk load or large delete	Application performance logs	The new data volume or distribution triggered a different plan
Planner chooses sequential scan on a selective query	`EXPLAIN ANALYZE` output	Row count estimate too high; planner thinks index would cost more
Planner chooses nested loop for a large result set	`EXPLAIN ANALYZE` output	Row count estimate too low; planner underestimated join output
`n_distinct` in `pg_stats` shows -1 for a column with few distinct values	`pg_stats`	Statistics estimate is extrapolated, not exact

First Five Checks

Confirm the estimate-vs-actual divergence — the EXPLAIN output is the primary diagnostic:

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.status = 'pending'
  AND o.created_at > now() - interval '7 days';

Look for rows where rows=N (actual rows=M) and N is off by more than a factor of 10. A nested loop chosen over a hash join when the actual row count exceeds 10,000 is a clear statistics failure. Note the exact node type (SeqScan, IndexScan, Hash, NestLoop) — this tells you which estimate was wrong.

Inspect column statistics for the affected table — pg_stats stores what the planner knows:

SELECT
  attname,
  n_distinct,
  correlation,
  null_frac,
  avg_width,
  most_common_vals,
  most_common_freqs
FROM pg_stats
WHERE tablename = 'orders'
  AND attname IN ('status', 'created_at', 'customer_id')
ORDER BY attname;

n_distinct > 0 means an absolute count; n_distinct < 0 means a fraction of the table. If n_distinct = -1, PostgreSQL is guessing that every row is unique — problematic for low-cardinality columns. Low correlation (near 0) on a column used in a range scan means physical row order does not match logical sort order, which raises index scan costs.

Check when statistics were last collected — stale analyze timestamps are the first explanation:

SELECT
  relname,
  last_analyze,
  last_autoanalyze,
  n_live_tup,
  n_dead_tup,
  n_mod_since_analyze
FROM pg_stat_user_tables
WHERE relname IN ('orders', 'customers')
ORDER BY last_analyze NULLS LAST;

n_mod_since_analyze is the counter that autovacuum uses to decide whether to run ANALYZE. If it is large relative to n_live_tup, statistics are definitely stale. A last_analyze of NULL means ANALYZE has never run on this table.

Check for bulk data changes that were not followed by ANALYZE — look at table modification counts:

SELECT
  relname,
  n_mod_since_analyze,
  n_live_tup,
  round(n_mod_since_analyze::numeric / nullif(n_live_tup, 0) * 100, 2) AS mod_pct
FROM pg_stat_user_tables
WHERE n_mod_since_analyze > 0
ORDER BY mod_pct DESC
LIMIT 10;

A mod_pct above 20% means more than 20% of the table has changed since the last statistics collection — the autovacuum analyze_scale_factor default is 0.2, so autovacuum should have triggered, but may not have if the table is very large or autovacuum was busy.

Check raw statistics storage — to understand what the planner is actually seeing:

SELECT
  staattnum,
  stakind1,
  stavalues1,
  stanumbers1
FROM pg_statistic
WHERE starelid = 'orders'::regclass
LIMIT 5;

stakind 1 = most-common-values, 2 = histogram, 3 = correlation. If stavalues1 is sparse or missing, the planner has no useful distribution data for that column. This is the raw form of what pg_stats presents in human-readable form.

Decision Tree

flowchart TD
    A[Slow query — plan regression suspected] --> B{EXPLAIN estimated rows match actual?}
    B -->|yes — estimates correct| C[Statistics not the problem — check indexes or locks]
    B -->|no — large divergence| D{last_analyze recent?}
    D -->|no — stale or never| E[ANALYZE tablename — re-check plan]
    D -->|yes — but still wrong| F{Column has unusual distribution?}
    F -->|yes — skewed or correlated| G[ALTER COLUMN SET STATISTICS 500]
    G --> H[ANALYZE tablename — re-check plan]
    F -->|no| I{Multiple columns in WHERE clause?}
    I -->|yes| J[CREATE STATISTICS for correlated columns]
    J --> K[ANALYZE tablename — re-check plan]
    I -->|no| L{n_distinct estimate wrong?}
    L -->|yes| M[ALTER COLUMN SET n_distinct — explicit override]
    L -->|no| N[Check for partial index mismatch or planner bugs]

Remediation Options

Option 1 — Run ANALYZE to refresh statistics

The simplest fix — and always the first step:

-- Analyze a specific table
ANALYZE VERBOSE orders;

-- Analyze multiple tables
ANALYZE VERBOSE orders, customers, order_items;

-- Analyze a specific column (faster on large tables)
ANALYZE orders (status, created_at, customer_id);

ANALYZE VERBOSE prints a summary of rows sampled, which is useful for confirming the statistics update ran successfully. After ANALYZE, re-run EXPLAIN (ANALYZE, BUFFERS) on the slow query to see if the estimates improved.

ANALYZE takes a SHARE UPDATE EXCLUSIVE lock — it blocks DDL but not reads or writes. It is safe to run on production tables at any time.

Option 2 — Increase statistics target for selective columns

The default default_statistics_target = 100 samples 300 * 100 = 30,000 rows for statistics. For columns with many distinct values or highly skewed distributions, this sample may not capture the tail. Increase the per-column target:

-- Increase statistics detail for a specific column
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;
ALTER TABLE orders ALTER COLUMN created_at SET STATISTICS 500;

-- Then refresh statistics
ANALYZE orders;

A statistics target of 500 collects approximately 150,000 rows — 5x the default. The pg_stats documentation notes that n_distinct estimates and histogram bucket counts improve with higher targets, especially for columns where the value distribution has a long tail.

After increasing the target, verify in pg_stats that most_common_vals is more populated and that histogram buckets look representative:

SELECT attname, array_length(most_common_vals, 1) AS mcv_count,
       array_length(histogram_bounds, 1) AS histogram_buckets
FROM pg_stats
WHERE tablename = 'orders';

Option 3 — Create extended statistics for correlated columns

When a WHERE clause filters on two columns that are correlated — e.g., status = 'shipped' AND region = 'EU' where shipped orders are disproportionately from EU — the planner multiplies the selectivity of each column independently and underestimates the result set. PostgreSQL 10 introduced extended statistics to model this:

-- Create statistics tracking correlation between two columns
CREATE STATISTICS orders_status_region (dependencies)
ON status, region
FROM orders;

-- Collect the extended statistics
ANALYZE orders;

-- Verify
SELECT stxname, stxkind, stxdefined
FROM pg_statistic_ext
WHERE stxrelid = 'orders'::regclass;

Extended statistics with dependencies teaches the planner that the two columns are correlated. The ndistinct option captures combined distinct value counts; mcv captures the most common value combinations. After collecting, re-run EXPLAIN to see if the multi-column estimate improved.

Rollback Plan

ANALYZE is always safe to run and always safe to re-run. It does not modify data. The only rollback consideration is performance: on a very large table with a high statistics target, ANALYZE can take minutes and create I/O pressure. Run during off-peak hours on tables over 100 GB.
ALTER COLUMN SET STATISTICS N is reversible: ALTER TABLE orders ALTER COLUMN status SET STATISTICS -1 returns to the default. No ANALYZE re-run is needed to revert — the change takes effect on the next ANALYZE.
CREATE STATISTICS is reversible: DROP STATISTICS orders_status_region. The planner reverts to independent column estimates immediately.
ALTER TABLE ... SET (n_distinct = N) — an explicit override that bypasses sampling — is reversible: ALTER TABLE orders ALTER COLUMN col SET (n_distinct = -1) restores to estimated mode.

Automation Opportunity

Stale statistics are predictable: they happen after bulk loads and large deletes. A pattern worth automating is a post-ETL ANALYZE call baked into the data pipeline itself, rather than relying on autovacuum timing:

-- After any bulk insert, run ANALYZE immediately
INSERT INTO orders_archive SELECT * FROM orders WHERE status = 'completed' AND created_at < now() - interval '1 year';
DELETE FROM orders WHERE status = 'completed' AND created_at < now() - interval '1 year';
ANALYZE orders;  -- do not skip this

For monitoring, a pg_cron query that alerts when n_mod_since_analyze exceeds a threshold gives advance notice before the planner starts making wrong decisions:

SELECT cron.schedule('stats-staleness-check', '30 * * * *', $$
  INSERT INTO ops.stats_alerts (tablename, mod_pct, captured_at)
  SELECT
    relname,
    round(n_mod_since_analyze::numeric / nullif(n_live_tup, 0) * 100, 2),
    now()
  FROM pg_stat_user_tables
  WHERE n_live_tup > 100000
    AND n_mod_since_analyze::numeric / nullif(n_live_tup, 0) > 0.15;
$$);

In Practice

The PostgreSQL statistics documentation describes the statistics target as controlling both the number of histogram buckets and the most-common-values list length. The documented relationship is: statistics_target × 300 = rows sampled. For a column where 0.01% of rows have a specific value that is frequently queried, the default 30,000-row sample will often miss that value entirely, producing a histogram-based estimate that is substantially wrong.

The documented behavior of CREATE STATISTICS with dependencies is that it computes functional dependency statistics between columns. Where the selectivity of col_a = 'x' is 0.01 and col_b = 'y' is 0.05, the planner without extended statistics estimates the joint selectivity as 0.01 × 0.05 = 0.0005. With a dependencies statistic showing that col_a = 'x' implies col_b = 'y' with 95% probability, the planner correctly estimates closer to 0.01.

Where It Breaks

Failure mode	Trigger	Fix
`ANALYZE` runs but estimates still wrong	Column has extreme skew — 99% of rows share one value	Increase `statistics_target` to 1000; use `CREATE STATISTICS mcv`
Extended statistics do not help	Correlation is partial, not functional dependency	Try `ndistinct` variant of `CREATE STATISTICS`
`ANALYZE` is too slow on large table	Table has 1B+ rows and wide schema	Analyze specific columns only: `ANALYZE table (col1, col2)`
Autovacuum is running ANALYZE but estimates still drift	`analyze_scale_factor` threshold crossed only after large drift	Lower `autovacuum_analyze_scale_factor` per-table to 0.01
Plan regression returns after ANALYZE	Statistics are correct but planner constant factors are wrong	Consider `pg_hint_plan` as a temporary override while investigating

What to Do Next

Problem: Stale or low-resolution statistics cause the planner to choose wrong join types and scan methods, producing query regressions that look like load spikes but are actually optimizer failures.
Solution: Run ANALYZE after bulk loads, raise statistics target to 500 for join and filter columns on large tables, and create extended statistics for correlated column pairs.
Proof: After ANALYZE, EXPLAIN (ANALYZE) estimated rows should be within a factor of 2 of actual rows for the primary scan nodes.
Action: Run the n_mod_since_analyze query from Check 4 this week. Any table where mod_pct > 20% needs an ANALYZE run today.

Checklist

Run EXPLAIN (ANALYZE, BUFFERS) on the slow query — compare estimated vs actual rows at each node
Query pg_stats for the filtered columns — check n_distinct, correlation, and most_common_vals
Query pg_stat_user_tables for last_analyze, last_autoanalyze, and n_mod_since_analyze
If last_analyze is stale or NULL: run ANALYZE tablename immediately
Re-run EXPLAIN (ANALYZE) after ANALYZE to verify estimates improved
If estimates still wrong: check for correlated columns in the WHERE clause
Raise statistics_target to 500 for high-cardinality or skewed columns
Create extended statistics with CREATE STATISTICS (dependencies) for correlated column pairs
Run ANALYZE again after any statistics configuration change
Lower autovacuum_analyze_scale_factor to 0.01 per-table for high-write tables
Add ANALYZE calls to ETL pipelines immediately after bulk loads or large deletes
Add a monitoring query on n_mod_since_analyze — alert when mod_pct > 15% on production tables

Situation

Symptoms

First Five Checks

Decision Tree

Remediation Options

Rollback Plan

Automation Opportunity

In Practice

Where It Breaks

What to Do Next

Checklist

Rajiv

Related Posts

Autovacuum Is a Capacity Problem, Not a Maintenance Task

When Autovacuum Becomes a Backpressure Signal

Prometheus + Grafana for Database Engineers: Open-Source Monitoring That Actually Works