A query that ran in 8 milliseconds last week and now takes 4 seconds has not changed — but the planner’s model of the data has. PostgreSQL’s query optimizer builds execution plans from table statistics: column value distributions, row counts, and correlation coefficients stored in pg_statistic. When those statistics drift from reality, the optimizer chooses wrong plans with confidence, and the resulting regressions are difficult to catch because no error is raised — just slower queries.

Situation

PostgreSQL uses a cost-based optimizer that estimates how many rows each plan step will process. Those estimates come from statistics gathered by ANALYZE. If statistics are stale — from a bulk load, a large delete, or simply not running ANALYZE for an extended period — the planner’s row estimates diverge from actual counts, and plan choices that were correct for the old data distribution become wrong for the current one.

The most common presentation: a query that joins two tables starts doing a nested loop instead of a hash join because the planner underestimates the inner table’s row count. Or an index scan gets chosen when the data has changed enough that a sequential scan would be faster. Or a partial index gets selected for a query where the filtered row count no longer makes that index selective.

Statistics drift is distinct from index bloat or table bloat. The physical storage might be fine. The problem is that the optimizer’s mental model of the data is wrong, and it is building plans optimized for a database that no longer exists.

Symptoms

SignalWhere to see itWhat it means
EXPLAIN ANALYZE estimated rows far from actual rowsQuery plan outputStatistics are stale or the column distribution is unusual
last_analyze or last_autoanalyze is days oldpg_stat_user_tablesAutomatic statistics updates not running on this table
Query plan changed after a bulk load or large deleteApplication performance logsThe new data volume or distribution triggered a different plan
Planner chooses sequential scan on a selective queryEXPLAIN ANALYZE outputRow count estimate too high; planner thinks index would cost more
Planner chooses nested loop for a large result setEXPLAIN ANALYZE outputRow count estimate too low; planner underestimated join output
n_distinct in pg_stats shows -1 for a column with few distinct valuespg_statsStatistics estimate is extrapolated, not exact

First Five Checks

  1. Confirm the estimate-vs-actual divergence — the EXPLAIN output is the primary diagnostic:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.status = 'pending'
  AND o.created_at > now() - interval '7 days';

Look for rows where rows=N (actual rows=M) and N is off by more than a factor of 10. A nested loop chosen over a hash join when the actual row count exceeds 10,000 is a clear statistics failure. Note the exact node type (SeqScan, IndexScan, Hash, NestLoop) — this tells you which estimate was wrong.

  1. Inspect column statistics for the affected tablepg_stats stores what the planner knows:
SELECT
  attname,
  n_distinct,
  correlation,
  null_frac,
  avg_width,
  most_common_vals,
  most_common_freqs
FROM pg_stats
WHERE tablename = 'orders'
  AND attname IN ('status', 'created_at', 'customer_id')
ORDER BY attname;

n_distinct > 0 means an absolute count; n_distinct < 0 means a fraction of the table. If n_distinct = -1, PostgreSQL is guessing that every row is unique — problematic for low-cardinality columns. Low correlation (near 0) on a column used in a range scan means physical row order does not match logical sort order, which raises index scan costs.

  1. Check when statistics were last collected — stale analyze timestamps are the first explanation:
SELECT
  relname,
  last_analyze,
  last_autoanalyze,
  n_live_tup,
  n_dead_tup,
  n_mod_since_analyze
FROM pg_stat_user_tables
WHERE relname IN ('orders', 'customers')
ORDER BY last_analyze NULLS LAST;

n_mod_since_analyze is the counter that autovacuum uses to decide whether to run ANALYZE. If it is large relative to n_live_tup, statistics are definitely stale. A last_analyze of NULL means ANALYZE has never run on this table.

  1. Check for bulk data changes that were not followed by ANALYZE — look at table modification counts:
SELECT
  relname,
  n_mod_since_analyze,
  n_live_tup,
  round(n_mod_since_analyze::numeric / nullif(n_live_tup, 0) * 100, 2) AS mod_pct
FROM pg_stat_user_tables
WHERE n_mod_since_analyze > 0
ORDER BY mod_pct DESC
LIMIT 10;

A mod_pct above 20% means more than 20% of the table has changed since the last statistics collection — the autovacuum analyze_scale_factor default is 0.2, so autovacuum should have triggered, but may not have if the table is very large or autovacuum was busy.

  1. Check raw statistics storage — to understand what the planner is actually seeing:
SELECT
  staattnum,
  stakind1,
  stavalues1,
  stanumbers1
FROM pg_statistic
WHERE starelid = 'orders'::regclass
LIMIT 5;

stakind 1 = most-common-values, 2 = histogram, 3 = correlation. If stavalues1 is sparse or missing, the planner has no useful distribution data for that column. This is the raw form of what pg_stats presents in human-readable form.

Decision Tree

flowchart TD
    A[Slow query — plan regression suspected] --> B{EXPLAIN estimated rows match actual?}
    B -->|yes — estimates correct| C[Statistics not the problem — check indexes or locks]
    B -->|no — large divergence| D{last_analyze recent?}
    D -->|no — stale or never| E[ANALYZE tablename — re-check plan]
    D -->|yes — but still wrong| F{Column has unusual distribution?}
    F -->|yes — skewed or correlated| G[ALTER COLUMN SET STATISTICS 500]
    G --> H[ANALYZE tablename — re-check plan]
    F -->|no| I{Multiple columns in WHERE clause?}
    I -->|yes| J[CREATE STATISTICS for correlated columns]
    J --> K[ANALYZE tablename — re-check plan]
    I -->|no| L{n_distinct estimate wrong?}
    L -->|yes| M[ALTER COLUMN SET n_distinct — explicit override]
    L -->|no| N[Check for partial index mismatch or planner bugs]

Remediation Options

Option 1 — Run ANALYZE to refresh statistics

The simplest fix — and always the first step:

-- Analyze a specific table
ANALYZE VERBOSE orders;

-- Analyze multiple tables
ANALYZE VERBOSE orders, customers, order_items;

-- Analyze a specific column (faster on large tables)
ANALYZE orders (status, created_at, customer_id);

ANALYZE VERBOSE prints a summary of rows sampled, which is useful for confirming the statistics update ran successfully. After ANALYZE, re-run EXPLAIN (ANALYZE, BUFFERS) on the slow query to see if the estimates improved.

ANALYZE takes a SHARE UPDATE EXCLUSIVE lock — it blocks DDL but not reads or writes. It is safe to run on production tables at any time.

Option 2 — Increase statistics target for selective columns

The default default_statistics_target = 100 samples 300 * 100 = 30,000 rows for statistics. For columns with many distinct values or highly skewed distributions, this sample may not capture the tail. Increase the per-column target:

-- Increase statistics detail for a specific column
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;
ALTER TABLE orders ALTER COLUMN created_at SET STATISTICS 500;

-- Then refresh statistics
ANALYZE orders;

A statistics target of 500 collects approximately 150,000 rows — 5x the default. The pg_stats documentation notes that n_distinct estimates and histogram bucket counts improve with higher targets, especially for columns where the value distribution has a long tail.

After increasing the target, verify in pg_stats that most_common_vals is more populated and that histogram buckets look representative:

SELECT attname, array_length(most_common_vals, 1) AS mcv_count,
       array_length(histogram_bounds, 1) AS histogram_buckets
FROM pg_stats
WHERE tablename = 'orders';

Option 3 — Create extended statistics for correlated columns

When a WHERE clause filters on two columns that are correlated — e.g., status = 'shipped' AND region = 'EU' where shipped orders are disproportionately from EU — the planner multiplies the selectivity of each column independently and underestimates the result set. PostgreSQL 10 introduced extended statistics to model this:

-- Create statistics tracking correlation between two columns
CREATE STATISTICS orders_status_region (dependencies)
ON status, region
FROM orders;

-- Collect the extended statistics
ANALYZE orders;

-- Verify
SELECT stxname, stxkind, stxdefined
FROM pg_statistic_ext
WHERE stxrelid = 'orders'::regclass;

Extended statistics with dependencies teaches the planner that the two columns are correlated. The ndistinct option captures combined distinct value counts; mcv captures the most common value combinations. After collecting, re-run EXPLAIN to see if the multi-column estimate improved.

Rollback Plan

  1. ANALYZE is always safe to run and always safe to re-run. It does not modify data. The only rollback consideration is performance: on a very large table with a high statistics target, ANALYZE can take minutes and create I/O pressure. Run during off-peak hours on tables over 100 GB.
  2. ALTER COLUMN SET STATISTICS N is reversible: ALTER TABLE orders ALTER COLUMN status SET STATISTICS -1 returns to the default. No ANALYZE re-run is needed to revert — the change takes effect on the next ANALYZE.
  3. CREATE STATISTICS is reversible: DROP STATISTICS orders_status_region. The planner reverts to independent column estimates immediately.
  4. ALTER TABLE ... SET (n_distinct = N) — an explicit override that bypasses sampling — is reversible: ALTER TABLE orders ALTER COLUMN col SET (n_distinct = -1) restores to estimated mode.

Automation Opportunity

Stale statistics are predictable: they happen after bulk loads and large deletes. A pattern worth automating is a post-ETL ANALYZE call baked into the data pipeline itself, rather than relying on autovacuum timing:

-- After any bulk insert, run ANALYZE immediately
INSERT INTO orders_archive SELECT * FROM orders WHERE status = 'completed' AND created_at < now() - interval '1 year';
DELETE FROM orders WHERE status = 'completed' AND created_at < now() - interval '1 year';
ANALYZE orders;  -- do not skip this

For monitoring, a pg_cron query that alerts when n_mod_since_analyze exceeds a threshold gives advance notice before the planner starts making wrong decisions:

SELECT cron.schedule('stats-staleness-check', '30 * * * *', $$
  INSERT INTO ops.stats_alerts (tablename, mod_pct, captured_at)
  SELECT
    relname,
    round(n_mod_since_analyze::numeric / nullif(n_live_tup, 0) * 100, 2),
    now()
  FROM pg_stat_user_tables
  WHERE n_live_tup > 100000
    AND n_mod_since_analyze::numeric / nullif(n_live_tup, 0) > 0.15;
$$);

In Practice

The PostgreSQL statistics documentation describes the statistics target as controlling both the number of histogram buckets and the most-common-values list length. The documented relationship is: statistics_target × 300 = rows sampled. For a column where 0.01% of rows have a specific value that is frequently queried, the default 30,000-row sample will often miss that value entirely, producing a histogram-based estimate that is substantially wrong.

The documented behavior of CREATE STATISTICS with dependencies is that it computes functional dependency statistics between columns. Where the selectivity of col_a = 'x' is 0.01 and col_b = 'y' is 0.05, the planner without extended statistics estimates the joint selectivity as 0.01 × 0.05 = 0.0005. With a dependencies statistic showing that col_a = 'x' implies col_b = 'y' with 95% probability, the planner correctly estimates closer to 0.01.

Where It Breaks

Failure modeTriggerFix
ANALYZE runs but estimates still wrongColumn has extreme skew — 99% of rows share one valueIncrease statistics_target to 1000; use CREATE STATISTICS mcv
Extended statistics do not helpCorrelation is partial, not functional dependencyTry ndistinct variant of CREATE STATISTICS
ANALYZE is too slow on large tableTable has 1B+ rows and wide schemaAnalyze specific columns only: ANALYZE table (col1, col2)
Autovacuum is running ANALYZE but estimates still driftanalyze_scale_factor threshold crossed only after large driftLower autovacuum_analyze_scale_factor per-table to 0.01
Plan regression returns after ANALYZEStatistics are correct but planner constant factors are wrongConsider pg_hint_plan as a temporary override while investigating

What to Do Next

  • Problem: Stale or low-resolution statistics cause the planner to choose wrong join types and scan methods, producing query regressions that look like load spikes but are actually optimizer failures.
  • Solution: Run ANALYZE after bulk loads, raise statistics target to 500 for join and filter columns on large tables, and create extended statistics for correlated column pairs.
  • Proof: After ANALYZE, EXPLAIN (ANALYZE) estimated rows should be within a factor of 2 of actual rows for the primary scan nodes.
  • Action: Run the n_mod_since_analyze query from Check 4 this week. Any table where mod_pct > 20% needs an ANALYZE run today.

Checklist

  1. Run EXPLAIN (ANALYZE, BUFFERS) on the slow query — compare estimated vs actual rows at each node
  2. Query pg_stats for the filtered columns — check n_distinct, correlation, and most_common_vals
  3. Query pg_stat_user_tables for last_analyze, last_autoanalyze, and n_mod_since_analyze
  4. If last_analyze is stale or NULL: run ANALYZE tablename immediately
  5. Re-run EXPLAIN (ANALYZE) after ANALYZE to verify estimates improved
  6. If estimates still wrong: check for correlated columns in the WHERE clause
  7. Raise statistics_target to 500 for high-cardinality or skewed columns
  8. Create extended statistics with CREATE STATISTICS (dependencies) for correlated column pairs
  9. Run ANALYZE again after any statistics configuration change
  10. Lower autovacuum_analyze_scale_factor to 0.01 per-table for high-write tables
  11. Add ANALYZE calls to ETL pipelines immediately after bulk loads or large deletes
  12. Add a monitoring query on n_mod_since_analyze — alert when mod_pct > 15% on production tables