PostgreSQL Statistics: Why the Optimizer Gets It Wrong
The PostgreSQL query planner does not look at your data. It looks at statistics about your data — histograms, most-common values, null fractions, and row count estimates stored in pg_statistic. When those statistics are stale, the planner makes wrong decisions: it picks sequential scans over index scans, chooses nested loops over hash joins, and estimates 100 rows for a query that will return 10 million. This is not a bug. It is an expected consequence of how cost-based optimization works, and it is entirely under operator control.
Situation
PostgreSQL builds query plans by estimating the cost of each possible execution path. Cost estimates depend on row count estimates, and row count estimates come from statistics. The statistics are not computed continuously — they are snapshots taken by ANALYZE (or automatically by autovacuum’s analyze pass).
Engineers typically encounter statistics problems in two situations. The first is after a bulk data load: a table that had 10,000 rows now has 10 million, but the planner still thinks it has 10,000 because ANALYZE has not run since the load. The second is on tables with highly skewed distributions — a few values account for most rows, but the planner’s histogram does not have enough resolution to represent that accurately.
The Problem
PostgreSQL stores column statistics in pg_statistic, exposed through the human-readable view pg_stats. The key columns:
most_common_vals— the N most frequent values and their frequencies (most_common_freqs)histogram_bounds— bucket boundaries dividing the non-MCV value range into equal-frequency slicesnull_frac— fraction of rows that are NULLcorrelation— how well physical row order matches logical sort order (1.0 = perfectly sorted; near 0 = random)
The planner combines these to estimate how many rows will pass a given filter condition. When the statistics are accurate, estimates are close to reality. When they are stale, the estimates can be off by orders of magnitude.
The documented failure mode from PostgreSQL’s query planning documentation: after a bulk insert of 10 million rows into a table whose last ANALYZE ran when the table had 1,000 rows, the planner’s reltuples estimate in pg_class will still read approximately 1,000. A query with WHERE id = $1 on a now-large table may generate a sequential scan plan — because the planner believes the table is small and the index overhead is not worth it.
The core question: which statistics settings should you tune, and when should you manually trigger ANALYZE?
How Statistics Collection Works
default_statistics_target controls how much detail is collected per column. The default is 100, meaning PostgreSQL tracks the 100 most common values and uses 100 histogram buckets. The valid range is 1 to 10,000.
Increasing default_statistics_target makes ANALYZE slower and the statistics larger, but improves estimate accuracy for skewed distributions. For most tables, the default is fine. For columns used in highly selective filters — especially foreign keys, status columns with many distinct values, or columns where the top 100 values do not capture the actual distribution — increasing the target at the column level is the right lever:
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;
ANALYZE orders;
You can observe what the planner currently knows about a column:
SELECT
attname,
n_distinct,
most_common_vals,
most_common_freqs,
histogram_bounds
FROM pg_stats
WHERE tablename = 'orders'
AND attname = 'status';
n_distinct tells you how many distinct values PostgreSQL believes exist. A value of -0.5 means the planner estimates 50% of rows have distinct values (common for primary keys). A positive value is a raw count. If this number looks wrong, the statistics are stale.
After a bulk load, always run ANALYZE explicitly before the new data receives production query traffic:
ANALYZE orders; -- whole table
ANALYZE orders (status); -- specific column only
Autovacuum’s analyze pass uses autovacuum_analyze_scale_factor (default: 0.2) and autovacuum_analyze_threshold (default: 50). Same structural problem as vacuum thresholds: on a 50-million row table, autovacuum will not trigger ANALYZE until 10 million rows have changed. For large bulk loads, waiting for autovacuum is not safe.
In Practice
PostgreSQL’s query planner documentation (postgresql.org/docs/current/planner-stats.html) describes exactly how the planner uses pg_statistic data: selectivity estimator functions read the statistics to produce row count estimates, and the planner chooses the lowest-cost plan based on those estimates combined with seq_page_cost, random_page_cost, and table and index size from pg_class.
The correlation value in pg_stats is particularly actionable: if correlation for an indexed column is near 1.0 (data is physically sorted by that column), the planner will heavily favor index scans because random I/O effectively becomes sequential. If correlation is near 0 (random physical order), the planner may correctly prefer a sequential scan even for a highly selective query on a large table, because fetching scattered heap pages costs more than scanning the whole table with sequential I/O. Knowing this prevents incorrect index-forcing interventions.
The documented pattern from PostgreSQL extended statistics documentation is that CREATE STATISTICS (available since PostgreSQL 10) allows the planner to model correlations between columns — solving the multi-column selectivity problem that single-column histograms cannot handle. When a query filters on two correlated columns (e.g., country and city), single-column estimates multiply their selectivities independently, producing severely underestimated row counts.
Where It Breaks
| Scenario | What breaks | Why |
|---|---|---|
| Bulk insert without subsequent ANALYZE | Planner uses row counts from before the load; index scans may be abandoned for sequential scans on newly large tables | pg_class.reltuples is only updated by ANALYZE; autovacuum’s analyze threshold may not trigger for hours |
| Correlated columns with single-column statistics | Multi-column filter estimates are too optimistic; wrong join strategy chosen | Planner multiplies per-column selectivities independently, ignoring correlation between columns |
| Partial index with no matching statistics | Planner cannot use the partial index’s selectivity correctly when the WHERE clause of the query partially matches the index predicate | pg_stats does not store per-partial-index statistics; planner falls back to whole-table estimates |
What to Do Next
- Problem: Stale statistics after bulk loads cause the planner to choose wrong execution plans — sequential scans where index scans are needed, or nested loops where hash joins would be correct.
- Solution: Run
ANALYZEexplicitly after every bulk load, reduceautovacuum_analyze_scale_factoron large tables, and raisestatistics_targeton highly selective or skewed columns. - Proof: Use
EXPLAIN (ANALYZE, BUFFERS)before and afterANALYZEon a query affected by a bulk load — the estimated row counts in the plan should converge toward actual row counts. - Action: This week, query
SELECT tablename, last_analyze, last_autoanalyze, n_live_tup FROM pg_stat_user_tables ORDER BY last_analyze ASC NULLS FIRST LIMIT 20;and identify tables where statistics are old relative to write volume.