PostgreSQL Statistics: Why the Optimizer Gets It Wrong

The PostgreSQL query planner does not look at your data. It looks at statistics about your data — histograms, most-common values, null fractions, and row count estimates stored in pg_statistic. When those statistics are stale, the planner makes wrong decisions: it picks sequential scans over index scans, chooses nested loops over hash joins, and estimates 100 rows for a query that will return 10 million. This is not a bug. It is an expected consequence of how cost-based optimization works, and it is entirely under operator control.

Situation

PostgreSQL builds query plans by estimating the cost of each possible execution path. Cost estimates depend on row count estimates, and row count estimates come from statistics. The statistics are not computed continuously — they are snapshots taken by ANALYZE (or automatically by autovacuum’s analyze pass).

Engineers typically encounter statistics problems in two situations. The first is after a bulk data load: a table that had 10,000 rows now has 10 million, but the planner still thinks it has 10,000 because ANALYZE has not run since the load. The second is on tables with highly skewed distributions — a few values account for most rows, but the planner’s histogram does not have enough resolution to represent that accurately.

The Problem

PostgreSQL stores column statistics in pg_statistic, exposed through the human-readable view pg_stats. The key columns:

most_common_vals — the N most frequent values and their frequencies (most_common_freqs)
histogram_bounds — bucket boundaries dividing the non-MCV value range into equal-frequency slices
null_frac — fraction of rows that are NULL
correlation — how well physical row order matches logical sort order (1.0 = perfectly sorted; near 0 = random)

The planner combines these to estimate how many rows will pass a given filter condition. When the statistics are accurate, estimates are close to reality. When they are stale, the estimates can be off by orders of magnitude.

The documented failure mode from PostgreSQL’s query planning documentation: after a bulk insert of 10 million rows into a table whose last ANALYZE ran when the table had 1,000 rows, the planner’s reltuples estimate in pg_class will still read approximately 1,000. A query with WHERE id = $1 on a now-large table may generate a sequential scan plan — because the planner believes the table is small and the index overhead is not worth it.

The core question: which statistics settings should you tune, and when should you manually trigger ANALYZE?

How Statistics Collection Works

default_statistics_target controls how much detail is collected per column. The default is 100, meaning PostgreSQL tracks the 100 most common values and uses 100 histogram buckets. The valid range is 1 to 10,000.

Increasing default_statistics_target makes ANALYZE slower and the statistics larger, but improves estimate accuracy for skewed distributions. For most tables, the default is fine. For columns used in highly selective filters — especially foreign keys, status columns with many distinct values, or columns where the top 100 values do not capture the actual distribution — increasing the target at the column level is the right lever:

ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;
ANALYZE orders;

You can observe what the planner currently knows about a column:

SELECT
  attname,
  n_distinct,
  most_common_vals,
  most_common_freqs,
  histogram_bounds
FROM pg_stats
WHERE tablename = 'orders'
  AND attname = 'status';

n_distinct tells you how many distinct values PostgreSQL believes exist. A value of -0.5 means the planner estimates 50% of rows have distinct values (common for primary keys). A positive value is a raw count. If this number looks wrong, the statistics are stale.

After a bulk load, always run ANALYZE explicitly before the new data receives production query traffic:

ANALYZE orders;           -- whole table
ANALYZE orders (status);  -- specific column only

Autovacuum’s analyze pass uses autovacuum_analyze_scale_factor (default: 0.2) and autovacuum_analyze_threshold (default: 50). Same structural problem as vacuum thresholds: on a 50-million row table, autovacuum will not trigger ANALYZE until 10 million rows have changed. For large bulk loads, waiting for autovacuum is not safe.

In Practice

PostgreSQL’s query planner documentation (postgresql.org/docs/current/planner-stats.html) describes exactly how the planner uses pg_statistic data: selectivity estimator functions read the statistics to produce row count estimates, and the planner chooses the lowest-cost plan based on those estimates combined with seq_page_cost, random_page_cost, and table and index size from pg_class.

The correlation value in pg_stats is particularly actionable: if correlation for an indexed column is near 1.0 (data is physically sorted by that column), the planner will heavily favor index scans because random I/O effectively becomes sequential. If correlation is near 0 (random physical order), the planner may correctly prefer a sequential scan even for a highly selective query on a large table, because fetching scattered heap pages costs more than scanning the whole table with sequential I/O. Knowing this prevents incorrect index-forcing interventions.

The documented pattern from PostgreSQL extended statistics documentation is that CREATE STATISTICS (available since PostgreSQL 10) allows the planner to model correlations between columns — solving the multi-column selectivity problem that single-column histograms cannot handle. When a query filters on two correlated columns (e.g., country and city), single-column estimates multiply their selectivities independently, producing severely underestimated row counts.

Where It Breaks

Scenario	What breaks	Why
Bulk insert without subsequent ANALYZE	Planner uses row counts from before the load; index scans may be abandoned for sequential scans on newly large tables	`pg_class.reltuples` is only updated by ANALYZE; autovacuum’s analyze threshold may not trigger for hours
Correlated columns with single-column statistics	Multi-column filter estimates are too optimistic; wrong join strategy chosen	Planner multiplies per-column selectivities independently, ignoring correlation between columns
Partial index with no matching statistics	Planner cannot use the partial index’s selectivity correctly when the WHERE clause of the query partially matches the index predicate	`pg_stats` does not store per-partial-index statistics; planner falls back to whole-table estimates

What to Do Next

Problem: Stale statistics after bulk loads cause the planner to choose wrong execution plans — sequential scans where index scans are needed, or nested loops where hash joins would be correct.
Solution: Run ANALYZE explicitly after every bulk load, reduce autovacuum_analyze_scale_factor on large tables, and raise statistics_target on highly selective or skewed columns.
Proof: Use EXPLAIN (ANALYZE, BUFFERS) before and after ANALYZE on a query affected by a bulk load — the estimated row counts in the plan should converge toward actual row counts.
Action: This week, query SELECT tablename, last_analyze, last_autoanalyze, n_live_tup FROM pg_stat_user_tables ORDER BY last_analyze ASC NULLS FIRST LIMIT 20; and identify tables where statistics are old relative to write volume.

Situation

The Problem

How Statistics Collection Works

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

AWS RDS Oracle and SQL Server: The License Cost Nobody Talks About

Autovacuum Is a Capacity Problem, Not a Maintenance Task

The Semantics AI Misses When Porting Storage Designs