Cardinality Estimation: Why the Query Planner Gets It Wrong

The query planner is a cost-based optimizer, and its cost estimates are only as good as its row count estimates. When the planner picks the wrong join strategy or uses the wrong index, the root cause is almost always a cardinality estimation error — not a missing index.

Situation

PostgreSQL’s query planner uses statistics — stored in pg_statistic and surfaced via pg_stats — to estimate how many rows each condition will match. These estimates drive the choice of join algorithm (hash join vs nested loop vs merge join), the order of joins, and the index selection decision. Bad estimates produce bad plans.

The planner makes estimates using histograms, most-common-value lists, and correlation statistics collected by ANALYZE. For a single table with a single condition, estimates are usually accurate. For multiple conditions on the same table, or joins across multiple tables, estimation errors compound.

The Problem

A query joins three tables and filters on two columns in the same table. The query is slow. EXPLAIN ANALYZE shows that the planner estimated 12 rows from one step but got back 450,000 rows — a 37,000x underestimate. The hash join built on that estimate is catastrophically undersized and spilled to disk.

Why did the planner get it so wrong, and what can engineers actually do about it?

How Estimation Fails

Column correlation: PostgreSQL’s default statistics assume predicate conditions on different columns are independent. If you filter WHERE region = 'West' AND product_category = 'Electronics', the planner multiplies the selectivity of each condition separately. If region and category are correlated (all Electronics orders come from West), the actual row count is much higher than the product of individual selectivities would suggest. This is the most common source of large estimation errors.

Stale statistics: After bulk inserts, large updates, or schema changes, the statistics in pg_statistic no longer reflect the actual data distribution. Autovacuum runs ANALYZE automatically, but if writes are faster than autovacuum can keep up, the statistics become stale.

Skewed distributions: The histogram has a fixed number of buckets (default: 100 per column). If a value appears in 40% of rows, the histogram captures this well. But if values are extremely skewed — 0.001% of rows match a specific condition — the histogram bucket resolution may be too coarse to estimate accurately.

-- Check statistics freshness
SELECT relname, last_analyze, last_autoanalyze, n_mod_since_analyze
FROM pg_stat_user_tables
WHERE n_mod_since_analyze > 10000
ORDER BY n_mod_since_analyze DESC;

-- View column statistics
SELECT attname, n_distinct, correlation, most_common_vals, most_common_freqs
FROM pg_stats
WHERE tablename = 'orders';

-- Force fresh statistics
ANALYZE orders;

-- Increase statistics target for a skewed column
ALTER TABLE orders ALTER COLUMN region SET STATISTICS 500;
ANALYZE orders;

In Practice

The documented PostgreSQL fix for correlated column estimation errors is extended statistics, available since PostgreSQL 10:

-- Create extended statistics for correlated columns
CREATE STATISTICS orders_region_category ON region, product_category FROM orders;
ANALYZE orders;

-- Verify the stats object exists
SELECT stxname, stxkeys, stxkind FROM pg_statistic_ext;

Extended statistics teach the planner that region and product_category are correlated, allowing it to estimate multi-column conditions accurately. Without extended statistics, the independence assumption produces systematically wrong estimates for correlated columns.

The default_statistics_target parameter (default: 100) controls how many values the histogram tracks per column. Increasing it to 500 for columns with highly skewed distributions improves estimation accuracy at the cost of slower ANALYZE runs.

Where It Breaks

Estimation failure	Symptom in EXPLAIN ANALYZE	Fix
Correlated columns	`rows=5 actual rows=200000` on multi-column filter	Create extended statistics on the correlated columns
Stale statistics	`rows=1000 actual rows=9000000` after bulk load	Run `ANALYZE` manually; tune autovacuum for high-write tables
Skewed distribution	Planner ignores partial index that should be selective	Increase `default_statistics_target` for the column
Join order wrong	Outer join processes more rows than inner	`SET join_collapse_limit = 1` and reorder joins manually to test

What to Do Next

Problem: Cardinality estimation errors cause the planner to pick wrong join strategies and wrong indexes, and the errors are invisible without reading EXPLAIN ANALYZE output carefully.
Solution: Compare estimated vs actual row counts in EXPLAIN ANALYZE — any 10x divergence is a signal to investigate statistics quality.
Proof: After adding extended statistics on correlated columns, re-run EXPLAIN ANALYZE — the estimated rows should match actual rows within a factor of 2–3.
Action: Find your slowest query, run EXPLAIN (ANALYZE, BUFFERS), and find the node where estimated rows diverges most from actual rows — that node is where the plan went wrong.

Situation

The Problem

How Estimation Fails

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

Consistency Models Your Application Actually Needs

CAP Theorem in Operational Terms

Caches, Queues, and Databases: When to Use Each