Index Selectivity: Why Cardinality Changes Everything
An index on a boolean column does not help. An index on a status column with three values probably does not help either. Index selectivity — how many distinct values a column has relative to the total row count — determines whether the planner will choose the index or ignore it entirely.
Situation
Database engineers add indexes to slow queries by instinct — the query filters on status, so create an index on status. When the index does not improve performance or is ignored by the planner, the engineer is confused. The planner is not wrong. A low-selectivity index is genuinely worse than a sequential scan for most queries, and the planner knows it.
Selectivity is the fraction of rows a condition matches. A condition that matches 1% of rows has high selectivity (the index is useful). A condition that matches 60% of rows has low selectivity (a sequential scan is likely faster).
The Problem
A table has 10 million orders. Engineers add an index on status to speed up a query filtering for status = 'pending'. The query uses the index in development (where the table has 1,000 rows and 200 are pending). In production (where 7 million of 10 million orders are pending), the query ignores the index and does a sequential scan. The planner is right both times.
How does the planner decide whether an index is worth using, and when is a low-cardinality index harmful?
Selectivity and the Cost Model
The planner estimates the cost of an index scan as: (rows matched by the condition) × (random page read cost). If matched rows is large, random reads add up quickly. Sequential scans read data in order and benefit from operating system read-ahead; random index lookups do not.
For status = 'pending' on a table where 70% of rows are pending:
Estimated index scan cost: 7,000,000 × 4 (random_page_cost) = 28,000,000 cost units
Estimated seq scan cost: table_pages × 1 (seq_page_cost) ≈ 50,000 cost units
The sequential scan wins by a large margin. Adding the index did not slow the query — but it did add write overhead and storage cost for zero benefit.
-- Check distinct values and cardinality for a column
SELECT status, count(*) as row_count,
round(count(*) * 100.0 / sum(count(*)) over (), 2) as pct
FROM orders
GROUP BY status
ORDER BY row_count DESC;
-- What statistics does the planner have?
SELECT attname, n_distinct, correlation
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'status';
n_distinct = 3 means the planner knows there are 3 distinct status values. With 10 million rows, each value has ~3.3 million rows on average. No single value is selective enough to make the index useful for queries that match a large fraction of rows.
When Low-Cardinality Indexes Work
A partial index solves this by indexing only the rare values that are actually selective:
-- Instead of a full index on status:
CREATE INDEX idx_orders_pending ON orders (created_at)
WHERE status = 'pending';
If only 0.5% of orders are pending at any given time, this partial index covers a small fraction of rows and is highly selective. The planner will use it for WHERE status = 'pending' queries. It is smaller, faster to update, and more selective than a full index on status.
In Practice
PostgreSQL’s documented statistics collection (ANALYZE) builds histograms and most-common-value lists for each column. The planner uses these to estimate how many rows a condition will return. When statistics are stale — because a table has had many inserts or updates since the last ANALYZE — estimates are wrong and the planner may make a bad choice. PostgreSQL’s autovacuum runs ANALYZE automatically, but on very high-write tables it may not keep up.
The correlation value in pg_stats measures how well the physical order of rows in the heap matches the sort order of the column. A high correlation (near 1.0) means the column’s values are physically ordered and index scans are efficient; a correlation near 0 means index scans require many random reads.
Where It Breaks
| Scenario | Problem | Fix |
|---|---|---|
| Index on low-cardinality column | Planner ignores the index; write overhead remains | Drop index; use partial index on the rare, selective values |
| Stale statistics on skewed data | Planner underestimates matching rows; bad plan | Run ANALYZE manually; tune default_statistics_target |
| Index exists but has wrong correlation | Index used but causes excessive random I/O | Run CLUSTER on the table; or accept the random I/O as the cost of index use |
What to Do Next
- Problem: Low-cardinality indexes add write overhead and storage cost without improving read performance for queries that match a large fraction of rows.
- Solution: Check
pg_stats.n_distinctbefore creating an index; for low-cardinality columns, consider a partial index on the selective values only. - Proof: A partial index on pending orders will appear in
EXPLAINoutput forWHERE status = 'pending'queries and be ignored forWHERE status = 'shipped'queries — exactly the right selectivity-aware behavior. - Action: Run
SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read, idx_tup_fetch FROM pg_stat_user_indexes ORDER BY idx_scan ASC LIMIT 20;today and find your least-used indexes — candidates for review or removal.