Index Selectivity: Why Cardinality Changes Everything

An index on a boolean column does not help. An index on a status column with three values probably does not help either. Index selectivity — how many distinct values a column has relative to the total row count — determines whether the planner will choose the index or ignore it entirely.

Situation

Database engineers add indexes to slow queries by instinct — the query filters on status, so create an index on status. When the index does not improve performance or is ignored by the planner, the engineer is confused. The planner is not wrong. A low-selectivity index is genuinely worse than a sequential scan for most queries, and the planner knows it.

Selectivity is the fraction of rows a condition matches. A condition that matches 1% of rows has high selectivity (the index is useful). A condition that matches 60% of rows has low selectivity (a sequential scan is likely faster).

The Problem

A table has 10 million orders. Engineers add an index on status to speed up a query filtering for status = 'pending'. The query uses the index in development (where the table has 1,000 rows and 200 are pending). In production (where 7 million of 10 million orders are pending), the query ignores the index and does a sequential scan. The planner is right both times.

How does the planner decide whether an index is worth using, and when is a low-cardinality index harmful?

Selectivity and the Cost Model

The planner estimates the cost of an index scan as: (rows matched by the condition) × (random page read cost). If matched rows is large, random reads add up quickly. Sequential scans read data in order and benefit from operating system read-ahead; random index lookups do not.

For status = 'pending' on a table where 70% of rows are pending:

Estimated index scan cost: 7,000,000 × 4 (random_page_cost) = 28,000,000 cost units
Estimated seq scan cost:   table_pages × 1 (seq_page_cost)  ≈ 50,000 cost units

The sequential scan wins by a large margin. Adding the index did not slow the query — but it did add write overhead and storage cost for zero benefit.

-- Check distinct values and cardinality for a column
SELECT status, count(*) as row_count,
       round(count(*) * 100.0 / sum(count(*)) over (), 2) as pct
FROM orders
GROUP BY status
ORDER BY row_count DESC;

-- What statistics does the planner have?
SELECT attname, n_distinct, correlation
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'status';

n_distinct = 3 means the planner knows there are 3 distinct status values. With 10 million rows, each value has ~3.3 million rows on average. No single value is selective enough to make the index useful for queries that match a large fraction of rows.

When Low-Cardinality Indexes Work

A partial index solves this by indexing only the rare values that are actually selective:

-- Instead of a full index on status:
CREATE INDEX idx_orders_pending ON orders (created_at)
WHERE status = 'pending';

If only 0.5% of orders are pending at any given time, this partial index covers a small fraction of rows and is highly selective. The planner will use it for WHERE status = 'pending' queries. It is smaller, faster to update, and more selective than a full index on status.

In Practice

PostgreSQL’s documented statistics collection (ANALYZE) builds histograms and most-common-value lists for each column. The planner uses these to estimate how many rows a condition will return. When statistics are stale — because a table has had many inserts or updates since the last ANALYZE — estimates are wrong and the planner may make a bad choice. PostgreSQL’s autovacuum runs ANALYZE automatically, but on very high-write tables it may not keep up.

The correlation value in pg_stats measures how well the physical order of rows in the heap matches the sort order of the column. A high correlation (near 1.0) means the column’s values are physically ordered and index scans are efficient; a correlation near 0 means index scans require many random reads.

Where It Breaks

Scenario	Problem	Fix
Index on low-cardinality column	Planner ignores the index; write overhead remains	Drop index; use partial index on the rare, selective values
Stale statistics on skewed data	Planner underestimates matching rows; bad plan	Run `ANALYZE` manually; tune `default_statistics_target`
Index exists but has wrong correlation	Index used but causes excessive random I/O	Run `CLUSTER` on the table; or accept the random I/O as the cost of index use

What to Do Next

Problem: Low-cardinality indexes add write overhead and storage cost without improving read performance for queries that match a large fraction of rows.
Solution: Check pg_stats.n_distinct before creating an index; for low-cardinality columns, consider a partial index on the selective values only.
Proof: A partial index on pending orders will appear in EXPLAIN output for WHERE status = 'pending' queries and be ignored for WHERE status = 'shipped' queries — exactly the right selectivity-aware behavior.
Action: Run SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read, idx_tup_fetch FROM pg_stat_user_indexes ORDER BY idx_scan ASC LIMIT 20; today and find your least-used indexes — candidates for review or removal.

Situation

The Problem

Selectivity and the Cost Model

When Low-Cardinality Indexes Work

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

Consistency Models Your Application Actually Needs

CAP Theorem in Operational Terms

Caches, Queues, and Databases: When to Use Each