MySQL Cardinality and Index Selectivity

MySQL can have a perfectly valid index on a column and still choose a full table scan — not because the optimizer is broken, but because the index is genuinely not worth using. Understanding cardinality and selectivity is what separates engineers who add indexes thoughtfully from those who add them and then wonder why EXPLAIN still shows type=ALL.

Situation

Most engineers learn early that indexes speed up queries. What the introductory materials skip is the optimizer’s decision logic: an index is only used when the optimizer estimates it will be cheaper than not using it. That estimate is driven by selectivity — how many rows the index is expected to filter out. A high-selectivity index on an email column eliminates nearly every row it does not match. A low-selectivity index on a status column with three possible values eliminates almost nothing, and the optimizer correctly concludes that scanning the whole table in a single sequential pass is cheaper than bouncing through the index structure.

This distinction matters most on large tables. On a 200-row test database, the optimizer often uses indexes it would ignore on a 50-million-row production table, because the cost model changes with scale. Engineers who tune queries against small datasets frequently miss the issue until the table grows.

The Problem

The failure mode is specific: you create an index, run EXPLAIN, and see type=ALL. The index exists. The query filters on the indexed column. But the optimizer ignores it. This confuses engineers who expect index presence to imply index use.

The root cause is low selectivity. If a status column has three values — active, inactive, deleted — and 60% of rows are active, an index on status where the query filters WHERE status = 'active' returns 60% of the table. InnoDB’s cost model estimates that reading 60% of a large table via random index lookups is more expensive than a sequential full scan, and it is usually right.

The second failure mode is stale cardinality estimates. InnoDB samples pages to estimate cardinality rather than counting exact distinct values. After a large bulk insert, a table truncate and reload, or months of accumulating rows, the stored cardinality estimate can be wildly wrong, causing the optimizer to make poor choices.

Why does the optimizer choose a full table scan despite an index, and how can engineers design indexes that the database will actually use?

Core Concept

Cardinality is the number of distinct values in an index, as estimated by InnoDB. Selectivity is the ratio of cardinality to total rows, driving the optimizer’s cost model.

flowchart TD
    A[Query filters by status] --> B{MySQL Optimizer}
    B --> C[Evaluate index — High random IO cost]
    B --> D[Evaluate table scan — Sequential IO cost]
    C --> E{Cost Model}
    D --> E
    E --> F[Table scan chosen]
    F --> G[Index ignored]

A selectivity of 0.99 (nearly unique column) is excellent. A selectivity of 0.000003 (three values across a million rows) is almost worthless for filtering.

You can query estimated selectivity directly:

SELECT
  s.INDEX_NAME,
  s.COLUMN_NAME,
  s.CARDINALITY,
  t.TABLE_ROWS,
  ROUND(s.CARDINALITY / t.TABLE_ROWS, 4) AS selectivity
FROM information_schema.STATISTICS s
JOIN information_schema.TABLES t
  ON s.TABLE_SCHEMA = t.TABLE_SCHEMA
  AND s.TABLE_NAME = t.TABLE_NAME
WHERE s.TABLE_SCHEMA = 'your_db'
  AND s.TABLE_NAME = 'your_table';

How InnoDB estimates cardinality: InnoDB uses random page sampling rather than a full scan. The number of pages sampled is controlled by innodb_stats_sample_pages and innodb_stats_persistent_sample_pages. Small samples on large tables with skewed data distributions produce inaccurate estimates.

Refreshing stale estimates: Running ANALYZE TABLE orders; re-runs the sampling process and updates the stored cardinality in mysql.innodb_table_stats. After bulk loads, table rebuilds, or significant data changes, running this is the fastest way to restore accurate optimizer decisions.

Composite indexes and leading column selectivity: A composite index on (status, created_at) is only useful when the query can filter on status first. If status has low selectivity, the optimizer may still prefer a full scan, unless the created_at range is exceptionally narrow.

In Practice

The documented pattern across high-scale engineering teams is to enforce strict index selectivity thresholds during schema reviews. Shopify’s engineering blog explicitly outlines their MySQL indexing strategy, noting that adding an index on a boolean or low-cardinality column is an anti-pattern. They observe that MySQL’s optimizer will frequently ignore these indexes because the random I/O required to fetch rows exceeds the sequential I/O cost of a full table scan.

Similarly, MySQL’s own InnoDB engine relies heavily on innodb_stats_persistent_sample_pages. If the sample pages do not accurately reflect the distribution of data — such as immediately following a massive backfill — the optimizer behaves unpredictably. The established behavior to combat this is hooking ANALYZE TABLE into post-migration automation to ensure the optimizer has fresh cardinality estimates before taking production traffic.

Where It Breaks

Scenario	What breaks	Why
Stale cardinality after bulk load	Optimizer uses wrong index or skips a valid one	Estimate reflects pre-load row distribution
Composite index with low-selectivity leading column	Index not entered even when tail columns are selective	Optimizer evaluates leading column selectivity first
FORCE INDEX overriding a correct low-selectivity decision	Query runs slower than a full scan would	Forces random I/O on a column that benefits from sequential scan

What to Do Next

Problem: An index exists but EXPLAIN shows type=ALL because selectivity is too low for the optimizer to prefer it over a full scan.
Solution: Check selectivity using the formula above; run ANALYZE TABLE after bulk data changes; design composite indexes with the most selective column first.
Proof: Compare EXPLAIN output before and after ANALYZE TABLE on a table with stale stats; watch type change from ALL to ref or range when the estimate is accurate.
Action: This week, run the selectivity query on your largest tables and verify that indexes on low-cardinality columns are intentional.

Situation

The Problem

Core Concept

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

Covering Indexes Are Not Enough Without Visibility

Prometheus + Grafana for Database Engineers: Open-Source Monitoring That Actually Works

Cassandra Observability: Compaction, Tombstones, Repair, Latency, and Hot Partitions