MVCC Explained Like a Database Engineer
Most engineers know that MVCC means “readers don’t block writers.” What they miss is the operational consequence: those non-blocking reads are paid for with storage, and if you stop collecting the debt, the database starts degrading in ways that look nothing like a concurrency problem.
Situation
MVCC — Multi-Version Concurrency Control — is the concurrency model used by PostgreSQL, MySQL InnoDB, Oracle, CockroachDB, and most other production-grade relational databases. Inside a transaction, the database does not show you the current physical state of the rows; it shows a consistent snapshot as it existed at the moment your transaction started.
Engineers rely on this without thinking about it. The property they care about — “I can run a long analytical query on a busy OLTP table without blocking inserts” — comes directly from MVCC. But few have thought through what has to be true at the storage level for that property to hold.
The Problem
The concrete failure mode is table bloat in PostgreSQL after a heavy UPDATE or DELETE workload. Engineers see a table that is 40 GB on disk with only 8 GB of live data and conclude something is wrong with storage. The actual cause is MVCC: every UPDATE leaves the old version in place; every DELETE marks the row dead without removing it. Old versions accumulate until VACUUM reclaims them.
The less visible failure is more dangerous: a long-running read transaction — a reporting query left open, a replication slot that fell behind — prevents VACUUM from advancing. PostgreSQL can eventually hit transaction ID wraparound, an emergency that takes the cluster offline.
Where is the cost of “free” snapshot isolation actually hidden?
How MVCC Works
When a transaction writes a row, the database does not overwrite the existing bytes. It writes a new version stamped with the writer’s transaction ID, leaving the old version in place. Concurrent readers see the version that was current at transaction start. Snapshot isolation without locking — but two systems store those versions very differently, and the difference shapes every operational concern that follows.
PostgreSQL stores all versions — live and dead — directly in the heap files alongside current rows. UPDATE leaves the old version in the page; DELETE flags it dead but does not remove it. VACUUM (or AUTOVACUUM) scans the heap and marks dead tuples as reclaimable. It cannot advance past any row version that is still visible to an open transaction.
You can inspect the version metadata directly. xmin is the transaction ID that created the row; xmax is the transaction ID that deleted or updated it (0 if the row is live). ctid is the physical location in the heap file:
-- Inspect row versions in PostgreSQL
SELECT xmin, xmax, ctid, id
FROM your_table
LIMIT 10;
After a series of updates, you will see multiple heap entries for the same logical row — old versions with non-zero xmax, new versions with xmax = 0. These are the dead tuples VACUUM is responsible for reclaiming.
MySQL InnoDB keeps only the current version in the clustered index. Old versions go to the undo log; when a reader needs an older snapshot, InnoDB reconstructs it by applying undo entries in reverse. A background purge thread reclaims undo space once no active transaction needs those versions. The same pressure applies: long-running reads block the purge thread.
Oracle uses a dedicated undo tablespace. The undo_retention parameter sets a fixed consistency window — simpler cleanup at the cost of a hard expiry (ORA-01555: snapshot too old).
| Database | Where old versions live | Cleanup mechanism | Risk when cleanup stalls |
|---|---|---|---|
| PostgreSQL | Heap files (table data) | VACUUM — explicit or autovacuum | Table bloat, transaction ID wraparound |
| MySQL InnoDB | Undo log segments | Background purge thread | Undo log growth, purge lag |
| Oracle | Undo tablespace | Automatic undo management | ORA-01555 snapshot too old |
In Practice
PostgreSQL’s MVCC documentation (chapter 13, “Concurrency Control”) states directly that dead tuples are not reclaimed until VACUUM runs, and that VACUUM cannot remove a dead tuple if any transaction older than that tuple is still open — the documented mechanism behind bloat from long-running transactions.
MySQL’s InnoDB documentation (“InnoDB Multi-Versioning”) states that the purge thread deletes undo log records no longer needed by any consistent read, and that history list length — in SHOW ENGINE INNODB STATUS — grows when the purge thread falls behind.
Where It Breaks
| Scenario | What breaks | Why |
|---|---|---|
| Long-running read in PostgreSQL | Table bloat; VACUUM cannot advance past the open snapshot | PostgreSQL keeps every row version visible to any active transaction |
| Long-running read in MySQL InnoDB | Undo log grows; purge thread stalls | Purge thread cannot remove records still needed by open transactions |
| Transaction ID wraparound in PostgreSQL | Cluster enters emergency read-only mode | 32-bit XID wraps after ~2 billion transactions; VACUUM must freeze rows before the counter laps |
What to Do Next
- Problem: Long-running transactions block VACUUM and the InnoDB purge thread, causing table bloat and undo log growth that degrades the database without any concurrency alarm firing.
- Solution: Set
idle_in_transaction_session_timeoutin PostgreSQL; monitor InnoDB history list length inSHOW ENGINE INNODB STATUS. - Proof: In PostgreSQL,
pg_stat_activityshows open transactions withstate = 'idle in transaction'; in InnoDB, a rising history list length during write traffic is the direct signal. - Action: Run this query on your PostgreSQL instances this week to surface any sessions holding open transactions without actively executing:
SELECT pid, now() - query_start AS duration, state, query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY duration DESC;
MVCC teaches the same lesson as most database internals: reads that look free are paid for somewhere. Knowing where is what lets you diagnose degradation instead of just observing it.