Redo vs Undo: How Databases Recover from Crashes
When a database crashes mid-transaction, it has two problems: replay every committed change that did not make it to disk, and remove every uncommitted change that did. These are solved by redo and undo, and conflating them is how engineers misread crash recovery timelines.
Situation
Every ACID database must survive a crash and return to a consistent state. After a crash, some committed transactions may not have flushed their data pages to disk (they were in the buffer cache). Some uncommitted transactions may have partially written data pages. The recovery process must handle both cases.
The standard model — used by PostgreSQL, Oracle, MySQL InnoDB, and SQL Server — divides recovery into two phases: redo and undo.
The Problem
Engineers monitoring a database restart after a crash often see recovery take longer than expected and cannot explain why. They see log messages about “replaying WAL” or “applying redo records” and assume that means the database is restoring from backup. It is not. It is doing normal crash recovery — and understanding the two phases explains why the timeline is what it is.
How long should crash recovery take, and what is the database actually doing during that time?
Redo: Bring Committed Changes Forward
Redo uses the write-ahead log (WAL in PostgreSQL, redo log in Oracle/MySQL) to replay every change since the last checkpoint, in log sequence order. The checkpoint is a known consistent point — all data pages at the checkpoint are guaranteed to be on disk.
After a crash, the database scans forward from the last checkpoint and replays each WAL record: insert a row here, update a column there, allocate a page. This brings data files forward to the state they would have been in if the crash had not happened. Redo does not distinguish between committed and uncommitted transactions — it applies all log records first.
-- PostgreSQL: see recovery progress during startup (from another session or log)
-- Check pg_waldump for log record analysis post-crash:
-- pg_waldump -p /var/lib/postgresql/data/pg_wal -s 0/1234ABCD
-- After recovery, confirm the database recovered to the right LSN:
SELECT pg_current_wal_lsn();
Redo is deterministic and bounded: it replays records from the checkpoint LSN to the end of the WAL. Recovery time is proportional to how far the WAL advanced past the last checkpoint — which is controlled by checkpoint_timeout and max_wal_size.
Undo: Roll Back Uncommitted Changes
After redo, the database contains a mix of committed and uncommitted changes. Undo scans the log in reverse and removes every change made by transactions that were not committed at the time of the crash. In PostgreSQL, this is handled implicitly by MVCC — uncommitted transaction row versions are simply invisible to new readers because their xmin was never marked committed. In InnoDB and Oracle, a separate undo log stores the before-images of rows that were modified by uncommitted transactions.
The operational implication: in InnoDB, recovery time includes the undo phase, which can be significant if a long-running uncommitted transaction modified many rows. PostgreSQL’s MVCC approach means undo is lazy — the dead rows persist and are cleaned up by vacuum later, trading immediate undo cost for deferred cleanup cost.
In Practice
PostgreSQL’s documented recovery model confirms that crash recovery replays WAL records from the last checkpoint. The time to recover is bounded by checkpoint_timeout (default: 5 minutes) and how aggressively the database was writing past the checkpoint. Oracle’s documented recovery model uses a dedicated undo tablespace where before-images are stored for rollback; the undo tablespace must be sized for the longest running uncommitted transaction.
Where It Breaks
| Failure | Cause | Fix |
|---|---|---|
| Crash recovery takes 20+ minutes | Long checkpoint interval; heavy WAL generation past last checkpoint | Lower checkpoint_timeout; ensure checkpoints complete before the next starts |
| InnoDB recovery stuck on undo | Large uncommitted transaction at time of crash | Cannot be accelerated; undo must complete before DB opens |
| PostgreSQL bloat after crash | Uncommitted dead tuples not cleaned up | Normal — autovacuum will reclaim after recovery; no action needed |
What to Do Next
- Problem: Long crash recovery is almost always a checkpoint tuning problem — the database is redoing too much WAL because checkpoints were too infrequent.
- Solution: Set
checkpoint_timeoutto 5 minutes or less; monitorpg_stat_bgwriter.checkpoints_timedvscheckpoints_reqto confirm checkpoints complete on schedule. - Proof: After tuning, crash recovery tests in staging should complete in under 2 minutes for typical OLTP loads.
- Action: Check your current
checkpoint_timeoutand calculate the worst-case redo window:SHOW checkpoint_timeout; SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0'));— this bounds your maximum recovery time.