Redo vs Undo: How Databases Recover from Crashes

When a database crashes mid-transaction, it has two problems: replay every committed change that did not make it to disk, and remove every uncommitted change that did. These are solved by redo and undo, and conflating them is how engineers misread crash recovery timelines.

Situation

Every ACID database must survive a crash and return to a consistent state. After a crash, some committed transactions may not have flushed their data pages to disk (they were in the buffer cache). Some uncommitted transactions may have partially written data pages. The recovery process must handle both cases.

The standard model — used by PostgreSQL, Oracle, MySQL InnoDB, and SQL Server — divides recovery into two phases: redo and undo.

The Problem

Engineers monitoring a database restart after a crash often see recovery take longer than expected and cannot explain why. They see log messages about “replaying WAL” or “applying redo records” and assume that means the database is restoring from backup. It is not. It is doing normal crash recovery — and understanding the two phases explains why the timeline is what it is.

How long should crash recovery take, and what is the database actually doing during that time?

Redo: Bring Committed Changes Forward

Redo uses the write-ahead log (WAL in PostgreSQL, redo log in Oracle/MySQL) to replay every change since the last checkpoint, in log sequence order. The checkpoint is a known consistent point — all data pages at the checkpoint are guaranteed to be on disk.

After a crash, the database scans forward from the last checkpoint and replays each WAL record: insert a row here, update a column there, allocate a page. This brings data files forward to the state they would have been in if the crash had not happened. Redo does not distinguish between committed and uncommitted transactions — it applies all log records first.

-- PostgreSQL: see recovery progress during startup (from another session or log)
-- Check pg_waldump for log record analysis post-crash:
-- pg_waldump -p /var/lib/postgresql/data/pg_wal -s 0/1234ABCD

-- After recovery, confirm the database recovered to the right LSN:
SELECT pg_current_wal_lsn();

Redo is deterministic and bounded: it replays records from the checkpoint LSN to the end of the WAL. Recovery time is proportional to how far the WAL advanced past the last checkpoint — which is controlled by checkpoint_timeout and max_wal_size.

Undo: Roll Back Uncommitted Changes

After redo, the database contains a mix of committed and uncommitted changes. Undo scans the log in reverse and removes every change made by transactions that were not committed at the time of the crash. In PostgreSQL, this is handled implicitly by MVCC — uncommitted transaction row versions are simply invisible to new readers because their xmin was never marked committed. In InnoDB and Oracle, a separate undo log stores the before-images of rows that were modified by uncommitted transactions.

The operational implication: in InnoDB, recovery time includes the undo phase, which can be significant if a long-running uncommitted transaction modified many rows. PostgreSQL’s MVCC approach means undo is lazy — the dead rows persist and are cleaned up by vacuum later, trading immediate undo cost for deferred cleanup cost.

In Practice

PostgreSQL’s documented recovery model confirms that crash recovery replays WAL records from the last checkpoint. The time to recover is bounded by checkpoint_timeout (default: 5 minutes) and how aggressively the database was writing past the checkpoint. Oracle’s documented recovery model uses a dedicated undo tablespace where before-images are stored for rollback; the undo tablespace must be sized for the longest running uncommitted transaction.

Where It Breaks

Failure	Cause	Fix
Crash recovery takes 20+ minutes	Long checkpoint interval; heavy WAL generation past last checkpoint	Lower `checkpoint_timeout`; ensure checkpoints complete before the next starts
InnoDB recovery stuck on undo	Large uncommitted transaction at time of crash	Cannot be accelerated; undo must complete before DB opens
PostgreSQL bloat after crash	Uncommitted dead tuples not cleaned up	Normal — autovacuum will reclaim after recovery; no action needed

What to Do Next

Problem: Long crash recovery is almost always a checkpoint tuning problem — the database is redoing too much WAL because checkpoints were too infrequent.
Solution: Set checkpoint_timeout to 5 minutes or less; monitor pg_stat_bgwriter.checkpoints_timed vs checkpoints_req to confirm checkpoints complete on schedule.
Proof: After tuning, crash recovery tests in staging should complete in under 2 minutes for typical OLTP loads.
Action: Check your current checkpoint_timeout and calculate the worst-case redo window: SHOW checkpoint_timeout; SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')); — this bounds your maximum recovery time.

Situation

The Problem

Redo: Bring Committed Changes Forward

Undo: Roll Back Uncommitted Changes

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

Consistency Models Your Application Actually Needs

CAP Theorem in Operational Terms

Caches, Queues, and Databases: When to Use Each