WAL Explained for Database Engineers
Most database failures are not storage failures — they are sequence failures. The write-ahead log is the mechanism that enforces the right sequence, survives crashes, and underpins every form of replication.
Situation
Every write to a PostgreSQL, MySQL, or Oracle database passes through a write-ahead log before touching any data file. In PostgreSQL it is called the WAL. In Oracle and MySQL it is called the redo log. These are not backups. They are an ordered, append-only record of every change the database intends to make, written before the change is applied to data pages.
The WAL exists because durable writes and fast writes are in tension. Flushing a modified data page to disk on every commit is slow because pages are scattered across disk. Flushing a sequential log record is fast. The WAL lets the database acknowledge a commit once the log record is flushed, then write data pages asynchronously.
The Problem
Engineers who manage production databases often treat the WAL as a background detail — something that creates disk pressure and replication lag but is otherwise invisible. That assumption fails at the worst time: during crash recovery, when a replica falls behind, or when a restore from backup fails because the WAL sequence is incomplete.
Why does the WAL exist at the level of protocol, not just implementation — and what does a database engineer actually need to understand to reason about durability and replication?
The Durability Contract
The WAL is a promise: if the log record is flushed to disk, the change survives any subsequent crash. The database can lose the in-memory copy and the unflushed data page. The log record is enough to reconstruct both.
Each record in the WAL has a position — PostgreSQL calls it the LSN (log sequence number), Oracle calls it the SCN. Everything in the database is ordered by this position. Crash recovery replays WAL records in LSN order to bring data files forward from the last checkpoint to the point of failure.
-- PostgreSQL: current WAL write position
SELECT pg_current_wal_lsn();
-- Gap between what has been written and what has been flushed
SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), pg_current_wal_flush_lsn()) AS unflushed_bytes;
-- Replication lag for each standby (on the primary)
SELECT application_name, write_lag, flush_lag, replay_lag
FROM pg_stat_replication;
Replication works because the WAL is a complete, ordered record of every change. Physical streaming replication ships WAL records from primary to standby, where they are replayed in LSN order. Logical replication decodes those records into SQL operations for cross-version or filtered replication.
In Practice
PostgreSQL’s documented behavior confirms that the WAL flush — not the data page flush — is what makes a commit durable. The synchronous_commit parameter controls this tradeoff explicitly: at on, a commit waits for WAL flush to replica; at local, it waits only for the local flush; at off, it returns before any flush, accepting a small window of data loss on crash. AWS Aurora’s architecture eliminates the data page shipping problem entirely — the primary sends only WAL records to the shared distributed storage layer, which handles durability across six copies without requiring physical standbys to apply full pages.
Where It Breaks
| Failure | Cause | Fix |
|---|---|---|
| Replication lag grows | WAL produced faster than standby replays | Tune standby I/O; investigate long-running transactions on primary |
| Disk full on primary | Inactive replication slot retaining WAL | Drop or advance the stale slot: SELECT pg_drop_replication_slot('name') |
| Crash recovery takes hours | Checkpoint interval too long | Lower checkpoint_timeout; verify checkpoint_completion_target |
What to Do Next
- Problem: WAL accumulation and replication lag are the same upstream pressure: writes that the WAL pipeline cannot drain fast enough.
- Solution: Monitor LSN delta between primary and each standby; alert when the gap exceeds your RPO budget in bytes or time.
- Proof: After adding WAL lag monitoring, lag spikes will correlate with bulk loads, ETL jobs, and autovacuum catch-up cycles.
- Action: Run
SELECT slot_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained FROM pg_replication_slots;today and confirm no inactive slot is silently accumulating WAL on your primary.