Double Write Buffers Fail at the I/O Boundary
Content reflects the state as of February 2025. AI tooling and model capabilities in this area change frequently.
A double write buffer only protects a database if the second write crosses the same durability boundary as the first; port InnoDB’s double write buffer into PostgreSQL without that boundary, and you have built a corruption machine with better comments.
Situation
AI coding agents are now good enough to produce plausible systems code inside mature engines like PostgreSQL. That changes the review problem: the first failure is no longer “does it compile?” but “does the generated design preserve the subsystem’s recovery invariants?”
The default PostgreSQL protection is write-ahead log (WAL) full page writes (FPW): after each checkpoint, the first modification of a page writes the whole page image into WAL. The tempting alternative is an InnoDB-style double write buffer (DWB): write a safe copy of the page elsewhere, flush it, then write the page to its final data-file location.
| Approach | Recovery copy | Durability boundary | Primary cost |
|---|---|---|---|
| PostgreSQL FPW | Full 8KB page image in WAL | WAL flush through wal_sync_method | Higher WAL volume after checkpoints |
| InnoDB DWB | Page copy in doublewrite files | DWB flush before final data-file write | Extra data writes and recovery state |
| Naive PostgreSQL DWB port | Page copy in a new buffer area | Often mistaken as smgrwrite() or sync_file_range() | Silent loss of the only safe copy |
The Problem
The non-obvious failure is that InnoDB’s DWB and PostgreSQL’s FPW solve the same torn-page problem under different I/O contracts. MySQL documents InnoDB’s DWB as a storage area written before pages go to their proper locations, with a single fsync() for the doublewrite chunk in the normal design (MySQL 8.0 manual). PostgreSQL documents FPW as necessary because an operating-system crash can leave a page containing a mix of old and new data, and row-level WAL alone cannot repair that page (PostgreSQL WAL settings).
The dangerous part is that the APIs look boring. write(), fsync(), sync_file_range(), background writer, checkpointer. An AI agent can assemble those names into code that resembles a storage feature. The database will still start. Basic tests will still pass. Then the first crash at the wrong microsecond becomes your design review.
| Failure point | What breaks | Why it matters |
|---|---|---|
smgrwrite() treated as durable | PostgreSQL has handed bytes to the kernel page cache, not necessarily persistent media | A DWB slot can be reused before the destination page is safe |
sync_file_range() treated as fsync() | Linux documents SYNC_FILE_RANGE_WRITE as asynchronous and warns it is not suitable for data integrity operations (man7) | The code can believe flushing started when recovery needs proof flushing finished |
| BgWriter given synchronous DWB work | bgwriter_delay defaults to 200ms and bgwriter_lru_maxpages bounds per-round writes in PostgreSQL’s background writer design (PostgreSQL resource settings) | A process designed to smooth dirty-buffer pressure becomes an fsync bottleneck |
| FPW removed before DWB proves equivalence | PostgreSQL’s full_page_writes default is on, and docs warn disabling it can cause unrecoverable or silent corruption after failure | You save WAL bytes by deleting the recovery source of truth |
| Slot metadata reused early | The page copy may be durable, but the mapping from page identity to DWB slot is no longer valid | The hardest corruption is not a torn page; it is confidence in a backup you already overwrote |
The core question is not whether PostgreSQL can have a double write buffer. It is whether the design can prove, at every crash point, that either WAL or DWB contains a complete page image newer than the torn data-file page.
Core Concept
A correct PostgreSQL DWB design has to be staged around recovery truth, not modeled as an extra function call in FlushBuffer(). The invariant is simple enough to write on a whiteboard: do not reuse the DWB slot until the final page location has been confirmed durable after the page write.
flowchart TD
Dirty[dirty buffer selected] --> Copy[copy page to DWB slot]
Copy --> DwbFsync[fsync DWB file]
DwbFsync --> WalCheck[confirm WAL ordering]
WalCheck --> DataWrite[write page to tablespace]
DataWrite --> DataSync[fsync tablespace file]
DataSync --> Reclaim[reclaim DWB slot]
Crash[crash recovery] --> Inspect[inspect page checksum and LSN]
Inspect -->|page torn| Restore[restore from DWB or WAL]
Inspect -->|page valid| Replay[continue WAL replay]
-
Define the authoritative recovery copy per page version.
If FPW remains enabled, WAL is authoritative for first-touch pages after checkpoint. If DWB is intended to replace FPW, the DWB slot plus metadata must become authoritative. Verification: write a crash-state matrix for DWB write, DWB fsync, tablespace write, tablespace fsync, checkpoint record, and slot reuse. -
Separate page copy from durability confirmation.
Copying an 8KB PostgreSQL page into a DWB slot is not the expensive part. The expensive part is proving that copy is on persistent storage, with its page identity, block number, relation fork, page LSN, and checksum intact. Verification: a crash after DWB copy but before DWB fsync must recover from WAL or ignore the incomplete DWB entry. -
Delay slot reuse until the destination file crosses a real sync boundary.
In PostgreSQL’s buffered I/O model, a successful data-file write is not enough.sync_file_range()can start writeback, but Linux explicitly does not make it a portable crash-safety primitive. Verification: a crash after tablespace write but before tablespace fsync must still find the DWB slot valid. -
Keep synchronous I/O out of the single BgWriter loop.
PostgreSQL spreads checkpoint writes over time withcheckpoint_completion_target, defaulting to 0.9 in current releases, specifically to avoid bursty I/O (PostgreSQL checkpoint settings). A DWB implementation needs a manager, batched slots, and completion accounting, not a per-buffer fsync in the background writer. Verification: trackbuffers_backend, checkpoint duration, WAL generation, and p99 write latency underpgbenchbefore and after enabling the prototype. -
Make recovery boring.
Recovery must not infer intent from partially updated state. It should read DWB metadata, validate checksums and LSNs, restore only complete entries, and ignore anything whose durability boundary was not crossed. Verification: run crash injection at every transition, including slot metadata update and slot reuse.
In Practice
The documented comparison is already enough to reject the naive port.
PostgreSQL’s own documentation says full_page_writes stores the whole disk page in WAL on the first modification after checkpoint because a torn data page cannot be repaired from row-level WAL alone. It also states the default is on and that disabling it can lead to unrecoverable or silent corruption after a system failure. That is not a tuning hint. That is a contract.
MySQL’s InnoDB documentation describes a different contract: pages flushed from the buffer pool are first written to the doublewrite area, and crash recovery can use that good copy if the final data-file write was interrupted. Since MySQL 8.0.20, those doublewrite pages live in doublewrite files rather than the old system tablespace location; since MySQL 8.0.30, innodb_doublewrite also supports DETECT_AND_RECOVER and DETECT_ONLY. The design is not merely “write the page twice.” It is “write the page twice with ordered recovery metadata and a known flush point.”
The documented pattern is clear: if generated code reclaims a DWB slot after smgrwrite() or after an advisory range flush, it has confused a buffered write with a durable write. That is enough to violate the recovery invariant. The system can lose the durable DWB copy while the data-file page is still only dirty kernel state.
This is exactly where AI-assisted systems work gets risky. Language models are strong at local similarity: InnoDB has a DWB, PostgreSQL has dirty pages, both have write paths, so assemble the bridge. But storage engines are not CRUD apps with worse naming. The important behavior lives between process architecture, kernel writeback, filesystem semantics, WAL ordering, and the crash replay path. The code shape is the least interesting part.
Where It Breaks
| Failure mode | Trigger | Fix |
|---|---|---|
| Premature DWB slot reuse | Slot is freed after smgrwrite() returns on PostgreSQL with buffered I/O | Reclaim only after confirmed destination fsync() or equivalent durable sync after the page write |
False confidence from sync_file_range() | Linux SYNC_FILE_RANGE_WRITE starts asynchronous writeback and does not flush volatile disk caches | Use it only as a writeback hint; keep fsync() or fdatasync() as the durability boundary |
| BgWriter latency collapse | Per-page DWB fsync added to a loop governed by bgwriter_delay and bgwriter_lru_maxpages | Move DWB fsync into batched workers with completion queues and backpressure |
| Checkpoint storms | DWB fsync work prevents dirty buffers from being cleaned ahead of checkpoints | Budget DWB throughput against checkpoint_completion_target, max_wal_size, and observed checkpoint sync time |
| WAL invariant drift | DWB metadata claims protection for a page whose WAL record was not flushed in the expected order | Tie DWB entries to page LSNs and WAL flush state; reject entries recovery cannot order |
| Recovery ambiguity | DWB slot has page bytes but stale relation, fork, block, checksum, or LSN metadata | Make metadata durable with the slot and validate all identifiers before restore |
| Misleading benchmark win | FPW disabled on a clean shutdown benchmark with no crash injection | Require power-fail tests, torn-page injection, and recovery validation before comparing WAL volume |
| Version-specific InnoDB copying | MySQL 8.0.20 moved DWB storage to doublewrite files; older mental models still cite ibdata1 | Treat engine version as part of the design, not trivia |
What to Do Next
- Problem: AI-generated storage code can compile while breaking the only invariant that matters: after a crash, one complete page image must exist.
- Solution: Review DWB as a recovery protocol with explicit durable states, not as a write-path optimization.
- Proof: The validation signal is not a passing smoke test; it is crash injection across every DWB, WAL, tablespace write, fsync, checkpoint, and slot-reuse transition.
- Action: This week, take one generated systems patch and write its durability matrix: recovery source of truth, sync boundary, reclaim condition, and invalid crash states.
A database does not care that the code looked like the reference architecture; it only cares which bytes survived the crash.