Double Write Buffers Fail at the I/O Boundary

A double write buffer only protects a database if the second write crosses the same durability boundary as the first; port InnoDB’s double write buffer into PostgreSQL without that boundary, and you have built a corruption machine with better comments.

Situation

AI coding agents are now good enough to produce plausible systems code inside mature engines like PostgreSQL. That changes the review problem: the first failure is no longer “does it compile?” but “does the generated design preserve the subsystem’s recovery invariants?”

The default PostgreSQL protection is write-ahead log (WAL) full page writes (FPW): after each checkpoint, the first modification of a page writes the whole page image into WAL. The tempting alternative is an InnoDB-style double write buffer (DWB): write a safe copy of the page elsewhere, flush it, then write the page to its final data-file location.

Approach	Recovery copy	Durability boundary	Primary cost
PostgreSQL FPW	Full 8KB page image in WAL	WAL flush through `wal_sync_method`	Higher WAL volume after checkpoints
InnoDB DWB	Page copy in doublewrite files	DWB flush before final data-file write	Extra data writes and recovery state
Naive PostgreSQL DWB port	Page copy in a new buffer area	Often mistaken as `smgrwrite()` or `sync_file_range()`	Silent loss of the only safe copy

The Problem

The non-obvious failure is that InnoDB’s DWB and PostgreSQL’s FPW solve the same torn-page problem under different I/O contracts. MySQL documents InnoDB’s DWB as a storage area written before pages go to their proper locations, with a single fsync() for the doublewrite chunk in the normal design (MySQL 8.0 manual). PostgreSQL documents FPW as necessary because an operating-system crash can leave a page containing a mix of old and new data, and row-level WAL alone cannot repair that page (PostgreSQL WAL settings).

The dangerous part is that the APIs look boring. write(), fsync(), sync_file_range(), background writer, checkpointer. An AI agent can assemble those names into code that resembles a storage feature. The database will still start. Basic tests will still pass. Then the first crash at the wrong microsecond becomes your design review.

Failure point	What breaks	Why it matters
`smgrwrite()` treated as durable	PostgreSQL has handed bytes to the kernel page cache, not necessarily persistent media	A DWB slot can be reused before the destination page is safe
`sync_file_range()` treated as `fsync()`	Linux documents `SYNC_FILE_RANGE_WRITE` as asynchronous and warns it is not suitable for data integrity operations (man7)	The code can believe flushing started when recovery needs proof flushing finished
BgWriter given synchronous DWB work	`bgwriter_delay` defaults to 200ms and `bgwriter_lru_maxpages` bounds per-round writes in PostgreSQL’s background writer design (PostgreSQL resource settings)	A process designed to smooth dirty-buffer pressure becomes an fsync bottleneck
FPW removed before DWB proves equivalence	PostgreSQL’s `full_page_writes` default is `on`, and docs warn disabling it can cause unrecoverable or silent corruption after failure	You save WAL bytes by deleting the recovery source of truth
Slot metadata reused early	The page copy may be durable, but the mapping from page identity to DWB slot is no longer valid	The hardest corruption is not a torn page; it is confidence in a backup you already overwrote

The core question is not whether PostgreSQL can have a double write buffer. It is whether the design can prove, at every crash point, that either WAL or DWB contains a complete page image newer than the torn data-file page.

Core Concept

A correct PostgreSQL DWB design has to be staged around recovery truth, not modeled as an extra function call in FlushBuffer(). The invariant is simple enough to write on a whiteboard: do not reuse the DWB slot until the final page location has been confirmed durable after the page write.

flowchart TD
    Dirty[dirty buffer selected] --> Copy[copy page to DWB slot]
    Copy --> DwbFsync[fsync DWB file]
    DwbFsync --> WalCheck[confirm WAL ordering]
    WalCheck --> DataWrite[write page to tablespace]
    DataWrite --> DataSync[fsync tablespace file]
    DataSync --> Reclaim[reclaim DWB slot]
    Crash[crash recovery] --> Inspect[inspect page checksum and LSN]
    Inspect -->|page torn| Restore[restore from DWB or WAL]
    Inspect -->|page valid| Replay[continue WAL replay]

Define the authoritative recovery copy per page version.
If FPW remains enabled, WAL is authoritative for first-touch pages after checkpoint. If DWB is intended to replace FPW, the DWB slot plus metadata must become authoritative. Verification: write a crash-state matrix for DWB write, DWB fsync, tablespace write, tablespace fsync, checkpoint record, and slot reuse.
Separate page copy from durability confirmation.
Copying an 8KB PostgreSQL page into a DWB slot is not the expensive part. The expensive part is proving that copy is on persistent storage, with its page identity, block number, relation fork, page LSN, and checksum intact. Verification: a crash after DWB copy but before DWB fsync must recover from WAL or ignore the incomplete DWB entry.
Delay slot reuse until the destination file crosses a real sync boundary.
In PostgreSQL’s buffered I/O model, a successful data-file write is not enough. sync_file_range() can start writeback, but Linux explicitly does not make it a portable crash-safety primitive. Verification: a crash after tablespace write but before tablespace fsync must still find the DWB slot valid.
Keep synchronous I/O out of the single BgWriter loop.
PostgreSQL spreads checkpoint writes over time with checkpoint_completion_target, defaulting to 0.9 in current releases, specifically to avoid bursty I/O (PostgreSQL checkpoint settings). A DWB implementation needs a manager, batched slots, and completion accounting, not a per-buffer fsync in the background writer. Verification: track buffers_backend, checkpoint duration, WAL generation, and p99 write latency under pgbench before and after enabling the prototype.
Make recovery boring.
Recovery must not infer intent from partially updated state. It should read DWB metadata, validate checksums and LSNs, restore only complete entries, and ignore anything whose durability boundary was not crossed. Verification: run crash injection at every transition, including slot metadata update and slot reuse.

In Practice

The documented comparison is already enough to reject the naive port.

PostgreSQL’s own documentation says full_page_writes stores the whole disk page in WAL on the first modification after checkpoint because a torn data page cannot be repaired from row-level WAL alone. It also states the default is on and that disabling it can lead to unrecoverable or silent corruption after a system failure. That is not a tuning hint. That is a contract.

MySQL’s InnoDB documentation describes a different contract: pages flushed from the buffer pool are first written to the doublewrite area, and crash recovery can use that good copy if the final data-file write was interrupted. Since MySQL 8.0.20, those doublewrite pages live in doublewrite files rather than the old system tablespace location; since MySQL 8.0.30, innodb_doublewrite also supports DETECT_AND_RECOVER and DETECT_ONLY. The design is not merely “write the page twice.” It is “write the page twice with ordered recovery metadata and a known flush point.”

The documented pattern is clear: if generated code reclaims a DWB slot after smgrwrite() or after an advisory range flush, it has confused a buffered write with a durable write. That is enough to violate the recovery invariant. The system can lose the durable DWB copy while the data-file page is still only dirty kernel state.

This is exactly where AI-assisted systems work gets risky. Language models are strong at local similarity: InnoDB has a DWB, PostgreSQL has dirty pages, both have write paths, so assemble the bridge. But storage engines are not CRUD apps with worse naming. The important behavior lives between process architecture, kernel writeback, filesystem semantics, WAL ordering, and the crash replay path. The code shape is the least interesting part.

Where It Breaks

Failure mode	Trigger	Fix
Premature DWB slot reuse	Slot is freed after `smgrwrite()` returns on PostgreSQL with buffered I/O	Reclaim only after confirmed destination `fsync()` or equivalent durable sync after the page write
False confidence from `sync_file_range()`	Linux `SYNC_FILE_RANGE_WRITE` starts asynchronous writeback and does not flush volatile disk caches	Use it only as a writeback hint; keep `fsync()` or `fdatasync()` as the durability boundary
BgWriter latency collapse	Per-page DWB fsync added to a loop governed by `bgwriter_delay` and `bgwriter_lru_maxpages`	Move DWB fsync into batched workers with completion queues and backpressure
Checkpoint storms	DWB fsync work prevents dirty buffers from being cleaned ahead of checkpoints	Budget DWB throughput against `checkpoint_completion_target`, `max_wal_size`, and observed checkpoint sync time
WAL invariant drift	DWB metadata claims protection for a page whose WAL record was not flushed in the expected order	Tie DWB entries to page LSNs and WAL flush state; reject entries recovery cannot order
Recovery ambiguity	DWB slot has page bytes but stale relation, fork, block, checksum, or LSN metadata	Make metadata durable with the slot and validate all identifiers before restore
Misleading benchmark win	FPW disabled on a clean shutdown benchmark with no crash injection	Require power-fail tests, torn-page injection, and recovery validation before comparing WAL volume
Version-specific InnoDB copying	MySQL 8.0.20 moved DWB storage to doublewrite files; older mental models still cite `ibdata1`	Treat engine version as part of the design, not trivia

What to Do Next

Problem: AI-generated storage code can compile while breaking the only invariant that matters: after a crash, one complete page image must exist.
Solution: Review DWB as a recovery protocol with explicit durable states, not as a write-path optimization.
Proof: The validation signal is not a passing smoke test; it is crash injection across every DWB, WAL, tablespace write, fsync, checkpoint, and slot-reuse transition.
Action: This week, take one generated systems patch and write its durability matrix: recovery source of truth, sync boundary, reclaim condition, and invalid crash states.

A database does not care that the code looked like the reference architecture; it only cares which bytes survived the crash.

Situation

The Problem

Core Concept

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

The Semantics AI Misses When Porting Storage Designs

The Stack for AI-Accelerated Database Operations Is Now Open Source

Stop Writing Ad-Hoc Queries: Build a Skill Backbone for Your DB Engineering Workflows