A double write buffer only protects a database if the second write crosses the same durability boundary as the first; port InnoDB’s double write buffer into PostgreSQL without that boundary, and you have built a corruption machine with better comments.

Situation

AI coding agents are now good enough to produce plausible systems code inside mature engines like PostgreSQL. That changes the review problem: the first failure is no longer “does it compile?” but “does the generated design preserve the subsystem’s recovery invariants?”

The default PostgreSQL protection is write-ahead log (WAL) full page writes (FPW): after each checkpoint, the first modification of a page writes the whole page image into WAL. The tempting alternative is an InnoDB-style double write buffer (DWB): write a safe copy of the page elsewhere, flush it, then write the page to its final data-file location.

ApproachRecovery copyDurability boundaryPrimary cost
PostgreSQL FPWFull 8KB page image in WALWAL flush through wal_sync_methodHigher WAL volume after checkpoints
InnoDB DWBPage copy in doublewrite filesDWB flush before final data-file writeExtra data writes and recovery state
Naive PostgreSQL DWB portPage copy in a new buffer areaOften mistaken as smgrwrite() or sync_file_range()Silent loss of the only safe copy

The Problem

The non-obvious failure is that InnoDB’s DWB and PostgreSQL’s FPW solve the same torn-page problem under different I/O contracts. MySQL documents InnoDB’s DWB as a storage area written before pages go to their proper locations, with a single fsync() for the doublewrite chunk in the normal design (MySQL 8.0 manual). PostgreSQL documents FPW as necessary because an operating-system crash can leave a page containing a mix of old and new data, and row-level WAL alone cannot repair that page (PostgreSQL WAL settings).

The dangerous part is that the APIs look boring. write(), fsync(), sync_file_range(), background writer, checkpointer. An AI agent can assemble those names into code that resembles a storage feature. The database will still start. Basic tests will still pass. Then the first crash at the wrong microsecond becomes your design review.

Failure pointWhat breaksWhy it matters
smgrwrite() treated as durablePostgreSQL has handed bytes to the kernel page cache, not necessarily persistent mediaA DWB slot can be reused before the destination page is safe
sync_file_range() treated as fsync()Linux documents SYNC_FILE_RANGE_WRITE as asynchronous and warns it is not suitable for data integrity operations (man7)The code can believe flushing started when recovery needs proof flushing finished
BgWriter given synchronous DWB workbgwriter_delay defaults to 200ms and bgwriter_lru_maxpages bounds per-round writes in PostgreSQL’s background writer design (PostgreSQL resource settings)A process designed to smooth dirty-buffer pressure becomes an fsync bottleneck
FPW removed before DWB proves equivalencePostgreSQL’s full_page_writes default is on, and docs warn disabling it can cause unrecoverable or silent corruption after failureYou save WAL bytes by deleting the recovery source of truth
Slot metadata reused earlyThe page copy may be durable, but the mapping from page identity to DWB slot is no longer validThe hardest corruption is not a torn page; it is confidence in a backup you already overwrote

The core question is not whether PostgreSQL can have a double write buffer. It is whether the design can prove, at every crash point, that either WAL or DWB contains a complete page image newer than the torn data-file page.

Core Concept

A correct PostgreSQL DWB design has to be staged around recovery truth, not modeled as an extra function call in FlushBuffer(). The invariant is simple enough to write on a whiteboard: do not reuse the DWB slot until the final page location has been confirmed durable after the page write.

flowchart TD
    Dirty[dirty buffer selected] --> Copy[copy page to DWB slot]
    Copy --> DwbFsync[fsync DWB file]
    DwbFsync --> WalCheck[confirm WAL ordering]
    WalCheck --> DataWrite[write page to tablespace]
    DataWrite --> DataSync[fsync tablespace file]
    DataSync --> Reclaim[reclaim DWB slot]
    Crash[crash recovery] --> Inspect[inspect page checksum and LSN]
    Inspect -->|page torn| Restore[restore from DWB or WAL]
    Inspect -->|page valid| Replay[continue WAL replay]
  1. Define the authoritative recovery copy per page version.
    If FPW remains enabled, WAL is authoritative for first-touch pages after checkpoint. If DWB is intended to replace FPW, the DWB slot plus metadata must become authoritative. Verification: write a crash-state matrix for DWB write, DWB fsync, tablespace write, tablespace fsync, checkpoint record, and slot reuse.

  2. Separate page copy from durability confirmation.
    Copying an 8KB PostgreSQL page into a DWB slot is not the expensive part. The expensive part is proving that copy is on persistent storage, with its page identity, block number, relation fork, page LSN, and checksum intact. Verification: a crash after DWB copy but before DWB fsync must recover from WAL or ignore the incomplete DWB entry.

  3. Delay slot reuse until the destination file crosses a real sync boundary.
    In PostgreSQL’s buffered I/O model, a successful data-file write is not enough. sync_file_range() can start writeback, but Linux explicitly does not make it a portable crash-safety primitive. Verification: a crash after tablespace write but before tablespace fsync must still find the DWB slot valid.

  4. Keep synchronous I/O out of the single BgWriter loop.
    PostgreSQL spreads checkpoint writes over time with checkpoint_completion_target, defaulting to 0.9 in current releases, specifically to avoid bursty I/O (PostgreSQL checkpoint settings). A DWB implementation needs a manager, batched slots, and completion accounting, not a per-buffer fsync in the background writer. Verification: track buffers_backend, checkpoint duration, WAL generation, and p99 write latency under pgbench before and after enabling the prototype.

  5. Make recovery boring.
    Recovery must not infer intent from partially updated state. It should read DWB metadata, validate checksums and LSNs, restore only complete entries, and ignore anything whose durability boundary was not crossed. Verification: run crash injection at every transition, including slot metadata update and slot reuse.

In Practice

The documented comparison is already enough to reject the naive port.

PostgreSQL’s own documentation says full_page_writes stores the whole disk page in WAL on the first modification after checkpoint because a torn data page cannot be repaired from row-level WAL alone. It also states the default is on and that disabling it can lead to unrecoverable or silent corruption after a system failure. That is not a tuning hint. That is a contract.

MySQL’s InnoDB documentation describes a different contract: pages flushed from the buffer pool are first written to the doublewrite area, and crash recovery can use that good copy if the final data-file write was interrupted. Since MySQL 8.0.20, those doublewrite pages live in doublewrite files rather than the old system tablespace location; since MySQL 8.0.30, innodb_doublewrite also supports DETECT_AND_RECOVER and DETECT_ONLY. The design is not merely “write the page twice.” It is “write the page twice with ordered recovery metadata and a known flush point.”

The documented pattern is clear: if generated code reclaims a DWB slot after smgrwrite() or after an advisory range flush, it has confused a buffered write with a durable write. That is enough to violate the recovery invariant. The system can lose the durable DWB copy while the data-file page is still only dirty kernel state.

This is exactly where AI-assisted systems work gets risky. Language models are strong at local similarity: InnoDB has a DWB, PostgreSQL has dirty pages, both have write paths, so assemble the bridge. But storage engines are not CRUD apps with worse naming. The important behavior lives between process architecture, kernel writeback, filesystem semantics, WAL ordering, and the crash replay path. The code shape is the least interesting part.

Where It Breaks

Failure modeTriggerFix
Premature DWB slot reuseSlot is freed after smgrwrite() returns on PostgreSQL with buffered I/OReclaim only after confirmed destination fsync() or equivalent durable sync after the page write
False confidence from sync_file_range()Linux SYNC_FILE_RANGE_WRITE starts asynchronous writeback and does not flush volatile disk cachesUse it only as a writeback hint; keep fsync() or fdatasync() as the durability boundary
BgWriter latency collapsePer-page DWB fsync added to a loop governed by bgwriter_delay and bgwriter_lru_maxpagesMove DWB fsync into batched workers with completion queues and backpressure
Checkpoint stormsDWB fsync work prevents dirty buffers from being cleaned ahead of checkpointsBudget DWB throughput against checkpoint_completion_target, max_wal_size, and observed checkpoint sync time
WAL invariant driftDWB metadata claims protection for a page whose WAL record was not flushed in the expected orderTie DWB entries to page LSNs and WAL flush state; reject entries recovery cannot order
Recovery ambiguityDWB slot has page bytes but stale relation, fork, block, checksum, or LSN metadataMake metadata durable with the slot and validate all identifiers before restore
Misleading benchmark winFPW disabled on a clean shutdown benchmark with no crash injectionRequire power-fail tests, torn-page injection, and recovery validation before comparing WAL volume
Version-specific InnoDB copyingMySQL 8.0.20 moved DWB storage to doublewrite files; older mental models still cite ibdata1Treat engine version as part of the design, not trivia

What to Do Next

  • Problem: AI-generated storage code can compile while breaking the only invariant that matters: after a crash, one complete page image must exist.
  • Solution: Review DWB as a recovery protocol with explicit durable states, not as a write-path optimization.
  • Proof: The validation signal is not a passing smoke test; it is crash injection across every DWB, WAL, tablespace write, fsync, checkpoint, and slot-reuse transition.
  • Action: This week, take one generated systems patch and write its durability matrix: recovery source of truth, sync boundary, reclaim condition, and invalid crash states.

A database does not care that the code looked like the reference architecture; it only cares which bytes survived the crash.