The Semantics AI Misses When Porting Storage Designs

AI can copy the shape of a storage design and still miss the contract that makes it correct: a double write buffer is not an extra write path, it is a durability boundary.

Situation

AI coding agents are now good enough to produce plausible database internals patches: new structs, recovery hooks, background workers, tests, and code that compiles. That changes the review problem. The risk is no longer only “does the code build?” The risk is “did the agent preserve the invisible contract between the database, kernel, filesystem, block device, and recovery algorithm?”

The source experiment is a useful failure: a Claude Code prototype attempted to port an InnoDB-style double write buffer into PostgreSQL. The implementation followed the surface pattern. Write page to double write buffer. Write page to the real data file. Reuse the slot. The failure was semantic: PostgreSQL and InnoDB do not share the same I/O model, process model, or recovery trust boundary.

Mechanism	Default trust boundary	What protects against torn pages	Review question
PostgreSQL full page writes	Write-ahead log, or WAL, flush	First modified 8KB page image after checkpoint	Is the WAL image durable before recovery needs it?
InnoDB doublewrite buffer	Doublewrite file flush	Page copy written before final tablespace overwrite	Is the doublewrite copy durable before the destination page can tear?
Naive AI port	Function names and control flow	Assumed equivalence between writes	Did the patch prove the same crash states are recoverable?

The lesson generalizes beyond databases. AI-generated infrastructure code often calls the right APIs in the wrong contract order.

The Problem

A double write buffer, or DWB, protects a database page from a torn write by writing a complete copy somewhere else before overwriting the page at its final location. InnoDB documents this directly: pages flushed from the buffer pool are written to the doublewrite buffer before their proper locations, so crash recovery can find a good copy if the final page write is torn. MySQL 8.4 documentation names that as the purpose of the feature.

PostgreSQL solves the same class of failure differently. With full_page_writes=on, PostgreSQL writes the entire page to WAL during the first modification after each checkpoint. The PostgreSQL docs are explicit: without that page image, a crash during a page write can leave mixed old and new data, and normal row-level WAL records are not enough to reconstruct the page. PostgreSQL current WAL documentation also warns that turning it off can lead to unrecoverable or silent corruption after system failure.

The bug in the AI-generated design was treating those mechanisms as interchangeable.

Failure point	What breaks	Why it matters
`write()` treated as durable	PostgreSQL writes dirty buffers through the operating system page cache; the kernel can accept the bytes before media persistence	A DWB slot reused after `smgrwrite()` can destroy the only good recovery copy
`sync_file_range()` treated as `fsync()`	Linux documents `SYNC_FILE_RANGE_WRITE` as asynchronous and not suitable for data integrity operations; it also does not flush volatile disk write caches	Advisory writeback is performance plumbing, not a crash recovery guarantee
BgWriter path gets synchronous durability work	PostgreSQL’s background writer is tuned around cheap dirty-page writes and checkpoint-spread I/O	Per-page DWB fsync turns an amortized background path into a latency amplifier
Full page writes disabled too early	WAL no longer contains first-dirtied page images after checkpoint	Recovery must trust a DWB copy that may not actually be durable or current
Slot lifecycle lacks LSN accounting	DWB slot reuse is disconnected from destination file fsync progress	Crash recovery can observe a stale tablespace page and an overwritten DWB slot

The core question is not “can PostgreSQL be given a DWB?” It is: what additional durability accounting would make a DWB at least as trustworthy as PostgreSQL’s existing WAL full page image boundary?

A Crash-State Contract for Double Write Buffering

The right design starts with crash states, not code generation. If the system crashes at every boundary, recovery must have one complete page image with a known log sequence number, or LSN. Anything less is wishful thinking with structs.

flowchart TD
    Dirty[dirty PostgreSQL buffer — page LSN known] --> WAL[WAL record — optional full page image]
    Dirty --> DWBWrite[DWB slot write — buffered copy]
    DWBWrite --> DWBFlush[DWB file fsync — durable recovery copy]
    DWBFlush --> DataWrite[tablespace write — page cache accepted]
    DataWrite --> DataFlush[tablespace fsync — final page durable]
    DataFlush --> Reclaim[DWB slot reclaim — safe reuse]
    WAL --> Recovery[crash recovery — choose trusted image]
    DWBFlush --> Recovery
    DataFlush --> Recovery

The invariant is narrow:

State	DWB slot reusable?	Recovery source	Reason
Before DWB fsync	No	WAL full page image	DWB copy may not exist after power loss
After DWB fsync, before tablespace write	No	DWB or WAL	DWB copy is durable, destination is old
After tablespace write, before tablespace fsync	No	DWB	Destination may be stale or torn
After tablespace fsync	Yes	Tablespace	Final copy is durable through the filesystem boundary
After checkpoint and slot reclaim	Yes	Tablespace plus WAL from checkpoint	Recovery no longer depends on that DWB slot

That table is the design. The implementation follows from it.

Keep full_page_writes=on while developing the DWB path.

A prototype that disables full page writes before proving DWB recovery has removed PostgreSQL’s existing safety net. PostgreSQL’s documented default is full_page_writes=on, and the reason is exactly torn-page recovery after OS crashes. The first implementation should run DWB as a redundant mechanism, then compare recovery decisions against WAL.

Verification: after crash recovery, report every page where WAL full page image and DWB recovery would have chosen different page contents or LSNs.
Treat DWB slot state as a durability state machine.

A slot is not “free” after the page is copied. It is not free after the destination write(). It is free only after the destination relation file has been synced past the page’s write. That requires at least: relation identifier, fork, block number, page LSN, DWB slot identifier, DWB fsync generation, and destination fsync generation.

Verification: inject crashes at each transition and assert that no slot with tablespace_fsync_lsn < page_lsn is reused.
Batch fsyncs around files, not pages.

A naive per-page fsync(dwb_fd) will collapse throughput on ordinary SSDs and will be theatrical on network block devices. The DWB write path needs group commit semantics: append many page copies to DWB storage, issue one durable flush, then schedule destination writes. The destination side also needs file-level fsync grouping by relation segment, because PostgreSQL relations are spread across segment files.

Verification: expose counters for pages per DWB fsync, relation files per destination fsync batch, p50 and p99 fsync latency, and backend buffer eviction waits.
Move synchronous work out of FlushBuffer().

FlushBuffer() is the wrong abstraction boundary for the whole protocol. It can mark that a page needs protection, enqueue the copy, and coordinate state. It should not become a per-page durability transaction. PostgreSQL already separates WAL writer, background writer, and checkpointer roles; a DWB design needs a manager that coordinates DWB slots, DWB fsync completion, destination writes, and reclaim.

Verification: run write-heavy workloads with bgwriter_lru_maxpages, checkpoint_timeout, checkpoint_completion_target, and checkpoint_flush_after visible in logs; confirm backend writes do not spike because DWB workers are saturated.
Make recovery distrustful by default.

During startup, recovery must validate DWB records by checksum, relation identity, block number, page LSN, and DWB fsync generation. A DWB record without proof of durable completion is a hint, not a recovery source. PostgreSQL page checksums, when enabled, help detect torn pages, but detection is not repair.

Verification: corrupt DWB records, destination pages, and WAL records independently in test images; recovery must either repair from a proven source or fail loudly.
Test against the actual storage stack.

PostgreSQL deployments differ by wal_sync_method, filesystem, cloud block device, hypervisor cache mode, RAID controller cache, and mount options. PostgreSQL documents several WAL sync methods, including fdatasync, fsync, open_sync, and open_datasync; Linux is not the whole production universe. The DWB claim is only meaningful on the stack where it is measured.

Verification: repeat crash-injection tests on the production-like filesystem and block layer, including VM-level kill, host reboot where available, and forced process termination.

In Practice

The public evidence points in one direction: the prototype failed because it copied an algorithm without copying the assumptions that make the algorithm true.

Evidence	Type	Engineering implication
InnoDB documents the doublewrite buffer as a separate area written before pages reach their final data-file positions	Public documented design	The protection comes from write ordering plus recovery lookup, not from an extra copy alone
PostgreSQL documents `full_page_writes` as writing the entire disk page to WAL on first modification after checkpoint	Public documented design	PostgreSQL’s trust boundary is WAL durability, not destination data-file durability
PostgreSQL documents `wal_sync_method` choices and warns that crash-safe configuration depends on system configuration	Public documented design	A DWB replacement must be validated under the configured sync method and storage layer
Linux documents `SYNC_FILE_RANGE_WRITE` as asynchronous and “not suitable for data integrity operations”	System behavior	Code that treats it as a durability boundary is wrong even if smoke tests pass
PostgreSQL checkpoint settings include `checkpoint_flush_after`, which attempts to push dirty data to storage to reduce later stalls	System behavior	PostgreSQL already distinguishes writeback pressure from confirmed persistence
JIN’s Claude Code experiment compiled and passed basic smoke tests before semantic review exposed the DWB flaw	Documented source experiment	Build success is not evidence of crash-state correctness

The deeper point is that storage correctness is usually hidden behind boring verbs: write, flush, sync, checkpoint, recover. Those verbs are not portable across systems.

write() to a regular file usually means “the kernel accepted bytes.” It does not mean “the bytes survived power loss.” sync_file_range() can start writeback and can be useful for reducing dirty-page backlog, but the Linux man page explicitly separates that from data integrity. fsync() is closer to the boundary PostgreSQL recovery cares about, but even then the real guarantee depends on the filesystem, block device, drive cache behavior, and whether the stack lies about flush completion.

This is exactly where AI-assisted systems work becomes dangerous. The model sees an InnoDB pattern:

InnoDB-looking step	What the AI can reproduce	What it may miss
Copy page to DWB	Buffer allocation and file write	Whether the copy is durable before final overwrite
Flush DWB	Call a function with “flush” in the name	Whether the function is advisory or a persistence barrier
Write destination page	`smgrwrite()` or equivalent call	Whether the write reached media or page cache
Reclaim slot	Free-list manipulation	Whether recovery still depends on that slot
Disable FPW	Config change or branch bypass	Whether WAL still has a complete first-touch page image

That is not a PostgreSQL-only lesson. The same failure shape appears when agents generate Kafka consumers without understanding offset commit semantics, Kubernetes controllers without understanding finalizers, S3 pipelines without understanding read-after-write boundaries by operation type, or distributed locks without understanding fencing tokens. The API name is the shallow part. The recovery contract is the system.

For this specific DWB design, I have not run the patch at production scale personally. The documented failure mode is enough to reject the architecture as described: if a DWB slot is reused after a buffered destination write but before a confirmed destination fsync, a crash can leave no durable complete image outside WAL. If full page writes have also been disabled, PostgreSQL’s documented repair mechanism has been removed.

The most deceptive benchmark would be a clean-shutdown write throughput test. It might show lower WAL volume and acceptable latency because it never exercises the crash boundary. A correct benchmark has to kill the database and the machine at controlled points: before DWB fsync, after DWB fsync, after destination write, before destination fsync, after destination fsync, and during checkpoint. Then it has to verify page checksums, page LSNs, WAL replay behavior, and DWB reclaim metadata. Anything else is testing formatting.

Where It Breaks

Failure mode	Trigger	Fix
DWB slot reused too early	Slot freed after `smgrwrite()` or `sync_file_range()` instead of after destination `fsync()`	Track destination fsync generation per relation segment and reclaim only when `tablespace_fsync_lsn >= page_lsn`
WAL safety removed before DWB is proven	`full_page_writes=off` during prototype or benchmark runs	Run DWB in shadow mode first; compare recovery choices against WAL full page images
BgWriter stalls under durability work	Per-page DWB fsync inside dirty buffer eviction path	Use DWB workers, group commit, and file-level batching outside the critical buffer eviction path
Checkpoint I/O becomes spiky	DWB backlog prevents pages from becoming safely reclaimable before checkpoint pressure rises	Coordinate DWB manager with checkpointer progress and expose backlog metrics tied to checkpoint cycles
Advisory flush mistaken for crash safety	Linux `sync_file_range()` or PostgreSQL writeback hints treated as persistence	Reserve advisory writeback for latency smoothing; require `fsync`, `fdatasync`, or platform-equivalent durability boundary
Storage stack changes invalidate assumptions	Moving from local NVMe to EBS, Azure managed disks, GCP Persistent Disk, ZFS, ext4, XFS, or a controller with volatile cache	Certify the crash matrix per production stack and keep the result with the deployment profile
Recovery accepts stale DWB records	DWB metadata lacks relation identity, block number, checksum, page LSN, or fsync generation	Validate DWB records as recovery artifacts; reject ambiguous records loudly
Benchmark hides corruption	Tests use clean shutdown, process kill only, or no filesystem fault injection	Add power-loss style crash testing and page verification after replay

What to Do Next

Problem: AI-generated systems code can preserve code shape while breaking the durability, scheduling, and recovery contracts underneath it.
Solution: Review infrastructure patches by crash-state matrix first, then by code diff.
Proof: A PostgreSQL DWB design is not credible until every page state between DWB write, DWB fsync, destination write, destination fsync, checkpoint, and slot reclaim has a verified recovery source.
Action: This week, take one AI-generated infrastructure patch and write its hidden contract table: API call, assumed guarantee, actual guarantee, failure if the assumption is false.

The hard part of storage engineering is not making the second write happen; it is knowing exactly which copy the system is allowed to trust after the lights come back on.

Situation

The Problem

A Crash-State Contract for Double Write Buffering

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

Double Write Buffers Fail at the I/O Boundary

The Stack for AI-Accelerated Database Operations Is Now Open Source

Stop Writing Ad-Hoc Queries: Build a Skill Backbone for Your DB Engineering Workflows