The Semantics AI Misses When Porting Storage Designs
Content reflects the state as of August 2025. AI tooling and model capabilities in this area change frequently.
AI can copy the shape of a storage design and still miss the contract that makes it correct: a double write buffer is not an extra write path, it is a durability boundary.
Situation
AI coding agents are now good enough to produce plausible database internals patches: new structs, recovery hooks, background workers, tests, and code that compiles. That changes the review problem. The risk is no longer only “does the code build?” The risk is “did the agent preserve the invisible contract between the database, kernel, filesystem, block device, and recovery algorithm?”
The source experiment is a useful failure: a Claude Code prototype attempted to port an InnoDB-style double write buffer into PostgreSQL. The implementation followed the surface pattern. Write page to double write buffer. Write page to the real data file. Reuse the slot. The failure was semantic: PostgreSQL and InnoDB do not share the same I/O model, process model, or recovery trust boundary.
| Mechanism | Default trust boundary | What protects against torn pages | Review question |
|---|---|---|---|
| PostgreSQL full page writes | Write-ahead log, or WAL, flush | First modified 8KB page image after checkpoint | Is the WAL image durable before recovery needs it? |
| InnoDB doublewrite buffer | Doublewrite file flush | Page copy written before final tablespace overwrite | Is the doublewrite copy durable before the destination page can tear? |
| Naive AI port | Function names and control flow | Assumed equivalence between writes | Did the patch prove the same crash states are recoverable? |
The lesson generalizes beyond databases. AI-generated infrastructure code often calls the right APIs in the wrong contract order.
The Problem
A double write buffer, or DWB, protects a database page from a torn write by writing a complete copy somewhere else before overwriting the page at its final location. InnoDB documents this directly: pages flushed from the buffer pool are written to the doublewrite buffer before their proper locations, so crash recovery can find a good copy if the final page write is torn. MySQL 8.4 documentation names that as the purpose of the feature.
PostgreSQL solves the same class of failure differently. With full_page_writes=on, PostgreSQL writes the entire page to WAL during the first modification after each checkpoint. The PostgreSQL docs are explicit: without that page image, a crash during a page write can leave mixed old and new data, and normal row-level WAL records are not enough to reconstruct the page. PostgreSQL current WAL documentation also warns that turning it off can lead to unrecoverable or silent corruption after system failure.
The bug in the AI-generated design was treating those mechanisms as interchangeable.
| Failure point | What breaks | Why it matters |
|---|---|---|
write() treated as durable | PostgreSQL writes dirty buffers through the operating system page cache; the kernel can accept the bytes before media persistence | A DWB slot reused after smgrwrite() can destroy the only good recovery copy |
sync_file_range() treated as fsync() | Linux documents SYNC_FILE_RANGE_WRITE as asynchronous and not suitable for data integrity operations; it also does not flush volatile disk write caches | Advisory writeback is performance plumbing, not a crash recovery guarantee |
| BgWriter path gets synchronous durability work | PostgreSQL’s background writer is tuned around cheap dirty-page writes and checkpoint-spread I/O | Per-page DWB fsync turns an amortized background path into a latency amplifier |
| Full page writes disabled too early | WAL no longer contains first-dirtied page images after checkpoint | Recovery must trust a DWB copy that may not actually be durable or current |
| Slot lifecycle lacks LSN accounting | DWB slot reuse is disconnected from destination file fsync progress | Crash recovery can observe a stale tablespace page and an overwritten DWB slot |
The core question is not “can PostgreSQL be given a DWB?” It is: what additional durability accounting would make a DWB at least as trustworthy as PostgreSQL’s existing WAL full page image boundary?
A Crash-State Contract for Double Write Buffering
The right design starts with crash states, not code generation. If the system crashes at every boundary, recovery must have one complete page image with a known log sequence number, or LSN. Anything less is wishful thinking with structs.
flowchart TD
Dirty[dirty PostgreSQL buffer — page LSN known] --> WAL[WAL record — optional full page image]
Dirty --> DWBWrite[DWB slot write — buffered copy]
DWBWrite --> DWBFlush[DWB file fsync — durable recovery copy]
DWBFlush --> DataWrite[tablespace write — page cache accepted]
DataWrite --> DataFlush[tablespace fsync — final page durable]
DataFlush --> Reclaim[DWB slot reclaim — safe reuse]
WAL --> Recovery[crash recovery — choose trusted image]
DWBFlush --> Recovery
DataFlush --> Recovery
The invariant is narrow:
| State | DWB slot reusable? | Recovery source | Reason |
|---|---|---|---|
| Before DWB fsync | No | WAL full page image | DWB copy may not exist after power loss |
| After DWB fsync, before tablespace write | No | DWB or WAL | DWB copy is durable, destination is old |
| After tablespace write, before tablespace fsync | No | DWB | Destination may be stale or torn |
| After tablespace fsync | Yes | Tablespace | Final copy is durable through the filesystem boundary |
| After checkpoint and slot reclaim | Yes | Tablespace plus WAL from checkpoint | Recovery no longer depends on that DWB slot |
That table is the design. The implementation follows from it.
-
Keep
full_page_writes=onwhile developing the DWB path.A prototype that disables full page writes before proving DWB recovery has removed PostgreSQL’s existing safety net. PostgreSQL’s documented default is
full_page_writes=on, and the reason is exactly torn-page recovery after OS crashes. The first implementation should run DWB as a redundant mechanism, then compare recovery decisions against WAL.Verification: after crash recovery, report every page where WAL full page image and DWB recovery would have chosen different page contents or LSNs.
-
Treat DWB slot state as a durability state machine.
A slot is not “free” after the page is copied. It is not free after the destination
write(). It is free only after the destination relation file has been synced past the page’s write. That requires at least: relation identifier, fork, block number, page LSN, DWB slot identifier, DWB fsync generation, and destination fsync generation.Verification: inject crashes at each transition and assert that no slot with
tablespace_fsync_lsn < page_lsnis reused. -
Batch fsyncs around files, not pages.
A naive per-page
fsync(dwb_fd)will collapse throughput on ordinary SSDs and will be theatrical on network block devices. The DWB write path needs group commit semantics: append many page copies to DWB storage, issue one durable flush, then schedule destination writes. The destination side also needs file-level fsync grouping by relation segment, because PostgreSQL relations are spread across segment files.Verification: expose counters for pages per DWB fsync, relation files per destination fsync batch, p50 and p99 fsync latency, and backend buffer eviction waits.
-
Move synchronous work out of
FlushBuffer().FlushBuffer()is the wrong abstraction boundary for the whole protocol. It can mark that a page needs protection, enqueue the copy, and coordinate state. It should not become a per-page durability transaction. PostgreSQL already separates WAL writer, background writer, and checkpointer roles; a DWB design needs a manager that coordinates DWB slots, DWB fsync completion, destination writes, and reclaim.Verification: run write-heavy workloads with
bgwriter_lru_maxpages,checkpoint_timeout,checkpoint_completion_target, andcheckpoint_flush_aftervisible in logs; confirm backend writes do not spike because DWB workers are saturated. -
Make recovery distrustful by default.
During startup, recovery must validate DWB records by checksum, relation identity, block number, page LSN, and DWB fsync generation. A DWB record without proof of durable completion is a hint, not a recovery source. PostgreSQL page checksums, when enabled, help detect torn pages, but detection is not repair.
Verification: corrupt DWB records, destination pages, and WAL records independently in test images; recovery must either repair from a proven source or fail loudly.
-
Test against the actual storage stack.
PostgreSQL deployments differ by
wal_sync_method, filesystem, cloud block device, hypervisor cache mode, RAID controller cache, and mount options. PostgreSQL documents several WAL sync methods, includingfdatasync,fsync,open_sync, andopen_datasync; Linux is not the whole production universe. The DWB claim is only meaningful on the stack where it is measured.Verification: repeat crash-injection tests on the production-like filesystem and block layer, including VM-level kill, host reboot where available, and forced process termination.
In Practice
The public evidence points in one direction: the prototype failed because it copied an algorithm without copying the assumptions that make the algorithm true.
| Evidence | Type | Engineering implication |
|---|---|---|
| InnoDB documents the doublewrite buffer as a separate area written before pages reach their final data-file positions | Public documented design | The protection comes from write ordering plus recovery lookup, not from an extra copy alone |
PostgreSQL documents full_page_writes as writing the entire disk page to WAL on first modification after checkpoint | Public documented design | PostgreSQL’s trust boundary is WAL durability, not destination data-file durability |
PostgreSQL documents wal_sync_method choices and warns that crash-safe configuration depends on system configuration | Public documented design | A DWB replacement must be validated under the configured sync method and storage layer |
Linux documents SYNC_FILE_RANGE_WRITE as asynchronous and “not suitable for data integrity operations” | System behavior | Code that treats it as a durability boundary is wrong even if smoke tests pass |
PostgreSQL checkpoint settings include checkpoint_flush_after, which attempts to push dirty data to storage to reduce later stalls | System behavior | PostgreSQL already distinguishes writeback pressure from confirmed persistence |
| JIN’s Claude Code experiment compiled and passed basic smoke tests before semantic review exposed the DWB flaw | Documented source experiment | Build success is not evidence of crash-state correctness |
The deeper point is that storage correctness is usually hidden behind boring verbs: write, flush, sync, checkpoint, recover. Those verbs are not portable across systems.
write() to a regular file usually means “the kernel accepted bytes.” It does not mean “the bytes survived power loss.” sync_file_range() can start writeback and can be useful for reducing dirty-page backlog, but the Linux man page explicitly separates that from data integrity. fsync() is closer to the boundary PostgreSQL recovery cares about, but even then the real guarantee depends on the filesystem, block device, drive cache behavior, and whether the stack lies about flush completion.
This is exactly where AI-assisted systems work becomes dangerous. The model sees an InnoDB pattern:
| InnoDB-looking step | What the AI can reproduce | What it may miss |
|---|---|---|
| Copy page to DWB | Buffer allocation and file write | Whether the copy is durable before final overwrite |
| Flush DWB | Call a function with “flush” in the name | Whether the function is advisory or a persistence barrier |
| Write destination page | smgrwrite() or equivalent call | Whether the write reached media or page cache |
| Reclaim slot | Free-list manipulation | Whether recovery still depends on that slot |
| Disable FPW | Config change or branch bypass | Whether WAL still has a complete first-touch page image |
That is not a PostgreSQL-only lesson. The same failure shape appears when agents generate Kafka consumers without understanding offset commit semantics, Kubernetes controllers without understanding finalizers, S3 pipelines without understanding read-after-write boundaries by operation type, or distributed locks without understanding fencing tokens. The API name is the shallow part. The recovery contract is the system.
For this specific DWB design, I have not run the patch at production scale personally. The documented failure mode is enough to reject the architecture as described: if a DWB slot is reused after a buffered destination write but before a confirmed destination fsync, a crash can leave no durable complete image outside WAL. If full page writes have also been disabled, PostgreSQL’s documented repair mechanism has been removed.
The most deceptive benchmark would be a clean-shutdown write throughput test. It might show lower WAL volume and acceptable latency because it never exercises the crash boundary. A correct benchmark has to kill the database and the machine at controlled points: before DWB fsync, after DWB fsync, after destination write, before destination fsync, after destination fsync, and during checkpoint. Then it has to verify page checksums, page LSNs, WAL replay behavior, and DWB reclaim metadata. Anything else is testing formatting.
Where It Breaks
| Failure mode | Trigger | Fix |
|---|---|---|
| DWB slot reused too early | Slot freed after smgrwrite() or sync_file_range() instead of after destination fsync() | Track destination fsync generation per relation segment and reclaim only when tablespace_fsync_lsn >= page_lsn |
| WAL safety removed before DWB is proven | full_page_writes=off during prototype or benchmark runs | Run DWB in shadow mode first; compare recovery choices against WAL full page images |
| BgWriter stalls under durability work | Per-page DWB fsync inside dirty buffer eviction path | Use DWB workers, group commit, and file-level batching outside the critical buffer eviction path |
| Checkpoint I/O becomes spiky | DWB backlog prevents pages from becoming safely reclaimable before checkpoint pressure rises | Coordinate DWB manager with checkpointer progress and expose backlog metrics tied to checkpoint cycles |
| Advisory flush mistaken for crash safety | Linux sync_file_range() or PostgreSQL writeback hints treated as persistence | Reserve advisory writeback for latency smoothing; require fsync, fdatasync, or platform-equivalent durability boundary |
| Storage stack changes invalidate assumptions | Moving from local NVMe to EBS, Azure managed disks, GCP Persistent Disk, ZFS, ext4, XFS, or a controller with volatile cache | Certify the crash matrix per production stack and keep the result with the deployment profile |
| Recovery accepts stale DWB records | DWB metadata lacks relation identity, block number, checksum, page LSN, or fsync generation | Validate DWB records as recovery artifacts; reject ambiguous records loudly |
| Benchmark hides corruption | Tests use clean shutdown, process kill only, or no filesystem fault injection | Add power-loss style crash testing and page verification after replay |
What to Do Next
- Problem: AI-generated systems code can preserve code shape while breaking the durability, scheduling, and recovery contracts underneath it.
- Solution: Review infrastructure patches by crash-state matrix first, then by code diff.
- Proof: A PostgreSQL DWB design is not credible until every page state between DWB write, DWB fsync, destination write, destination fsync, checkpoint, and slot reclaim has a verified recovery source.
- Action: This week, take one AI-generated infrastructure patch and write its hidden contract table: API call, assumed guarantee, actual guarantee, failure if the assumption is false.
The hard part of storage engineering is not making the second write happen; it is knowing exactly which copy the system is allowed to trust after the lights come back on.