The expensive part of torn-page protection is not the extra write; it is where the extra write lands: PostgreSQL’s Full Page Write puts the copy on the foreground Write-Ahead Log path, while InnoDB’s Doublewrite Buffer moves the copy into the background flush path.

Situation

Database durability still lives below the abstraction line most application engineers prefer to ignore. That works until a write-heavy system hits checkpoint pressure, latency doubles, and the answer is not a missing index but an 8 KB page being protected from a 4 KB failure.

PostgreSQL protects against torn pages with Full Page Write (FPW): after each checkpoint, the first modification of a data page writes the entire page image into Write-Ahead Log (WAL). MySQL’s InnoDB protects against the same class of failure with a Doublewrite Buffer (DWB): dirty pages are first written to a dedicated area, synced, then written to their final data-file locations.

DesignProtection copy lives inRequest path impactRecovery behavior
PostgreSQL FPWWAL streamThe first post-checkpoint dirtying of each page expands foreground WALRecovery restores the full page image from WAL, then replays later WAL records
InnoDB DWBDoublewrite filesDirty-page copy is paid by flush machinery, not directly by SQL executionRecovery repairs torn data pages from the doublewrite copy
Atomic-write storageStorage layerDatabase may avoid software copy only if the whole stack actually guarantees page atomicityRecovery depends on the storage contract being true

PostgreSQL’s own documentation says full_page_writes writes the entire disk page to WAL on first modification after checkpoint and warns that turning it off can cause unrecoverable or silent corruption after failure. The MySQL 8.4 manual describes InnoDB’s doublewrite buffer as a storage area written before final data-file placement and notes that the large sequential write usually avoids doubling I/O operations one-for-one. See the PostgreSQL WAL settings documentation and MySQL InnoDB doublewrite documentation for the baseline behavior: PostgreSQL full_page_writes, MySQL 8.4 Doublewrite Buffer.

The Problem

A torn page is not a logical transaction problem. It is a physical write atomicity problem. PostgreSQL pages are normally 8 KB; MySQL InnoDB pages are commonly 16 KB; operating systems and devices often expose smaller practical atomic write units such as 4 KB sectors. If power loss or kernel failure interrupts a database page write, recovery may find a page that is half old and half new.

That matters because PostgreSQL WAL records are usually physiological: they identify a physical page, then describe a logical change inside it. If the page cannot be parsed after a crash, the redo record may not have a sane object to apply to. The PostgreSQL wiki explains the problem directly: recovery needs a readable page with valid structure before logical page changes can be replayed. PostgreSQL wiki: Full page writes

Failure pointWhat breaksWhy it matters
First dirty page after checkpoint in PostgreSQL 16, 17, or 18The WAL record may include an 8 KB full page image instead of only the logical changeWrite-heavy workloads see WAL volume jump immediately after checkpoint
checkpoint_timeout too low, such as the documented minimum of 30 secondsPages become “first dirty after checkpoint” more oftenLower recovery distance increases foreground WAL amplification
max_wal_size too low under write loadPostgreSQL triggers size-driven checkpoints earlier than the time scheduleA workload can enter a loop of checkpoint, FPW surge, WAL growth, checkpoint
wal_compression=off with highly compressible page imagesFull page images are stored without compressionThe storage bill moves from CPU to WAL bandwidth; compression can help but adds CPU on WAL insert and replay
Data checksums enabledHint-bit behavior can create additional WAL pressure because checksum-protected pages need correctness around page writesChecksums detect corruption; they do not remove the need for torn-page protection
Benchmark with full_page_writes=offThroughput improves while the system is no longer protected against the same crash classThis is a measurement mode, not a production durability design

PostgreSQL checkpoints are started by checkpoint_timeout or when max_wal_size is about to be exceeded. That means FPW makes checkpoint frequency a durability-performance coupling: shorter intervals reduce crash-recovery distance but increase the rate at which pages become eligible for full-page images again.

The core question is not whether FPW or DWB performs “two writes.” The question is whether the durability copy blocks the foreground commit path, or whether the system can batch it behind dirty-page flushing without weakening crash recovery.

Move Torn-Page Copies Off the Foreground Path

The right architecture is not “turn off full-page writes and hope the storage behaves.” The right architecture is to separate two responsibilities that FPW intentionally combines: WAL should preserve transaction order, while the torn-page protection copy should be paid by the page-flush path.

flowchart TD
    SQL[SQL transaction] --> Buffer[shared buffer page dirtied]
    Buffer --> WAL[WAL foreground path — logical record]
    Buffer --> Checkpoint[checkpoint boundary]
    Checkpoint --> FPW[PostgreSQL FPW — first dirty page image in WAL]
    Buffer --> Flusher[background dirty page flusher]
    Flusher --> DWB[Doublewrite area — sequential page copies]
    DWB --> Sync[fsync doublewrite area]
    Sync --> DataFiles[scatter write final data files]
    FPW --> Recovery[crash recovery — restore page then replay WAL]
    DataFiles --> Recovery
    DWB --> Recovery

The important distinction is scheduling. FPW pays the copy at WAL insertion time for the first page modification after checkpoint. DWB pays the copy when dirty pages leave the buffer pool. Both protect against torn pages; they do not put the pressure on the same queue.

  1. Keep WAL responsible for transaction ordering, not page-copy transport.

    In PostgreSQL, WAL must be flushed before dirty data pages reach durable storage. That ordering is non-negotiable. A DWB prototype should not weaken WAL-before-data; it should remove full page images from the normal WAL record path only when the doublewrite mechanism can guarantee a complete repair copy before final page placement.

    Verification: crash after WAL flush but before final data-file write; recovery must replay WAL without reading an unrecoverable torn page.

  2. Insert a doublewrite stage into the dirty-page flush path.

    The flush path should write dirty buffers into a sequential doublewrite area, force that area durable, then write the same pages to their final relation files. The doublewrite area needs enough metadata to map page identity back to relation fork and block number after restart.

    Verification: force a partial final data-file page write and confirm restart repairs it from the doublewrite copy before normal redo continues.

  3. Preserve checkpoint semantics explicitly.

    A checkpoint cannot simply assume pages are safe because they were scheduled for writeback. It needs a durable boundary: either the final page reached storage intact, or the doublewrite copy did. Otherwise the checkpoint can advertise a recovery point that depends on a page image which exists only in kernel cache.

    Verification: kill the postmaster during checkpoint completion, restart, and verify that checkpoint redo location never advances past unprotected dirty pages.

  4. Measure WAL bytes, data-file bytes, fsync latency, and tail latency separately.

    A DWB design can reduce foreground WAL pressure while increasing background writeback pressure. That is a good trade only if latency-critical SQL stops waiting and the background system does not fall behind. Use pg_current_wal_lsn() deltas, pg_stat_bgwriter, pg_stat_io in PostgreSQL 16 and later, filesystem writeback metrics, and storage latency histograms.

    Verification: compare p50, p95, and p99 transaction latency across checkpoint_timeout, max_wal_size, and shared_buffers, not only aggregate transactions per second.

  5. Treat AI-assisted kernel work as scaffolding, not proof.

    Zongzhi Chen’s 2026 experiment reported a PostgreSQL prototype where Claude Code helped replace FPW with a DWB-style mechanism, with DWB outperforming FPW in an I/O-bound pgbench workload. That is interesting engineering signal, especially because the patch touches real storage-engine paths. It is not enough to declare the design production-safe. Storage bugs are excellent at passing normal tests and failing only when the machine dies at precisely the wrong time. See the source experiment here: Zongzhi Chen, 2026.

    Verification: run crash-restart loops with forced partial writes, checksum validation, logical consistency checks, and comparisons against a known-good source.

In Practice

The documented PostgreSQL pattern is that FPW is checkpoint-coupled. The PostgreSQL documentation states that the first modification of a page after checkpoint writes the full page image to WAL, and that increasing checkpoint interval parameters can reduce that cost. That is not an implementation footnote; it is the operational reason write latency often worsens around checkpoint-heavy workloads.

Documented behaviorProduction implicationValidation signal
full_page_writes=on is the default in PostgreSQL and protects against partially completed page writesDisabling it for throughput changes the crash-safety contractSHOW full_page_writes; must be treated as a durability check, not a tuning curiosity
Full page images occur on first page modification after checkpointCheckpoint cadence directly affects WAL amplificationWAL growth should be measured before and after CHECKPOINT under the same write workload
wal_compression can compress full page images with pglz, lz4, or zstd when compiled inCompression shifts cost from WAL bandwidth to CPU and replay decompressionCompare WAL bytes and CPU saturation with each compression method
pg_checksums can verify checksums offline when checksums are enabledChecksums detect page corruption; they do not repair missing torn-page protection by themselvesRestart, stop cleanly, run pg_checksums --check against the cluster
InnoDB DWB writes pages to doublewrite files before final placementInnoDB pays an extra page-copy step outside the user transaction’s immediate WAL insert pathMonitor page cleaner activity, doublewrite files, fsync latency, and data-file writeback

The documented InnoDB pattern is different. MySQL 8.4 says InnoDB writes flushed buffer-pool pages to doublewrite storage before writing to final data files, and crash recovery can use the doublewrite copy if the final page write was interrupted. The same documentation also says data is written twice, but not necessarily at twice the I/O operation cost, because the doublewrite write is a large sequential chunk with a single fsync() in normal configurations.

That distinction is the architecture lesson. Equal total bytes do not imply equal user-visible latency. A foreground WAL write competes with commit progress. A background doublewrite stage competes with page flushing, eviction, checkpoint completion, and storage bandwidth. Both queues can saturate; they fail differently.

The source experiment’s reported pgbench numbers are consistent with this mechanism. In the reported write-only 128-thread result, FPW-on delivered 14,857 transactions per second, while the DWB prototype delivered 33,814 transactions per second. The interesting result is not “DWB is 2.3x faster” as a universal claim. The interesting result is that moving the copy away from foreground WAL changed where the bottleneck surfaced.

For production builders, the deeper lesson is about validation. A storage-engine change is not proven by a five-minute pgbench run. It needs a crash matrix.

Test classWhat it provesMinimum bar
Forced partial final-page writeDWB can repair a torn data pageInject half-page writes and confirm recovery restores the page
Crash after doublewrite sync before final scatter writeDurable repair copy exists before final placementRestart must complete without checksum failure
Crash during doublewrite writeRecovery ignores incomplete doublewrite entriesRestart must not restore from a corrupt doublewrite slot
Checkpoint boundary crashRecovery point is not advanced beyond protected pagesRepeated kill during checkpoint must preserve logical contents
Replica and backup interactionWAL stream remains sufficient for replicas and point-in-time recovery expectationsPhysical replica, base backup, and restore tests must pass
Device diversitySequential-write assumptions hold on real storageTest local NVMe, network-attached block storage, and throttled cloud volumes

I have not run this PostgreSQL DWB prototype at scale personally. The documented failure mode is clear anyway: if a DWB design acknowledges a checkpoint or allows final data-file writes before the repair copy is durable, it can create a database that looks faster until the first badly timed crash. That is the least charming kind of benchmark.

Where It Breaks

Failure modeTriggerFix
Doublewrite area becomes the new bottleneckHigh dirty-page churn with shared_buffers large enough to delay eviction, then sudden checkpoint pressureSize the doublewrite area for flush bursts; track fsync latency and dirty buffer age
Recovery restores the wrong page versionDoublewrite metadata does not encode relation identity, fork, block number, and page LSN safelyTreat DWB metadata as recovery-critical; checksum the slot header and page body
Checkpoint completes too earlyPrototype marks pages safe after scheduling writeback instead of after durable doublewrite or durable final writeCheckpoint accounting must wait for a durable protection point
Cloud block storage reorders or stalls writesNetwork-attached volumes with variable latency and opaque cache behaviorTest under the actual storage class; do not extrapolate from local NVMe
WAL compression already solves enough of the painPostgreSQL workload has compressible full page images and CPU headroomBenchmark wal_compression=zstd or lz4 before changing storage architecture
Full-page images help replica recovery behaviorLarge working sets where WAL page images reduce random data-page reads during replayMeasure replica replay lag and recovery prefetch behavior, not only primary throughput
DWB increases write amplification under cold churnWorkload dirties pages once and evicts them without repeated updatesCompare physical bytes written per committed transaction across FPW and DWB
AI-generated kernel patch misses crash edge casesNormal regression tests pass because they rarely interrupt I/O at durability boundariesAdd fault injection, checksum validation, crash loops, and page-level corruption tests

What to Do Next

  • Problem: Treating all durability writes as equivalent hides the queue that users actually wait on.
  • Solution: Keep transaction ordering in WAL, but move torn-page repair copies to a durable background flush mechanism when the storage engine can prove the ordering.
  • Proof: A credible result is not one pgbench chart; it is lower foreground WAL amplification plus successful crash recovery across forced partial writes and checkpoint-boundary failures.
  • Action: This week, measure your PostgreSQL WAL growth around CHECKPOINT with full_page_writes=on, test wal_compression, and record p95 commit latency alongside pg_stat_bgwriter and pg_stat_io.

A storage engine is allowed to be faster only after it has earned the right to crash badly and come back boring.