Torn Page Protection Belongs Off the Foreground Path

The expensive part of torn-page protection is not the extra write; it is where the extra write lands: PostgreSQL’s Full Page Write puts the copy on the foreground Write-Ahead Log path, while InnoDB’s Doublewrite Buffer moves the copy into the background flush path.

Situation

Database durability still lives below the abstraction line most application engineers prefer to ignore. That works until a write-heavy system hits checkpoint pressure, latency doubles, and the answer is not a missing index but an 8 KB page being protected from a 4 KB failure.

PostgreSQL protects against torn pages with Full Page Write (FPW): after each checkpoint, the first modification of a data page writes the entire page image into Write-Ahead Log (WAL). MySQL’s InnoDB protects against the same class of failure with a Doublewrite Buffer (DWB): dirty pages are first written to a dedicated area, synced, then written to their final data-file locations.

Design	Protection copy lives in	Request path impact	Recovery behavior
PostgreSQL FPW	WAL stream	The first post-checkpoint dirtying of each page expands foreground WAL	Recovery restores the full page image from WAL, then replays later WAL records
InnoDB DWB	Doublewrite files	Dirty-page copy is paid by flush machinery, not directly by SQL execution	Recovery repairs torn data pages from the doublewrite copy
Atomic-write storage	Storage layer	Database may avoid software copy only if the whole stack actually guarantees page atomicity	Recovery depends on the storage contract being true

PostgreSQL’s own documentation says full_page_writes writes the entire disk page to WAL on first modification after checkpoint and warns that turning it off can cause unrecoverable or silent corruption after failure. The MySQL 8.4 manual describes InnoDB’s doublewrite buffer as a storage area written before final data-file placement and notes that the large sequential write usually avoids doubling I/O operations one-for-one. See the PostgreSQL WAL settings documentation and MySQL InnoDB doublewrite documentation for the baseline behavior: PostgreSQL full_page_writes, MySQL 8.4 Doublewrite Buffer.

The Problem

A torn page is not a logical transaction problem. It is a physical write atomicity problem. PostgreSQL pages are normally 8 KB; MySQL InnoDB pages are commonly 16 KB; operating systems and devices often expose smaller practical atomic write units such as 4 KB sectors. If power loss or kernel failure interrupts a database page write, recovery may find a page that is half old and half new.

That matters because PostgreSQL WAL records are usually physiological: they identify a physical page, then describe a logical change inside it. If the page cannot be parsed after a crash, the redo record may not have a sane object to apply to. The PostgreSQL wiki explains the problem directly: recovery needs a readable page with valid structure before logical page changes can be replayed. PostgreSQL wiki: Full page writes

Failure point	What breaks	Why it matters
First dirty page after checkpoint in PostgreSQL 16, 17, or 18	The WAL record may include an 8 KB full page image instead of only the logical change	Write-heavy workloads see WAL volume jump immediately after checkpoint
`checkpoint_timeout` too low, such as the documented minimum of 30 seconds	Pages become “first dirty after checkpoint” more often	Lower recovery distance increases foreground WAL amplification
`max_wal_size` too low under write load	PostgreSQL triggers size-driven checkpoints earlier than the time schedule	A workload can enter a loop of checkpoint, FPW surge, WAL growth, checkpoint
`wal_compression=off` with highly compressible page images	Full page images are stored without compression	The storage bill moves from CPU to WAL bandwidth; compression can help but adds CPU on WAL insert and replay
Data checksums enabled	Hint-bit behavior can create additional WAL pressure because checksum-protected pages need correctness around page writes	Checksums detect corruption; they do not remove the need for torn-page protection
Benchmark with `full_page_writes=off`	Throughput improves while the system is no longer protected against the same crash class	This is a measurement mode, not a production durability design

PostgreSQL checkpoints are started by checkpoint_timeout or when max_wal_size is about to be exceeded. That means FPW makes checkpoint frequency a durability-performance coupling: shorter intervals reduce crash-recovery distance but increase the rate at which pages become eligible for full-page images again.

The core question is not whether FPW or DWB performs “two writes.” The question is whether the durability copy blocks the foreground commit path, or whether the system can batch it behind dirty-page flushing without weakening crash recovery.

Move Torn-Page Copies Off the Foreground Path

The right architecture is not “turn off full-page writes and hope the storage behaves.” The right architecture is to separate two responsibilities that FPW intentionally combines: WAL should preserve transaction order, while the torn-page protection copy should be paid by the page-flush path.

flowchart TD
    SQL[SQL transaction] --> Buffer[shared buffer page dirtied]
    Buffer --> WAL[WAL foreground path — logical record]
    Buffer --> Checkpoint[checkpoint boundary]
    Checkpoint --> FPW[PostgreSQL FPW — first dirty page image in WAL]
    Buffer --> Flusher[background dirty page flusher]
    Flusher --> DWB[Doublewrite area — sequential page copies]
    DWB --> Sync[fsync doublewrite area]
    Sync --> DataFiles[scatter write final data files]
    FPW --> Recovery[crash recovery — restore page then replay WAL]
    DataFiles --> Recovery
    DWB --> Recovery

The important distinction is scheduling. FPW pays the copy at WAL insertion time for the first page modification after checkpoint. DWB pays the copy when dirty pages leave the buffer pool. Both protect against torn pages; they do not put the pressure on the same queue.

Keep WAL responsible for transaction ordering, not page-copy transport.

In PostgreSQL, WAL must be flushed before dirty data pages reach durable storage. That ordering is non-negotiable. A DWB prototype should not weaken WAL-before-data; it should remove full page images from the normal WAL record path only when the doublewrite mechanism can guarantee a complete repair copy before final page placement.

Verification: crash after WAL flush but before final data-file write; recovery must replay WAL without reading an unrecoverable torn page.
Insert a doublewrite stage into the dirty-page flush path.

The flush path should write dirty buffers into a sequential doublewrite area, force that area durable, then write the same pages to their final relation files. The doublewrite area needs enough metadata to map page identity back to relation fork and block number after restart.

Verification: force a partial final data-file page write and confirm restart repairs it from the doublewrite copy before normal redo continues.
Preserve checkpoint semantics explicitly.

A checkpoint cannot simply assume pages are safe because they were scheduled for writeback. It needs a durable boundary: either the final page reached storage intact, or the doublewrite copy did. Otherwise the checkpoint can advertise a recovery point that depends on a page image which exists only in kernel cache.

Verification: kill the postmaster during checkpoint completion, restart, and verify that checkpoint redo location never advances past unprotected dirty pages.
Measure WAL bytes, data-file bytes, fsync latency, and tail latency separately.

A DWB design can reduce foreground WAL pressure while increasing background writeback pressure. That is a good trade only if latency-critical SQL stops waiting and the background system does not fall behind. Use pg_current_wal_lsn() deltas, pg_stat_bgwriter, pg_stat_io in PostgreSQL 16 and later, filesystem writeback metrics, and storage latency histograms.

Verification: compare p50, p95, and p99 transaction latency across checkpoint_timeout, max_wal_size, and shared_buffers, not only aggregate transactions per second.
Treat AI-assisted kernel work as scaffolding, not proof.

Zongzhi Chen’s 2026 experiment reported a PostgreSQL prototype where Claude Code helped replace FPW with a DWB-style mechanism, with DWB outperforming FPW in an I/O-bound pgbench workload. That is interesting engineering signal, especially because the patch touches real storage-engine paths. It is not enough to declare the design production-safe. Storage bugs are excellent at passing normal tests and failing only when the machine dies at precisely the wrong time. See the source experiment here: Zongzhi Chen, 2026.

Verification: run crash-restart loops with forced partial writes, checksum validation, logical consistency checks, and comparisons against a known-good source.

In Practice

The documented PostgreSQL pattern is that FPW is checkpoint-coupled. The PostgreSQL documentation states that the first modification of a page after checkpoint writes the full page image to WAL, and that increasing checkpoint interval parameters can reduce that cost. That is not an implementation footnote; it is the operational reason write latency often worsens around checkpoint-heavy workloads.

Documented behavior	Production implication	Validation signal
`full_page_writes=on` is the default in PostgreSQL and protects against partially completed page writes	Disabling it for throughput changes the crash-safety contract	`SHOW full_page_writes;` must be treated as a durability check, not a tuning curiosity
Full page images occur on first page modification after checkpoint	Checkpoint cadence directly affects WAL amplification	WAL growth should be measured before and after `CHECKPOINT` under the same write workload
`wal_compression` can compress full page images with `pglz`, `lz4`, or `zstd` when compiled in	Compression shifts cost from WAL bandwidth to CPU and replay decompression	Compare WAL bytes and CPU saturation with each compression method
`pg_checksums` can verify checksums offline when checksums are enabled	Checksums detect page corruption; they do not repair missing torn-page protection by themselves	Restart, stop cleanly, run `pg_checksums --check` against the cluster
InnoDB DWB writes pages to doublewrite files before final placement	InnoDB pays an extra page-copy step outside the user transaction’s immediate WAL insert path	Monitor page cleaner activity, doublewrite files, fsync latency, and data-file writeback

The documented InnoDB pattern is different. MySQL 8.4 says InnoDB writes flushed buffer-pool pages to doublewrite storage before writing to final data files, and crash recovery can use the doublewrite copy if the final page write was interrupted. The same documentation also says data is written twice, but not necessarily at twice the I/O operation cost, because the doublewrite write is a large sequential chunk with a single fsync() in normal configurations.

That distinction is the architecture lesson. Equal total bytes do not imply equal user-visible latency. A foreground WAL write competes with commit progress. A background doublewrite stage competes with page flushing, eviction, checkpoint completion, and storage bandwidth. Both queues can saturate; they fail differently.

The source experiment’s reported pgbench numbers are consistent with this mechanism. In the reported write-only 128-thread result, FPW-on delivered 14,857 transactions per second, while the DWB prototype delivered 33,814 transactions per second. The interesting result is not “DWB is 2.3x faster” as a universal claim. The interesting result is that moving the copy away from foreground WAL changed where the bottleneck surfaced.

For production builders, the deeper lesson is about validation. A storage-engine change is not proven by a five-minute pgbench run. It needs a crash matrix.

Test class	What it proves	Minimum bar
Forced partial final-page write	DWB can repair a torn data page	Inject half-page writes and confirm recovery restores the page
Crash after doublewrite sync before final scatter write	Durable repair copy exists before final placement	Restart must complete without checksum failure
Crash during doublewrite write	Recovery ignores incomplete doublewrite entries	Restart must not restore from a corrupt doublewrite slot
Checkpoint boundary crash	Recovery point is not advanced beyond protected pages	Repeated kill during checkpoint must preserve logical contents
Replica and backup interaction	WAL stream remains sufficient for replicas and point-in-time recovery expectations	Physical replica, base backup, and restore tests must pass
Device diversity	Sequential-write assumptions hold on real storage	Test local NVMe, network-attached block storage, and throttled cloud volumes

I have not run this PostgreSQL DWB prototype at scale personally. The documented failure mode is clear anyway: if a DWB design acknowledges a checkpoint or allows final data-file writes before the repair copy is durable, it can create a database that looks faster until the first badly timed crash. That is the least charming kind of benchmark.

Where It Breaks

Failure mode	Trigger	Fix
Doublewrite area becomes the new bottleneck	High dirty-page churn with `shared_buffers` large enough to delay eviction, then sudden checkpoint pressure	Size the doublewrite area for flush bursts; track fsync latency and dirty buffer age
Recovery restores the wrong page version	Doublewrite metadata does not encode relation identity, fork, block number, and page LSN safely	Treat DWB metadata as recovery-critical; checksum the slot header and page body
Checkpoint completes too early	Prototype marks pages safe after scheduling writeback instead of after durable doublewrite or durable final write	Checkpoint accounting must wait for a durable protection point
Cloud block storage reorders or stalls writes	Network-attached volumes with variable latency and opaque cache behavior	Test under the actual storage class; do not extrapolate from local NVMe
WAL compression already solves enough of the pain	PostgreSQL workload has compressible full page images and CPU headroom	Benchmark `wal_compression=zstd` or `lz4` before changing storage architecture
Full-page images help replica recovery behavior	Large working sets where WAL page images reduce random data-page reads during replay	Measure replica replay lag and recovery prefetch behavior, not only primary throughput
DWB increases write amplification under cold churn	Workload dirties pages once and evicts them without repeated updates	Compare physical bytes written per committed transaction across FPW and DWB
AI-generated kernel patch misses crash edge cases	Normal regression tests pass because they rarely interrupt I/O at durability boundaries	Add fault injection, checksum validation, crash loops, and page-level corruption tests

What to Do Next

Problem: Treating all durability writes as equivalent hides the queue that users actually wait on.
Solution: Keep transaction ordering in WAL, but move torn-page repair copies to a durable background flush mechanism when the storage engine can prove the ordering.
Proof: A credible result is not one pgbench chart; it is lower foreground WAL amplification plus successful crash recovery across forced partial writes and checkpoint-boundary failures.
Action: This week, measure your PostgreSQL WAL growth around CHECKPOINT with full_page_writes=on, test wal_compression, and record p95 commit latency alongside pg_stat_bgwriter and pg_stat_io.

A storage engine is allowed to be faster only after it has earned the right to crash badly and come back boring.

Situation

The Problem

Move Torn-Page Copies Off the Foreground Path

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

The Stack for AI-Accelerated Database Operations Is Now Open Source

Stop Writing Ad-Hoc Queries: Build a Skill Backbone for Your DB Engineering Workflows

Top GitHub Breakouts: March 2026 — Agent Adaptation and Production-Scale Vector Search