Torn Page Protection Belongs Off the Foreground Path
Content reflects the state as of October 2025. AI tooling and model capabilities in this area change frequently.
The expensive part of torn-page protection is not the extra write; it is where the extra write lands: PostgreSQL’s Full Page Write puts the copy on the foreground Write-Ahead Log path, while InnoDB’s Doublewrite Buffer moves the copy into the background flush path.
Situation
Database durability still lives below the abstraction line most application engineers prefer to ignore. That works until a write-heavy system hits checkpoint pressure, latency doubles, and the answer is not a missing index but an 8 KB page being protected from a 4 KB failure.
PostgreSQL protects against torn pages with Full Page Write (FPW): after each checkpoint, the first modification of a data page writes the entire page image into Write-Ahead Log (WAL). MySQL’s InnoDB protects against the same class of failure with a Doublewrite Buffer (DWB): dirty pages are first written to a dedicated area, synced, then written to their final data-file locations.
| Design | Protection copy lives in | Request path impact | Recovery behavior |
|---|---|---|---|
| PostgreSQL FPW | WAL stream | The first post-checkpoint dirtying of each page expands foreground WAL | Recovery restores the full page image from WAL, then replays later WAL records |
| InnoDB DWB | Doublewrite files | Dirty-page copy is paid by flush machinery, not directly by SQL execution | Recovery repairs torn data pages from the doublewrite copy |
| Atomic-write storage | Storage layer | Database may avoid software copy only if the whole stack actually guarantees page atomicity | Recovery depends on the storage contract being true |
PostgreSQL’s own documentation says full_page_writes writes the entire disk page to WAL on first modification after checkpoint and warns that turning it off can cause unrecoverable or silent corruption after failure. The MySQL 8.4 manual describes InnoDB’s doublewrite buffer as a storage area written before final data-file placement and notes that the large sequential write usually avoids doubling I/O operations one-for-one. See the PostgreSQL WAL settings documentation and MySQL InnoDB doublewrite documentation for the baseline behavior: PostgreSQL full_page_writes, MySQL 8.4 Doublewrite Buffer.
The Problem
A torn page is not a logical transaction problem. It is a physical write atomicity problem. PostgreSQL pages are normally 8 KB; MySQL InnoDB pages are commonly 16 KB; operating systems and devices often expose smaller practical atomic write units such as 4 KB sectors. If power loss or kernel failure interrupts a database page write, recovery may find a page that is half old and half new.
That matters because PostgreSQL WAL records are usually physiological: they identify a physical page, then describe a logical change inside it. If the page cannot be parsed after a crash, the redo record may not have a sane object to apply to. The PostgreSQL wiki explains the problem directly: recovery needs a readable page with valid structure before logical page changes can be replayed. PostgreSQL wiki: Full page writes
| Failure point | What breaks | Why it matters |
|---|---|---|
| First dirty page after checkpoint in PostgreSQL 16, 17, or 18 | The WAL record may include an 8 KB full page image instead of only the logical change | Write-heavy workloads see WAL volume jump immediately after checkpoint |
checkpoint_timeout too low, such as the documented minimum of 30 seconds | Pages become “first dirty after checkpoint” more often | Lower recovery distance increases foreground WAL amplification |
max_wal_size too low under write load | PostgreSQL triggers size-driven checkpoints earlier than the time schedule | A workload can enter a loop of checkpoint, FPW surge, WAL growth, checkpoint |
wal_compression=off with highly compressible page images | Full page images are stored without compression | The storage bill moves from CPU to WAL bandwidth; compression can help but adds CPU on WAL insert and replay |
| Data checksums enabled | Hint-bit behavior can create additional WAL pressure because checksum-protected pages need correctness around page writes | Checksums detect corruption; they do not remove the need for torn-page protection |
Benchmark with full_page_writes=off | Throughput improves while the system is no longer protected against the same crash class | This is a measurement mode, not a production durability design |
PostgreSQL checkpoints are started by checkpoint_timeout or when max_wal_size is about to be exceeded. That means FPW makes checkpoint frequency a durability-performance coupling: shorter intervals reduce crash-recovery distance but increase the rate at which pages become eligible for full-page images again.
The core question is not whether FPW or DWB performs “two writes.” The question is whether the durability copy blocks the foreground commit path, or whether the system can batch it behind dirty-page flushing without weakening crash recovery.
Move Torn-Page Copies Off the Foreground Path
The right architecture is not “turn off full-page writes and hope the storage behaves.” The right architecture is to separate two responsibilities that FPW intentionally combines: WAL should preserve transaction order, while the torn-page protection copy should be paid by the page-flush path.
flowchart TD
SQL[SQL transaction] --> Buffer[shared buffer page dirtied]
Buffer --> WAL[WAL foreground path — logical record]
Buffer --> Checkpoint[checkpoint boundary]
Checkpoint --> FPW[PostgreSQL FPW — first dirty page image in WAL]
Buffer --> Flusher[background dirty page flusher]
Flusher --> DWB[Doublewrite area — sequential page copies]
DWB --> Sync[fsync doublewrite area]
Sync --> DataFiles[scatter write final data files]
FPW --> Recovery[crash recovery — restore page then replay WAL]
DataFiles --> Recovery
DWB --> Recovery
The important distinction is scheduling. FPW pays the copy at WAL insertion time for the first page modification after checkpoint. DWB pays the copy when dirty pages leave the buffer pool. Both protect against torn pages; they do not put the pressure on the same queue.
-
Keep WAL responsible for transaction ordering, not page-copy transport.
In PostgreSQL, WAL must be flushed before dirty data pages reach durable storage. That ordering is non-negotiable. A DWB prototype should not weaken WAL-before-data; it should remove full page images from the normal WAL record path only when the doublewrite mechanism can guarantee a complete repair copy before final page placement.
Verification: crash after WAL flush but before final data-file write; recovery must replay WAL without reading an unrecoverable torn page.
-
Insert a doublewrite stage into the dirty-page flush path.
The flush path should write dirty buffers into a sequential doublewrite area, force that area durable, then write the same pages to their final relation files. The doublewrite area needs enough metadata to map page identity back to relation fork and block number after restart.
Verification: force a partial final data-file page write and confirm restart repairs it from the doublewrite copy before normal redo continues.
-
Preserve checkpoint semantics explicitly.
A checkpoint cannot simply assume pages are safe because they were scheduled for writeback. It needs a durable boundary: either the final page reached storage intact, or the doublewrite copy did. Otherwise the checkpoint can advertise a recovery point that depends on a page image which exists only in kernel cache.
Verification: kill the postmaster during checkpoint completion, restart, and verify that checkpoint redo location never advances past unprotected dirty pages.
-
Measure WAL bytes, data-file bytes, fsync latency, and tail latency separately.
A DWB design can reduce foreground WAL pressure while increasing background writeback pressure. That is a good trade only if latency-critical SQL stops waiting and the background system does not fall behind. Use
pg_current_wal_lsn()deltas,pg_stat_bgwriter,pg_stat_ioin PostgreSQL 16 and later, filesystem writeback metrics, and storage latency histograms.Verification: compare p50, p95, and p99 transaction latency across
checkpoint_timeout,max_wal_size, andshared_buffers, not only aggregate transactions per second. -
Treat AI-assisted kernel work as scaffolding, not proof.
Zongzhi Chen’s 2026 experiment reported a PostgreSQL prototype where Claude Code helped replace FPW with a DWB-style mechanism, with DWB outperforming FPW in an I/O-bound pgbench workload. That is interesting engineering signal, especially because the patch touches real storage-engine paths. It is not enough to declare the design production-safe. Storage bugs are excellent at passing normal tests and failing only when the machine dies at precisely the wrong time. See the source experiment here: Zongzhi Chen, 2026.
Verification: run crash-restart loops with forced partial writes, checksum validation, logical consistency checks, and comparisons against a known-good source.
In Practice
The documented PostgreSQL pattern is that FPW is checkpoint-coupled. The PostgreSQL documentation states that the first modification of a page after checkpoint writes the full page image to WAL, and that increasing checkpoint interval parameters can reduce that cost. That is not an implementation footnote; it is the operational reason write latency often worsens around checkpoint-heavy workloads.
| Documented behavior | Production implication | Validation signal |
|---|---|---|
full_page_writes=on is the default in PostgreSQL and protects against partially completed page writes | Disabling it for throughput changes the crash-safety contract | SHOW full_page_writes; must be treated as a durability check, not a tuning curiosity |
| Full page images occur on first page modification after checkpoint | Checkpoint cadence directly affects WAL amplification | WAL growth should be measured before and after CHECKPOINT under the same write workload |
wal_compression can compress full page images with pglz, lz4, or zstd when compiled in | Compression shifts cost from WAL bandwidth to CPU and replay decompression | Compare WAL bytes and CPU saturation with each compression method |
pg_checksums can verify checksums offline when checksums are enabled | Checksums detect page corruption; they do not repair missing torn-page protection by themselves | Restart, stop cleanly, run pg_checksums --check against the cluster |
| InnoDB DWB writes pages to doublewrite files before final placement | InnoDB pays an extra page-copy step outside the user transaction’s immediate WAL insert path | Monitor page cleaner activity, doublewrite files, fsync latency, and data-file writeback |
The documented InnoDB pattern is different. MySQL 8.4 says InnoDB writes flushed buffer-pool pages to doublewrite storage before writing to final data files, and crash recovery can use the doublewrite copy if the final page write was interrupted. The same documentation also says data is written twice, but not necessarily at twice the I/O operation cost, because the doublewrite write is a large sequential chunk with a single fsync() in normal configurations.
That distinction is the architecture lesson. Equal total bytes do not imply equal user-visible latency. A foreground WAL write competes with commit progress. A background doublewrite stage competes with page flushing, eviction, checkpoint completion, and storage bandwidth. Both queues can saturate; they fail differently.
The source experiment’s reported pgbench numbers are consistent with this mechanism. In the reported write-only 128-thread result, FPW-on delivered 14,857 transactions per second, while the DWB prototype delivered 33,814 transactions per second. The interesting result is not “DWB is 2.3x faster” as a universal claim. The interesting result is that moving the copy away from foreground WAL changed where the bottleneck surfaced.
For production builders, the deeper lesson is about validation. A storage-engine change is not proven by a five-minute pgbench run. It needs a crash matrix.
| Test class | What it proves | Minimum bar |
|---|---|---|
| Forced partial final-page write | DWB can repair a torn data page | Inject half-page writes and confirm recovery restores the page |
| Crash after doublewrite sync before final scatter write | Durable repair copy exists before final placement | Restart must complete without checksum failure |
| Crash during doublewrite write | Recovery ignores incomplete doublewrite entries | Restart must not restore from a corrupt doublewrite slot |
| Checkpoint boundary crash | Recovery point is not advanced beyond protected pages | Repeated kill during checkpoint must preserve logical contents |
| Replica and backup interaction | WAL stream remains sufficient for replicas and point-in-time recovery expectations | Physical replica, base backup, and restore tests must pass |
| Device diversity | Sequential-write assumptions hold on real storage | Test local NVMe, network-attached block storage, and throttled cloud volumes |
I have not run this PostgreSQL DWB prototype at scale personally. The documented failure mode is clear anyway: if a DWB design acknowledges a checkpoint or allows final data-file writes before the repair copy is durable, it can create a database that looks faster until the first badly timed crash. That is the least charming kind of benchmark.
Where It Breaks
| Failure mode | Trigger | Fix |
|---|---|---|
| Doublewrite area becomes the new bottleneck | High dirty-page churn with shared_buffers large enough to delay eviction, then sudden checkpoint pressure | Size the doublewrite area for flush bursts; track fsync latency and dirty buffer age |
| Recovery restores the wrong page version | Doublewrite metadata does not encode relation identity, fork, block number, and page LSN safely | Treat DWB metadata as recovery-critical; checksum the slot header and page body |
| Checkpoint completes too early | Prototype marks pages safe after scheduling writeback instead of after durable doublewrite or durable final write | Checkpoint accounting must wait for a durable protection point |
| Cloud block storage reorders or stalls writes | Network-attached volumes with variable latency and opaque cache behavior | Test under the actual storage class; do not extrapolate from local NVMe |
| WAL compression already solves enough of the pain | PostgreSQL workload has compressible full page images and CPU headroom | Benchmark wal_compression=zstd or lz4 before changing storage architecture |
| Full-page images help replica recovery behavior | Large working sets where WAL page images reduce random data-page reads during replay | Measure replica replay lag and recovery prefetch behavior, not only primary throughput |
| DWB increases write amplification under cold churn | Workload dirties pages once and evicts them without repeated updates | Compare physical bytes written per committed transaction across FPW and DWB |
| AI-generated kernel patch misses crash edge cases | Normal regression tests pass because they rarely interrupt I/O at durability boundaries | Add fault injection, checksum validation, crash loops, and page-level corruption tests |
What to Do Next
- Problem: Treating all durability writes as equivalent hides the queue that users actually wait on.
- Solution: Keep transaction ordering in WAL, but move torn-page repair copies to a durable background flush mechanism when the storage engine can prove the ordering.
- Proof: A credible result is not one pgbench chart; it is lower foreground WAL amplification plus successful crash recovery across forced partial writes and checkpoint-boundary failures.
- Action: This week, measure your PostgreSQL WAL growth around
CHECKPOINTwithfull_page_writes=on, testwal_compression, and record p95 commit latency alongsidepg_stat_bgwriterandpg_stat_io.
A storage engine is allowed to be faster only after it has earned the right to crash badly and come back boring.