Checkpoint and Flush: What Your Database Does Before It Can Rest

A checkpoint is not a pause — it is the database settling its accounts. Everything written to the buffer cache since the last checkpoint must be flushed to disk so that crash recovery has a known starting point. Getting checkpoint timing wrong turns a 30-second restart into a 20-minute recovery.

Situation

PostgreSQL and most other ACID databases use checkpoints to bound crash recovery time. Between checkpoints, the database accumulates dirty pages in the buffer cache — pages that have been modified in memory but not yet written to their data files on disk. At a checkpoint, all dirty pages are flushed.

After a crash, the database only needs to replay WAL records that were written after the last successful checkpoint. If checkpoints are frequent, less WAL needs to be replayed. If checkpoints are infrequent, recovery takes longer.

The Problem

Engineers often observe I/O spikes on their database hosts that correlate with checkpoint activity and assume something is wrong. The database is not misbehaving — it is doing its job. But poorly tuned checkpoints create two distinct problems: if too frequent, the database constantly flushes dirty pages and saturates I/O; if too infrequent, crash recovery takes too long and dirty pages accumulate in the buffer cache past useful limits.

What is actually happening during a checkpoint, and what parameters control it?

What a Checkpoint Does

When PostgreSQL triggers a checkpoint, it:

Records the current WAL position as the checkpoint LSN.
Identifies all dirty pages in the shared buffer cache.
Writes those pages to their data files on disk, spread across the checkpoint interval.
Flushes the WAL up to the checkpoint LSN.
Updates pg_control to record the checkpoint as complete.

The spreading is controlled by checkpoint_completion_target (default: 0.9), which tells PostgreSQL to spread dirty page writes over 90% of the checkpoint interval. This prevents a large I/O burst at the start of each checkpoint.

-- See checkpoint activity since last restart
SELECT checkpoints_timed, checkpoints_req,
       buffers_checkpoint, buffers_clean, buffers_backend,
       checkpoint_write_time, checkpoint_sync_time
FROM pg_stat_bgwriter;

-- checkpoints_req being high means checkpoints are being forced by WAL volume,
-- not by time — usually means max_wal_size is too small

checkpoints_req being significantly higher than checkpoints_timed is a signal that max_wal_size is too small and the database is triggering emergency checkpoints to prevent WAL from exceeding the limit.

In Practice

PostgreSQL’s documented guidance is that checkpoint_timeout should be long enough that checkpoint I/O does not saturate the storage system, but short enough that recovery after a crash completes within the acceptable window. The relationship: worst-case recovery time ≈ checkpoint_timeout × write throughput. For a database writing 500MB/min of WAL with a 10-minute checkpoint timeout, recovery could replay up to 5GB of WAL.

buffers_backend in pg_stat_bgwriter counts pages that were written directly by backend processes rather than the background writer. A high buffers_backend count means the background writer is not keeping up with dirty page accumulation — backends are being forced to flush their own dirty pages before the checkpointer gets to them. This creates latency spikes for application queries.

Where It Breaks

Symptom	Cause	Fix
I/O spike every N minutes	Checkpoint spreading not working; `checkpoint_completion_target` too low	Increase `checkpoint_completion_target` to 0.9
`checkpoints_req` high	WAL volume exceeds `max_wal_size` limit	Increase `max_wal_size`; or reduce write throughput
High `buffers_backend`	Background writer not keeping up	Tune `bgwriter_lru_maxpages` and `bgwriter_delay`
Long crash recovery	Checkpoint interval too long	Reduce `checkpoint_timeout` to 5 minutes

What to Do Next

Problem: Checkpoint timing that is either too aggressive or too infrequent creates I/O spikes or long recovery windows — both are preventable with correct parameter tuning.
Solution: Set checkpoint_timeout = 5min, checkpoint_completion_target = 0.9, and max_wal_size to a value that allows at least 2–3 checkpoint intervals of WAL accumulation without forcing early checkpoints.
Proof: After tuning, checkpoints_req should approach zero and checkpoint_write_time should show smooth, gradual I/O rather than spikes.
Action: Run SELECT checkpoints_timed, checkpoints_req FROM pg_stat_bgwriter; today — if checkpoints_req is more than 20% of checkpoints_timed, your max_wal_size is undersized.

Situation

The Problem

What a Checkpoint Does

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

Consistency Models Your Application Actually Needs

CAP Theorem in Operational Terms

Caches, Queues, and Databases: When to Use Each