Checkpoint and Flush: What Your Database Does Before It Can Rest
A checkpoint is not a pause — it is the database settling its accounts. Everything written to the buffer cache since the last checkpoint must be flushed to disk so that crash recovery has a known starting point. Getting checkpoint timing wrong turns a 30-second restart into a 20-minute recovery.
Situation
PostgreSQL and most other ACID databases use checkpoints to bound crash recovery time. Between checkpoints, the database accumulates dirty pages in the buffer cache — pages that have been modified in memory but not yet written to their data files on disk. At a checkpoint, all dirty pages are flushed.
After a crash, the database only needs to replay WAL records that were written after the last successful checkpoint. If checkpoints are frequent, less WAL needs to be replayed. If checkpoints are infrequent, recovery takes longer.
The Problem
Engineers often observe I/O spikes on their database hosts that correlate with checkpoint activity and assume something is wrong. The database is not misbehaving — it is doing its job. But poorly tuned checkpoints create two distinct problems: if too frequent, the database constantly flushes dirty pages and saturates I/O; if too infrequent, crash recovery takes too long and dirty pages accumulate in the buffer cache past useful limits.
What is actually happening during a checkpoint, and what parameters control it?
What a Checkpoint Does
When PostgreSQL triggers a checkpoint, it:
- Records the current WAL position as the checkpoint LSN.
- Identifies all dirty pages in the shared buffer cache.
- Writes those pages to their data files on disk, spread across the checkpoint interval.
- Flushes the WAL up to the checkpoint LSN.
- Updates
pg_controlto record the checkpoint as complete.
The spreading is controlled by checkpoint_completion_target (default: 0.9), which tells PostgreSQL to spread dirty page writes over 90% of the checkpoint interval. This prevents a large I/O burst at the start of each checkpoint.
-- See checkpoint activity since last restart
SELECT checkpoints_timed, checkpoints_req,
buffers_checkpoint, buffers_clean, buffers_backend,
checkpoint_write_time, checkpoint_sync_time
FROM pg_stat_bgwriter;
-- checkpoints_req being high means checkpoints are being forced by WAL volume,
-- not by time — usually means max_wal_size is too small
checkpoints_req being significantly higher than checkpoints_timed is a signal that max_wal_size is too small and the database is triggering emergency checkpoints to prevent WAL from exceeding the limit.
In Practice
PostgreSQL’s documented guidance is that checkpoint_timeout should be long enough that checkpoint I/O does not saturate the storage system, but short enough that recovery after a crash completes within the acceptable window. The relationship: worst-case recovery time ≈ checkpoint_timeout × write throughput. For a database writing 500MB/min of WAL with a 10-minute checkpoint timeout, recovery could replay up to 5GB of WAL.
buffers_backend in pg_stat_bgwriter counts pages that were written directly by backend processes rather than the background writer. A high buffers_backend count means the background writer is not keeping up with dirty page accumulation — backends are being forced to flush their own dirty pages before the checkpointer gets to them. This creates latency spikes for application queries.
Where It Breaks
| Symptom | Cause | Fix |
|---|---|---|
| I/O spike every N minutes | Checkpoint spreading not working; checkpoint_completion_target too low | Increase checkpoint_completion_target to 0.9 |
checkpoints_req high | WAL volume exceeds max_wal_size limit | Increase max_wal_size; or reduce write throughput |
High buffers_backend | Background writer not keeping up | Tune bgwriter_lru_maxpages and bgwriter_delay |
| Long crash recovery | Checkpoint interval too long | Reduce checkpoint_timeout to 5 minutes |
What to Do Next
- Problem: Checkpoint timing that is either too aggressive or too infrequent creates I/O spikes or long recovery windows — both are preventable with correct parameter tuning.
- Solution: Set
checkpoint_timeout = 5min,checkpoint_completion_target = 0.9, andmax_wal_sizeto a value that allows at least 2–3 checkpoint intervals of WAL accumulation without forcing early checkpoints. - Proof: After tuning,
checkpoints_reqshould approach zero andcheckpoint_write_timeshould show smooth, gradual I/O rather than spikes. - Action: Run
SELECT checkpoints_timed, checkpoints_req FROM pg_stat_bgwriter;today — ifcheckpoints_reqis more than 20% ofcheckpoints_timed, yourmax_wal_sizeis undersized.