Database Backup Validation Workflow
A backup that has never been restored is a hypothesis, not a safety net. The job of a backup validation workflow is not to confirm that backup files exist — it is to prove that a recoverable database can be produced from them within your documented RTO, on demand, and on a schedule that keeps that proof fresh.
Situation
Most teams reach a point where backup jobs are running nightly, retention windows are configured, and monitoring shows no failures. The backup checkbox is green. What is rarely true is that anyone has measured how long a restore actually takes, or whether the restored database is consistent enough to serve traffic.
The gap between “backups are running” and “we can recover from backups” is where most recovery failures live. That gap expands silently: schema migrations add tables that the restore script does not verify, sequences drift out of sync, foreign key constraints that were dropped for a bulk load never get re-added, and PITR windows shrink as WAL archiving falls behind. None of these register as a backup failure. They register as a recovery failure — at 3am, under incident pressure, with customers waiting.
This runbook operationalizes the difference. The goal is a weekly validation cycle that produces a measured RTO, a verified consistent restore, and documented PITR coverage — before you need any of them.
Symptoms
| Signal | Where to see it | What it means |
|---|---|---|
| No documented restore time | Runbook or incident playbook | RTO is aspirational, not measured |
| Backup job shows “succeeded” but restore has never been tested | CI logs, backup tool dashboard | File integrity is confirmed; recoverability is not |
| Backup files exist but manifest or catalog is unverified | pg_dump output, S3 bucket listing | Partial or corrupt dump may silently pass a file-size check |
| Last restore test was more than 90 days ago | Backup validation log, calendar | Schema and data drift since last test may invalidate assumptions |
| RTO and RPO are in the SLA doc but not measured | SLA document, incident retrospectives | Numbers were estimated at design time and never validated |
| pg_stat_archiver shows gaps or lag | PostgreSQL system view | WAL archive is falling behind; PITR window is narrowing |
First Five Checks
-
Verify backup file integrity
For a PostgreSQL logical dump, verify the catalog without performing a full restore:
pg_restore --list backup.dump > /dev/null && echo "catalog OK"The
--listflag reads the table of contents from a custom-format dump. If the dump is corrupt or truncated, this fails immediately. A clean exit with “catalog OK” confirms the file is structurally valid. It does not confirm data integrity — that requires a restore.For Aurora RDS snapshots, check snapshot status and progress via the CLI:
aws rds describe-db-snapshots \ --db-instance-identifier mydb \ --query 'DBSnapshots[*].[DBSnapshotIdentifier,Status,PercentProgress]' \ --output tableAny snapshot not in
availablestatus cannot be used for restore. ThePercentProgressfield indicates whether an automated snapshot is still in progress. -
Check backup age and frequency
For PostgreSQL with WAL archiving, query the archiver process state:
SELECT archived_count, last_archived_wal, last_archived_time, failed_count, last_failed_wal, last_failed_time, stats_reset FROM pg_stat_archiver;The documented behavior of
pg_stat_archiver(PostgreSQL documentation, §28.2) is thatlast_archived_timereflects when the most recent WAL segment was successfully archived. Afailed_countgreater than zero with a recentlast_failed_timemeans the archive pipeline is broken and your PITR window has stopped advancing.archived_countresetting unexpectedly can indicate a statistics reset, not necessarily a problem — checkstats_reset.For RDS, list recent snapshots with a date filter:
aws rds describe-db-snapshots \ --db-instance-identifier mydb \ --query 'DBSnapshots[?SnapshotCreateTime>=`2023-05-08`].[DBSnapshotIdentifier,SnapshotCreateTime,Status]' \ --output table -
Time a restore to a test instance
Record the start time, execute the restore, and record the end time. This is your measured RTO. Do not estimate — measure:
RESTORE_START=$(date -u +%Y-%m-%dT%H:%M:%SZ) echo "Restore started: $RESTORE_START" # PostgreSQL logical restore to a test instance pg_restore \ --host=test-db.internal \ --port=5432 \ --username=restore_user \ --dbname=restore_target \ --verbose \ backup.dump RESTORE_END=$(date -u +%Y-%m-%dT%H:%M:%SZ) echo "Restore completed: $RESTORE_END"For Aurora, restore from a snapshot using the AWS CLI:
aws rds restore-db-instance-from-db-snapshot \ --db-instance-identifier mydb-validation-$(date +%Y%m%d) \ --db-snapshot-identifier mydb-snapshot-id \ --db-instance-class db.t3.medium \ --no-multi-az \ --no-publicly-accessibleLog start and end times. The elapsed wall-clock time is your real RTO for this backup type and database size.
-
Verify data consistency post-restore
Row counts on critical tables catch gross data loss. Sequence values confirm identity columns are in sync. Foreign key constraints confirm referential integrity was preserved:
-- Row counts on high-value tables SELECT schemaname, tablename, n_live_tup FROM pg_stat_user_tables WHERE schemaname = 'public' ORDER BY n_live_tup DESC LIMIT 20; -- Check current sequence values SELECT sequence_name, last_value FROM information_schema.sequences WHERE sequence_schema = 'public'; -- Verify foreign key constraints are present SELECT conname, contype, conrelid::regclass AS table_name FROM pg_constraint WHERE contype = 'f' LIMIT 20;The expected output is that row counts roughly match production (accounting for any lag), sequences are ahead of the maximum id values in their respective tables, and all foreign key constraints are present. A missing constraint row indicates the constraint was dropped and not re-added before the backup was taken.
-
Test point-in-time recovery
For PostgreSQL, a PITR test restores to a target LSN or timestamp rather than the latest checkpoint. This verifies that WAL segments are intact and readable:
# In recovery.conf (Postgres 11 and earlier) or postgresql.conf (12+): # recovery_target_time = '2023-05-14 22:00:00 UTC' # restore_command = 'cp /mnt/wal_archive/%f %p' # For Aurora, restore to a point in time one hour before present: aws rds restore-db-instance-to-point-in-time \ --source-db-instance-identifier mydb \ --target-db-instance-identifier mydb-pitr-validation-$(date +%Y%m%d) \ --restore-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \ --db-instance-class db.t3.medium \ --no-publicly-accessibleThe AWS Aurora PITR documentation specifies that the
--restore-timeparameter accepts an ISO 8601 timestamp. The restored instance should come up in a consistent state at the target time. Verify by checking a table that had known writes in the hour before the target timestamp.
Decision Tree
flowchart TD
A[Backup exists in storage] --> B{Integrity verified?}
B -->|no| C[Re-run backup — check for errors]
B -->|yes| D{Restore timed in last 30 days?}
D -->|no| E[Run restore drill — record start and end time]
E --> F{Measured RTO within SLA?}
F -->|no| G[Escalate — switch to physical backup or optimize]
F -->|yes| H{Data consistency verified?}
D -->|yes| H
H -->|no| I[Investigate — row counts, constraints, sequences]
H -->|yes| J{PITR tested in last 30 days?}
J -->|no| K[Run PITR drill — restore to timestamp minus 1 hour]
K --> L{PITR restore succeeded?}
L -->|no| M[Check WAL archive — review pg_stat_archiver]
L -->|yes| N[Mark validation complete — log date and RTO]
J -->|yes| N
Remediation Options
Option 1 — Switch from logical to physical backup for faster RTO
PostgreSQL pg_dump produces a portable logical backup but restore time scales with database size and is limited by the single-threaded restore process for custom-format dumps (parallel restore with -j helps but still requires full data transfer). For large databases where RTO is failing its SLA target, switching to a physical backup method — pg_basebackup for self-managed PostgreSQL, or Aurora snapshots which use storage-level cloning — typically reduces restore time significantly because physical restores do not need to re-execute every INSERT.
# Physical base backup for self-managed PostgreSQL
pg_basebackup \
--host=primary.internal \
--username=replication_user \
--pgdata=/var/lib/postgresql/base_backup \
--format=tar \
--gzip \
--progress \
--wal-method=stream
Use when: logical restore times consistently exceed RTO targets and the database is large enough that parallel restore does not close the gap.
Risk: physical backups are not portable across major PostgreSQL versions and require the same OS page size as the source.
Option 2 — Automate weekly restore drill to an isolated test instance
Manual restore drills get deferred. An automated weekly drill that spins up a test instance, runs consistency checks, logs the RTO, and terminates the instance provides continuous validation without engineer attention. The pattern works for both self-managed PostgreSQL (via cron + pg_restore + psql checks) and Aurora (via AWS Lambda + EventBridge + the RDS API).
# Shell skeleton for a self-managed weekly drill
#!/bin/bash
set -euo pipefail
BACKUP_FILE="/backups/latest.dump"
TEST_HOST="test-restore.internal"
LOG_FILE="/var/log/backup_validation/$(date +%Y%m%d).log"
START=$(date +%s)
pg_restore --host="$TEST_HOST" --dbname=restore_target "$BACKUP_FILE" >> "$LOG_FILE" 2>&1
END=$(date +%s)
ELAPSED=$((END - START))
echo "RTO measured: ${ELAPSED}s" >> "$LOG_FILE"
psql --host="$TEST_HOST" --dbname=restore_target \
-c "SELECT count(*) FROM pg_stat_user_tables;" >> "$LOG_FILE"
echo "Validation complete: $(date -u)" >> "$LOG_FILE"
Use when: restore drills are happening less than monthly, or the team wants evidence of RTO measurements for compliance purposes.
Risk: the test instance must be isolated from production network paths to avoid accidental writes.
Option 3 — Add catalog verification to CI/CD for schema migrations
Schema migrations are the most common way a logical backup becomes silently unrestorable — a migration drops and re-creates a constraint, a sequence, or a table in a way that the backup catalog does not reflect. Adding pg_restore --list verification as a post-migration CI check confirms that the dump catalog matches expected objects after every migration run.
# In CI pipeline, after migration:
pg_dump \
--format=custom \
--schema-only \
--file=schema_backup.dump \
"$DATABASE_URL"
pg_restore --list schema_backup.dump | grep -E "TABLE|SEQUENCE|CONSTRAINT" | sort > /tmp/current_objects.txt
# Diff against expected objects baseline
diff /tmp/expected_objects.txt /tmp/current_objects.txt
Use when: the team runs frequent migrations and wants early warning before a corrupt backup reaches the weekly restore drill.
Risk: schema-only catalog verification does not catch data integrity issues — it only confirms structural completeness.
Rollback Plan
The backup validation workflow is entirely read-only on production. All restore operations target isolated test instances. There is nothing to roll back from the validation process itself.
-
If Option 1 (physical backup) causes issues: The original logical backup schedule is unchanged. Run both in parallel for one validation cycle before cutting over. Revert by disabling the
pg_basebackupcron job and monitoring the next scheduled logical backup. -
If Option 2 (automated restore drill) causes unexpected resource usage: The EventBridge or cron schedule can be disabled immediately. If a test instance was not terminated by the script, terminate it manually via
aws rds delete-db-instance --db-instance-identifier mydb-validation-YYYYMMDD --skip-final-snapshot. -
If Option 3 (CI catalog check) produces false positives after a migration: Regenerate the
expected_objects.txtbaseline from the current schema and commit it. The diff will be clean on the next run.
Automation Opportunity
The most impactful automation for this runbook is a weekly restore drill that requires no engineer involvement. The AWS pattern for Aurora uses EventBridge to trigger a Lambda function once per week. The Lambda calls restore-db-instance-from-db-snapshot using the most recent available snapshot, polls the instance status until it reaches available, runs row count checks via the RDS Data API or a temporary Lambda-to-RDS connection, logs the elapsed time and results to CloudWatch Logs, then calls delete-db-instance to terminate the test instance.
For a 100 GB Aurora database, the AWS RDS pricing documentation indicates that snapshot restore charges apply at the storage rate for the duration the instance is running. A validation instance that runs for two hours per week at db.t3.medium pricing (on-demand) costs approximately $0.34 per week at current us-east-1 rates — less than the cost of one engineer-hour spent on a manual drill. The actual cost depends on instance class, storage provisioned, and region.
For self-managed PostgreSQL, a pg_cron job or a systemd timer can trigger the shell skeleton from Option 2. The key instrumentation addition is writing the elapsed RTO and row count results to a table in a monitoring database so that trend data is available — a restore time that grows month over month as the database grows is a signal to revisit backup type before it breaches SLA.
Leadership Summary
- What broke: Backup jobs were succeeding but restorability had never been tested, meaning the team’s documented RTO had no measured basis and recovery from a real incident would be slower and less certain than assumed.
- What was done: A validation workflow was implemented that measures actual restore time, verifies data consistency post-restore, and tests point-in-time recovery on a documented schedule.
- What prevents recurrence: Automated weekly restore drills log measured RTO to a persistent store, and a CI catalog check flags schema migrations that would make a backup unrestorable before they reach production.
Checklist
- Verify backup file integrity using
pg_restore --list(PostgreSQL) oraws rds describe-db-snapshots(Aurora) — confirm no errors before proceeding - Check backup age: confirm the most recent backup is within the expected retention window and frequency
- Query
pg_stat_archiverand confirmfailed_countis zero andlast_archived_timeis recent - Run a timed restore to an isolated test instance and record wall-clock start and end times as the measured RTO
- Compare measured RTO against documented SLA target — escalate if over threshold
- Run row counts on the top 20 tables by size on the restored instance and compare to production baseline
- Verify sequence values are ahead of their respective table maximum id values
- Query
pg_constrainton the restored instance and confirm all expected foreign key constraints are present - Run a PITR drill to a timestamp 1 hour before the current time — confirm the instance comes up and data at the target time is present
- Document the validation date, measured RTO, PITR result, and any anomalies in the validation log
- Set a calendar reminder or automate a trigger to repeat this cycle within 30 days
- If measured RTO exceeds SLA: open a ticket to evaluate physical backup method or restore parallelism before the next scheduled drill
What to Do Next
- Problem: Backup jobs report success but the team has never measured actual restore time or verified data consistency — meaning the documented RTO is a guess and a real recovery event will be slower and less certain than expected.
- Solution: Run a timed restore to an isolated test instance, verify row counts and foreign key constraints post-restore, and test PITR to a target timestamp — on a schedule that keeps the measurement fresh.
- Proof: A logged RTO that fits inside the SLA target, verified by wall-clock start and end times from the last restore drill, plus a confirmed PITR result within the last 30 days.
- Action: This week, run
pg_restore --list backup.dump(oraws rds describe-db-snapshots) to verify your most recent backup file is structurally intact, then schedule the first timed restore drill if one has not been run in the past 30 days.