Database Backup Validation Workflow

A backup that has never been restored is a hypothesis, not a safety net. The job of a backup validation workflow is not to confirm that backup files exist — it is to prove that a recoverable database can be produced from them within your documented RTO, on demand, and on a schedule that keeps that proof fresh.

Situation

Most teams reach a point where backup jobs are running nightly, retention windows are configured, and monitoring shows no failures. The backup checkbox is green. What is rarely true is that anyone has measured how long a restore actually takes, or whether the restored database is consistent enough to serve traffic.

The gap between “backups are running” and “we can recover from backups” is where most recovery failures live. That gap expands silently: schema migrations add tables that the restore script does not verify, sequences drift out of sync, foreign key constraints that were dropped for a bulk load never get re-added, and PITR windows shrink as WAL archiving falls behind. None of these register as a backup failure. They register as a recovery failure — at 3am, under incident pressure, with customers waiting.

This runbook operationalizes the difference. The goal is a weekly validation cycle that produces a measured RTO, a verified consistent restore, and documented PITR coverage — before you need any of them.

Symptoms

Signal	Where to see it	What it means
No documented restore time	Runbook or incident playbook	RTO is aspirational, not measured
Backup job shows “succeeded” but restore has never been tested	CI logs, backup tool dashboard	File integrity is confirmed; recoverability is not
Backup files exist but manifest or catalog is unverified	pg_dump output, S3 bucket listing	Partial or corrupt dump may silently pass a file-size check
Last restore test was more than 90 days ago	Backup validation log, calendar	Schema and data drift since last test may invalidate assumptions
RTO and RPO are in the SLA doc but not measured	SLA document, incident retrospectives	Numbers were estimated at design time and never validated
pg_stat_archiver shows gaps or lag	PostgreSQL system view	WAL archive is falling behind; PITR window is narrowing

First Five Checks

Verify backup file integrity

For a PostgreSQL logical dump, verify the catalog without performing a full restore:
```
pg_restore --list backup.dump > /dev/null && echo "catalog OK"
```
The --list flag reads the table of contents from a custom-format dump. If the dump is corrupt or truncated, this fails immediately. A clean exit with “catalog OK” confirms the file is structurally valid. It does not confirm data integrity — that requires a restore.

For Aurora RDS snapshots, check snapshot status and progress via the CLI:
```
aws rds describe-db-snapshots \
  --db-instance-identifier mydb \
  --query 'DBSnapshots[*].[DBSnapshotIdentifier,Status,PercentProgress]' \
  --output table
```
Any snapshot not in available status cannot be used for restore. The PercentProgress field indicates whether an automated snapshot is still in progress.
Check backup age and frequency

For PostgreSQL with WAL archiving, query the archiver process state:
```
SELECT archived_count,
       last_archived_wal,
       last_archived_time,
       failed_count,
       last_failed_wal,
       last_failed_time,
       stats_reset
FROM pg_stat_archiver;
```
The documented behavior of pg_stat_archiver (PostgreSQL documentation, §28.2) is that last_archived_time reflects when the most recent WAL segment was successfully archived. A failed_count greater than zero with a recent last_failed_time means the archive pipeline is broken and your PITR window has stopped advancing. archived_count resetting unexpectedly can indicate a statistics reset, not necessarily a problem — check stats_reset.

For RDS, list recent snapshots with a date filter:
```
aws rds describe-db-snapshots \
  --db-instance-identifier mydb \
  --query 'DBSnapshots[?SnapshotCreateTime>=`2023-05-08`].[DBSnapshotIdentifier,SnapshotCreateTime,Status]' \
  --output table
```

Time a restore to a test instance

Record the start time, execute the restore, and record the end time. This is your measured RTO. Do not estimate — measure:

RESTORE_START=$(date -u +%Y-%m-%dT%H:%M:%SZ)
echo "Restore started: $RESTORE_START"

# PostgreSQL logical restore to a test instance
pg_restore \
  --host=test-db.internal \
  --port=5432 \
  --username=restore_user \
  --dbname=restore_target \
  --verbose \
  backup.dump

RESTORE_END=$(date -u +%Y-%m-%dT%H:%M:%SZ)
echo "Restore completed: $RESTORE_END"

For Aurora, restore from a snapshot using the AWS CLI:

aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier mydb-validation-$(date +%Y%m%d) \
  --db-snapshot-identifier mydb-snapshot-id \
  --db-instance-class db.t3.medium \
  --no-multi-az \
  --no-publicly-accessible

Log start and end times. The elapsed wall-clock time is your real RTO for this backup type and database size.

Verify data consistency post-restore

Row counts on critical tables catch gross data loss. Sequence values confirm identity columns are in sync. Foreign key constraints confirm referential integrity was preserved:
```
-- Row counts on high-value tables
SELECT schemaname, tablename, n_live_tup
FROM pg_stat_user_tables
WHERE schemaname = 'public'
ORDER BY n_live_tup DESC
LIMIT 20;

-- Check current sequence values
SELECT sequence_name, last_value
FROM information_schema.sequences
WHERE sequence_schema = 'public';

-- Verify foreign key constraints are present
SELECT conname, contype, conrelid::regclass AS table_name
FROM pg_constraint
WHERE contype = 'f'
LIMIT 20;
```
The expected output is that row counts roughly match production (accounting for any lag), sequences are ahead of the maximum id values in their respective tables, and all foreign key constraints are present. A missing constraint row indicates the constraint was dropped and not re-added before the backup was taken.

Test point-in-time recovery

For PostgreSQL, a PITR test restores to a target LSN or timestamp rather than the latest checkpoint. This verifies that WAL segments are intact and readable:

# In recovery.conf (Postgres 11 and earlier) or postgresql.conf (12+):
# recovery_target_time = '2023-05-14 22:00:00 UTC'
# restore_command = 'cp /mnt/wal_archive/%f %p'

# For Aurora, restore to a point in time one hour before present:
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier mydb \
  --target-db-instance-identifier mydb-pitr-validation-$(date +%Y%m%d) \
  --restore-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
  --db-instance-class db.t3.medium \
  --no-publicly-accessible

The AWS Aurora PITR documentation specifies that the --restore-time parameter accepts an ISO 8601 timestamp. The restored instance should come up in a consistent state at the target time. Verify by checking a table that had known writes in the hour before the target timestamp.

Decision Tree

flowchart TD
    A[Backup exists in storage] --> B{Integrity verified?}
    B -->|no| C[Re-run backup — check for errors]
    B -->|yes| D{Restore timed in last 30 days?}
    D -->|no| E[Run restore drill — record start and end time]
    E --> F{Measured RTO within SLA?}
    F -->|no| G[Escalate — switch to physical backup or optimize]
    F -->|yes| H{Data consistency verified?}
    D -->|yes| H
    H -->|no| I[Investigate — row counts, constraints, sequences]
    H -->|yes| J{PITR tested in last 30 days?}
    J -->|no| K[Run PITR drill — restore to timestamp minus 1 hour]
    K --> L{PITR restore succeeded?}
    L -->|no| M[Check WAL archive — review pg_stat_archiver]
    L -->|yes| N[Mark validation complete — log date and RTO]
    J -->|yes| N

Remediation Options

Option 1 — Switch from logical to physical backup for faster RTO

PostgreSQL pg_dump produces a portable logical backup but restore time scales with database size and is limited by the single-threaded restore process for custom-format dumps (parallel restore with -j helps but still requires full data transfer). For large databases where RTO is failing its SLA target, switching to a physical backup method — pg_basebackup for self-managed PostgreSQL, or Aurora snapshots which use storage-level cloning — typically reduces restore time significantly because physical restores do not need to re-execute every INSERT.

# Physical base backup for self-managed PostgreSQL
pg_basebackup \
  --host=primary.internal \
  --username=replication_user \
  --pgdata=/var/lib/postgresql/base_backup \
  --format=tar \
  --gzip \
  --progress \
  --wal-method=stream

Use when: logical restore times consistently exceed RTO targets and the database is large enough that parallel restore does not close the gap.

Risk: physical backups are not portable across major PostgreSQL versions and require the same OS page size as the source.

Option 2 — Automate weekly restore drill to an isolated test instance

Manual restore drills get deferred. An automated weekly drill that spins up a test instance, runs consistency checks, logs the RTO, and terminates the instance provides continuous validation without engineer attention. The pattern works for both self-managed PostgreSQL (via cron + pg_restore + psql checks) and Aurora (via AWS Lambda + EventBridge + the RDS API).

# Shell skeleton for a self-managed weekly drill
#!/bin/bash
set -euo pipefail

BACKUP_FILE="/backups/latest.dump"
TEST_HOST="test-restore.internal"
LOG_FILE="/var/log/backup_validation/$(date +%Y%m%d).log"

START=$(date +%s)
pg_restore --host="$TEST_HOST" --dbname=restore_target "$BACKUP_FILE" >> "$LOG_FILE" 2>&1
END=$(date +%s)

ELAPSED=$((END - START))
echo "RTO measured: ${ELAPSED}s" >> "$LOG_FILE"

psql --host="$TEST_HOST" --dbname=restore_target \
  -c "SELECT count(*) FROM pg_stat_user_tables;" >> "$LOG_FILE"

echo "Validation complete: $(date -u)" >> "$LOG_FILE"

Use when: restore drills are happening less than monthly, or the team wants evidence of RTO measurements for compliance purposes.

Risk: the test instance must be isolated from production network paths to avoid accidental writes.

Option 3 — Add catalog verification to CI/CD for schema migrations

Schema migrations are the most common way a logical backup becomes silently unrestorable — a migration drops and re-creates a constraint, a sequence, or a table in a way that the backup catalog does not reflect. Adding pg_restore --list verification as a post-migration CI check confirms that the dump catalog matches expected objects after every migration run.

# In CI pipeline, after migration:
pg_dump \
  --format=custom \
  --schema-only \
  --file=schema_backup.dump \
  "$DATABASE_URL"

pg_restore --list schema_backup.dump | grep -E "TABLE|SEQUENCE|CONSTRAINT" | sort > /tmp/current_objects.txt

# Diff against expected objects baseline
diff /tmp/expected_objects.txt /tmp/current_objects.txt

Use when: the team runs frequent migrations and wants early warning before a corrupt backup reaches the weekly restore drill.

Risk: schema-only catalog verification does not catch data integrity issues — it only confirms structural completeness.

Rollback Plan

The backup validation workflow is entirely read-only on production. All restore operations target isolated test instances. There is nothing to roll back from the validation process itself.

If Option 1 (physical backup) causes issues: The original logical backup schedule is unchanged. Run both in parallel for one validation cycle before cutting over. Revert by disabling the pg_basebackup cron job and monitoring the next scheduled logical backup.
If Option 2 (automated restore drill) causes unexpected resource usage: The EventBridge or cron schedule can be disabled immediately. If a test instance was not terminated by the script, terminate it manually via aws rds delete-db-instance --db-instance-identifier mydb-validation-YYYYMMDD --skip-final-snapshot.
If Option 3 (CI catalog check) produces false positives after a migration: Regenerate the expected_objects.txt baseline from the current schema and commit it. The diff will be clean on the next run.

Automation Opportunity

The most impactful automation for this runbook is a weekly restore drill that requires no engineer involvement. The AWS pattern for Aurora uses EventBridge to trigger a Lambda function once per week. The Lambda calls restore-db-instance-from-db-snapshot using the most recent available snapshot, polls the instance status until it reaches available, runs row count checks via the RDS Data API or a temporary Lambda-to-RDS connection, logs the elapsed time and results to CloudWatch Logs, then calls delete-db-instance to terminate the test instance.

For a 100 GB Aurora database, the AWS RDS pricing documentation indicates that snapshot restore charges apply at the storage rate for the duration the instance is running. A validation instance that runs for two hours per week at db.t3.medium pricing (on-demand) costs approximately $0.34 per week at current us-east-1 rates — less than the cost of one engineer-hour spent on a manual drill. The actual cost depends on instance class, storage provisioned, and region.

For self-managed PostgreSQL, a pg_cron job or a systemd timer can trigger the shell skeleton from Option 2. The key instrumentation addition is writing the elapsed RTO and row count results to a table in a monitoring database so that trend data is available — a restore time that grows month over month as the database grows is a signal to revisit backup type before it breaches SLA.

Leadership Summary

What broke: Backup jobs were succeeding but restorability had never been tested, meaning the team’s documented RTO had no measured basis and recovery from a real incident would be slower and less certain than assumed.
What was done: A validation workflow was implemented that measures actual restore time, verifies data consistency post-restore, and tests point-in-time recovery on a documented schedule.
What prevents recurrence: Automated weekly restore drills log measured RTO to a persistent store, and a CI catalog check flags schema migrations that would make a backup unrestorable before they reach production.

Checklist

Verify backup file integrity using pg_restore --list (PostgreSQL) or aws rds describe-db-snapshots (Aurora) — confirm no errors before proceeding
Check backup age: confirm the most recent backup is within the expected retention window and frequency
Query pg_stat_archiver and confirm failed_count is zero and last_archived_time is recent
Run a timed restore to an isolated test instance and record wall-clock start and end times as the measured RTO
Compare measured RTO against documented SLA target — escalate if over threshold
Run row counts on the top 20 tables by size on the restored instance and compare to production baseline
Verify sequence values are ahead of their respective table maximum id values
Query pg_constraint on the restored instance and confirm all expected foreign key constraints are present
Run a PITR drill to a timestamp 1 hour before the current time — confirm the instance comes up and data at the target time is present
Document the validation date, measured RTO, PITR result, and any anomalies in the validation log
Set a calendar reminder or automate a trigger to repeat this cycle within 30 days
If measured RTO exceeds SLA: open a ticket to evaluate physical backup method or restore parallelism before the next scheduled drill

What to Do Next

Problem: Backup jobs report success but the team has never measured actual restore time or verified data consistency — meaning the documented RTO is a guess and a real recovery event will be slower and less certain than expected.
Solution: Run a timed restore to an isolated test instance, verify row counts and foreign key constraints post-restore, and test PITR to a target timestamp — on a schedule that keeps the measurement fresh.
Proof: A logged RTO that fits inside the SLA target, verified by wall-clock start and end times from the last restore drill, plus a confirmed PITR result within the last 30 days.
Action: This week, run pg_restore --list backup.dump (or aws rds describe-db-snapshots) to verify your most recent backup file is structurally intact, then schedule the first timed restore drill if one has not been run in the past 30 days.

Situation

Symptoms

First Five Checks

Decision Tree

Remediation Options

Rollback Plan

Automation Opportunity

Leadership Summary

Checklist

What to Do Next

Rajiv

Related Posts

Backups Are Not Recovery: The DBA Rule Everyone Learns Late

MySQL Slow Query Playbook: From Slow Log to Fix

PostgreSQL Slow Query Triage Workflow