Backups Are Not Recovery: The DBA Rule Everyone Learns Late
A backup file is not proof of recoverability. It is proof that data was written to storage at a point in time. Recovery is the separate process of taking that file and producing a running, consistent database on a different system within your RTO. Engineers who conflate the two discover the gap during an actual incident — the worst possible time to find it.
Situation
Most teams running production databases configure some form of backup. Nightly pg_dump jobs, Aurora snapshots, xtrabackup runs around low-traffic windows — the mechanics are straightforward. Monitoring confirms the job completed without error.
That confirmation covers one half of the contract. It says data left the system. It says nothing about restore time, or whether WAL segments and encryption keys are available in the same failure scenario that just took down the primary.
The Problem
The documented failure mode: a team runs nightly pg_dump, stores output to S3, and considers their backup strategy complete. During a corruption event, they initiate a restore and discover that pg_dump replays every row as SQL against a cold instance — on a large database, hours of work. With no WAL archives stored, there is no PITR capability either.
The backup was real. The recovery was not viable within their RTO.
The question every team must answer before an incident: have you timed a full restore on target hardware, and does that number fit inside your recovery time objective?
Core Concept
RPO and RTO are different constraints governed by different mechanics.
RPO (Recovery Point Objective) is how much data loss is acceptable. A nightly backup gives an RPO of up to 24 hours. An RPO of minutes requires continuous WAL archiving (PostgreSQL) or binary log shipping (MySQL). Aurora documents this explicitly — PITR to any second within the retention window is only possible because Aurora streams redo logs continuously, not because snapshots run frequently.
RTO (Recovery Time Objective) is how long you can be down. It is determined by restore speed, not backup frequency.
flowchart TD
A[Primary Database] -->|Writes data| B[Base Backup]
A -->|Streams changes| C[WAL Archive]
B --> D[Disaster Recovery Target]
C -->|Replays until PITR| D
D --> E[Recovered Database]
| Backup type | Restore speed | PITR capable |
|---|---|---|
Logical — pg_dump, mysqldump | Slow — replays SQL row by row | No, without WAL or binlog archiving |
Physical — pg_basebackup, xtrabackup | Fast — copies raw data files | Yes, when WAL or binlog archiving is configured |
| Cloud snapshot — Aurora, RDS | Fast — clones at storage layer | Yes, when continuous backup is enabled |
PostgreSQL’s documentation for pg_basebackup describes its output as a binary copy of the data directory that a new instance can start from directly — bypassing the replay overhead that makes logical restores slow. For large databases, the difference is not marginal.
Three additional gaps close the trap:
Same-region backup storage. A regional disruption takes out both the database and the S3 bucket if they share a region. A backup unavailable during the failure it is meant to cover is not a recovery asset.
Logical backup without WAL archiving. A pg_dump taken at 2:00 AM returns you to 2:00 AM state. If corruption happened at 11:58 PM, 22 hours of data are gone. PITR requires WAL archiving in PostgreSQL or binary logging in MySQL, both enabled explicitly.
Encryption key in the failed system. If the key lives in the same environment that just failed or was compromised, the backup cannot be decrypted. Key management must be independent of the system being protected.
In Practice
PostgreSQL’s pg_basebackup documentation notes that WAL files generated during and after the backup are required for consistency — WAL archiving is the prerequisite for any PITR capability in self-managed PostgreSQL.
Percona’s XtraBackup documentation describes a hot physical backup that does not block writes. It records the binary log position at the backup’s end — the anchor required for point-in-time recovery in MySQL and MariaDB.
Amazon Aurora’s PITR documentation states that restores create a new DB cluster, not an in-place restoration. Applications must re-point to the new endpoint after a PITR restore — a step that surprises engineers who have never run the procedure under pressure.
Where It Breaks
| Scenario | What breaks | Why |
|---|---|---|
| Untested restore | RTO is unknown until the incident | Restore time was assumed, never measured on comparable hardware |
| Same-region backup storage | Backup unavailable during regional failure | S3 bucket and database instance share the same AWS region |
| Logical backup without WAL archiving | No PITR capability | pg_dump is a point-in-time snapshot; intermediate recovery requires WAL or binlog |
| Encryption key in the same environment | Cannot decrypt backup during recovery | Key management system is part of the failed or compromised system |
What to Do Next
- Problem: A backup job completing successfully does not mean recovery is possible within your RTO.
- Solution: Treat backup and recovery as separate contracts — configure WAL archiving for PITR, store backups cross-region, and time a full restore on comparable hardware.
- Proof: A timed restore drill producing a running, queryable database at a point in time before a simulated event, completed inside your documented RTO.
- Action: This week, identify your largest production database and determine how long a full restore would take with your current backup type. If you have never timed it, schedule the drill now.
The backup proves data was written somewhere. The only thing that proves recovery is doing it.