Backups Are Not Recovery: The DBA Rule Everyone Learns Late

A backup file is not proof of recoverability. It is proof that data was written to storage at a point in time. Recovery is the separate process of taking that file and producing a running, consistent database on a different system within your RTO. Engineers who conflate the two discover the gap during an actual incident — the worst possible time to find it.

Situation

Most teams running production databases configure some form of backup. Nightly pg_dump jobs, Aurora snapshots, xtrabackup runs around low-traffic windows — the mechanics are straightforward. Monitoring confirms the job completed without error.

That confirmation covers one half of the contract. It says data left the system. It says nothing about restore time, or whether WAL segments and encryption keys are available in the same failure scenario that just took down the primary.

The Problem

The documented failure mode: a team runs nightly pg_dump, stores output to S3, and considers their backup strategy complete. During a corruption event, they initiate a restore and discover that pg_dump replays every row as SQL against a cold instance — on a large database, hours of work. With no WAL archives stored, there is no PITR capability either.

The backup was real. The recovery was not viable within their RTO.

The question every team must answer before an incident: have you timed a full restore on target hardware, and does that number fit inside your recovery time objective?

Core Concept

RPO and RTO are different constraints governed by different mechanics.

RPO (Recovery Point Objective) is how much data loss is acceptable. A nightly backup gives an RPO of up to 24 hours. An RPO of minutes requires continuous WAL archiving (PostgreSQL) or binary log shipping (MySQL). Aurora documents this explicitly — PITR to any second within the retention window is only possible because Aurora streams redo logs continuously, not because snapshots run frequently.

RTO (Recovery Time Objective) is how long you can be down. It is determined by restore speed, not backup frequency.

flowchart TD
    A[Primary Database] -->|Writes data| B[Base Backup]
    A -->|Streams changes| C[WAL Archive]
    B --> D[Disaster Recovery Target]
    C -->|Replays until PITR| D
    D --> E[Recovered Database]

Backup type	Restore speed	PITR capable
Logical — `pg_dump`, `mysqldump`	Slow — replays SQL row by row	No, without WAL or binlog archiving
Physical — `pg_basebackup`, `xtrabackup`	Fast — copies raw data files	Yes, when WAL or binlog archiving is configured
Cloud snapshot — Aurora, RDS	Fast — clones at storage layer	Yes, when continuous backup is enabled

PostgreSQL’s documentation for pg_basebackup describes its output as a binary copy of the data directory that a new instance can start from directly — bypassing the replay overhead that makes logical restores slow. For large databases, the difference is not marginal.

Three additional gaps close the trap:

Same-region backup storage. A regional disruption takes out both the database and the S3 bucket if they share a region. A backup unavailable during the failure it is meant to cover is not a recovery asset.

Logical backup without WAL archiving. A pg_dump taken at 2:00 AM returns you to 2:00 AM state. If corruption happened at 11:58 PM, 22 hours of data are gone. PITR requires WAL archiving in PostgreSQL or binary logging in MySQL, both enabled explicitly.

Encryption key in the failed system. If the key lives in the same environment that just failed or was compromised, the backup cannot be decrypted. Key management must be independent of the system being protected.

In Practice

PostgreSQL’s pg_basebackup documentation notes that WAL files generated during and after the backup are required for consistency — WAL archiving is the prerequisite for any PITR capability in self-managed PostgreSQL.

Percona’s XtraBackup documentation describes a hot physical backup that does not block writes. It records the binary log position at the backup’s end — the anchor required for point-in-time recovery in MySQL and MariaDB.

Amazon Aurora’s PITR documentation states that restores create a new DB cluster, not an in-place restoration. Applications must re-point to the new endpoint after a PITR restore — a step that surprises engineers who have never run the procedure under pressure.

Where It Breaks

Scenario	What breaks	Why
Untested restore	RTO is unknown until the incident	Restore time was assumed, never measured on comparable hardware
Same-region backup storage	Backup unavailable during regional failure	S3 bucket and database instance share the same AWS region
Logical backup without WAL archiving	No PITR capability	`pg_dump` is a point-in-time snapshot; intermediate recovery requires WAL or binlog
Encryption key in the same environment	Cannot decrypt backup during recovery	Key management system is part of the failed or compromised system

What to Do Next

Problem: A backup job completing successfully does not mean recovery is possible within your RTO.
Solution: Treat backup and recovery as separate contracts — configure WAL archiving for PITR, store backups cross-region, and time a full restore on comparable hardware.
Proof: A timed restore drill producing a running, queryable database at a point in time before a simulated event, completed inside your documented RTO.
Action: This week, identify your largest production database and determine how long a full restore would take with your current backup type. If you have never timed it, schedule the drill now.

The backup proves data was written somewhere. The only thing that proves recovery is doing it.

Situation

The Problem

Core Concept

In Practice

Where It Breaks

What to Do Next

Rajiv

Related Posts

Database Backup Validation Workflow

MySQL Slow Query Playbook: From Slow Log to Fix

PostgreSQL Slow Query Triage Workflow