Series / Databases

Database Reliability Playbook

Operational workflows for diagnosing, recovering, and preventing database incidents across slow queries, backups, replication, autovacuum, migrations, and capacity pressure.

57 posts Databases

Who This Is For

DBAs, SREs, and backend engineers who own databases in production and need structured triage workflows — not tutorials, but field-tested runbooks for when something is already on fire.

What You Will Be Able to Do

  • Run a structured triage when a query suddenly slows down or a replica falls behind
  • Validate backup integrity before you need it, not during an incident
  • Diagnose autovacuum and bloat before they cause table scans
  • Execute schema migrations without downtime on live traffic

Prerequisites

Basic familiarity with PostgreSQL or MySQL. You should know what a query plan is, even if you don't read them fluently yet.

1 Diagnostics

How to find what's slow and why the optimizer chose a bad plan — before you start changing things.

2 Incident Runbooks

Structured triage for when something is already on fire — connection storms, autovacuum failures, replication lag, and CPU spikes.

3 Prevention & Capacity

Proactive work that stops incidents before they happen — backups you can actually restore, connection pools that don't kill the primary, and migrations with a rollback path.

Additional Posts

Related posts matched to this series by topic, tags, and keywords.