Database Reliability Playbook

PostgreSQL Slow Query Triage Workflow

A structured runbook for diagnosing slow query root causes in PostgreSQL — missing indexes, stale statistics, lock contention, and I/O saturation — in the order that wastes the least time.

May 23, 2022 11 min read

L3 Reference Guide

MySQL Slow Query Playbook: From Slow Log to Fix

A repeatable workflow for diagnosing MySQL slow queries — from enabling the slow log through reading EXPLAIN output to committing a safe fix.

2 Incident Runbooks

Structured triage for when something is already on fire — connection storms, autovacuum failures, replication lag, and CPU spikes.

Apr 3, 2023 10 min read

L3 Reference Guide

PostgreSQL Connection Storm Runbook

Diagnosing and resolving connection exhaustion in PostgreSQL: too many clients, idle-in-transaction accumulation, and the case for connection pooling.

Jan 16, 2023 9 min read

L2 Deep Dive

PostgreSQL Autovacuum Failure Workflow

A step-by-step runbook for diagnosing and resolving autovacuum failures: dead tuple accumulation, bloat, and transaction ID wraparound risk.

Mar 6, 2023 8 min read

L2 Deep Dive

Aurora MySQL Writer CPU Spike Workflow

A systematic runbook for diagnosing Aurora MySQL writer CPU spikes — from Performance Insights through lock contention, long transactions, and read offload.

#databases #cloud #checklist #failures

Feb 6, 2023 8 min read

L2 Deep Dive

MySQL Replication Lag Decision Tree

A systematic runbook for diagnosing MySQL replication lag — from initial SHOW REPLICA STATUS to parallel apply, long transactions, and relay log space.

Jul 17, 2023 10 min read

L3 Reference Guide

Logical Replication Failure Workflow

A diagnostic runbook for logical replication lag, apply worker failures, replication conflicts, and schema drift between publisher and subscriber.

3 Prevention & Capacity

Proactive work that stops incidents before they happen — backups you can actually restore, connection pools that don't kill the primary, and migrations with a rollback path.

Apr 11, 2022 5 min read

L1 Field Note

PostgreSQL Autovacuum: What Every Engineer Should Know

Autovacuum is not optional maintenance — it is the mechanism that prevents table bloat and transaction ID wraparound from taking your database offline.

#databases #checklist

Nov 14, 2022 5 min read

L2 Deep Dive

Backups Are Not Recovery: The DBA Rule Everyone Learns Late

A backup file proves you captured data. Recovery is the process of producing a running, consistent database on a different system inside your RTO. They are not the same thing, and confusing them is how incidents get worse.

#databases #failures #checklist

May 15, 2023 11 min read

L3 Reference Guide

Database Backup Validation Workflow

A repeatable runbook for proving that your database backups are actually restorable — with exact commands, decision tree, and automation patterns.

Jul 16, 2024 7 min read

L2 Deep Dive

Database Changes in CI/CD: Migrations, Backfills, Expand-Contract, and Verification

Database changes in CI/CD require separate gates for schema migrations, backfills, and expand-contract patterns — not just a shell command before deployment.

#databases #architecture

Additional Posts

Related posts matched to this series by topic, tags, and keywords.

Jun 5, 2023 10 min read

L3 Reference Guide

Cloud Database Cost Triage: Storage, IOPS, CPU, Replicas

A structured runbook for identifying which cost dimension is driving your AWS RDS or Aurora bill before making any changes.

#databases #cloud #checklist

Aug 20, 2024 5 min read

L2 Deep Dive

PostgreSQL Observability: Vacuum, Bloat, Locks, Replication Lag, and Query Plans

Monitoring PostgreSQL requires looking past the operating system and into the internal bookkeeping of MVCC, autovacuum, and replication streams.

#databases #architecture #failures

Feb 18, 2025 5 min read

L2 Deep Dive

AI Engineering

AI-Assisted Incident Triage: From Alert Noise to Root-Cause Hypotheses

How generative AI tools like CloudWatch Investigations shift the operational burden from reading raw dashboards to validating machine-generated hypotheses.

#ai-engineering #failures #cloud

Apr 16, 2026 2 min read

L1 Field Note

#ai-engineering #failures #architecture #checklist

SQL Server to PostgreSQL Migration Cost Defense Checklist

A pragmatic checklist to defend the business case for migrating away from Microsoft SQL Server.

#checklist #databases

May 27, 2026 7 min read

L2 Deep Dive

AI Engineering

AI Cost Incident Runbook: What to Do When Monthly Token Spend Suddenly Doubles

An operational playbook for triaging and containing LLM token spend spikes — from alert fire to root cause within 30 minutes.

Jan 20, 2026 8 min read

L2 Deep Dive

AI Engineering

AI Agent Observability: Monitor Tool Calls, Token Spend, Latency, and Failure Loops

Why monitoring autonomous SRE agents requires tracking tool-call hallucinations, context window saturation, and recursive retry loops, rather than just basic CPU metrics.

#ai-engineering #architecture #failures #system-design

Sep 26, 2022 7 min read

L2 Deep Dive

MongoDB Query Performance Workflow

A systematic runbook for diagnosing slow MongoDB queries — from explain output through COLLSCAN, index selectivity, in-memory sort, and WiredTiger cache pressure.

Jan 10, 2023 4 min read

L1 Field Note

Engineering Fundamentals

Replication Lag Explained

What replication lag actually measures in PostgreSQL, the three distinct lag components that most monitoring tools conflate, and which one matters for your RPO.

#databases #fundamentals

May 8, 2023 6 min read

L2 Deep Dive

Logical Replication vs Physical Replication in PostgreSQL

Physical replication copies bytes; logical replication copies row changes — and confusing the two causes silent schema drift, sequence divergence, and failed zero-downtime upgrades.

#databases #architecture

Aug 19, 2023 8 min read

L2 Deep Dive

OCI for Oracle-Heavy Enterprises: Migration Pattern, Risk Boundary, and Cost Model

OCI migration risk model for Oracle-heavy enterprises — where the lift-and-shift boundary shifts from the database tier into dependent application contracts.

#architecture #databases #cloud

Jan 16, 2024 7 min read

L2 Deep Dive

Checkout Failure Triage: Payment, Inventory, Order Write, or Downstream Event

Triage checklist for isolating checkout failures across payment gateway, inventory reservation, order write, and event propagation boundaries.

#architecture #failures #cloud

Feb 26, 2024 10 min read

L3 Reference Guide

PostgreSQL Statistics Drift Workflow

When the query planner gets row estimates wrong, queries regress silently. This runbook diagnoses statistics drift and restores accurate plans.

Jun 14, 2024 7 min read

L2 Deep Dive

Search Index Drift Workflow: Rebuilds, Dual Writes, CDC, and User-Visible Staleness

Search index drift is a truth-management failure: when to rebuild vs. dual-write vs. CDC, and how to bound user-visible staleness.

#architecture #databases #cloud

Sep 17, 2024 6 min read

L2 Deep Dive

Argo CD Deployment Workflow: Sync Waves, Health Checks, Rollbacks, and Drift

Argo CD sync waves, health check gates, rollback triggers, and drift detection — the four mechanisms that separate GitOps deployments from applied YAML.

#architecture #cloud #failures

Oct 15, 2024 7 min read

L2 Deep Dive

CI/CD Observability: Queue Time, Flake Rate, Lead Time, Failure Domains, and Change Risk

Queue time, flake rate, lead time, failure domains, and change risk as CI/CD signals that reveal whether a delivery system is becoming safer or just busier.

#architecture #failures #cloud

Apr 8, 2025 7 min read

L2 Deep Dive

Python Automation Framework for DB and Cloud Ops: Architecture and Failure Model

DB and cloud automation fails when partial failures leave the database, cloud account, and ticketing system describing different operation states.

#architecture #cloud #databases

Jul 5, 2025 9 min read

L2 Deep Dive

When Autovacuum Becomes a Backpressure Signal

PostgreSQL vacuum stalls are often symptoms of lock pressure, table bloat, and missing operational visibility.

#databases #failures #checklist

Sep 13, 2025 10 min read

L3 Reference Guide

Autovacuum Is a Capacity Problem, Not a Maintenance Task

PostgreSQL vacuum failures often start with blocked cleanup, table bloat, and weak lock observability during peak load.

#databases #failures #checklist

Oct 7, 2025 13 min read

L2 Deep Dive

PostgreSQL 18 Replication Upgrade Opportunities

What changes in replication when upgrading from PostgreSQL 14–16 to PostgreSQL 18: parallel apply, pg_createsubscriber, and surfaced conflict visibility.

#databases #architecture #checklist

Dec 20, 2025 8 min read

L2 Deep Dive

#databases #ai-engineering #architecture

Automated Reliability Across the Stack: Database Backups, Platform Observability, and SQL Quality (November 2025)

Three November 2025 open-source releases eliminate manual work from three engineering reliability tasks — multi-database backup verification, self-hosted log and trace collection, and SQL static analysis in CI pipelines.

Jan 30, 2026 4 min read

L1 Field Note

#databases #ai-engineering #architecture #checklist

Database Runbooks as Agent Contracts

A reference operating model for turning human database runbooks into machine-usable agent contracts.

May 16, 2026 6 min read

L2 Deep Dive

#ai-engineering #databases #architecture

Stop Writing Ad-Hoc Queries: Build a Skill Backbone for Your DB Engineering Workflows

How to codify repetitive DB tasks into testable, reusable Claude skills that produce consistent SQL, runbooks, and migration outputs instead of one-off chat prompts.

May 28, 2026 17 min read

L3 Reference Guide

Per-App Postgres on Kubernetes Changes the Failure Boundary

How CloudNativePG, GitOps, and external secrets make per-application Postgres viable without hiding the operational cost.

#databases #cloud #architecture

Nov 9, 2021 8 min read

L2 Deep Dive

Runbook to Pipeline: How to Convert Manual Operations Without Creating Risk

Converting a runbook into an automated pipeline is not a transcription exercise — a human operator can stop at bad preconditions, and a pipeline must explicitly encode every check that was previously implicit in that judgment.

Dec 14, 2021 7 min read

L2 Deep Dive

Automation Incident Review: When the Tool Worked and the System Failed

The hardest automation incidents are not broken tools — they happen when every tool executes exactly as asked while the surrounding system loses the ability to evaluate whether that action is still safe.

Jan 11, 2022 8 min read

L2 Deep Dive

System Design Starts With Failure Modes, Not Boxes and Arrows

The first system design question is not 'what are the services' — it is 'what breaks, how fast does it spread, and what evidence tells us the damage is contained.' A framework for failure-mode-first design.

Jul 12, 2022 8 min read

L2 Deep Dive

Terraform Drift Triage Workflow: Detect, Classify, Reconcile, Prevent

Terraform drift is a control-plane integrity problem — how to detect it, classify whether it is an emergency or acceptable deviation, reconcile state safely, and prevent future splits without blocking legitimate out-of-band changes.

Aug 9, 2022 6 min read

L2 Deep Dive

Terraform Import Workflow: Bringing Existing Cloud Resources Under Control

Terraform import's dangerous moment is not the command — it is when a team mistakes 'now in state' for 'now under control.' A safe import workflow covering targeted plans, drift checks, and state file validation before any apply.

Oct 8, 2022 7 min read

L2 Deep Dive

AWS Database Cost Triage: RDS, Aurora, DynamoDB, ElastiCache, and OpenSearch

Database bills grow when ownership, workload shape, and control loops drift apart — a structured triage approach for RDS, Aurora, DynamoDB, ElastiCache, and OpenSearch spend before it becomes an emergency.

Jan 21, 2023 7 min read

L2 Deep Dive

Azure Database Reliability Review: Failover Groups, Backups, and Geo-Replication

Azure database recovery beyond 'we have backups': failover group cutover, geo-replication lag, and backup restore testing as the real reliability floor.

May 21, 2023 7 min read

L2 Deep Dive

GCP Multi-Region Architecture: Global Load Balancing, Spanner, Pub/Sub, and Failure Testing

Control plane coupling, Spanner split boundaries, and untested Pub/Sub failover are why GCP multi-region architectures break before the region goes dark.

Jul 11, 2023 7 min read

L2 Deep Dive

Ownership Metadata: The Small Catalog Field That Fixes Incidents

Ownership fields in the service catalog make the responsible team discoverable at alert time — the missing link that shortens incident duration.

Aug 4, 2023 7 min read

L2 Deep Dive

OCI Disaster Recovery Review: Regions, ADs, Backups, Data Guard, and GoldenGate

OCI disaster recovery gaps that emerge when teams rely on regional failover alone, and how Data Guard and GoldenGate address the database replication tier.

Nov 14, 2023 7 min read

L2 Deep Dive

Service Lifecycle Workflow: Create, Promote, Deprecate, Archive, Delete

Service lifecycle management — from creation through deprecation and safe deletion — requires a control system beyond the deployment pipeline.

Jan 31, 2024 7 min read

L2 Deep Dive

Inventory Consistency Playbook: Reservation, Release, Reconciliation, and Oversell Risk

Reservation, release, and reconciliation for inventory systems where carts, payments, and retries generate conflicting stock counts across writes.

Feb 13, 2024 7 min read

L2 Deep Dive

Service Catalog Incident Workflow: Find Owner, Blast Radius, Dependencies, and Last Change

Service catalog fields for owner, dependency graph, blast radius, and last deploy that cut incident triage time before Slack threads spiral.

Feb 15, 2024 8 min read

L2 Deep Dive

Catalog Sync Workflow: Database, Search Index, CDN, and Cache Invalidation

Propagating a catalog update from database commit through Elasticsearch, CDN edge cache, and application cache without stranding stale reads downstream.

Apr 30, 2024 7 min read

L2 Deep Dive

API Gateway Incident Workflow: Auth, Rate Limits, Routing, and Downstream Saturation

API gateway incidents are misdiagnosed when teams treat them as proxy failures instead of control-plane failures with downstream saturation blast radius.

May 15, 2024 7 min read

L2 Deep Dive

Cache Incident Workflow: Hit Rate Collapse, Stampede, TTLs, and Database Protection

Cache hit-rate collapse leads to stampede, TTL misconfiguration, and unprotected database load — a workflow for diagnosing each failure in sequence.

May 30, 2024 7 min read

L2 Deep Dive

Queue Backlog Workflow: Producer Spike, Consumer Lag, Poison Messages, and Retry Storms

Producer spikes, consumer lag, poison messages, and retry storms each need a different intervention — the diagnosis order matters as much as the fix.

Jul 14, 2024 7 min read

L2 Deep Dive

Cloud Cost Triage Workflow: Compute, Storage, Data Transfer, Logs, and Managed Services

Cloud cost triage across compute, storage, data transfer, logs, and managed services — a repeatable workflow for finding runaway spend before the bill arrives.

Jul 29, 2024 8 min read

L2 Deep Dive

Database Migration Cutover Workflow: Dual Writes, CDC, Backfill, Freeze, and Rollback

Database migration cutover using dual writes, CDC, backfill, and freeze phases — with rollback boundaries for when 'almost synchronized' is not an operational state.

Aug 20, 2024 7 min read

L2 Deep Dive

GitHub Actions for Platform Teams: Reusable Workflows, OIDC, Environments, and Audit

GitHub Actions reusable workflows, OIDC credential federation, and environment approval gates — preventing per-repo credential sprawl across a platform.

Jan 14, 2025 7 min read

L2 Deep Dive

Building a Safe Python Migration Runner for Operational Data Changes

A Python migration runner for live operational data needs idempotency guards, dry-run modes, and rollback hooks that schema migrations skip by default.

Jun 10, 2025 7 min read

L2 Deep Dive

DB Team Automation Roadmap: Backups, Patching, Refreshes, Provisioning, and Guardrails

A sequenced roadmap for database teams to automate backups, patching, refreshes, and provisioning — with guardrails that prevent automation from becoming a risk multiplier.

Jul 15, 2025 7 min read

L2 Deep Dive

Automation Rollback Playbook: Disable, Revert, Repair State, and Reconcile Reality

How to roll back automation safely when it misfires — the four-stage playbook: disable the automation, revert the change, repair state, and reconcile system reality with declared intent.