PostgreSQL Slow Query Triage Workflow
A structured runbook for diagnosing slow query root causes in PostgreSQL — missing indexes, stale statistics, lock contention, and I/O saturation — in the order that wastes the least time.
Series / Databases
Operational workflows for diagnosing, recovering, and preventing database incidents across slow queries, backups, replication, autovacuum, migrations, and capacity pressure.
DBAs, SREs, and backend engineers who own databases in production and need structured triage workflows — not tutorials, but field-tested runbooks for when something is already on fire.
Basic familiarity with PostgreSQL or MySQL. You should know what a query plan is, even if you don't read them fluently yet.
How to find what's slow and why the optimizer chose a bad plan — before you start changing things.
A structured runbook for diagnosing slow query root causes in PostgreSQL — missing indexes, stale statistics, lock contention, and I/O saturation — in the order that wastes the least time.
A repeatable workflow for diagnosing MySQL slow queries — from enabling the slow log through reading EXPLAIN output to committing a safe fix.
Structured triage for when something is already on fire — connection storms, autovacuum failures, replication lag, and CPU spikes.
Diagnosing and resolving connection exhaustion in PostgreSQL: too many clients, idle-in-transaction accumulation, and the case for connection pooling.
A step-by-step runbook for diagnosing and resolving autovacuum failures: dead tuple accumulation, bloat, and transaction ID wraparound risk.
A systematic runbook for diagnosing Aurora MySQL writer CPU spikes — from Performance Insights through lock contention, long transactions, and read offload.
A systematic runbook for diagnosing MySQL replication lag — from initial SHOW REPLICA STATUS to parallel apply, long transactions, and relay log space.
A diagnostic runbook for logical replication lag, apply worker failures, replication conflicts, and schema drift between publisher and subscriber.
Proactive work that stops incidents before they happen — backups you can actually restore, connection pools that don't kill the primary, and migrations with a rollback path.
Autovacuum is not optional maintenance — it is the mechanism that prevents table bloat and transaction ID wraparound from taking your database offline.
A backup file proves you captured data. Recovery is the process of producing a running, consistent database on a different system inside your RTO. They are not the same thing, and confusing them is how incidents get worse.
A repeatable runbook for proving that your database backups are actually restorable — with exact commands, decision tree, and automation patterns.
Database changes in CI/CD require separate gates for schema migrations, backfills, and expand-contract patterns — not just a shell command before deployment.
Related posts matched to this series by topic, tags, and keywords.
A structured runbook for identifying which cost dimension is driving your AWS RDS or Aurora bill before making any changes.
Monitoring PostgreSQL requires looking past the operating system and into the internal bookkeeping of MVCC, autovacuum, and replication streams.
How generative AI tools like CloudWatch Investigations shift the operational burden from reading raw dashboards to validating machine-generated hypotheses.
A pragmatic checklist to defend the business case for migrating away from Microsoft SQL Server.
An operational playbook for triaging and containing LLM token spend spikes — from alert fire to root cause within 30 minutes.
Why monitoring autonomous SRE agents requires tracking tool-call hallucinations, context window saturation, and recursive retry loops, rather than just basic CPU metrics.
A systematic runbook for diagnosing slow MongoDB queries — from explain output through COLLSCAN, index selectivity, in-memory sort, and WiredTiger cache pressure.
What replication lag actually measures in PostgreSQL, the three distinct lag components that most monitoring tools conflate, and which one matters for your RPO.
Physical replication copies bytes; logical replication copies row changes — and confusing the two causes silent schema drift, sequence divergence, and failed zero-downtime upgrades.
OCI migration risk model for Oracle-heavy enterprises — where the lift-and-shift boundary shifts from the database tier into dependent application contracts.
Triage checklist for isolating checkout failures across payment gateway, inventory reservation, order write, and event propagation boundaries.
When the query planner gets row estimates wrong, queries regress silently. This runbook diagnoses statistics drift and restores accurate plans.
Search index drift is a truth-management failure: when to rebuild vs. dual-write vs. CDC, and how to bound user-visible staleness.
Argo CD sync waves, health check gates, rollback triggers, and drift detection — the four mechanisms that separate GitOps deployments from applied YAML.
Queue time, flake rate, lead time, failure domains, and change risk as CI/CD signals that reveal whether a delivery system is becoming safer or just busier.
DB and cloud automation fails when partial failures leave the database, cloud account, and ticketing system describing different operation states.
PostgreSQL vacuum stalls are often symptoms of lock pressure, table bloat, and missing operational visibility.
PostgreSQL vacuum failures often start with blocked cleanup, table bloat, and weak lock observability during peak load.
What changes in replication when upgrading from PostgreSQL 14–16 to PostgreSQL 18: parallel apply, pg_createsubscriber, and surfaced conflict visibility.
Three November 2025 open-source releases eliminate manual work from three engineering reliability tasks — multi-database backup verification, self-hosted log and trace collection, and SQL static analysis in CI pipelines.
A reference operating model for turning human database runbooks into machine-usable agent contracts.
How to codify repetitive DB tasks into testable, reusable Claude skills that produce consistent SQL, runbooks, and migration outputs instead of one-off chat prompts.
How CloudNativePG, GitOps, and external secrets make per-application Postgres viable without hiding the operational cost.
Converting a runbook into an automated pipeline is not a transcription exercise — a human operator can stop at bad preconditions, and a pipeline must explicitly encode every check that was previously implicit in that judgment.
The hardest automation incidents are not broken tools — they happen when every tool executes exactly as asked while the surrounding system loses the ability to evaluate whether that action is still safe.
The first system design question is not 'what are the services' — it is 'what breaks, how fast does it spread, and what evidence tells us the damage is contained.' A framework for failure-mode-first design.
Terraform drift is a control-plane integrity problem — how to detect it, classify whether it is an emergency or acceptable deviation, reconcile state safely, and prevent future splits without blocking legitimate out-of-band changes.
Terraform import's dangerous moment is not the command — it is when a team mistakes 'now in state' for 'now under control.' A safe import workflow covering targeted plans, drift checks, and state file validation before any apply.
Database bills grow when ownership, workload shape, and control loops drift apart — a structured triage approach for RDS, Aurora, DynamoDB, ElastiCache, and OpenSearch spend before it becomes an emergency.
Azure database recovery beyond 'we have backups': failover group cutover, geo-replication lag, and backup restore testing as the real reliability floor.
Control plane coupling, Spanner split boundaries, and untested Pub/Sub failover are why GCP multi-region architectures break before the region goes dark.
Ownership fields in the service catalog make the responsible team discoverable at alert time — the missing link that shortens incident duration.
OCI disaster recovery gaps that emerge when teams rely on regional failover alone, and how Data Guard and GoldenGate address the database replication tier.
Service lifecycle management — from creation through deprecation and safe deletion — requires a control system beyond the deployment pipeline.
Reservation, release, and reconciliation for inventory systems where carts, payments, and retries generate conflicting stock counts across writes.
Service catalog fields for owner, dependency graph, blast radius, and last deploy that cut incident triage time before Slack threads spiral.
Propagating a catalog update from database commit through Elasticsearch, CDN edge cache, and application cache without stranding stale reads downstream.
API gateway incidents are misdiagnosed when teams treat them as proxy failures instead of control-plane failures with downstream saturation blast radius.
Cache hit-rate collapse leads to stampede, TTL misconfiguration, and unprotected database load — a workflow for diagnosing each failure in sequence.
Producer spikes, consumer lag, poison messages, and retry storms each need a different intervention — the diagnosis order matters as much as the fix.
Cloud cost triage across compute, storage, data transfer, logs, and managed services — a repeatable workflow for finding runaway spend before the bill arrives.
Database migration cutover using dual writes, CDC, backfill, and freeze phases — with rollback boundaries for when 'almost synchronized' is not an operational state.
GitHub Actions reusable workflows, OIDC credential federation, and environment approval gates — preventing per-repo credential sprawl across a platform.
A Python migration runner for live operational data needs idempotency guards, dry-run modes, and rollback hooks that schema migrations skip by default.
A sequenced roadmap for database teams to automate backups, patching, refreshes, and provisioning — with guardrails that prevent automation from becoming a risk multiplier.
How to roll back automation safely when it misfires — the four-stage playbook: disable the automation, revert the change, repair state, and reconcile system reality with declared intent.