PostgreSQL Slow Query Triage Workflow
When p95 latency spikes and the on-call alert fires, most engineers open the slow query log and immediately jump to the biggest query by average execution time. That is the wrong move. The query that shows up longest in pg_stat_statements is often not the query that caused the spike — it is the query that was already slow. The blocking transaction, the missing index on a newly-deployed code path, or autovacuum being interrupted mid-table are the usual culprits. This runbook gives you the order to check that actually closes incidents.
Situation
A p95 latency spike lands in monitoring. The graphs show it clearly: something changed in the last five to fifteen minutes. The application is returning slow responses. Your first instinct is to check the dashboard, which shows elevated CPU and read latency on the database host. pg_stat_activity has more active connections than usual. The alert threshold on slow queries crossed.
At this point, engineers split into two groups. The first opens the slow query log, picks the worst query, and starts trying to add an index or rewrite the SQL. The second checks what PostgreSQL is actually doing right now — what is blocked, what is waiting, and what happened to statistics or autovacuum in the last hour. The second group resolves the incident faster because they are reading system state rather than historical averages.
The problem with jumping straight to the slow query log is that pg_stat_statements accumulates over time. A query that has always been slow will look exactly like a query that just started being slow because of a table scan it previously avoided. You need the current state first, then the cumulative data as context.
PostgreSQL exposes the information you need through its system catalog views. The triage workflow below uses five queries — in order — to eliminate root causes before you start making changes.
Symptoms
| Signal | Where to see it | What it means |
|---|---|---|
| Active query count above baseline | pg_stat_activity, CloudWatch connections metric | Connection pressure or query backup — check for lock waits first |
| Queries appearing in slow query log with new query shapes | pg_stat_statements, auto_explain log output | New code path or table growth crossed a plan-change threshold |
| Sequential scan on a large table in explain output | EXPLAIN (ANALYZE, BUFFERS) output | Missing index or statistics too stale to use an existing one |
waiting column true for multiple queries | pg_stat_activity | Lock contention — one transaction is blocking others |
| High read I/O on the database host | CloudWatch read latency, Datadog disk metrics | Table or index bloat forcing extra page reads; autovacuum may be behind |
last_autoanalyze timestamp hours or days old on active table | pg_stat_user_tables | Stale statistics — planner is working from outdated row estimates |
First Five Checks
- Find currently running slow queries — This is always first. Before looking at anything historical, see what PostgreSQL is doing right now. Queries held open for more than five seconds are either blocked, doing real work, or stuck. The
statecolumn tells you whether they are actively executing or waiting.
SELECT
pid,
now() - query_start AS duration,
state,
wait_event_type,
wait_event,
query
FROM pg_stat_activity
WHERE state != 'idle'
AND query_start < now() - interval '5 seconds'
ORDER BY duration DESC;
Look at wait_event_type. If it reads Lock, you have a lock contention issue. If it reads IO, the query is waiting on disk. If it is null, the query is actively executing — check the plan next.
- Find top queries by cumulative execution time — Once you know what is running now, pull the historical picture from
pg_stat_statements. This extension is documented in the PostgreSQLpg_stat_statementsmodule reference and accumulates statistics since the last reset. Sort bytotal_exec_timeto find queries that are expensive in aggregate, not just occasionally slow.
SELECT
query,
calls,
total_exec_time / calls AS avg_ms,
total_exec_time,
rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
A query with high avg_ms but low calls is an outlier. A query with moderate avg_ms but millions of calls is a throughput problem. Both need attention, but the right fix differs.
- Check for lock waits — If check 1 showed any
wait_event_type = 'Lock'rows, this query identifies the full blocking chain.pg_blocking_pids()is a PostgreSQL built-in that returns the PIDs of sessions blocking a given session.
SELECT
blocked.pid,
blocked.query,
blocked.wait_event_type,
blocking.pid AS blocking_pid,
blocking.query AS blocking_query,
blocking.state AS blocking_state,
now() - blocking.query_start AS blocking_duration
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
ON blocking.pid = ANY(pg_blocking_pids(blocked.pid));
The blocking_query column often reveals the transaction holding the lock. An idle-in-transaction connection is a common culprit: a transaction that opened, ran one query, and then paused while the application did something else — holding its lock the whole time.
- Check table statistics age — If lock waits are not the issue, check whether the planner is working from stale statistics. PostgreSQL uses statistics collected by
ANALYZEto estimate row counts and choose access paths. When statistics fall behind the actual table state — after a large data load, a batch delete, or a period when autovacuum was interrupted — the planner can choose a sequential scan where an index would be far faster.
SELECT
schemaname,
tablename,
last_analyze,
last_autoanalyze,
n_live_tup,
n_dead_tup,
n_dead_tup::float / NULLIF(n_live_tup + n_dead_tup, 0) AS dead_ratio
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 10;
A table with a last_autoanalyze timestamp more than a few hours old on a high-write workload, or a dead_ratio above 10–20%, is a candidate. The autovacuum capacity implications of this pattern are covered in depth in Autovacuum Is a Capacity Problem, Not a Maintenance Task.
- Get EXPLAIN ANALYZE for the slow query — Once you have identified the specific query from checks 1 or 2, pull the execution plan with buffer statistics.
BUFFERSoutput shows how many shared buffer hits versus disk reads the query required, which distinguishes a missing index (high shared hits, no index scan) from an I/O problem (high disk reads).
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
<paste slow query here>;
Look for: Seq Scan on a table with high rows= estimates, rows=1 estimates on nodes where the actual rows are in the thousands (stale statistics), and Buffers: shared read= values that are high relative to table size.
Decision Tree
flowchart TD
A[Slow query alert fires] --> B{pg_stat_activity — queries waiting on Lock?}
B -->|yes| C[Check blocking chain — kill or wait out blocker]
B -->|no| D{EXPLAIN shows Seq Scan on large table?}
D -->|yes| E{Index exists for this predicate?}
E -->|no| F[Add index with CREATE INDEX CONCURRENTLY]
E -->|yes| G{Statistics stale — last_autoanalyze old?}
G -->|yes| H[Run ANALYZE on table — recheck plan]
G -->|no| I{High Buffers: shared read in EXPLAIN?}
I -->|yes| J[Check table bloat and autovacuum lag]
I -->|no| K{Connection count near pool limit?}
K -->|yes| L[Check pool settings and idle-in-transaction connections]
K -->|no| M[Profile query logic — may be algorithmic]
What this diagram shows: A decision tree for PostgreSQL slow query triage — starting with active lock waits, then sequential scans on large tables, missing indexes, stale statistics (last_autoanalyze), high shared buffer reads indicating bloat, and connection pool saturation — in the order that eliminates the most common root causes first.
Remediation Options
Option 1 — Add a missing index
When EXPLAIN shows a sequential scan on a large table and no index covers the query predicate, create one online. CREATE INDEX CONCURRENTLY builds the index without blocking reads or writes. It takes longer than a standard index build, and it can fail if the transaction load is very high, but it is the safe choice for production.
CREATE INDEX CONCURRENTLY idx_orders_customer_created
ON orders (customer_id, created_at)
WHERE status != 'cancelled';
Partial indexes (the WHERE clause above) reduce size and improve selectivity when the query always filters on a stable condition. After creation, run EXPLAIN again to confirm the planner picks up the new index. If it does not, check that the statistics are current — ANALYZE orders; and re-examine.
Option 2 — Refresh stale statistics
When EXPLAIN shows row estimates that are far off from actual rows — typically rows=1 or a small number where the actual is thousands — and pg_stat_user_tables shows a stale last_autoanalyze, run ANALYZE manually.
ANALYZE VERBOSE orders;
ANALYZE is always safe. It takes a SHARE UPDATE EXCLUSIVE lock, which does not block reads or writes. It completes quickly on most tables. After it finishes, run EXPLAIN again. If the plan does not change, the statistics were not the issue — move to the next check.
If autovacuum is consistently falling behind on this table, the default autovacuum_analyze_scale_factor of 20% is too coarse for large or frequently-modified tables. Lower it per-table:
ALTER TABLE orders SET (autovacuum_analyze_scale_factor = 0.01);
Option 3 — Resolve lock contention
When the blocking chain query from check 3 shows a long-running transaction holding a lock that others are waiting on, you have two choices: wait for it to finish, or terminate it.
Terminate with care. pg_terminate_backend() sends SIGTERM to the backend process; the transaction rolls back and its locks are released immediately. Use it when the blocking transaction has been idle for longer than your incident SLA, or when it is clearly stuck.
SELECT pg_terminate_backend(blocking_pid)
FROM (
SELECT DISTINCT blocking.pid AS blocking_pid
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE blocking.state = 'idle in transaction'
AND now() - blocking.query_start > interval '2 minutes'
) sub;
After terminating, investigate why the transaction stayed open. Idle-in-transaction connections usually point to application-side connection pool misconfiguration or missing error handling that closes transactions on exception.
Option 4 — Address bloat and autovacuum lag
When EXPLAIN shows high Buffers: shared read= values disproportionate to the query’s logical data needs, and pg_stat_user_tables shows high n_dead_tup on the relevant table, dead row versions are inflating the table and causing unnecessary disk reads.
-- Check bloat on a specific table
SELECT
n_live_tup,
n_dead_tup,
last_autovacuum,
last_autoanalyze
FROM pg_stat_user_tables
WHERE relname = 'orders';
-- Force vacuum manually during the incident
VACUUM (VERBOSE, ANALYZE) orders;
Standard VACUUM — as opposed to VACUUM FULL — does not block reads or writes. It reclaims dead tuple space and updates statistics. VACUUM FULL requires an exclusive lock and rewrites the table; it should not be used on production tables during an incident.
Rollback Plan
-
Created index with
CREATE INDEX CONCURRENTLY— Drop it withDROP INDEX CONCURRENTLY. The drop is also online and does not block queries. If the index was a partial index, dropping it has no data impact. -
Ran
ANALYZE— No rollback needed.ANALYZEupdates statistics only. The planner reverts to the previous plan at the next statistics collection if the table state reverts. There is no mechanism to restore old statistics directly. -
Killed a blocking transaction — The killed transaction rolls back automatically. Any work it had done is undone. Monitor
pg_stat_activityto confirm the blocked queries resume. If they do not, check for a new blocking chain. -
Ran
VACUUM— No rollback needed. Vacuum is additive: it reclaims space but does not modify live rows. Re-enable autovacuum if it was disabled during the incident.
Automation Opportunity
Two automation patterns are worth implementing before the next incident rather than after.
The first is continuous slow query capture. PostgreSQL’s auto_explain extension logs execution plans automatically when a query exceeds a duration threshold. Add these settings to postgresql.conf (or as session-level settings for testing):
-- Load the extension (requires restart or ALTER SYSTEM)
LOAD 'auto_explain';
SET auto_explain.log_min_duration = '1s';
SET auto_explain.log_analyze = true;
SET auto_explain.log_buffers = true;
With auto_explain active, every query over one second logs its plan to the PostgreSQL log. Feed those logs to a log aggregator and you will have plan history before the incident rather than needing to reconstruct it after.
The second is a scheduled pg_stat_activity snapshot. Use pg_cron to capture long-running queries every minute to a local table. This gives you a timeline to review post-incident that pg_stat_statements alone cannot provide, since pg_stat_statements aggregates across time but does not record when queries were running.
-- Requires pg_cron extension
SELECT cron.schedule(
'capture-slow-queries',
'* * * * *',
$$
INSERT INTO slow_query_log (captured_at, pid, duration, state, query)
SELECT now(), pid, now() - query_start, state, query
FROM pg_stat_activity
WHERE state != 'idle'
AND query_start < now() - interval '10 seconds';
$$
);
Alert on this table when row counts spike: that is an early signal that something is blocking normal query throughput before the application-side p95 alert fires.
Leadership Summary
- What broke: Queries slowed because of lock contention from a long-running transaction, or because the query planner chose a sequential scan after table statistics fell out of date.
- What was done: Identified the root cause using PostgreSQL system catalog queries, terminated the blocking connection or added a missing index, and ran
ANALYZEto refresh planner statistics. - What prevents recurrence:
auto_explainnow captures slow query plans automatically; per-table autovacuum thresholds are set for high-write tables; apg_cronjob snapshots long-running queries every minute for post-incident review.
Checklist
- Pull currently running queries from
pg_stat_activity— checkwait_event_typebefore anything else - Identify any sessions with
wait_event_type = 'Lock'and trace the blocking chain - Pull top queries by
total_exec_timefrompg_stat_statements— distinguish outliers from throughput problems - Run
EXPLAIN (ANALYZE, BUFFERS)on the specific slow query — look for Seq Scan and row estimate mismatches - Check
pg_stat_user_tablesfor tables with stalelast_autoanalyzeor highn_dead_tup - If lock contention: terminate idle-in-transaction connections blocking others for more than two minutes
- If missing index: create with
CREATE INDEX CONCURRENTLY— confirm plan change withEXPLAINafterward - If stale statistics: run
ANALYZEon the affected table — always safe, non-blocking - If bloat: run
VACUUM (VERBOSE, ANALYZE)— do not useVACUUM FULLduring an incident - After resolving: lower
autovacuum_analyze_scale_factoron high-write tables to prevent recurrence - Enable
auto_explainwithlog_min_durationset to your slow query threshold - Schedule a
pg_cronjob to snapshotpg_stat_activityfor future post-incident timelines
What This Post Does Not Cover
This post covers triage of an active slow query incident. It does not cover: pg_partman partition pruning for large tables, physical replication lag as a source of slow reads on replicas, connection pooler (PgBouncer) saturation that precedes the slow query symptom, or schema migration locking analysis. Each of those is a distinct failure mode with its own triage path.
What to Do Next
- Problem: A slow query alert fires and the on-call engineer spends 30 minutes checking the wrong root cause — stale statistics were the issue, not the query they were tuning.
- Solution: Work through the five checks in order: current activity first, then historical aggregates, then lock contention, then statistics age, then the execution plan.
- Proof: Running
pg_stat_activitybefore touching anything else shows whether the incident is lock-driven within 60 seconds — that confirmation eliminates half the possible root causes immediately. - Action: Add
pg_stat_statementsandauto_explainto your PostgreSQL configuration this week; validate they are collecting data; add the five check queries to your team’s runbook.