When p95 latency spikes and the on-call alert fires, most engineers open the slow query log and immediately jump to the biggest query by average execution time. That is the wrong move. The query that shows up longest in pg_stat_statements is often not the query that caused the spike — it is the query that was already slow. The blocking transaction, the missing index on a newly-deployed code path, or autovacuum being interrupted mid-table are the usual culprits. This runbook gives you the order to check that actually closes incidents.

Situation

A p95 latency spike lands in monitoring. The graphs show it clearly: something changed in the last five to fifteen minutes. The application is returning slow responses. Your first instinct is to check the dashboard, which shows elevated CPU and read latency on the database host. pg_stat_activity has more active connections than usual. The alert threshold on slow queries crossed.

At this point, engineers split into two groups. The first opens the slow query log, picks the worst query, and starts trying to add an index or rewrite the SQL. The second checks what PostgreSQL is actually doing right now — what is blocked, what is waiting, and what happened to statistics or autovacuum in the last hour. The second group resolves the incident faster because they are reading system state rather than historical averages.

The problem with jumping straight to the slow query log is that pg_stat_statements accumulates over time. A query that has always been slow will look exactly like a query that just started being slow because of a table scan it previously avoided. You need the current state first, then the cumulative data as context.

PostgreSQL exposes the information you need through its system catalog views. The triage workflow below uses five queries — in order — to eliminate root causes before you start making changes.

Symptoms

SignalWhere to see itWhat it means
Active query count above baselinepg_stat_activity, CloudWatch connections metricConnection pressure or query backup — check for lock waits first
Queries appearing in slow query log with new query shapespg_stat_statements, auto_explain log outputNew code path or table growth crossed a plan-change threshold
Sequential scan on a large table in explain outputEXPLAIN (ANALYZE, BUFFERS) outputMissing index or statistics too stale to use an existing one
waiting column true for multiple queriespg_stat_activityLock contention — one transaction is blocking others
High read I/O on the database hostCloudWatch read latency, Datadog disk metricsTable or index bloat forcing extra page reads; autovacuum may be behind
last_autoanalyze timestamp hours or days old on active tablepg_stat_user_tablesStale statistics — planner is working from outdated row estimates

First Five Checks

  1. Find currently running slow queries — This is always first. Before looking at anything historical, see what PostgreSQL is doing right now. Queries held open for more than five seconds are either blocked, doing real work, or stuck. The state column tells you whether they are actively executing or waiting.
SELECT
  pid,
  now() - query_start AS duration,
  state,
  wait_event_type,
  wait_event,
  query
FROM pg_stat_activity
WHERE state != 'idle'
  AND query_start < now() - interval '5 seconds'
ORDER BY duration DESC;

Look at wait_event_type. If it reads Lock, you have a lock contention issue. If it reads IO, the query is waiting on disk. If it is null, the query is actively executing — check the plan next.

  1. Find top queries by cumulative execution time — Once you know what is running now, pull the historical picture from pg_stat_statements. This extension is documented in the PostgreSQL pg_stat_statements module reference and accumulates statistics since the last reset. Sort by total_exec_time to find queries that are expensive in aggregate, not just occasionally slow.
SELECT
  query,
  calls,
  total_exec_time / calls AS avg_ms,
  total_exec_time,
  rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

A query with high avg_ms but low calls is an outlier. A query with moderate avg_ms but millions of calls is a throughput problem. Both need attention, but the right fix differs.

  1. Check for lock waits — If check 1 showed any wait_event_type = 'Lock' rows, this query identifies the full blocking chain. pg_blocking_pids() is a PostgreSQL built-in that returns the PIDs of sessions blocking a given session.
SELECT
  blocked.pid,
  blocked.query,
  blocked.wait_event_type,
  blocking.pid AS blocking_pid,
  blocking.query AS blocking_query,
  blocking.state AS blocking_state,
  now() - blocking.query_start AS blocking_duration
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
  ON blocking.pid = ANY(pg_blocking_pids(blocked.pid));

The blocking_query column often reveals the transaction holding the lock. An idle-in-transaction connection is a common culprit: a transaction that opened, ran one query, and then paused while the application did something else — holding its lock the whole time.

  1. Check table statistics age — If lock waits are not the issue, check whether the planner is working from stale statistics. PostgreSQL uses statistics collected by ANALYZE to estimate row counts and choose access paths. When statistics fall behind the actual table state — after a large data load, a batch delete, or a period when autovacuum was interrupted — the planner can choose a sequential scan where an index would be far faster.
SELECT
  schemaname,
  tablename,
  last_analyze,
  last_autoanalyze,
  n_live_tup,
  n_dead_tup,
  n_dead_tup::float / NULLIF(n_live_tup + n_dead_tup, 0) AS dead_ratio
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 10;

A table with a last_autoanalyze timestamp more than a few hours old on a high-write workload, or a dead_ratio above 10–20%, is a candidate. The autovacuum capacity implications of this pattern are covered in depth in Autovacuum Is a Capacity Problem, Not a Maintenance Task.

  1. Get EXPLAIN ANALYZE for the slow query — Once you have identified the specific query from checks 1 or 2, pull the execution plan with buffer statistics. BUFFERS output shows how many shared buffer hits versus disk reads the query required, which distinguishes a missing index (high shared hits, no index scan) from an I/O problem (high disk reads).
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
<paste slow query here>;

Look for: Seq Scan on a table with high rows= estimates, rows=1 estimates on nodes where the actual rows are in the thousands (stale statistics), and Buffers: shared read= values that are high relative to table size.

Decision Tree

flowchart TD
    A[Slow query alert fires] --> B{pg_stat_activity — queries waiting on Lock?}
    B -->|yes| C[Check blocking chain — kill or wait out blocker]
    B -->|no| D{EXPLAIN shows Seq Scan on large table?}
    D -->|yes| E{Index exists for this predicate?}
    E -->|no| F[Add index with CREATE INDEX CONCURRENTLY]
    E -->|yes| G{Statistics stale — last_autoanalyze old?}
    G -->|yes| H[Run ANALYZE on table — recheck plan]
    G -->|no| I{High Buffers: shared read in EXPLAIN?}
    I -->|yes| J[Check table bloat and autovacuum lag]
    I -->|no| K{Connection count near pool limit?}
    K -->|yes| L[Check pool settings and idle-in-transaction connections]
    K -->|no| M[Profile query logic — may be algorithmic]

What this diagram shows: A decision tree for PostgreSQL slow query triage — starting with active lock waits, then sequential scans on large tables, missing indexes, stale statistics (last_autoanalyze), high shared buffer reads indicating bloat, and connection pool saturation — in the order that eliminates the most common root causes first.

Remediation Options

Option 1 — Add a missing index

When EXPLAIN shows a sequential scan on a large table and no index covers the query predicate, create one online. CREATE INDEX CONCURRENTLY builds the index without blocking reads or writes. It takes longer than a standard index build, and it can fail if the transaction load is very high, but it is the safe choice for production.

CREATE INDEX CONCURRENTLY idx_orders_customer_created
  ON orders (customer_id, created_at)
  WHERE status != 'cancelled';

Partial indexes (the WHERE clause above) reduce size and improve selectivity when the query always filters on a stable condition. After creation, run EXPLAIN again to confirm the planner picks up the new index. If it does not, check that the statistics are current — ANALYZE orders; and re-examine.

Option 2 — Refresh stale statistics

When EXPLAIN shows row estimates that are far off from actual rows — typically rows=1 or a small number where the actual is thousands — and pg_stat_user_tables shows a stale last_autoanalyze, run ANALYZE manually.

ANALYZE VERBOSE orders;

ANALYZE is always safe. It takes a SHARE UPDATE EXCLUSIVE lock, which does not block reads or writes. It completes quickly on most tables. After it finishes, run EXPLAIN again. If the plan does not change, the statistics were not the issue — move to the next check.

If autovacuum is consistently falling behind on this table, the default autovacuum_analyze_scale_factor of 20% is too coarse for large or frequently-modified tables. Lower it per-table:

ALTER TABLE orders SET (autovacuum_analyze_scale_factor = 0.01);

Option 3 — Resolve lock contention

When the blocking chain query from check 3 shows a long-running transaction holding a lock that others are waiting on, you have two choices: wait for it to finish, or terminate it.

Terminate with care. pg_terminate_backend() sends SIGTERM to the backend process; the transaction rolls back and its locks are released immediately. Use it when the blocking transaction has been idle for longer than your incident SLA, or when it is clearly stuck.

SELECT pg_terminate_backend(blocking_pid)
FROM (
  SELECT DISTINCT blocking.pid AS blocking_pid
  FROM pg_stat_activity blocked
  JOIN pg_stat_activity blocking
    ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
  WHERE blocking.state = 'idle in transaction'
    AND now() - blocking.query_start > interval '2 minutes'
) sub;

After terminating, investigate why the transaction stayed open. Idle-in-transaction connections usually point to application-side connection pool misconfiguration or missing error handling that closes transactions on exception.

Option 4 — Address bloat and autovacuum lag

When EXPLAIN shows high Buffers: shared read= values disproportionate to the query’s logical data needs, and pg_stat_user_tables shows high n_dead_tup on the relevant table, dead row versions are inflating the table and causing unnecessary disk reads.

-- Check bloat on a specific table
SELECT
  n_live_tup,
  n_dead_tup,
  last_autovacuum,
  last_autoanalyze
FROM pg_stat_user_tables
WHERE relname = 'orders';

-- Force vacuum manually during the incident
VACUUM (VERBOSE, ANALYZE) orders;

Standard VACUUM — as opposed to VACUUM FULL — does not block reads or writes. It reclaims dead tuple space and updates statistics. VACUUM FULL requires an exclusive lock and rewrites the table; it should not be used on production tables during an incident.

Rollback Plan

  1. Created index with CREATE INDEX CONCURRENTLY — Drop it with DROP INDEX CONCURRENTLY. The drop is also online and does not block queries. If the index was a partial index, dropping it has no data impact.

  2. Ran ANALYZE — No rollback needed. ANALYZE updates statistics only. The planner reverts to the previous plan at the next statistics collection if the table state reverts. There is no mechanism to restore old statistics directly.

  3. Killed a blocking transaction — The killed transaction rolls back automatically. Any work it had done is undone. Monitor pg_stat_activity to confirm the blocked queries resume. If they do not, check for a new blocking chain.

  4. Ran VACUUM — No rollback needed. Vacuum is additive: it reclaims space but does not modify live rows. Re-enable autovacuum if it was disabled during the incident.

Automation Opportunity

Two automation patterns are worth implementing before the next incident rather than after.

The first is continuous slow query capture. PostgreSQL’s auto_explain extension logs execution plans automatically when a query exceeds a duration threshold. Add these settings to postgresql.conf (or as session-level settings for testing):

-- Load the extension (requires restart or ALTER SYSTEM)
LOAD 'auto_explain';
SET auto_explain.log_min_duration = '1s';
SET auto_explain.log_analyze = true;
SET auto_explain.log_buffers = true;

With auto_explain active, every query over one second logs its plan to the PostgreSQL log. Feed those logs to a log aggregator and you will have plan history before the incident rather than needing to reconstruct it after.

The second is a scheduled pg_stat_activity snapshot. Use pg_cron to capture long-running queries every minute to a local table. This gives you a timeline to review post-incident that pg_stat_statements alone cannot provide, since pg_stat_statements aggregates across time but does not record when queries were running.

-- Requires pg_cron extension
SELECT cron.schedule(
  'capture-slow-queries',
  '* * * * *',
  $$
    INSERT INTO slow_query_log (captured_at, pid, duration, state, query)
    SELECT now(), pid, now() - query_start, state, query
    FROM pg_stat_activity
    WHERE state != 'idle'
      AND query_start < now() - interval '10 seconds';
  $$
);

Alert on this table when row counts spike: that is an early signal that something is blocking normal query throughput before the application-side p95 alert fires.

Leadership Summary

  • What broke: Queries slowed because of lock contention from a long-running transaction, or because the query planner chose a sequential scan after table statistics fell out of date.
  • What was done: Identified the root cause using PostgreSQL system catalog queries, terminated the blocking connection or added a missing index, and ran ANALYZE to refresh planner statistics.
  • What prevents recurrence: auto_explain now captures slow query plans automatically; per-table autovacuum thresholds are set for high-write tables; a pg_cron job snapshots long-running queries every minute for post-incident review.

Checklist

  1. Pull currently running queries from pg_stat_activity — check wait_event_type before anything else
  2. Identify any sessions with wait_event_type = 'Lock' and trace the blocking chain
  3. Pull top queries by total_exec_time from pg_stat_statements — distinguish outliers from throughput problems
  4. Run EXPLAIN (ANALYZE, BUFFERS) on the specific slow query — look for Seq Scan and row estimate mismatches
  5. Check pg_stat_user_tables for tables with stale last_autoanalyze or high n_dead_tup
  6. If lock contention: terminate idle-in-transaction connections blocking others for more than two minutes
  7. If missing index: create with CREATE INDEX CONCURRENTLY — confirm plan change with EXPLAIN afterward
  8. If stale statistics: run ANALYZE on the affected table — always safe, non-blocking
  9. If bloat: run VACUUM (VERBOSE, ANALYZE) — do not use VACUUM FULL during an incident
  10. After resolving: lower autovacuum_analyze_scale_factor on high-write tables to prevent recurrence
  11. Enable auto_explain with log_min_duration set to your slow query threshold
  12. Schedule a pg_cron job to snapshot pg_stat_activity for future post-incident timelines

What This Post Does Not Cover

This post covers triage of an active slow query incident. It does not cover: pg_partman partition pruning for large tables, physical replication lag as a source of slow reads on replicas, connection pooler (PgBouncer) saturation that precedes the slow query symptom, or schema migration locking analysis. Each of those is a distinct failure mode with its own triage path.

What to Do Next

  • Problem: A slow query alert fires and the on-call engineer spends 30 minutes checking the wrong root cause — stale statistics were the issue, not the query they were tuning.
  • Solution: Work through the five checks in order: current activity first, then historical aggregates, then lock contention, then statistics age, then the execution plan.
  • Proof: Running pg_stat_activity before touching anything else shows whether the incident is lock-driven within 60 seconds — that confirmation eliminates half the possible root causes immediately.
  • Action: Add pg_stat_statements and auto_explain to your PostgreSQL configuration this week; validate they are collecting data; add the five check queries to your team’s runbook.