A PostgreSQL dashboard that only shows CPU and memory is a late warning system. The database tells you about problems in its own catalog — in pg_stat_activity, pg_stat_statements, pg_stat_replication, and pg_stat_bgwriter — before they surface as user-visible errors. The question is whether you’re reading those catalogs before or after the incident page fires.

Situation

Most PostgreSQL monitoring setups start with the OS metrics the infrastructure team already collects: CPU, memory, disk I/O, network. Those metrics are necessary but not sufficient. A database with 20% CPU and 60% memory can still be in deep trouble: connection pools exhausted, replica 45 minutes behind, autovacuum fighting bloat on the largest tables, and a lock chain building behind a slow migration.

The eight PostgreSQL metric groups below come from the database itself. Most can be collected by any monitoring agent — Datadog, Prometheus + postgres_exporter, CloudWatch with Enhanced Monitoring, or direct queries from a read-only monitoring role.

Symptoms

SymptomLikely sourceFirst catalog to check
Application queries suddenly slowerLock contention or bad planpg_stat_activity, pg_locks
Connection pool exhaustedIdle-in-transaction or max_connections hitpg_stat_activity filtered by state
Replica reads returning stale dataReplication lagpg_stat_replication
Table scan on a previously fast queryBloat has made statistics stalepg_stat_user_tables
Checkpoint warnings in server logbgwriter pressurepg_stat_bgwriter
Application sees deadlock errorsWrite contention on hot rowspg_locks + server log
Disk filling faster than expectedOrphaned temp files or unarchived WALpg_stat_bgwriter, WAL directory
OOM kill on the database serverWork_mem overrun from parallel queriespg_stat_activity + work_mem setting

First Five Checks

Run these in order when something is wrong. Each check requires only read access to system catalogs.

1. What are active sessions doing right now?

SELECT pid, now() - pg_stat_activity.query_start AS duration,
       query, state, wait_event_type, wait_event, usename
FROM pg_stat_activity
WHERE state != 'idle'
  AND query_start < now() - interval '5 seconds'
ORDER BY duration DESC
LIMIT 20;

Look for sessions in idle in transaction (holding locks while waiting on an application) or active with long durations. Any query running more than 30 seconds in OLTP deserves investigation.

2. Is anyone waiting on locks?

SELECT blocked.pid AS blocked_pid,
       blocked.query AS blocked_query,
       blocking.pid AS blocking_pid,
       blocking.query AS blocking_query,
       blocking.usename AS blocking_user,
       now() - blocked.query_start AS blocked_duration
FROM pg_locks blocked_locks
JOIN pg_stat_activity blocked ON blocked_locks.pid = blocked.pid
JOIN pg_locks blocking_locks
     ON blocking_locks.transactionid = blocked_locks.transactionid
     AND blocking_locks.pid != blocked_locks.pid
JOIN pg_stat_activity blocking ON blocking_locks.pid = blocking.pid
WHERE NOT blocked_locks.granted
ORDER BY blocked_duration DESC;

A lock chain longer than 10 seconds is a reliability event, not a monitoring blip.

3. How far behind is the replica?

-- On the primary:
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
       (sent_lsn - replay_lsn) AS replication_lag_bytes,
       write_lag, flush_lag, replay_lag
FROM pg_stat_replication;

For seconds of lag: pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / 16384 * (wal_block_size / 16384) approximates byte lag. Many monitoring agents compute this directly. Alert at 60 seconds; page at 300 seconds for read-replica-dependent applications.

4. Is autovacuum keeping up?

SELECT relname,
       n_dead_tup,
       n_live_tup,
       ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2) AS dead_pct,
       last_autovacuum,
       last_autoanalyze
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC
LIMIT 20;

Dead tuple ratio over 20% on a high-traffic table means autovacuum is behind. Tables not autovacuumed in 24 hours are candidates for bloat investigation.

5. What is checkpoint pressure?

SELECT checkpoints_timed, checkpoints_req,
       checkpoint_write_time / 1000.0 AS write_secs,
       checkpoint_sync_time / 1000.0 AS sync_secs,
       buffers_checkpoint, buffers_clean, buffers_backend,
       buffers_alloc,
       stats_reset
FROM pg_stat_bgwriter;

checkpoints_req above zero means PostgreSQL is forcing checkpoints faster than checkpoint_completion_target can absorb. buffers_backend above zero means application processes are doing work that bgwriter should handle — a sign of write pressure.

Decision Tree

flowchart TD
    A[Symptom observed] --> B{Active sessions check}
    B -->|Long-running active queries| C[Check pg_stat_statements — plan regression or new query?]
    B -->|Idle in transaction sessions| D[Find the application holding transactions open]
    B -->|Lock waits| E[Kill blocking session or escalate to application team]
    B -->|All looks normal| F{Check replication}
    F -->|Replica lag above threshold| G[Identify write pressure source — high-volume writes or bloated WAL archiving?]
    F -->|Lag acceptable| H{Check autovacuum}
    H -->|Dead tuples high| I[Manual VACUUM on table or increase autovacuum_vacuum_scale_factor]
    H -->|Autovacuum absent| J[Check autovacuum_max_workers and pg_stat_activity for autovacuum processes]
    H -->|No autovacuum issues| K{Check checkpoint pressure}
    K -->|checkpoints_req high| L[Increase max_wal_size or spread write workload]
    K -->|buffers_backend high| M[Tune bgwriter_lru_maxpages or review write amplification]

Remediation Options

ProblemImmediate actionDurable fix
Long-running idle-in-transactionSELECT pg_terminate_backend(pid) on sessions over thresholdSet idle_in_transaction_session_timeout on the application role
Lock chainIdentify and terminate the root blocking sessionFix the application transaction that holds locks across slow external calls
Replica lagCheck for write burst or long transaction on primaryAdd streaming replication slot monitoring; tune wal_level and replica apply workers
High dead tuplesVACUUM (VERBOSE) tablename; directlyLower autovacuum_vacuum_scale_factor for high-traffic tables; increase autovacuum_max_workers
Checkpoint pressureIncrease max_wal_size (default 1GB, common to set 4–16GB)Review write amplification from bulk loads; separate OLAP workloads to replicas
Cache hit ratio below 95%Review shared_buffers sizing (target 25% of RAM, not more)Identify tables with sequential scans using pg_statio_user_tables

Automation Opportunity

Three PostgreSQL checks can be automated into a runbook trigger:

  1. Idle-in-transaction watchdog: query pg_stat_activity every 60 seconds; alert if any session has been idle in transaction for more than 5 minutes. Auto-terminate sessions over 30 minutes with a logged record.

  2. Replica lag SLO: collect pg_stat_replication.replay_lag as a gauge metric; alert at 60s, page at 5 minutes, trigger write traffic rerouting away from reader endpoint at 10 minutes.

  3. Autovacuum health check: daily scheduled query against pg_stat_user_tables; flag tables where last_autovacuum is null or more than 48 hours old AND n_live_tup > 100000. Output as a structured JSON payload to the operations channel.

Leadership Summary

PostgreSQL health is not visible in CPU and memory alone. The database catalogs tell you about lock chains, replica lag, bloat accumulation, and checkpoint pressure — all of which affect user-visible latency before CPU crosses 80%. The metrics above require a read-only monitoring role and a scrape interval of 60 seconds or less. The most common monitoring gap in PostgreSQL deployments is not the absence of metrics but the absence of thresholds: teams collect data without defining what “bad” looks like until they are in an incident trying to find historical baselines.

Where It Breaks

Failure modeWhy it happensFix
Alert on every autovacuum completionautovacuum runs are logged as activity; thresholds not tuned to table sizeAlert on dead tuple ratio, not autovacuum frequency
Lock alert fires during schema migrationIntentional DDL lock causes alert stormSuppress lock alerts during maintenance windows; use lock_timeout on migrations
Replica lag alert on writesSingle large write causes temporary lag; recovers in secondsUse 60-second averages, not point-in-time values
pg_stat_statements not populatedpg_stat_statements not in shared_preload_librariesAdd to shared_preload_libraries, restart, set track_activity_query_size
Monitoring role missingAgent lacks read access to catalogsCreate a dedicated monitoring role with pg_monitor system role (PG 10+)
Timestamp drift on replicasLag reported in bytes, not secondsUse replay_lag column directly (PG 10+) or compute from LSN difference

What This Post Does Not Cover

This post covers catalog-level PostgreSQL monitoring from inside the database. It does not cover: Prometheus exporter configuration and recording rules (covered in the Prometheus and Grafana post in this series), CloudWatch Enhanced Monitoring for RDS/Aurora, PgBouncer pool metrics, or logical replication slot lag as a distinct monitoring dimension. Each of those has a dedicated post in this series.

What to Do Next

  • Problem: PostgreSQL is reporting problems through its catalogs, but your dashboard only shows OS-level metrics.
  • Solution: Add the eight metric groups above to your monitoring stack using pg_monitor role and a 60-second scrape interval.
  • Proof: Run the five checks above against your production instance right now and note whether any sessions are idle-in-transaction, whether replicas are within SLO, and whether any table has a dead tuple ratio above 10%.
  • Action: This week, create a monitoring role with GRANT pg_monitor TO monitoring, add it to your Datadog, Prometheus, or CloudWatch configuration, and set a replica lag alert with a 60-second threshold.