PostgreSQL Monitoring: The Dashboard That Surfaces Problems Before Users Do

A PostgreSQL dashboard that only shows CPU and memory is a late warning system. The database tells you about problems in its own catalog — in pg_stat_activity, pg_stat_statements, pg_stat_replication, and pg_stat_bgwriter — before they surface as user-visible errors. The question is whether you’re reading those catalogs before or after the incident page fires.

Situation

Most PostgreSQL monitoring setups start with the OS metrics the infrastructure team already collects: CPU, memory, disk I/O, network. Those metrics are necessary but not sufficient. A database with 20% CPU and 60% memory can still be in deep trouble: connection pools exhausted, replica 45 minutes behind, autovacuum fighting bloat on the largest tables, and a lock chain building behind a slow migration.

The eight PostgreSQL metric groups below come from the database itself. Most can be collected by any monitoring agent — Datadog, Prometheus + postgres_exporter, CloudWatch with Enhanced Monitoring, or direct queries from a read-only monitoring role.

Symptoms

Symptom	Likely source	First catalog to check
Application queries suddenly slower	Lock contention or bad plan	`pg_stat_activity`, `pg_locks`
Connection pool exhausted	Idle-in-transaction or max_connections hit	`pg_stat_activity` filtered by state
Replica reads returning stale data	Replication lag	`pg_stat_replication`
Table scan on a previously fast query	Bloat has made statistics stale	`pg_stat_user_tables`
Checkpoint warnings in server log	bgwriter pressure	`pg_stat_bgwriter`
Application sees deadlock errors	Write contention on hot rows	`pg_locks` + server log
Disk filling faster than expected	Orphaned temp files or unarchived WAL	`pg_stat_bgwriter`, WAL directory
OOM kill on the database server	Work_mem overrun from parallel queries	`pg_stat_activity` + `work_mem` setting

First Five Checks

Run these in order when something is wrong. Each check requires only read access to system catalogs.

1. What are active sessions doing right now?

SELECT pid, now() - pg_stat_activity.query_start AS duration,
       query, state, wait_event_type, wait_event, usename
FROM pg_stat_activity
WHERE state != 'idle'
  AND query_start < now() - interval '5 seconds'
ORDER BY duration DESC
LIMIT 20;

Look for sessions in idle in transaction (holding locks while waiting on an application) or active with long durations. Any query running more than 30 seconds in OLTP deserves investigation.

2. Is anyone waiting on locks?

SELECT blocked.pid AS blocked_pid,
       blocked.query AS blocked_query,
       blocking.pid AS blocking_pid,
       blocking.query AS blocking_query,
       blocking.usename AS blocking_user,
       now() - blocked.query_start AS blocked_duration
FROM pg_locks blocked_locks
JOIN pg_stat_activity blocked ON blocked_locks.pid = blocked.pid
JOIN pg_locks blocking_locks
     ON blocking_locks.transactionid = blocked_locks.transactionid
     AND blocking_locks.pid != blocked_locks.pid
JOIN pg_stat_activity blocking ON blocking_locks.pid = blocking.pid
WHERE NOT blocked_locks.granted
ORDER BY blocked_duration DESC;

A lock chain longer than 10 seconds is a reliability event, not a monitoring blip.

3. How far behind is the replica?

-- On the primary:
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
       (sent_lsn - replay_lsn) AS replication_lag_bytes,
       write_lag, flush_lag, replay_lag
FROM pg_stat_replication;

For seconds of lag: pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / 16384 * (wal_block_size / 16384) approximates byte lag. Many monitoring agents compute this directly. Alert at 60 seconds; page at 300 seconds for read-replica-dependent applications.

4. Is autovacuum keeping up?

SELECT relname,
       n_dead_tup,
       n_live_tup,
       ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2) AS dead_pct,
       last_autovacuum,
       last_autoanalyze
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC
LIMIT 20;

Dead tuple ratio over 20% on a high-traffic table means autovacuum is behind. Tables not autovacuumed in 24 hours are candidates for bloat investigation.

5. What is checkpoint pressure?

SELECT checkpoints_timed, checkpoints_req,
       checkpoint_write_time / 1000.0 AS write_secs,
       checkpoint_sync_time / 1000.0 AS sync_secs,
       buffers_checkpoint, buffers_clean, buffers_backend,
       buffers_alloc,
       stats_reset
FROM pg_stat_bgwriter;

checkpoints_req above zero means PostgreSQL is forcing checkpoints faster than checkpoint_completion_target can absorb. buffers_backend above zero means application processes are doing work that bgwriter should handle — a sign of write pressure.

Decision Tree

flowchart TD
    A[Symptom observed] --> B{Active sessions check}
    B -->|Long-running active queries| C[Check pg_stat_statements — plan regression or new query?]
    B -->|Idle in transaction sessions| D[Find the application holding transactions open]
    B -->|Lock waits| E[Kill blocking session or escalate to application team]
    B -->|All looks normal| F{Check replication}
    F -->|Replica lag above threshold| G[Identify write pressure source — high-volume writes or bloated WAL archiving?]
    F -->|Lag acceptable| H{Check autovacuum}
    H -->|Dead tuples high| I[Manual VACUUM on table or increase autovacuum_vacuum_scale_factor]
    H -->|Autovacuum absent| J[Check autovacuum_max_workers and pg_stat_activity for autovacuum processes]
    H -->|No autovacuum issues| K{Check checkpoint pressure}
    K -->|checkpoints_req high| L[Increase max_wal_size or spread write workload]
    K -->|buffers_backend high| M[Tune bgwriter_lru_maxpages or review write amplification]

Remediation Options

Problem	Immediate action	Durable fix
Long-running idle-in-transaction	`SELECT pg_terminate_backend(pid)` on sessions over threshold	Set `idle_in_transaction_session_timeout` on the application role
Lock chain	Identify and terminate the root blocking session	Fix the application transaction that holds locks across slow external calls
Replica lag	Check for write burst or long transaction on primary	Add streaming replication slot monitoring; tune `wal_level` and replica apply workers
High dead tuples	`VACUUM (VERBOSE) tablename;` directly	Lower `autovacuum_vacuum_scale_factor` for high-traffic tables; increase `autovacuum_max_workers`
Checkpoint pressure	Increase `max_wal_size` (default 1GB, common to set 4–16GB)	Review write amplification from bulk loads; separate OLAP workloads to replicas
Cache hit ratio below 95%	Review `shared_buffers` sizing (target 25% of RAM, not more)	Identify tables with sequential scans using `pg_statio_user_tables`

Automation Opportunity

Three PostgreSQL checks can be automated into a runbook trigger:

Idle-in-transaction watchdog: query pg_stat_activity every 60 seconds; alert if any session has been idle in transaction for more than 5 minutes. Auto-terminate sessions over 30 minutes with a logged record.
Replica lag SLO: collect pg_stat_replication.replay_lag as a gauge metric; alert at 60s, page at 5 minutes, trigger write traffic rerouting away from reader endpoint at 10 minutes.
Autovacuum health check: daily scheduled query against pg_stat_user_tables; flag tables where last_autovacuum is null or more than 48 hours old AND n_live_tup > 100000. Output as a structured JSON payload to the operations channel.

Leadership Summary

PostgreSQL health is not visible in CPU and memory alone. The database catalogs tell you about lock chains, replica lag, bloat accumulation, and checkpoint pressure — all of which affect user-visible latency before CPU crosses 80%. The metrics above require a read-only monitoring role and a scrape interval of 60 seconds or less. The most common monitoring gap in PostgreSQL deployments is not the absence of metrics but the absence of thresholds: teams collect data without defining what “bad” looks like until they are in an incident trying to find historical baselines.

Where It Breaks

Failure mode	Why it happens	Fix
Alert on every autovacuum completion	autovacuum runs are logged as activity; thresholds not tuned to table size	Alert on dead tuple ratio, not autovacuum frequency
Lock alert fires during schema migration	Intentional DDL lock causes alert storm	Suppress lock alerts during maintenance windows; use `lock_timeout` on migrations
Replica lag alert on writes	Single large write causes temporary lag; recovers in seconds	Use 60-second averages, not point-in-time values
`pg_stat_statements` not populated	`pg_stat_statements` not in `shared_preload_libraries`	Add to `shared_preload_libraries`, restart, set `track_activity_query_size`
Monitoring role missing	Agent lacks read access to catalogs	Create a dedicated `monitoring` role with `pg_monitor` system role (PG 10+)
Timestamp drift on replicas	Lag reported in bytes, not seconds	Use `replay_lag` column directly (PG 10+) or compute from LSN difference

What This Post Does Not Cover

This post covers catalog-level PostgreSQL monitoring from inside the database. It does not cover: Prometheus exporter configuration and recording rules (covered in the Prometheus and Grafana post in this series), CloudWatch Enhanced Monitoring for RDS/Aurora, PgBouncer pool metrics, or logical replication slot lag as a distinct monitoring dimension. Each of those has a dedicated post in this series.

What to Do Next

Problem: PostgreSQL is reporting problems through its catalogs, but your dashboard only shows OS-level metrics.
Solution: Add the eight metric groups above to your monitoring stack using pg_monitor role and a 60-second scrape interval.
Proof: Run the five checks above against your production instance right now and note whether any sessions are idle-in-transaction, whether replicas are within SLO, and whether any table has a dead tuple ratio above 10%.
Action: This week, create a monitoring role with GRANT pg_monitor TO monitoring, add it to your Datadog, Prometheus, or CloudWatch configuration, and set a replica lag alert with a 60-second threshold.

Situation

Symptoms

First Five Checks

Decision Tree

Remediation Options

Automation Opportunity

Leadership Summary

Where It Breaks

What This Post Does Not Cover

What to Do Next

Rajiv

Related Posts

Datadog Database Monitoring: PostgreSQL, MySQL, and Aurora Setup

Prometheus and Grafana for Database Monitoring: PostgreSQL and MySQL Setup

Database Alert Design: Thresholds That Fire on Real Problems