PostgreSQL Monitoring: The Dashboard That Surfaces Problems Before Users Do
A PostgreSQL dashboard that only shows CPU and memory is a late warning system. The database tells you about problems in its own catalog — in pg_stat_activity, pg_stat_statements, pg_stat_replication, and pg_stat_bgwriter — before they surface as user-visible errors. The question is whether you’re reading those catalogs before or after the incident page fires.
Situation
Most PostgreSQL monitoring setups start with the OS metrics the infrastructure team already collects: CPU, memory, disk I/O, network. Those metrics are necessary but not sufficient. A database with 20% CPU and 60% memory can still be in deep trouble: connection pools exhausted, replica 45 minutes behind, autovacuum fighting bloat on the largest tables, and a lock chain building behind a slow migration.
The eight PostgreSQL metric groups below come from the database itself. Most can be collected by any monitoring agent — Datadog, Prometheus + postgres_exporter, CloudWatch with Enhanced Monitoring, or direct queries from a read-only monitoring role.
Symptoms
| Symptom | Likely source | First catalog to check |
|---|---|---|
| Application queries suddenly slower | Lock contention or bad plan | pg_stat_activity, pg_locks |
| Connection pool exhausted | Idle-in-transaction or max_connections hit | pg_stat_activity filtered by state |
| Replica reads returning stale data | Replication lag | pg_stat_replication |
| Table scan on a previously fast query | Bloat has made statistics stale | pg_stat_user_tables |
| Checkpoint warnings in server log | bgwriter pressure | pg_stat_bgwriter |
| Application sees deadlock errors | Write contention on hot rows | pg_locks + server log |
| Disk filling faster than expected | Orphaned temp files or unarchived WAL | pg_stat_bgwriter, WAL directory |
| OOM kill on the database server | Work_mem overrun from parallel queries | pg_stat_activity + work_mem setting |
First Five Checks
Run these in order when something is wrong. Each check requires only read access to system catalogs.
1. What are active sessions doing right now?
SELECT pid, now() - pg_stat_activity.query_start AS duration,
query, state, wait_event_type, wait_event, usename
FROM pg_stat_activity
WHERE state != 'idle'
AND query_start < now() - interval '5 seconds'
ORDER BY duration DESC
LIMIT 20;
Look for sessions in idle in transaction (holding locks while waiting on an application) or active with long durations. Any query running more than 30 seconds in OLTP deserves investigation.
2. Is anyone waiting on locks?
SELECT blocked.pid AS blocked_pid,
blocked.query AS blocked_query,
blocking.pid AS blocking_pid,
blocking.query AS blocking_query,
blocking.usename AS blocking_user,
now() - blocked.query_start AS blocked_duration
FROM pg_locks blocked_locks
JOIN pg_stat_activity blocked ON blocked_locks.pid = blocked.pid
JOIN pg_locks blocking_locks
ON blocking_locks.transactionid = blocked_locks.transactionid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_stat_activity blocking ON blocking_locks.pid = blocking.pid
WHERE NOT blocked_locks.granted
ORDER BY blocked_duration DESC;
A lock chain longer than 10 seconds is a reliability event, not a monitoring blip.
3. How far behind is the replica?
-- On the primary:
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
(sent_lsn - replay_lsn) AS replication_lag_bytes,
write_lag, flush_lag, replay_lag
FROM pg_stat_replication;
For seconds of lag: pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / 16384 * (wal_block_size / 16384) approximates byte lag. Many monitoring agents compute this directly. Alert at 60 seconds; page at 300 seconds for read-replica-dependent applications.
4. Is autovacuum keeping up?
SELECT relname,
n_dead_tup,
n_live_tup,
ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2) AS dead_pct,
last_autovacuum,
last_autoanalyze
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC
LIMIT 20;
Dead tuple ratio over 20% on a high-traffic table means autovacuum is behind. Tables not autovacuumed in 24 hours are candidates for bloat investigation.
5. What is checkpoint pressure?
SELECT checkpoints_timed, checkpoints_req,
checkpoint_write_time / 1000.0 AS write_secs,
checkpoint_sync_time / 1000.0 AS sync_secs,
buffers_checkpoint, buffers_clean, buffers_backend,
buffers_alloc,
stats_reset
FROM pg_stat_bgwriter;
checkpoints_req above zero means PostgreSQL is forcing checkpoints faster than checkpoint_completion_target can absorb. buffers_backend above zero means application processes are doing work that bgwriter should handle — a sign of write pressure.
Decision Tree
flowchart TD
A[Symptom observed] --> B{Active sessions check}
B -->|Long-running active queries| C[Check pg_stat_statements — plan regression or new query?]
B -->|Idle in transaction sessions| D[Find the application holding transactions open]
B -->|Lock waits| E[Kill blocking session or escalate to application team]
B -->|All looks normal| F{Check replication}
F -->|Replica lag above threshold| G[Identify write pressure source — high-volume writes or bloated WAL archiving?]
F -->|Lag acceptable| H{Check autovacuum}
H -->|Dead tuples high| I[Manual VACUUM on table or increase autovacuum_vacuum_scale_factor]
H -->|Autovacuum absent| J[Check autovacuum_max_workers and pg_stat_activity for autovacuum processes]
H -->|No autovacuum issues| K{Check checkpoint pressure}
K -->|checkpoints_req high| L[Increase max_wal_size or spread write workload]
K -->|buffers_backend high| M[Tune bgwriter_lru_maxpages or review write amplification]
Remediation Options
| Problem | Immediate action | Durable fix |
|---|---|---|
| Long-running idle-in-transaction | SELECT pg_terminate_backend(pid) on sessions over threshold | Set idle_in_transaction_session_timeout on the application role |
| Lock chain | Identify and terminate the root blocking session | Fix the application transaction that holds locks across slow external calls |
| Replica lag | Check for write burst or long transaction on primary | Add streaming replication slot monitoring; tune wal_level and replica apply workers |
| High dead tuples | VACUUM (VERBOSE) tablename; directly | Lower autovacuum_vacuum_scale_factor for high-traffic tables; increase autovacuum_max_workers |
| Checkpoint pressure | Increase max_wal_size (default 1GB, common to set 4–16GB) | Review write amplification from bulk loads; separate OLAP workloads to replicas |
| Cache hit ratio below 95% | Review shared_buffers sizing (target 25% of RAM, not more) | Identify tables with sequential scans using pg_statio_user_tables |
Automation Opportunity
Three PostgreSQL checks can be automated into a runbook trigger:
-
Idle-in-transaction watchdog: query
pg_stat_activityevery 60 seconds; alert if any session has beenidle in transactionfor more than 5 minutes. Auto-terminate sessions over 30 minutes with a logged record. -
Replica lag SLO: collect
pg_stat_replication.replay_lagas a gauge metric; alert at 60s, page at 5 minutes, trigger write traffic rerouting away from reader endpoint at 10 minutes. -
Autovacuum health check: daily scheduled query against
pg_stat_user_tables; flag tables wherelast_autovacuumis null or more than 48 hours old ANDn_live_tup > 100000. Output as a structured JSON payload to the operations channel.
Leadership Summary
PostgreSQL health is not visible in CPU and memory alone. The database catalogs tell you about lock chains, replica lag, bloat accumulation, and checkpoint pressure — all of which affect user-visible latency before CPU crosses 80%. The metrics above require a read-only monitoring role and a scrape interval of 60 seconds or less. The most common monitoring gap in PostgreSQL deployments is not the absence of metrics but the absence of thresholds: teams collect data without defining what “bad” looks like until they are in an incident trying to find historical baselines.
Where It Breaks
| Failure mode | Why it happens | Fix |
|---|---|---|
| Alert on every autovacuum completion | autovacuum runs are logged as activity; thresholds not tuned to table size | Alert on dead tuple ratio, not autovacuum frequency |
| Lock alert fires during schema migration | Intentional DDL lock causes alert storm | Suppress lock alerts during maintenance windows; use lock_timeout on migrations |
| Replica lag alert on writes | Single large write causes temporary lag; recovers in seconds | Use 60-second averages, not point-in-time values |
pg_stat_statements not populated | pg_stat_statements not in shared_preload_libraries | Add to shared_preload_libraries, restart, set track_activity_query_size |
| Monitoring role missing | Agent lacks read access to catalogs | Create a dedicated monitoring role with pg_monitor system role (PG 10+) |
| Timestamp drift on replicas | Lag reported in bytes, not seconds | Use replay_lag column directly (PG 10+) or compute from LSN difference |
What This Post Does Not Cover
This post covers catalog-level PostgreSQL monitoring from inside the database. It does not cover: Prometheus exporter configuration and recording rules (covered in the Prometheus and Grafana post in this series), CloudWatch Enhanced Monitoring for RDS/Aurora, PgBouncer pool metrics, or logical replication slot lag as a distinct monitoring dimension. Each of those has a dedicated post in this series.
What to Do Next
- Problem: PostgreSQL is reporting problems through its catalogs, but your dashboard only shows OS-level metrics.
- Solution: Add the eight metric groups above to your monitoring stack using
pg_monitorrole and a 60-second scrape interval. - Proof: Run the five checks above against your production instance right now and note whether any sessions are idle-in-transaction, whether replicas are within SLO, and whether any table has a dead tuple ratio above 10%.
- Action: This week, create a
monitoringrole withGRANT pg_monitor TO monitoring, add it to your Datadog, Prometheus, or CloudWatch configuration, and set a replica lag alert with a 60-second threshold.