The Database Observability Baseline: What Every DBA Dashboard Must Show
If your primary database monitoring signal is a CPU spike, your telemetry is designed to tell you when the application is already broken, rather than telling you why the database is about to break.
Situation
Most engineering teams rely on default cloud dashboards that prioritize host-level metrics: CPU utilization, memory consumption, and disk I/O. While these metrics matter for capacity planning, they are lag indicators for database health. A CPU spike is the result of a problem—a bad query plan, a missing index, or a connection storm—not the problem itself.
As teams move toward automated operations and AI-assisted triage, the agentic systems investigating incidents need granular telemetry. You cannot build a reliable AI SRE if the only context it receives is “CPU is at 99%.” The foundation of database observability must shift from host-level symptoms to engine-level state.
The Problem
When a database fails, it usually does so in one of three ways: it runs out of connections, it gets blocked by a lock, or it falls behind on maintenance tasks (like replication or vacuuming) until performance collapses.
Default dashboards rarely surface these states clearly. Engineers spend critical incident minutes running ad-hoc SQL queries to figure out what is currently executing, who is blocking whom, and whether the connection pool is saturated. If your observability strategy relies on engineers SSH-ing into a bastion or running pg_stat_activity manually during an outage, your time-to-mitigation will never improve.
The Saturation and Contention Baseline
Every database dashboard must surface three categories of engine-level telemetry:
- Saturation Metrics: Active connections vs. maximum allowed, thread pool utilization, and cache hit ratios. You must know if the database is refusing work.
- Contention Metrics: Row locks, table locks, and wait events. In PostgreSQL, this means tracking
wait_event_type. In MySQL, it means watching InnoDB row lock waits. - Lag Metrics: Replication lag (in bytes and seconds) and maintenance lag (e.g., autovacuum backlog, compaction queue depth).
A baseline SQL query for PostgreSQL contention that should be converted into a constant metric looks like this:
SELECT
wait_event_type,
wait_event,
count(*) as waiting_sessions
FROM pg_stat_activity
WHERE wait_event_type IS NOT NULL
GROUP BY wait_event_type, wait_event
ORDER BY waiting_sessions DESC;
If your dashboard shows a spike in Lock wait events alongside a drop in cache hit ratio, you immediately know you have a query contention issue, saving 15 minutes of triage.
In Practice
The documented pattern for robust observability involves turning engine-state queries into time-series data.
Context: PostgreSQL’s lock architecture means that sessions waiting for a lock consume zero CPU — a blocked process is simply parked, not working. This makes host-level monitoring blind to lock-induced latency. The PostgreSQL documentation describes pg_stat_activity.wait_event_type as the authoritative source for what a session is waiting on, with Lock as the wait event type for sessions blocked behind another session’s hold (PostgreSQL docs: pg_stat_activity).
Action: The documented operational pattern is to export pg_stat_activity wait event counts as a time-series metric polled every 10–15 seconds, so that lock contention spikes appear on dashboards alongside — and often well ahead of — latency metrics.
Result: This approach surfaces AccessExclusiveLock spikes from DDL operations — TRUNCATE, VACUUM FULL, schema migrations — that block all concurrent readers without generating any CPU activity on the database host.
Learning: PostgreSQL lock waits are invisible to infrastructure monitoring. The only signal is in the engine itself: wait_event_type = 'Lock' in pg_stat_activity is the diagnostic that turns a “CPU looks fine, why is the app slow?” incident into a sub-minute diagnosis.
Where It Breaks
Relying entirely on custom engine metrics introduces its own set of tradeoffs:
| Approach | Advantage | Disadvantage | Failure Mode |
|---|---|---|---|
| High-Frequency Polling | Catches micro-spikes in locks and connection exhaustion. | Puts continuous load on the database just to monitor it. | The monitoring query itself times out when the database is fully saturated. |
| Log-Based Telemetry | Zero additional query load; captures exact slow queries. | High ingestion costs and delayed parsing times. | Log volumes spike during an incident, delaying the very telemetry needed to diagnose it. |
| Cloud Provider Insights (e.g., PI) | Managed, low-overhead, deep integration with the hypervisor. | Locked into the vendor’s UI; harder to expose to internal AI agents. | The data cannot be easily correlated with external application traces. |
What to Do Next
- Problem: Default cloud dashboards report CPU and memory — lag indicators that fire after the database is already broken, not before. Lock-induced latency produces zero CPU signal.
- Solution: Add a “What is Waiting?” panel tracking
pg_stat_activitywait event counts, active lock counts, connection pool saturation, and replication byte lag as continuously scraped time-series metrics. - Proof: A staging game day that artificially locks a row should fire an alert within 60 seconds based on wait events — if it doesn’t, the telemetry foundation is incomplete and the next production incident will look exactly like the current one.
- Action: Deploy a PostgreSQL exporter polling
pg_stat_activityevery 15 seconds and add a dashboard panel forLockwait event counts this week.