Prometheus and Grafana for Database Monitoring: PostgreSQL and MySQL Setup
Prometheus and Grafana are the right default for database monitoring when the team already runs them for infrastructure. The mistake is treating database exporters as install-and-forget: they require scope decisions, scrape tuning, recording rules for expensive queries, and panels aligned to operational questions rather than metric availability.
Situation
Prometheus with postgres_exporter or mysqld_exporter gives a team database metrics in the same system they use for Kubernetes, application, and infrastructure metrics. That consistency matters during incidents: one tool, one query language, one dashboard system.
The challenge is setup quality. Both exporters expose hundreds of metrics by default. Without scope decisions and recording rules, the result is a Prometheus instance ingesting metrics that nobody queries, Grafana dashboards that show every metric but answer no operational question, and a scrape interval too infrequent to catch short-duration failures.
Symptoms
| Symptom | Likely cause |
|---|---|
| Grafana database dashboard shows data but engineer can’t tell if system is healthy | Dashboard shows metrics, not answers — no thresholds, no anomaly detection |
| Prometheus scrape latency is high | Exporter is running expensive queries during scrape; needs collector filtering |
| Database monitoring is absent during Prometheus downtime | No remote write or long-term storage — single point of failure |
| Alert fires but metric data is missing | Scrape interval too long for the alert evaluation window |
| Exporter crashes after database restart | Exporter not configured to retry connections |
First Five Checks
1. Is postgres_exporter running with appropriate collector scope?
postgres_exporter \
--collector.stat_activity_autovacuum \
--collector.stat_statements \
--collector.stat_bgwriter \
--collector.stat_replication \
--collector.replication_slot \
--no-collector.wal \
--no-collector.database_wraparound \
--web.listen-address=:9187
Disable expensive collectors you do not need. database_wraparound queries age(datfrozenxid) on every database and can be slow on instances with many databases. Enable only the collectors you have dashboard panels for.
2. Is the scrape interval appropriate?
For OLTP databases, scrape every 30 seconds. For analytics-heavy workloads with slow collector queries, 60 seconds is acceptable. Shorter than 30 seconds risks accumulating scrape delays during high-load periods.
In prometheus.yml:
scrape_configs:
- job_name: 'postgres'
scrape_interval: 30s
scrape_timeout: 20s
static_configs:
- targets: ['postgres-exporter:9187']
labels:
env: 'production'
db_engine: 'postgres'
cluster: 'primary'
3. Are recording rules defined for expensive derived metrics?
PromQL queries that compute ratios from raw counters on every dashboard load are expensive at query time. Move them into recording rules evaluated once per scrape.
# prometheus/rules/database.yaml
groups:
- name: database_derived
interval: 60s
rules:
- record: postgres:cache_hit_ratio
expr: |
rate(pg_statio_user_tables_heap_blks_hit[5m]) /
(rate(pg_statio_user_tables_heap_blks_hit[5m]) +
rate(pg_statio_user_tables_heap_blks_read[5m]))
- record: postgres:connections_pct
expr: |
pg_stat_activity_count{state!="idle"} /
pg_settings_max_connections * 100
- record: postgres:replication_lag_seconds
expr: |
pg_replication_lag
4. Are alert rules configured with meaningful labels?
groups:
- name: postgres_alerts
rules:
- alert: PostgresReplicaLagHigh
expr: pg_replication_lag > 60
for: 2m
labels:
severity: warning
team: database
annotations:
summary: "PostgreSQL replica lag above 60s on {{ $labels.instance }}"
runbook_url: "https://wiki.example.com/runbooks/postgres-replica-lag"
- alert: PostgresConnectionsNearLimit
expr: postgres:connections_pct > 85
for: 5m
labels:
severity: critical
team: database
annotations:
summary: "PostgreSQL connections at {{ $value | humanize }}% on {{ $labels.instance }}"
5. Is mysqld_exporter configured with the right user grants?
CREATE USER 'prometheus'@'%' IDENTIFIED BY 'use-secret-manager-here';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'prometheus'@'%';
-- For performance_schema access:
GRANT SELECT ON performance_schema.* TO 'prometheus'@'%';
FLUSH PRIVILEGES;
The exporter connects as this user. Grant only what the collectors actually need — not SUPER.
Decision Tree
flowchart TD
A[Set up database monitoring with Prometheus] --> B[Install exporter]
B --> C{Scope collectors}
C -->|High-traffic OLTP| D[Enable: stat_activity, stat_statements, stat_bgwriter, stat_replication, locks]
C -->|Analytics replica| E[Enable: stat_statements, replication_slot, database_size]
D --> F[Set scrape interval 30s]
E --> F
F --> G[Define recording rules for ratios]
G --> H[Build Grafana panels by operational question]
H --> I{Alert rules}
I -->|Define warning + critical| J[Set runbook URL on every alert]
J --> K[Test alert with simulated failure in staging]
Core Grafana Panel Design
Build panels that answer operational questions, not panels that display metrics.
| Question | Panel type | PromQL |
|---|---|---|
| Is replica lag within SLO? | Gauge + threshold | pg_replication_lag{instance="$instance"} |
| How close are we to connection limit? | Gauge + threshold | postgres:connections_pct{instance="$instance"} |
| Which queries are slowest right now? | Table | topk(10, rate(pg_stat_statements_total_time[5m])) |
| Is cache hit ratio healthy? | Time series | postgres:cache_hit_ratio{instance="$instance"} |
| Which tables have the most dead tuples? | Bar chart | topk(10, pg_stat_user_tables_n_dead_tup) |
| Is checkpoint behavior normal? | Time series | rate(pg_stat_bgwriter_checkpoints_req[5m]) |
For MySQL:
| Question | PromQL |
|---|---|
| Replication lag | mysql_slave_status_seconds_behind_master |
| Threads running | mysql_global_status_threads_running |
| InnoDB buffer pool wait | rate(mysql_global_status_innodb_buffer_pool_wait_free[5m]) |
| Slow queries per second | rate(mysql_global_status_slow_queries[5m]) |
| Open tables vs cache | mysql_global_status_open_tables / mysql_global_variables_table_open_cache |
Rollback Plan
If the exporter is causing database load:
- Disable the problematic collector immediately: restart the exporter with
--no-collector.<name>. - Check
pg_stat_activityfor exporter sessions with long durations. - Increase
scrape_timeoutto avoid Prometheus treating slow scrapes as failed. - If the database is degraded, disable the exporter entirely and fall back to CloudWatch or basic OS metrics until the database is stable.
Automation Opportunity
-
Dashboards as code: store Grafana dashboard JSON in Git and use
grafana-dashboard-exporteror Terraform to provision dashboards. This prevents dashboard drift between environments. -
Exporter configuration templates: manage
postgres_exporterconfiguration through a Helm chart or Ansible role with environment-specific variables. The monitoring role credentials and scrape endpoints should be provisioned through the same credential management pipeline as application secrets. -
Alert rule testing: use
promtool test rulesto write unit tests for alert rules. Test that alerts fire correctly given synthetic metric data — before deploying the rules to production.
promtool test rules tests/database_alerts_test.yaml
Leadership Summary
Prometheus and Grafana database monitoring is operationally complete only when it has four properties: appropriate collector scope (not every metric, only the ones with panels and alerts), recording rules for derived metrics (not computed on every dashboard load), alert rules with runbook links (not raw metric thresholds with no context), and tested alert coverage (simulated failures verified the alerts fire). An exporter that is installed but not tuned produces more cardinality than signal and slows down Prometheus at query time.
Where It Breaks
| Failure mode | Trigger | Fix |
|---|---|---|
| Exporter queries slow the database | Default collectors include expensive queries (e.g., bloat estimation) | Disable unused collectors; enable only what has dashboard panels |
| Alert fires too often | Scrape every 15s, alert window is 1m — transient spikes trigger alert | Increase for duration to 2–5 minutes for metric volatility |
| Dashboard has 40 panels, no one knows what to look at | Metrics-first design instead of question-first | Redesign from operational questions, not metric availability |
| Exporter loses database connection silently | PostgreSQL restart drops exporter connection; exporter does not reconnect | Set --web.config.file reconnect policy; use Kubernetes liveness probe |
| Alert runbook link is dead | Wiki reorganized, link not updated | Store runbook URL as a configmap value; validate links in CI |
What to Do Next
- Problem: Database monitoring uses Prometheus but panels show raw metrics, not operational health.
- Solution: Add recording rules for derived metrics, build question-first panels, and add alert rules with runbook URLs.
- Proof: Walk through an incident simulation: kill one replica, verify the lag alert fires within 2 minutes, confirm the runbook link points to the correct procedure.
- Action: This week, define three recording rules (connection utilization, replica lag, cache hit ratio), create an alert for each at the critical threshold, and add a Grafana time series panel for each.