Prometheus and Grafana for Database Monitoring: PostgreSQL and MySQL Setup

Prometheus and Grafana are the right default for database monitoring when the team already runs them for infrastructure. The mistake is treating database exporters as install-and-forget: they require scope decisions, scrape tuning, recording rules for expensive queries, and panels aligned to operational questions rather than metric availability.

Situation

Prometheus with postgres_exporter or mysqld_exporter gives a team database metrics in the same system they use for Kubernetes, application, and infrastructure metrics. That consistency matters during incidents: one tool, one query language, one dashboard system.

The challenge is setup quality. Both exporters expose hundreds of metrics by default. Without scope decisions and recording rules, the result is a Prometheus instance ingesting metrics that nobody queries, Grafana dashboards that show every metric but answer no operational question, and a scrape interval too infrequent to catch short-duration failures.

Symptoms

Symptom	Likely cause
Grafana database dashboard shows data but engineer can’t tell if system is healthy	Dashboard shows metrics, not answers — no thresholds, no anomaly detection
Prometheus scrape latency is high	Exporter is running expensive queries during scrape; needs collector filtering
Database monitoring is absent during Prometheus downtime	No remote write or long-term storage — single point of failure
Alert fires but metric data is missing	Scrape interval too long for the alert evaluation window
Exporter crashes after database restart	Exporter not configured to retry connections

First Five Checks

1. Is postgres_exporter running with appropriate collector scope?

postgres_exporter \
  --collector.stat_activity_autovacuum \
  --collector.stat_statements \
  --collector.stat_bgwriter \
  --collector.stat_replication \
  --collector.replication_slot \
  --no-collector.wal \
  --no-collector.database_wraparound \
  --web.listen-address=:9187

Disable expensive collectors you do not need. database_wraparound queries age(datfrozenxid) on every database and can be slow on instances with many databases. Enable only the collectors you have dashboard panels for.

2. Is the scrape interval appropriate?

For OLTP databases, scrape every 30 seconds. For analytics-heavy workloads with slow collector queries, 60 seconds is acceptable. Shorter than 30 seconds risks accumulating scrape delays during high-load periods.

In prometheus.yml:

scrape_configs:
  - job_name: 'postgres'
    scrape_interval: 30s
    scrape_timeout: 20s
    static_configs:
      - targets: ['postgres-exporter:9187']
        labels:
          env: 'production'
          db_engine: 'postgres'
          cluster: 'primary'

3. Are recording rules defined for expensive derived metrics?

PromQL queries that compute ratios from raw counters on every dashboard load are expensive at query time. Move them into recording rules evaluated once per scrape.

# prometheus/rules/database.yaml
groups:
  - name: database_derived
    interval: 60s
    rules:
      - record: postgres:cache_hit_ratio
        expr: |
          rate(pg_statio_user_tables_heap_blks_hit[5m]) /
          (rate(pg_statio_user_tables_heap_blks_hit[5m]) +
           rate(pg_statio_user_tables_heap_blks_read[5m]))

      - record: postgres:connections_pct
        expr: |
          pg_stat_activity_count{state!="idle"} /
          pg_settings_max_connections * 100

      - record: postgres:replication_lag_seconds
        expr: |
          pg_replication_lag

4. Are alert rules configured with meaningful labels?

groups:
  - name: postgres_alerts
    rules:
      - alert: PostgresReplicaLagHigh
        expr: pg_replication_lag > 60
        for: 2m
        labels:
          severity: warning
          team: database
        annotations:
          summary: "PostgreSQL replica lag above 60s on {{ $labels.instance }}"
          runbook_url: "https://wiki.example.com/runbooks/postgres-replica-lag"

      - alert: PostgresConnectionsNearLimit
        expr: postgres:connections_pct > 85
        for: 5m
        labels:
          severity: critical
          team: database
        annotations:
          summary: "PostgreSQL connections at {{ $value | humanize }}% on {{ $labels.instance }}"

5. Is mysqld_exporter configured with the right user grants?

CREATE USER 'prometheus'@'%' IDENTIFIED BY 'use-secret-manager-here';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'prometheus'@'%';
-- For performance_schema access:
GRANT SELECT ON performance_schema.* TO 'prometheus'@'%';
FLUSH PRIVILEGES;

The exporter connects as this user. Grant only what the collectors actually need — not SUPER.

Decision Tree

flowchart TD
    A[Set up database monitoring with Prometheus] --> B[Install exporter]
    B --> C{Scope collectors}
    C -->|High-traffic OLTP| D[Enable: stat_activity, stat_statements, stat_bgwriter, stat_replication, locks]
    C -->|Analytics replica| E[Enable: stat_statements, replication_slot, database_size]
    D --> F[Set scrape interval 30s]
    E --> F
    F --> G[Define recording rules for ratios]
    G --> H[Build Grafana panels by operational question]
    H --> I{Alert rules}
    I -->|Define warning + critical| J[Set runbook URL on every alert]
    J --> K[Test alert with simulated failure in staging]

Core Grafana Panel Design

Build panels that answer operational questions, not panels that display metrics.

Question	Panel type	PromQL
Is replica lag within SLO?	Gauge + threshold	`pg_replication_lag{instance="$instance"}`
How close are we to connection limit?	Gauge + threshold	`postgres:connections_pct{instance="$instance"}`
Which queries are slowest right now?	Table	`topk(10, rate(pg_stat_statements_total_time[5m]))`
Is cache hit ratio healthy?	Time series	`postgres:cache_hit_ratio{instance="$instance"}`
Which tables have the most dead tuples?	Bar chart	`topk(10, pg_stat_user_tables_n_dead_tup)`
Is checkpoint behavior normal?	Time series	`rate(pg_stat_bgwriter_checkpoints_req[5m])`

For MySQL:

Question	PromQL
Replication lag	`mysql_slave_status_seconds_behind_master`
Threads running	`mysql_global_status_threads_running`
InnoDB buffer pool wait	`rate(mysql_global_status_innodb_buffer_pool_wait_free[5m])`
Slow queries per second	`rate(mysql_global_status_slow_queries[5m])`
Open tables vs cache	`mysql_global_status_open_tables / mysql_global_variables_table_open_cache`

Rollback Plan

If the exporter is causing database load:

Disable the problematic collector immediately: restart the exporter with --no-collector.<name>.
Check pg_stat_activity for exporter sessions with long durations.
Increase scrape_timeout to avoid Prometheus treating slow scrapes as failed.
If the database is degraded, disable the exporter entirely and fall back to CloudWatch or basic OS metrics until the database is stable.

Automation Opportunity

Dashboards as code: store Grafana dashboard JSON in Git and use grafana-dashboard-exporter or Terraform to provision dashboards. This prevents dashboard drift between environments.
Exporter configuration templates: manage postgres_exporter configuration through a Helm chart or Ansible role with environment-specific variables. The monitoring role credentials and scrape endpoints should be provisioned through the same credential management pipeline as application secrets.
Alert rule testing: use promtool test rules to write unit tests for alert rules. Test that alerts fire correctly given synthetic metric data — before deploying the rules to production.

promtool test rules tests/database_alerts_test.yaml

Leadership Summary

Prometheus and Grafana database monitoring is operationally complete only when it has four properties: appropriate collector scope (not every metric, only the ones with panels and alerts), recording rules for derived metrics (not computed on every dashboard load), alert rules with runbook links (not raw metric thresholds with no context), and tested alert coverage (simulated failures verified the alerts fire). An exporter that is installed but not tuned produces more cardinality than signal and slows down Prometheus at query time.

Where It Breaks

Failure mode	Trigger	Fix
Exporter queries slow the database	Default collectors include expensive queries (e.g., bloat estimation)	Disable unused collectors; enable only what has dashboard panels
Alert fires too often	Scrape every 15s, alert window is 1m — transient spikes trigger alert	Increase `for` duration to 2–5 minutes for metric volatility
Dashboard has 40 panels, no one knows what to look at	Metrics-first design instead of question-first	Redesign from operational questions, not metric availability
Exporter loses database connection silently	PostgreSQL restart drops exporter connection; exporter does not reconnect	Set `--web.config.file` reconnect policy; use Kubernetes liveness probe
Alert runbook link is dead	Wiki reorganized, link not updated	Store runbook URL as a configmap value; validate links in CI

What to Do Next

Problem: Database monitoring uses Prometheus but panels show raw metrics, not operational health.
Solution: Add recording rules for derived metrics, build question-first panels, and add alert rules with runbook URLs.
Proof: Walk through an incident simulation: kill one replica, verify the lag alert fires within 2 minutes, confirm the runbook link points to the correct procedure.
Action: This week, define three recording rules (connection utilization, replica lag, cache hit ratio), create an alert for each at the critical threshold, and add a Grafana time series panel for each.

Situation

Symptoms

First Five Checks

Decision Tree

Core Grafana Panel Design

Rollback Plan

Automation Opportunity

Leadership Summary

Where It Breaks

What to Do Next

Rajiv

Related Posts

Datadog Database Monitoring: PostgreSQL, MySQL, and Aurora Setup

Database Alert Design: Thresholds That Fire on Real Problems

MySQL and Aurora Monitoring: The Dashboard That Catches Problems Before Users Do