Prometheus and Grafana are the right default for database monitoring when the team already runs them for infrastructure. The mistake is treating database exporters as install-and-forget: they require scope decisions, scrape tuning, recording rules for expensive queries, and panels aligned to operational questions rather than metric availability.

Situation

Prometheus with postgres_exporter or mysqld_exporter gives a team database metrics in the same system they use for Kubernetes, application, and infrastructure metrics. That consistency matters during incidents: one tool, one query language, one dashboard system.

The challenge is setup quality. Both exporters expose hundreds of metrics by default. Without scope decisions and recording rules, the result is a Prometheus instance ingesting metrics that nobody queries, Grafana dashboards that show every metric but answer no operational question, and a scrape interval too infrequent to catch short-duration failures.

Symptoms

SymptomLikely cause
Grafana database dashboard shows data but engineer can’t tell if system is healthyDashboard shows metrics, not answers — no thresholds, no anomaly detection
Prometheus scrape latency is highExporter is running expensive queries during scrape; needs collector filtering
Database monitoring is absent during Prometheus downtimeNo remote write or long-term storage — single point of failure
Alert fires but metric data is missingScrape interval too long for the alert evaluation window
Exporter crashes after database restartExporter not configured to retry connections

First Five Checks

1. Is postgres_exporter running with appropriate collector scope?

postgres_exporter \
  --collector.stat_activity_autovacuum \
  --collector.stat_statements \
  --collector.stat_bgwriter \
  --collector.stat_replication \
  --collector.replication_slot \
  --no-collector.wal \
  --no-collector.database_wraparound \
  --web.listen-address=:9187

Disable expensive collectors you do not need. database_wraparound queries age(datfrozenxid) on every database and can be slow on instances with many databases. Enable only the collectors you have dashboard panels for.

2. Is the scrape interval appropriate?

For OLTP databases, scrape every 30 seconds. For analytics-heavy workloads with slow collector queries, 60 seconds is acceptable. Shorter than 30 seconds risks accumulating scrape delays during high-load periods.

In prometheus.yml:

scrape_configs:
  - job_name: 'postgres'
    scrape_interval: 30s
    scrape_timeout: 20s
    static_configs:
      - targets: ['postgres-exporter:9187']
        labels:
          env: 'production'
          db_engine: 'postgres'
          cluster: 'primary'

3. Are recording rules defined for expensive derived metrics?

PromQL queries that compute ratios from raw counters on every dashboard load are expensive at query time. Move them into recording rules evaluated once per scrape.

# prometheus/rules/database.yaml
groups:
  - name: database_derived
    interval: 60s
    rules:
      - record: postgres:cache_hit_ratio
        expr: |
          rate(pg_statio_user_tables_heap_blks_hit[5m]) /
          (rate(pg_statio_user_tables_heap_blks_hit[5m]) +
           rate(pg_statio_user_tables_heap_blks_read[5m]))

      - record: postgres:connections_pct
        expr: |
          pg_stat_activity_count{state!="idle"} /
          pg_settings_max_connections * 100

      - record: postgres:replication_lag_seconds
        expr: |
          pg_replication_lag

4. Are alert rules configured with meaningful labels?

groups:
  - name: postgres_alerts
    rules:
      - alert: PostgresReplicaLagHigh
        expr: pg_replication_lag > 60
        for: 2m
        labels:
          severity: warning
          team: database
        annotations:
          summary: "PostgreSQL replica lag above 60s on {{ $labels.instance }}"
          runbook_url: "https://wiki.example.com/runbooks/postgres-replica-lag"

      - alert: PostgresConnectionsNearLimit
        expr: postgres:connections_pct > 85
        for: 5m
        labels:
          severity: critical
          team: database
        annotations:
          summary: "PostgreSQL connections at {{ $value | humanize }}% on {{ $labels.instance }}"

5. Is mysqld_exporter configured with the right user grants?

CREATE USER 'prometheus'@'%' IDENTIFIED BY 'use-secret-manager-here';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'prometheus'@'%';
-- For performance_schema access:
GRANT SELECT ON performance_schema.* TO 'prometheus'@'%';
FLUSH PRIVILEGES;

The exporter connects as this user. Grant only what the collectors actually need — not SUPER.

Decision Tree

flowchart TD
    A[Set up database monitoring with Prometheus] --> B[Install exporter]
    B --> C{Scope collectors}
    C -->|High-traffic OLTP| D[Enable: stat_activity, stat_statements, stat_bgwriter, stat_replication, locks]
    C -->|Analytics replica| E[Enable: stat_statements, replication_slot, database_size]
    D --> F[Set scrape interval 30s]
    E --> F
    F --> G[Define recording rules for ratios]
    G --> H[Build Grafana panels by operational question]
    H --> I{Alert rules}
    I -->|Define warning + critical| J[Set runbook URL on every alert]
    J --> K[Test alert with simulated failure in staging]

Core Grafana Panel Design

Build panels that answer operational questions, not panels that display metrics.

QuestionPanel typePromQL
Is replica lag within SLO?Gauge + thresholdpg_replication_lag{instance="$instance"}
How close are we to connection limit?Gauge + thresholdpostgres:connections_pct{instance="$instance"}
Which queries are slowest right now?Tabletopk(10, rate(pg_stat_statements_total_time[5m]))
Is cache hit ratio healthy?Time seriespostgres:cache_hit_ratio{instance="$instance"}
Which tables have the most dead tuples?Bar charttopk(10, pg_stat_user_tables_n_dead_tup)
Is checkpoint behavior normal?Time seriesrate(pg_stat_bgwriter_checkpoints_req[5m])

For MySQL:

QuestionPromQL
Replication lagmysql_slave_status_seconds_behind_master
Threads runningmysql_global_status_threads_running
InnoDB buffer pool waitrate(mysql_global_status_innodb_buffer_pool_wait_free[5m])
Slow queries per secondrate(mysql_global_status_slow_queries[5m])
Open tables vs cachemysql_global_status_open_tables / mysql_global_variables_table_open_cache

Rollback Plan

If the exporter is causing database load:

  1. Disable the problematic collector immediately: restart the exporter with --no-collector.<name>.
  2. Check pg_stat_activity for exporter sessions with long durations.
  3. Increase scrape_timeout to avoid Prometheus treating slow scrapes as failed.
  4. If the database is degraded, disable the exporter entirely and fall back to CloudWatch or basic OS metrics until the database is stable.

Automation Opportunity

  1. Dashboards as code: store Grafana dashboard JSON in Git and use grafana-dashboard-exporter or Terraform to provision dashboards. This prevents dashboard drift between environments.

  2. Exporter configuration templates: manage postgres_exporter configuration through a Helm chart or Ansible role with environment-specific variables. The monitoring role credentials and scrape endpoints should be provisioned through the same credential management pipeline as application secrets.

  3. Alert rule testing: use promtool test rules to write unit tests for alert rules. Test that alerts fire correctly given synthetic metric data — before deploying the rules to production.

promtool test rules tests/database_alerts_test.yaml

Leadership Summary

Prometheus and Grafana database monitoring is operationally complete only when it has four properties: appropriate collector scope (not every metric, only the ones with panels and alerts), recording rules for derived metrics (not computed on every dashboard load), alert rules with runbook links (not raw metric thresholds with no context), and tested alert coverage (simulated failures verified the alerts fire). An exporter that is installed but not tuned produces more cardinality than signal and slows down Prometheus at query time.

Where It Breaks

Failure modeTriggerFix
Exporter queries slow the databaseDefault collectors include expensive queries (e.g., bloat estimation)Disable unused collectors; enable only what has dashboard panels
Alert fires too oftenScrape every 15s, alert window is 1m — transient spikes trigger alertIncrease for duration to 2–5 minutes for metric volatility
Dashboard has 40 panels, no one knows what to look atMetrics-first design instead of question-firstRedesign from operational questions, not metric availability
Exporter loses database connection silentlyPostgreSQL restart drops exporter connection; exporter does not reconnectSet --web.config.file reconnect policy; use Kubernetes liveness probe
Alert runbook link is deadWiki reorganized, link not updatedStore runbook URL as a configmap value; validate links in CI

What to Do Next

  • Problem: Database monitoring uses Prometheus but panels show raw metrics, not operational health.
  • Solution: Add recording rules for derived metrics, build question-first panels, and add alert rules with runbook URLs.
  • Proof: Walk through an incident simulation: kill one replica, verify the lag alert fires within 2 minutes, confirm the runbook link points to the correct procedure.
  • Action: This week, define three recording rules (connection utilization, replica lag, cache hit ratio), create an alert for each at the critical threshold, and add a Grafana time series panel for each.