Datadog Database Monitoring is not just metrics collection with a nicer UI — it ships query-level explain plans, wait event breakdown, and connection pool visibility without requiring pg_stat_statements configuration or custom PromQL recording rules. The mistake is enabling it and leaving all sampling and explain plan collection at defaults, which produces query data that is too sparse to diagnose production slowdowns.

Situation

Teams running Datadog for application performance monitoring have a strong reason to use it for database monitoring too: one dashboard, one query language, and automatic correlation between slow application traces and the database queries those traces hit. The alternative — running a separate Prometheus stack with postgres_exporter, custom recording rules, and Grafana — is operationally heavier for teams that are not already Prometheus-native.

Datadog Database Monitoring (DBM) covers PostgreSQL, MySQL, Aurora PostgreSQL, Aurora MySQL, SQL Server, and Oracle. This post focuses on PostgreSQL and MySQL/Aurora MySQL — the two most common open-source targets.

The challenge is not installation. The challenge is that defaults produce incomplete data: explain plans are sampled at a low rate, wait event tracking requires explicit enabling, and the Agent needs database-side configuration (a dedicated monitoring user with the right grants) that Datadog’s quickstart guide underspecifies.

Symptoms

Symptom in Datadog DBMLikely cause
Query samples show “no explain plan available”pg_stat_statements not in shared_preload_libraries, or explain plan sampling rate is too low
Slow query visible in APM but not in DBMQuery duration is below DBM’s configured min duration threshold
Wait events show only “ClientRead”track_activity_query_size too small; truncating queries before DBM can match them
Aurora read replicas not appearing in DBMAgent not configured to connect to the reader endpoint separately
High DBM Agent CPU on the database hostExplain plan collection running too frequently; throttle via explain_statement_min_duration
Connection count in DBM does not match pg_stat_activityDBM is reading from pg_stat_activity but the monitoring user lacks pg_monitor role

First Five Checks

1. Is the monitoring user configured with the right grants?

For PostgreSQL:

CREATE USER datadog WITH password 'use-secret-manager-here';
GRANT pg_monitor TO datadog;

-- Required for query samples and explain plans:
CREATE SCHEMA datadog;
GRANT USAGE ON SCHEMA datadog TO datadog;
GRANT USAGE ON SCHEMA public TO datadog;
GRANT pg_read_all_stats TO datadog;

-- Function required for DBM explain plan collection:
CREATE OR REPLACE FUNCTION datadog.explain_statement(
   l_query TEXT,
   OUT explain JSON
)
RETURNS SETOF JSON AS $$
DECLARE
curs REFCURSOR;
plan JSON;
BEGIN
   OPEN curs FOR EXECUTE pg_catalog.concat('EXPLAIN (FORMAT JSON) ', l_query);
   FETCH curs INTO plan;
   CLOSE curs;
   RETURN QUERY SELECT plan;
END;
$$
LANGUAGE 'plpgsql'
RETURNS NULL ON NULL INPUT
SECURITY DEFINER;

The SECURITY DEFINER function is required because DBM collects explain plans for queries run by other users — the monitoring role does not have execution rights on arbitrary user queries.

For MySQL/Aurora MySQL:

CREATE USER 'datadog'@'%' IDENTIFIED WITH mysql_native_password BY 'use-secret-manager-here';
GRANT REPLICATION CLIENT ON *.* TO 'datadog'@'%';
GRANT PROCESS ON *.* TO 'datadog'@'%';
GRANT SELECT ON performance_schema.* TO 'datadog'@'%';
-- For explain plan collection:
GRANT SELECT ON sys.* TO 'datadog'@'%';

2. Is pg_stat_statements enabled?

SHOW shared_preload_libraries;
-- Must include 'pg_stat_statements'

-- If missing, add to postgresql.conf and restart:
-- shared_preload_libraries = 'pg_stat_statements'

-- After restart, verify:
SELECT * FROM pg_extension WHERE extname = 'pg_stat_statements';
-- If absent: CREATE EXTENSION pg_stat_statements;

-- Tune:
ALTER SYSTEM SET pg_stat_statements.max = 10000;
ALTER SYSTEM SET pg_stat_statements.track = 'all';
ALTER SYSTEM SET track_activity_query_size = 4096;
SELECT pg_reload_conf();

track_activity_query_size defaults to 1024 bytes in PostgreSQL 13 and earlier. Queries longer than this are truncated in pg_stat_activity, which prevents DBM from matching query samples to their explain plans.

3. Is the Datadog Agent configured for DBM?

In /etc/datadog-agent/conf.d/postgres.d/conf.yaml:

init_config:

instances:
  - host: your-db-host
    port: 5432
    username: datadog
    password: ENC[your-secret]   # use Datadog secret management
    dbname: your_database
    
    # Enable Database Monitoring:
    dbm: true
    
    # Query metrics — increase statement cache:
    query_metrics:
      enabled: true
    
    # Query samples — how often to collect explain plans:
    query_samples:
      enabled: true
      explain_statement_min_duration: 500   # ms — only collect plans for queries over 500ms
      samples_per_second: 1                  # Reduce if CPU pressure on the Agent host
    
    # Wait events (PostgreSQL 9.6+):
    query_activity:
      enabled: true
      collection_interval: 10    # seconds
    
    tags:
      - env:production
      - service:your-app
      - db_engine:postgres

For MySQL:

instances:
  - host: your-mysql-host
    user: datadog
    pass: ENC[your-secret]
    port: 3306
    dbm: true
    query_metrics:
      enabled: true
    query_samples:
      enabled: true
      explain_statement_min_duration: 500
    query_activity:
      enabled: true

4. Are explain plans being collected?

In Datadog UI: APM → Database Monitoring → Query Samples. Filter to your database host. If queries show “no explain plan,” verify:

  • The datadog.explain_statement function exists in the target database
  • explain_statement_min_duration is not set too high (default 5000ms misses most slow OLTP queries — set to 500ms)
  • The query is not a DDL or COPY statement (explain plans are not collected for these)
  • The Agent’s datadog user has USAGE on the schema where the queried tables live

5. Are wait events visible?

In Datadog UI: Database Monitoring → Query Metrics → click a query → Wait Events tab. If the tab is empty:

  • Verify query_activity.enabled: true in conf.yaml
  • Verify the datadog user has pg_monitor role
  • Check Agent logs: datadog-agent check postgres — look for errors on the pg_stat_activity collection

Decision Tree

flowchart TD
    A[Set up Datadog DBM] --> B[Create monitoring user with correct grants]
    B --> C{PostgreSQL or MySQL?}
    C -->|PostgreSQL| D[Enable pg_stat_statements — add to shared_preload_libraries]
    C -->|MySQL| E[Grant SELECT on performance_schema and sys]
    D --> F[Create datadog.explain_statement SECURITY DEFINER function]
    E --> G[Set dbm:true in Agent conf.yaml]
    F --> G
    G --> H[Set explain_statement_min_duration to 500ms]
    H --> I[Enable query_activity for wait events]
    I --> J{Verify data appears}
    J -->|Query samples empty| K[Check pg_stat_statements.track — set to all — check track_activity_query_size]
    J -->|No explain plans| L[Verify explain_statement function — check USAGE grant on all schemas]
    J -->|No wait events| M[Verify pg_monitor grant — check query_activity.enabled in conf.yaml]
    J -->|All data visible| N[Set alert thresholds on p99 query latency and connection saturation]

Rollback Plan

If DBM is causing database load:

  1. Reduce query_samples.samples_per_second to 0.1 or disable query sampling entirely: query_samples.enabled: false. Query metrics (without explain plans) have minimal database impact.
  2. Increase explain_statement_min_duration to 2000ms to reduce explain plan frequency.
  3. If the monitoring connection itself is causing connection count pressure, reduce Agent check frequency: min_collection_interval: 30 (seconds).
  4. Disable query_activity collection if the pg_stat_activity query is slow on instances with many databases or connections.
  5. The datadog.explain_statement function runs EXPLAIN on sampled queries. On very high-throughput databases, this adds measurable load. Disable plan collection and rely on query metrics only if the database is already under pressure.

Automation Opportunity

  1. Provision monitoring user via Terraform: manage the datadog PostgreSQL user and grants through the same Terraform module that provisions the database. Store the password in AWS Secrets Manager or Vault, not in the Agent config file directly.

  2. Agent configuration as code: manage conf.yaml through Ansible or a Helm chart value. The explain_statement_min_duration threshold and collection_interval settings should be tunable per environment without touching the Agent host directly.

  3. Alert from DBM metrics: create Datadog monitors on:

    • postgresql.connections > 80% of max_connections — warning; 90% critical
    • postgresql.replication.delay > 60s warning; 300s critical
    • postgresql.queries.avg_time P99 spike > 2× baseline — warning
    • mysql.replication.seconds_behind_master > 30s warning; null = critical (broken replication)

Leadership Summary

Datadog Database Monitoring closes the gap between APM traces and database behavior. When an application trace is slow, DBM lets the team click through to the specific SQL, its explain plan at the time of the slowdown, and the wait events that show what the database was waiting on. Without DBM configured correctly — with the right grants, pg_stat_statements enabled, track_activity_query_size large enough, and explain plan sampling at a useful threshold — the team gets query metrics but not query diagnostics. The setup work is one-time; the operational benefit is continuous.

Where It Breaks

Failure modeTriggerFix
Explain plans absent for short queriesexplain_statement_min_duration set to 5000ms (default)Lower to 500ms for OLTP databases
Truncated queries in DBMtrack_activity_query_size too smallSet to 4096 in postgresql.conf
Aurora read replicas not in DBMEach endpoint is a separate instanceAdd a separate instances: entry for the reader endpoint in conf.yaml
SECURITY DEFINER function security concernFunction runs EXPLAIN as superuser equivalentLimit the function to read-only plans only — the function only calls EXPLAIN, not EXECUTE
DBM adds one extra connection per AgentOn databases near max_connections, Agent connection pushes over the limitReserve connections for monitoring: set max_connections 10 higher than application pool max
pg_stat_statements reset on restartCumulative counters reset; DBM shows spikeSet pg_stat_statements.save = on; use rate metrics in Datadog, not raw counters

What to Do Next

  • Problem: Your database is visible in Datadog as infrastructure metrics but slow queries are not linked to their explain plans or wait events.
  • Solution: Enable DBM with the monitoring user grants above, set explain_statement_min_duration to 500ms, and verify pg_stat_statements is loaded.
  • Proof: After setup, trigger a known slow query and verify it appears in Query Samples with an explain plan attached within 60 seconds.
  • Action: This week, create the datadog monitoring user, add the SECURITY DEFINER explain function, and set dbm: true in the Agent config. Restart the Agent and verify query samples appear in the Datadog UI within 5 minutes.